Обсуждение: Finding out if 'vacuum --analyze-in-stages' has generated enough stats

Поиск
Список
Период
Сортировка

Finding out if 'vacuum --analyze-in-stages' has generated enough stats

От
Vineet Naik
Дата:
Hello,

Let me provide some context. I am evaluating `pg_upgrade` for
upgrading a pg 9.3 server to 9.6. There are multiple dbs, only one of
which is large (~3.5 TB with indexes) and the others are pretty small.

For a trial run and to understand the approximate duration of
downtime, I restored a basebackup on a standalone node and ran
`pg_upgrade` with `--link` flag. This command completed in only ~30s.
But after that the `analyze_new_cluster.sh` script took 130 hours to
complete. This much downtime is definitely not acceptable for us.

Looking inside the `analyze_new_cluster` script, it runs

"/usr/lib/postgresql/9.6/bin/vacuumdb" --all --analyze-in-stages

to generate statistics which are used for determining efficient query
execution plans. So I'm assuming that resuming reads to the server
without running ANALYZE would result in considerable performance
degradation.

Now, only 2 of the tables across all dbs contribute to most of the
size (738 GB + 916 GB). If `vacuumdb` is run separately for smaller
tables then the stats for those will be generated quickly rather than
waiting for each stage to complete for the 2 large tables. So I'm
planning to use this approach.

If I run `vacuumdb --analyze-in-stages` for largest table, then it prints,

vacuumdb: processing database "<dbname>": Generating minimal optimizer
statistics (1 target)

And then nothing gets printed for several hours. It's not clear if
it's taking that long to process 1 target or if it has started
processing for 10 targets. Since it says "processing database" I'm
assuming it has not yet completed processing for 1 target.

As per documentation about --analyze-in-stages,

> This option will try to create some statistics as fast as possible, to make the database usable, and then produce
fullstatistics in the subsequent stages.
 

So my question is, after which stage is it ok to send traffic to the
server without much performance impact? Also, is it possible to query
the stats view in pg and determine if enough stats have been
generated?

Thanks,
Vineet



Re: Finding out if 'vacuum --analyze-in-stages' has generated enough stats

От
Bruce Momjian
Дата:
On Wed, Mar 31, 2021 at 12:38:14PM +0530, Vineet Naik wrote:
> > This option will try to create some statistics as fast as possible, to make the database usable, and then produce
fullstatistics in the subsequent stages.
 
> 
> So my question is, after which stage is it ok to send traffic to the
> server without much performance impact? Also, is it possible to query
> the stats view in pg and determine if enough stats have been
> generated?

That is a very hard question to answer.  Some queries have only a few
possible plans, so few or no statistics are fine.  Other queries require
detailed statistics to generate efficient plans.  I think you need to
set up a dummy server with no stats and try your workload on that, then
run the stages manually and test after each stage to see how the
performance is.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: Finding out if 'vacuum --analyze-in-stages' has generated enough stats

От
Vineet Naik
Дата:
On Wed, Mar 31, 2021 at 6:46 PM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Wed, Mar 31, 2021 at 12:38:14PM +0530, Vineet Naik wrote:
> > > This option will try to create some statistics as fast as possible, to make the database usable, and then produce
fullstatistics in the subsequent stages.
 
> >
> > So my question is, after which stage is it ok to send traffic to the
> > server without much performance impact? Also, is it possible to query
> > the stats view in pg and determine if enough stats have been
> > generated?
>
> That is a very hard question to answer.  Some queries have only a few
> possible plans, so few or no statistics are fine.  Other queries require
> detailed statistics to generate efficient plans.  I think you need to
> set up a dummy server with no stats and try your workload on that, then
> run the stages manually and test after each stage to see how the
> performance is.

That's a good idea. And it should be easily doable as we already have
a dummy server and there are only 2 tables of concern.

Thanks,
Vineet