Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system
Дата
Msg-id 5123F1D3.30303@fuzzy.cz
обсуждение исходный текст
Ответ на Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Список pgsql-hackers
On 19.2.2013 11:27, Tom Lane wrote:
> Tomas Vondra <tv@fuzzy.cz> writes:
>> Dne 19.02.2013 05:46, Alvaro Herrera napsal:
>>> Mastodon failed:
>>> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mastodon&dt=2013-02-19%2000%3A00%3A01
>>>
>>> probably worth investigating a bit; we might have broken something.
> 
>> Hmmm, interesting. A single Windows machine, while the other Windows 
>> machines seem to work fine (although some of them were not built for a 
>> few weeks).
> 
> Could be random chance --- we've seen the same failure before, eg
> 
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mastodon&dt=2012-11-25%2006%3A00%3A00
> 
>             regards, tom lane

I'm looking at that test, and I'm not really sure about a few details.

First, this function seems pretty useless to me:

=======================================================================
create function wait_for_stats() returns void as $$ declare   start_time timestamptz := clock_timestamp();   updated
bool;begin   -- we don't want to wait forever; loop will exit after 30 seconds   for i in 1 .. 300 loop
 
     -- check to see if indexscan has been sensed     SELECT (st.idx_scan >= pr.idx_scan + 1) INTO updated       FROM
pg_stat_user_tablesAS st, pg_class AS cl, prevstats AS pr      WHERE st.relname='tenk2' AND cl.relname='tenk2';
 
     exit when updated;
     -- wait a little     perform pg_sleep(0.1);
     -- reset stats snapshot so we can test again     perform pg_stat_clear_snapshot();
   end loop;
   -- report time waited in postmaster log (where it won't change test
output)   raise log 'wait_for_stats delayed % seconds',     extract(epoch from clock_timestamp() - start_time); end $$
languageplpgsql;
 
=======================================================================

AFAIK the stats remain the same within a transaction, and as a function
runs within a transaction, it will either get new data on the first
iteration, or it will run all 300 of them. I've checked several
buildfarm members and I'm yet to see a single duration between 12ms and
30sec.

So IMHO we can replace the function call with pg_sleep(30) and we'll get
about the same effect.

But this obviously does not answer the question why it failed, although
on both occasions there's this log message:

[50b1b7fa.0568:14] LOG:  wait_for_stats delayed 34.75 seconds

which essentialy means the stats were not updated before the call to
wait_for_stats().

Anyway, there are these two failing queries:

=======================================================================
-- check effects
SELECT st.seq_scan >= pr.seq_scan + 1,      st.seq_tup_read >= pr.seq_tup_read + cl.reltuples,      st.idx_scan >=
pr.idx_scan+ 1,      st.idx_tup_fetch >= pr.idx_tup_fetch + 1 FROM pg_stat_user_tables AS st, pg_class AS cl, prevstats
ASprWHERE st.relname='tenk2' AND cl.relname='tenk2';?column? | ?column? | ?column? | ?column?
 
----------+----------+----------+----------t        | t        | t        | t
(1 row)

SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,      st.idx_blks_read + st.idx_blks_hit >=
pr.idx_blks+ 1 FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS prWHERE st.relname='tenk2' AND
cl.relname='tenk2';?column?| ?column?
 
----------+----------t        | t
(1 row)
=======================================================================

The first one returns just falses, the second one retuns either (t,f) or
(f,f) - for the two failures posted by Alvaro and TL earlier today.

I'm really wondering how that could happen. The only thing that I can
think of is some strange timing issue, causing lost requests to write
the stats or maybe some of the stats updates. Hmmm, IIRC the stats are
sent over UDP - couldn't that be related?

Tomas



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: Call for Google Summer of Code mentors, admins
Следующее
От: Gurjeet Singh
Дата:
Сообщение: Patch to make pgindent work cleanly