Re: [HACKERS] WAL logging problem in 9.4.3?
От | Noah Misch |
---|---|
Тема | Re: [HACKERS] WAL logging problem in 9.4.3? |
Дата | |
Msg-id | 20200321224920.GB1763544@rfd.leadboat.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] WAL logging problem in 9.4.3? (Noah Misch <noah@leadboat.com>) |
Ответы |
Re: [HACKERS] WAL logging problem in 9.4.3?
|
Список | pgsql-hackers |
On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote: > Pushed, after adding a missing "break" to gist_identify() and tweaking two > more comments. However, a diverse minority of buildfarm members are failing > like this, in most branches: > > Mar 21 13:16:37 # Failed test 'wal_level = minimal, SET TABLESPACE, hint bit' > Mar 21 13:16:37 # at t/018_wal_optimize.pl line 231. > Mar 21 13:16:37 # got: '1' > Mar 21 13:16:37 # expected: '2' > Mar 21 13:16:46 # Looks like you failed 1 test of 34. > Mar 21 13:16:46 [13:16:46] t/018_wal_optimize.pl ................ > -- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2020-03-21%2016%3A52%3A05 > > Since I run two of the failing animals, I expect to reproduce this soon. force_parallel_regress was the setting needed to reproduce this: printf '%s\n%s\n%s\n' 'log_statement = all' 'force_parallel_mode = regress' >/tmp/force_parallel.conf make -C src/test/recovery check PROVE_TESTS=t/018_wal_optimize.pl TEMP_CONFIG=/tmp/force_parallel.conf The proximate cause is the RelFileNodeSkippingWAL() call that we added to MarkBufferDirtyHint(). MarkBufferDirtyHint() runs in parallel workers, but parallel workers have zeroes for pendingSyncHash and rd_*Subid. I hacked up the attached patch to understand the scope of the problem (not to commit). It logs a message whenever a parallel worker uses pendingSyncHash or RelationNeedsWAL(). Some of the cases happen often enough to make logs huge, so the patch suppresses logging for them. You can see the lower-volume calls like this: printf '%s\n%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 'max_wal_senders = 0' 'force_parallel_mode = regress'>/tmp/minimal_parallel.conf make check-world TEMP_CONFIG=/tmp/minimal_parallel.conf find . -name log | xargs grep -rl 'nm0 invalid' Not all are actual bugs. For example, get_relation_info() behaves fine: /* Temporary and unlogged relations are inaccessible during recovery. */ if (!RelationNeedsWAL(relation) && RecoveryInProgress()) Kyotaro, can you look through the affected code and propose a strategy for good coexistence of parallel query with the WAL skipping mechanism? Since I don't expect one strategy to win clearly and quickly, I plan to revert the main patch around 2020-03-22 17:30 UTC. That will give the patch about twenty-four hours in the buildfarm, so more animals can report in. I will leave the three smaller patches in place. > fairywren failed differently on 9.5; I have not yet studied it: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-03-21%2018%3A01%3A10 This did not remain specific to 9.5. On platforms where SIZEOF_SIZE_T==4 or SIZEOF_LONG==4, wal_skip_threshold cannot exceed 2GB. A simple s/1TB/1GB/ in the test should fix this.
Вложения
В списке pgsql-hackers по дате отправления: