Обсуждение: BUG #17245: Index corruption involving deduplicated entries
The following bug has been logged on the website: Bug reference: 17245 Logged by: Andrew Gierth Email address: andrew@tao11.riddles.org.uk PostgreSQL version: 14.0 Operating system: FreeBSD 13.0-RELEASE Description: From a report from IRC, from a user running a mediawiki instance on PG: select ctid, page_title from "page" WHERE page_title = 'Isokaze' and page_namespace = 0; returned two rows rather than the expected 1: (181,53) | Ying_Swei (722,3) | Isokaze Disabling index scans restored the expected output of 1 row. Inspecting the index revealed this: 000114a0: 2f47 616c 6c65 7279 0000 1000 0320 2860 /Gallery..... (` 000114b0: 1149 736f 6b61 7a65 0000 7600 1800 0000 .Isokaze..v..... 000114c0: b500 0900 0000 d202 3800 0000 0000 0000 ........8....... 000114d0: 0000 c501 2d00 1840 1749 7365 2f51 756f ....-..@.Ise/Quo which appears to me to be a deduplicated index entry pointing to heap rows (181,9) and (722,56). Inspecting the heap showed that both of those ctids are just pointer entries, to (181,53) and (722,3) respectively, which explains the 2 row result. What I don't know is how this could have happened in the first place. The definition of the affected index is: CREATE INDEX page_main_title ON mediawiki.page USING btree (page_title text_pattern_ops) WHERE (page_namespace = 0); The real entries for 'Ying_Swei' in the index are not in the same page or even any nearby page, and it seems unlikely that the page_title was updated. The original reporter (who I will CC on a followup message) still has a snapshot of the corrupt data. (A REINDEX naturally fixed the data).
The following bug has been logged on the website: Bug reference: 17245 CC'ing to original reporter. See report at https://www.postgresql.org/message-id/17245-ddf06aaf85735f36%40postgresql.org -- Andrew (irc:RhodiumToad)
On Sun, Oct 24, 2021 at 4:57 AM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > CC'ing to original reporter. I suggest that pg_amcheck be run against the database, to try to establish the overall extent of the corruption -- including all indexes and heap relations from the database. This should start with the original known-corrupt backup, from before the REINDEX. Then report back the output of pg_amcheck to the list. See: https://www.postgresql.org/docs/14/app-pgamcheck.html Please specify --heapallindexed and --parent-check options when running pg_amcheck, so that it will perform the most thorough possible checks against indexes. --heapallindexed is particularly important here, because the issue that Andrew highlighted is probably not detectable just by looking at the index itself. It's purely an inconsistency between the index and the table. This class of problem has many possible causes. Thanks -- Peter Geoghegan
During investigation, `SELECT bt_index_parent_check('mediawiki.page_main_title', true, true);` displayed no errors. I have installed the same build of 14.0 in a VM and started it with the data directory before reindex. pg_amcheck does report errors, but these are not related to the index in question (page_main_title): $ ./pg_amcheck -U postgres --heapallindexed --parent-check --rootdescend azurlane_wiki btree index "azurlane_wiki.mediawiki.page_redirect_namespace_len": ERROR: posting list contains misplaced TID in index "page_redirect_namespace_len" DETAIL: Index tid=(14,9) posting list offset=110 page lsn=2/2C4F7CD8. btree index "azurlane_wiki.mediawiki.page_len": ERROR: posting list contains misplaced TID in index "page_len" DETAIL: Index tid=(1,2) posting list offset=34 page lsn=2/2DDA2378. btree index "azurlane_wiki.mediawiki.transcode_key_idx": ERROR: posting list contains misplaced TID in index "transcode_key_idx" DETAIL: Index tid=(1,9) posting list offset=5 page lsn=2/2B53AED8. This instance displays the same behaviour with page_main_title: azurlane_wiki=# select ctid, page_title from mediawiki.page WHERE page_title = 'Isokaze' and page_namespace = 0; ctid | page_title ----------+------------ (181,53) | Ying_Swei (722,3) | Isokaze (2 rows) azurlane_wiki=# select ctid, page_title from mediawiki.page WHERE page_title = 'Kaga' and page_namespace = 0; ctid | page_title ----------+------------ (446,32) | Belfast (720,53) | Kaga (2 rows) After REINDEX'ing the three indices listed by pg_amcheck, there is no apparent change to the state of the page_main_title index: azurlane_wiki=# reindex index mediawiki.page_redirect_namespace_len; REINDEX azurlane_wiki=# reindex index mediawiki.page_len; REINDEX azurlane_wiki=# reindex index mediawiki.transcode_key_idx; REINDEX [xiatian@freebsd pg_amcheck]$ ./pg_amcheck -U postgres --heapallindexed --parent-check --rootdescend azurlane_wiki [xiatian@freebsd pg_amcheck]$ azurlane_wiki=# select ctid, page_title from mediawiki.page WHERE page_title = 'Kaga' and page_namespace = 0; ctid | page_title ----------+------------ (446,32) | Belfast (720,53) | Kaga (2 rows) P. S. My previous copy of this e-mail was sent from an invalid address; I apologise for the possible noise. -- K. R.
On Sun, Oct 24, 2021 at 4:40 AM PG Bug reporting form <noreply@postgresql.org> wrote: > PostgreSQL version: 14.0 > Operating system: FreeBSD 13.0-RELEASE PostgreSQL 14 is the first version where the default wal_sync_method is fdatasync() on FreeBSD -- though only with FreeBSD 13. See commit f900a79e. Perhaps that has something to do with the problem seen here. I have no concrete reason to suspect fdatasync() here -- I'm just pattern matching. But the inconsistency that Andrew reported looks like any other scenario where the system's wal_sync_method doesn't provide the expected guarantees. This looks like generic corruption to me. Andrew said that the "real entries for 'Ying_Swei' in the index are not in the same page or even any nearby page". It is reasonable to suppose that the TID in the index pointing to the wrong tuple in the heap/table is from an index tuple that has no business being anywhere near the leaf page that we find it in -- I surmise that it's too far away in the keyspace for that. (I think that this was Andrew's exact point, in fact, but I just want to make sure.) I suspect that there was TID recycling by VACUUM involved here. Originally, the leaf page image that Andrew examined might well have actually pointed to the right thing in the heap. Perhaps VACUUM ran, and recycled the relevant heap TID, while somehow leaving an older pre-VACUUM leaf page behind (any kind of storage misbehavior could do this). There doesn't have to have been a VACUUM, actually. A simple hard crash (with a misbehaving wal_sync_method or whatever) could easily lead to the same symptoms. You just need to have heapam "allocate new heap TIDs" and then forget about having done so. Without the index structure also forgetting. -- Peter Geoghegan
On Sun, Oct 24, 2021 at 2:35 PM K. R. <iijima.yun@koumakan.jp> wrote: > $ ./pg_amcheck -U postgres --heapallindexed --parent-check --rootdescend > azurlane_wiki > btree index "azurlane_wiki.mediawiki.page_redirect_namespace_len": > ERROR: posting list contains misplaced TID in index > "page_redirect_namespace_len" > DETAIL: Index tid=(14,9) posting list offset=110 page lsn=2/2C4F7CD8. > btree index "azurlane_wiki.mediawiki.page_len": > ERROR: posting list contains misplaced TID in index "page_len" > DETAIL: Index tid=(1,2) posting list offset=34 page lsn=2/2DDA2378. > btree index "azurlane_wiki.mediawiki.transcode_key_idx": > ERROR: posting list contains misplaced TID in index > "transcode_key_idx" > DETAIL: Index tid=(1,9) posting list offset=5 page lsn=2/2B53AED8. Thanks for getting back to me with that so quickly. I suspect that these posting list tuples have duplicate TIDs, which is never supposed to happen -- nbtree expects that heapam (and the broader system) will never allow a duplicate TID to be present in the whole index. It's possible that I'm wrong, and the corrupt posting list TIDs are actually in the wrong order (i.e. they're all unique, but are somehow not in perfect TID-wise order). But I doubt it. > After REINDEX'ing the three indices listed by pg_amcheck, there is no > apparent change to the state of the page_main_title index: > azurlane_wiki=# select ctid, page_title from mediawiki.page WHERE > page_title = 'Kaga' and page_namespace = 0; > ctid | page_title > ----------+------------ > (446,32) | Belfast > (720,53) | Kaga > (2 rows) I'm a little confused. Do you mean that even a REINDEX isn't enough to stop the index from giving this particular wrong answer, where it thinks that 'Belfast' is the same as 'Kaga'? Even a REINDEX won't make it stop doing that? Or, are you concerned that pg_amcheck doesn't detect a remaining problem with one index? Thanks -- Peter Geoghegan
On Sun, Oct 24, 2021 at 3:11 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Sun, Oct 24, 2021 at 2:35 PM K. R. <iijima.yun@koumakan.jp> wrote: > > $ ./pg_amcheck -U postgres --heapallindexed --parent-check --rootdescend > > azurlane_wiki > > btree index "azurlane_wiki.mediawiki.page_redirect_namespace_len": > > ERROR: posting list contains misplaced TID in index > > "page_redirect_namespace_len" > > DETAIL: Index tid=(14,9) posting list offset=110 page lsn=2/2C4F7CD8. > > btree index "azurlane_wiki.mediawiki.page_len": > > ERROR: posting list contains misplaced TID in index "page_len" > > DETAIL: Index tid=(1,2) posting list offset=34 page lsn=2/2DDA2378. > > btree index "azurlane_wiki.mediawiki.transcode_key_idx": > > ERROR: posting list contains misplaced TID in index > > "transcode_key_idx" > > DETAIL: Index tid=(1,9) posting list offset=5 page lsn=2/2B53AED8. Does it make any difference if you run "./pg_amcheck -U postgres --heapallindexed azurlane_wiki" instead? Typically I would expect this to be less interesting, because you're asking pg_amcheck to do less checking. But this is not a typical case. -- Peter Geoghegan
>>>>> "Peter" == Peter Geoghegan <pg@bowt.ie> writes: Peter> PostgreSQL 14 is the first version where the default Peter> wal_sync_method is fdatasync() on FreeBSD -- though only with Peter> FreeBSD 13. See commit f900a79e. Perhaps that has something to Peter> do with the problem seen here. Nope. The database in question was restored from a pg_dumpall backup three weeks ago and I'm told there have been no crashes or even unclean restarts since then. Peter> I suspect that there was TID recycling by VACUUM involved here. That also doesn't seem likely from the heap page images I saw, which suggested a fairly low update rate, HOT updates, and page cleanup rather than much vacuum activity (in particular there were several LP_REDIRECT pointers and few free pointers). -- Andrew (irc:RhodiumToad)
Do you happen to have WAL traffic for this table and index? -- Álvaro Herrera
On Sun, Oct 24, 2021 at 5:56 PM Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > Nope. The database in question was restored from a pg_dumpall backup > three weeks ago and I'm told there have been no crashes or even unclean > restarts since then. Anything is possible, but this doesn't look like logical index corruption. If we assume that it was something like a faulty deduplication pass, then we have to explain how the spurious TID ended up in that particular leaf page to begin with. We also have to explain why it is a valid TID though totally distinct TID, which is unlikely to occur by chance. I find it slightly suspicious that the "correct" heap blocks (those blocks that the correct heap tuple is located) are close together in the heap -- 720 (in the case of 'Kaga'), 722 (in the case of 'Isokaze'). While the "incorrect" heap blocks (those blocks that the non-matching heap tuple is located, that are nevertheless returned by the index scan) are earlier in the table -- 181 and 446. > That also doesn't seem likely from the heap page images I saw, which > suggested a fairly low update rate, HOT updates, and page cleanup rather > than much vacuum activity (in particular there were several LP_REDIRECT > pointers and few free pointers). FWIW Postgres 14 can truncate a heap page's line pointer array during VACUUM's second pass -- see commit 3c3b8a4b. So a lack of observable free line pointers is not quite the signal it once was. What does pg_stat_user_tables say about autovacuum? -- Peter Geoghegan
On 25.10.2021 1:28, Peter Geoghegan wrote: > Does it make any difference if you run "./pg_amcheck -U postgres > --heapallindexed azurlane_wiki" instead? > > Typically I would expect this to be less interesting, because you're > asking pg_amcheck to do less checking. But this is not a typical case. No, there seem to be no differences between using different sets of arguments: [xiatian@freebsd pg_amcheck]$ ./pg_amcheck -U postgres --heapallindexed azurlane_wiki [xiatian@freebsd pg_amcheck]$ > Do you mean that even a REINDEX isn't enough to stop the index from > giving this particular wrong answer, where it thinks that 'Belfast' is > the same as 'Kaga'? Even a REINDEX won't make it stop doing that? Or, > are you concerned that pg_amcheck doesn't detect a remaining problem > with one index? The latter. REINDEX fixes the problem with page_main_title. I apologise for not being clear enough; I have never looked too deep into PostgreSQL's inner workings before and have been using it purely as a regular application developer. > What does pg_stat_user_tables say about autovacuum? For "mediawiki.page" specifically: last_vacuum: 2021-10-24 12:21:22.256194+09 last_autovacuum: 2021-10-05 18:57:43.850106+09 last_analyze: 2021-10-24 18:48:04.706226+09 last_autoanalyze: 2021-10-09 19:38:36.441795+09 vacuum_count is 20 (the periodic daily script), autovacuum_count is 1. Considering the constraints for the index in question, I can tell that changes affecting it were made on the 22nd of October and before, and then yesterday on the 24th. The first user report regarding the problem was on the 23rd, unless, of course, it went unnoticed before. -- K. R.
>>>>> "Alvaro" == Alvaro Herrera <alvherre@alvh.no-ip.org> writes: Alvaro> Do you happen to have WAL traffic for this table and index? Unfortunately there's no WAL archiving, and no available WAL files from before the state was known to be corrupt. We'll look into enabling archiving in case the problem recurs or can be reproduced. -- Andrew (irc:RhodiumToad)
On 25.10.2021 0:52, Peter Geoghegan wrote: > On Sun, Oct 24, 2021 at 4:40 AM PG Bug reporting form > <noreply@postgresql.org> wrote: >> PostgreSQL version: 14.0 >> Operating system: FreeBSD 13.0-RELEASE > PostgreSQL 14 is the first version where the default wal_sync_method > is fdatasync() on FreeBSD -- though only with FreeBSD 13. See commit > f900a79e. Perhaps that has something to do with the problem seen here. To clarify: I had been running PostgreSQL 13.3 since the 18th of May (upgraded from 13.2) until the upgrade to 14.0. The REL_13_3 tag seems to include the change from f900a79e (matching the commit message: ‘Like commit 576477e73c4, which did the same for Linux, back-patch to all supported releases.’) and both the 13.3 package from my package manager's cache and the 13.4 package available from FreeBSD repositories at the moment uses fdatasync() as well: [root@lincle-backup ~]# postgres --version postgres (PostgreSQL) 13.3 [root@lincle-backup ~]# postgres --describe-config| grep wal_sync_method wal_sync_method sighup Write-Ahead Log / Settings ENUM fdatasync Selects the method used for forcing WAL updates to disk. [root@lincle-backup ~]# postgres --version postgres (PostgreSQL) 13.4 [root@lincle-backup ~]# postgres --describe-config| grep wal_sync_method wal_sync_method sighup Write-Ahead Log / Settings ENUM fdatasync Selects the method used for forcing WAL updates to disk. According to the logs I have (daily reports), there have been no crashes or unclean shutdowns of PostgreSQL on the server since the database jail re-creation time (August of 2014 or so). Vacuum has been performed daily on all databases since at least around 2017 via the daily periodic script. -- K. R.
On Mon, Oct 25, 2021 at 10:50 PM K. R. <iijima.yun@koumakan.jp> wrote: > On 25.10.2021 0:52, Peter Geoghegan wrote: > > On Sun, Oct 24, 2021 at 4:40 AM PG Bug reporting form > > <noreply@postgresql.org> wrote: > >> PostgreSQL version: 14.0 > >> Operating system: FreeBSD 13.0-RELEASE > > PostgreSQL 14 is the first version where the default wal_sync_method > > is fdatasync() on FreeBSD -- though only with FreeBSD 13. See commit > > f900a79e. Perhaps that has something to do with the problem seen here. > To clarify: I had been running PostgreSQL 13.3 since the 18th of May > (upgraded from 13.2) until the upgrade to 14.0. > > The REL_13_3 tag seems to include the change from f900a79e (matching the > commit message: ‘Like commit 576477e73c4, which did the same for Linux, > back-patch to all supported releases.’) and both the 13.3 package from > my package manager's cache and the 13.4 package available from FreeBSD > repositories at the moment uses fdatasync() as well: For the record, commit f900a79e was about preserving pre-existing behaviour dating back a few years (ie avoiding a change). The history is like this: On ancient FreeBSD releases, PostgreSQL would default to wal_sync_method=fsync. If built on FreeBSD 11.1+ (July 2017), it would select wal_sync_method=fdatasync as the default, because the fdatasync() system call was added and would be detected by the configure script. If built on FreeBSD 13.0+ (April 2021), it would traditionally have selected wal_sync_method=open_datasync, because the O_DSYNC flag was added to <fcntl.h>. But commit f900a79e changed PostgreSQL 13.3+ (April 2021) and sibling releases to prefer wal_sync_method=fdatasync still, despite the presence of O_DSYNC. Given that this system upgraded PostgreSQL 13.2->13.3 on the 18th of May, and assuming it had installed FreeBSD 13.0 and PostgreSQL 13.2 around the time of FreeBSD 13.0's release on the 13th of April, it would have been running with wal_sync_method=open_datasync (assuming default used) in that 5 week window. I could talk about what exact effects that would have if I knew which file system we're talking about here, but it doesn't seems to be relevant to this case, given that "The database in question was restored from a pg_dumpall backup _three weeks ago_ and I'm told there have been no crashes or even unclean restarts since then" (my emphasis).
On 25.10.2021 23:49, Thomas Munro wrote: > For the record, commit f900a79e was about preserving pre-existing > behaviour dating back a few years (ie avoiding a change). The history > is like this: > > On ancient FreeBSD releases, PostgreSQL would default to wal_sync_method=fsync. > > If built on FreeBSD 11.1+ (July 2017), it would select > wal_sync_method=fdatasync as the default, because the fdatasync() > system call was added and would be detected by the configure script. > > If built on FreeBSD 13.0+ (April 2021), it would traditionally have > selected wal_sync_method=open_datasync, because the O_DSYNC flag was > added to <fcntl.h>. But commit f900a79e changed PostgreSQL 13.3+ > (April 2021) and sibling releases to prefer wal_sync_method=fdatasync > still, despite the presence of O_DSYNC. Thank you, this confirms the assumptions I had after reading through the sources of different releases and testing the --describe-config output from old packages in my package manager's cache. > Given that this system upgraded PostgreSQL 13.2->13.3 on the 18th of > May, and assuming it had installed FreeBSD 13.0 and PostgreSQL 13.2 > around the time of FreeBSD 13.0's release on the 13th of April, it > would have been running with wal_sync_method=open_datasync (assuming > default used) in that 5 week window. I could talk about what exact > effects that would have if I knew which file system we're talking > about here, but it doesn't seems to be relevant to this case, given > that "The database in question was restored from a pg_dumpall backup > _three weeks ago_ and I'm told there have been no crashes or even > unclean restarts since then" (my emphasis). In case this might be important later: everything is on ZFS (upgraded from FreeBSD's old zfs port to openzfs, as imported in FreeBSD 13.0). Log entries related to the upgrade: Oct 5 09:55:27 db pkg[7101]: postgresql13-server-13.3_1 deinstalled Oct 5 09:55:41 db pkg[7144]: postgresql14-server-14.0 installed This was preceded by pg_dumpall and followed by initdb with the exact same parameters (using unchanged rc.conf postgresql_initdb_flags="--encoding=utf-8 --lc-collate=en_GB.UTF-8") and the import of the dump using psql. I do not use pg_upgrade since different versions of PostgreSQL cannot be installed at the same time in FreeBSD using the ports/packages system, and I do not find dump/restore much a chore (plus it solves issues with possible locale changes). There have been no crashes since; there was one reload (pg_hba.conf edits) and several restarts (to snapshot the file structure with the corrupted index, plus another enabling WAL archiving today in the morning). I have postgresql-rum installed, if this matters; it is used by a Pleroma instance in a separate database. -- K. R.
On Mon, Oct 25, 2021 at 2:08 PM K. R. <iijima.yun@koumakan.jp> wrote: > There have been no crashes since; there was one reload (pg_hba.conf > edits) and several restarts (to snapshot the file structure with the > corrupted index, plus another enabling WAL archiving today in the morning). Thank you for your help. I wonder if you can show me a page that amcheck reports as having an incorrect posting list? I am interested in posting list tuples that are not just pointing to the wrong thing, but actually look wrong without even looking at the heap. You must have done this for Andrew already, but note that the procedure is outlined here: https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump For example, the amcheck error with "DETAIL: Index tid=(14,9) posting list offset=110 page lsn=2/2C4F7CD8 btree index "azurlane_wiki.mediawiki.page_len" has an "index TID" of 14, meaning that the block number 14 from the index "azurlane_wiki.mediawiki.page_len" is interesting to me. It could help me with debugging. I can treat this page image as confidential. You could send it to me privately. Thanks again -- Peter Geoghegan
On Tue, Oct 26, 2021 at 10:08 AM K. R. <iijima.yun@koumakan.jp> wrote: > [...] I do not use pg_upgrade since > different versions of PostgreSQL cannot be installed at the same time in > FreeBSD using the ports/packages system, and I do not find dump/restore > much a chore (plus it solves issues with possible locale changes). It really would be nice to address that. (FWIW A few years ago I had a mostly-working experiment-grade patch for the ports tree that could install and run multiple versions in parallel, and the bones of a port of the Debian wrapper scripts and cluster control tools so you can list, start, stop, clone clusters etc, but I ran out of steam while contemplating the squillions of other packages that interact with PostgreSQL... happy to share the starter bits and pieces I have off-list if anyone is interested...). But in this particular case it significantly reduces the search area for the problem, so that's a useful deficiency!
On Sun, Oct 24, 2021 at 6:33 PM Peter Geoghegan <pg@bowt.ie> wrote: > Anything is possible, but this doesn't look like logical index > corruption. If we assume that it was something like a faulty > deduplication pass, then we have to explain how the spurious TID ended > up in that particular leaf page to begin with. We also have to explain > why it is a valid TID though totally distinct TID, which is unlikely > to occur by chance. I discussed the robustness of some of the posting list split code privately with Heikki earlier today. Heikki suggested that I more consistently invalidate the insert state that is used across nbtinsert.c. Although I don't have any reason to believe that that makes a difference, I still agree with his conclusions. Attached patch does this. It also makes the special posting split related call to _bt_binsrch_insert() at the end of _bt_findinsertloc() more robust. There is a similar defensive elog() within _bt_swap_posting(), but that only kicks in when the posting offset is totally out of bounds for the tuple. _bt_binsrch_insert() could still give answers for some other reason, at least in theory, and so it seems like a good idea to be extra careful, and throw an error at the earliest opportunity. -- Peter Geoghegan
Вложения
On Mon, Oct 25, 2021 at 2:29 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Oct 25, 2021 at 2:08 PM K. R. <iijima.yun@koumakan.jp> wrote: > > There have been no crashes since; there was one reload (pg_hba.conf > > edits) and several restarts (to snapshot the file structure with the > > corrupted index, plus another enabling WAL archiving today in the morning). > > Thank you for your help. Thank you for sharing page images with me privately -- that was very helpful. I can see the same basic issue in all 3 of the pages that you said amcheck reports as corrupt (the 3 pages that you shared). As I suspected, there are posting lists that contain duplicate TIDs, but look normal in every other way. This is very subtle. A few more questions for you, if you don't mind: 1. Can you tell me which tables the 4 known-corrupt indexes (page_redirect_namespace_len, page_len, transcode_key_idx, and page_main_title) are defined against? 2. Can you show me the definition of the table or tables, using \d from psql? 3. Do you notice any consistent pattern here? For example, are foreign keys involved? 4. How do the table or tables tend to get updated? Thanks again Executive summary on my progress: This now looks like it might be related to code on the heapam side, perhaps in heapam_index_build_range_scan(), which has a history of getting confused about duplicate TIDs with HOT chains (e.g., see Alvaro's commit a811ea5bde from late 2020). Here is why I now suspect heapam_index_build_range_scan(): It disturbed me that even "heapallindexed" amcheck verification did not detect the original reported problem, where we see index scans that give wrong answers (a scan of the page_main_title index shows an extra non-matching row, per Andrew's original message). We saw that amcheck does notice a few corrupt posting list tuples in other nearby indexes, but not this bigger problem. That eventually made me think about table_index_build_scan()/heapam_index_build_range_scan(), which is used by amcheck for heapallindexed verification -- most of the important work is outsourced to that core code. amcheck more or less performs heapallindexed verification by comparing the tuples that a REINDEX would see against the actual tuples it fingerprinting from the index. I wonder if heapallindexed verification doesn't detect corruption in the page_main_title index because "the table seems to agree with the index", even though the actual details are clearly wrong. In other words, perhaps even the table_index_build_scan() call inside amcheck's verify_ntree.c file says that there should be duplicate TIDs in the index, even though that's always wrong, by definition. What if table_index_build_scan() is actually the problem? If my hypothesis is true, then we might expect problems to not even go away following a REINDEX. Or maybe they would go away sometimes, but not other times. This idea is supported by remarks from Herman verschooten on Slack [1], when discussing a similar problem case on Postgres 14 [2]. I'll quote him directly: """ tranman_production=# CREATE INDEX index_freights_on_cmr_received ON public.freights USING btree (cmr_received) with (deduplicate_items = off); CREATE INDEX tranman_production=# \set VERBOSITY verbose tranman_production=# update freights set cmr_received=false where id=49632; ERROR: XX000: posting list tuple with 20 items cannot be split at offset 168 LOCATION: _bt_swap_posting, nbtdedup.c:1037 tranman_production=# drop index index_freights_on_cmr_received ; DROP INDEX tranman_production=# update freights set cmr_received=false where id=49632; UPDATE 1 """ Notice that the problem seems to be confined to one index, which has deduplication explicitly disabled. We see an error message about a posting list tuple all the same. I think that that's just a downstream consequence of duplicate TIDs appearing in the index. It looks like the corruption exists and is independent of whether the index is present or not. When the index is present the problem becomes obvious, but it's still there either way. The supposed posting list tuple with 20 items that the error message goes on about is actually just a plain tuple. The number 20 comes from its IndexTupleData.t_tid offset number. In other words, it's an offset number from the index tuple's pointed-to heap TID, which has been misinterpreted as something else by BTreeTupleGetNPosting(). BTreeTupleGetNPosting() expects to never be called against such a tuple, but it can happen once we assume duplicate TIDs are possible. (You'd see an assertion failure if asserts were enabled, but of course they're not enabled here.) Note that this independently reported "freights" case from Herman has the same amcheck issue we see with the mediawiki indexes on this thread: amcheck won't actually complain about the clearly corrupt index_freights_on_cmr_received index, but will complain about other indexes on the same table. This time around amcheck complains about out-of-order index tuples, but I strongly suspect that that's practically the same condition as the "posting list contains misplaced TID" error messages we see here. In other words, I think that both cases just have duplicate heap TIDs, and that the variation in error messages is totally unrelated to the true underlying problem. [1] https://postgresteam.slack.com/archives/C0FS3UTAP/p1635161173202100?thread_ts=1635154585.197300&cid=C0FS3UTAP [2] https://postgr.es/m/8CDB73C1-E3AF-40A6-BA81-8AFE174C6402@verschooten.net -- Peter Geoghegan
On Tue, Oct 26, 2021 at 1:36 PM Peter Geoghegan <pg@bowt.ie> wrote: > A few more questions for you, if you don't mind: > > 1. Can you tell me which tables the 4 known-corrupt indexes > (page_redirect_namespace_len, page_len, transcode_key_idx, and > page_main_title) are defined against? > 2. Can you show me the definition of the table or tables, using \d from psql? > 3. Do you notice any consistent pattern here? For example, are foreign > keys involved? > 4. How do the table or tables tend to get updated? Also: 5. Do you use CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY much? -- Peter Geoghegan
On 26.10.2021 23:36, Peter Geoghegan wrote: > On Mon, Oct 25, 2021 at 2:29 PM Peter Geoghegan <pg@bowt.ie> wrote: > A few more questions for you, if you don't mind: > > 1. Can you tell me which tables the 4 known-corrupt indexes > (page_redirect_namespace_len, page_len, transcode_key_idx, and > page_main_title) are defined against? page and transcode. > 2. Can you show me the definition of the table or tables, using \d from psql? azurlane_wiki=> \d mediawiki.page Table "mediawiki.page" Column | Type | Collation | Nullable | Default --------------------+--------------------------+-----------+----------+------------------------------------------------- page_id | integer | | not null | nextval('mediawiki.page_page_id_seq'::regclass) page_namespace | integer | | not null | page_title | text | | not null | page_restrictions | text | | | page_is_redirect | smallint | | not null | 0 page_is_new | smallint | | not null | 0 page_random | double precision | | not null | random() page_touched | timestamp with time zone | | not null | page_links_updated | timestamp with time zone | | | page_latest | integer | | not null | page_len | integer | | not null | page_content_model | text | | | page_lang | text | | | titlevector | tsvector | | | Indexes: "page_pkey" PRIMARY KEY, btree (page_id) "name_title" UNIQUE, btree (page_namespace, page_title) "page_len" btree (page_len) "page_main_title" btree (page_title text_pattern_ops) WHERE page_namespace = 0 "page_mediawiki_title" btree (page_title text_pattern_ops) WHERE page_namespace = 8 "page_project_title" btree (page_title text_pattern_ops) WHERE page_namespace = 4 "page_random" btree (page_random) "page_redirect_namespace_len" btree (page_is_redirect, page_namespace, page_len) "page_talk_title" btree (page_title text_pattern_ops) WHERE page_namespace = 1 "page_user_title" btree (page_title text_pattern_ops) WHERE page_namespace = 2 "page_utalk_title" btree (page_title text_pattern_ops) WHERE page_namespace = 3 "ts2_page_title" gist (titlevector) Referenced by: TABLE "mediawiki.cu_changes" CONSTRAINT "cu_changes_cuc_page_id_fkey" FOREIGN KEY (cuc_page_id) REFERENCES mediawiki.page(page_id) ON DELETE SET NULL TABLE "mediawiki.revision" CONSTRAINT "revision_rev_page_fkey" FOREIGN KEY (rev_page) REFERENCES mediawiki.page(page_id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED Triggers: page_deleted AFTER DELETE ON mediawiki.page FOR EACH ROW EXECUTE FUNCTION mediawiki.page_deleted() ts2_page_title BEFORE INSERT OR UPDATE ON mediawiki.page FOR EACH ROW EXECUTE FUNCTION mediawiki.ts2_page_title() azurlane_wiki=> \d mediawiki.transcode Table "mediawiki.transcode" Column | Type | Collation | Nullable | Default --------------------------+--------------------------+-----------+----------+----------------------------------------------------------- transcode_id | integer | | not null | nextval('mediawiki.transcode_transcode_id_seq'::regclass) transcode_image_name | character varying(255) | | not null | transcode_key | character varying(48) | | not null | transcode_error | text | | not null | transcode_time_addjob | timestamp with time zone | | | transcode_time_startwork | timestamp with time zone | | | transcode_time_success | timestamp with time zone | | | transcode_time_error | timestamp with time zone | | | transcode_final_bitrate | integer | | not null | Indexes: "transcode_key_idx" btree (transcode_key) "transcode_name_key" UNIQUE, btree (transcode_image_name, transcode_key) "transcode_time_inx" btree (transcode_time_addjob, transcode_time_startwork, transcode_time_success, transcode_time_error) > 3. Do you notice any consistent pattern here? For example, are foreign > keys involved? It does not look like it: [xiatian@freebsd ~]$ pg_dump -U azurlane_wiki -d azurlane_wiki -s | grep -i FOREIGN ADD CONSTRAINT account_credentials_acd_user_fkey FOREIGN KEY (acd_user) REFERENCES mediawiki.mwuser(user_id) ON DELETE SET NULL; ADD CONSTRAINT account_requests_acr_user_fkey FOREIGN KEY (acr_user) REFERENCES mediawiki.mwuser(user_id) ON DELETE SET NULL; ADD CONSTRAINT cu_changes_cuc_page_id_fkey FOREIGN KEY (cuc_page_id) REFERENCES mediawiki.page(page_id) ON DELETE SET NULL; ADD CONSTRAINT cu_changes_cuc_user_fkey FOREIGN KEY (cuc_user) REFERENCES mediawiki.mwuser(user_id) ON DELETE SET NULL; ADD CONSTRAINT cu_log_cul_target_id_fkey FOREIGN KEY (cul_target_id) REFERENCES mediawiki.mwuser(user_id) ON DELETE SET NULL; ADD CONSTRAINT cu_log_cul_user_fkey FOREIGN KEY (cul_user) REFERENCES mediawiki.mwuser(user_id) ON DELETE SET NULL; ADD CONSTRAINT revision_rev_page_fkey FOREIGN KEY (rev_page) REFERENCES mediawiki.page(page_id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED; ADD CONSTRAINT user_newtalk_user_id_fkey FOREIGN KEY (user_id) REFERENCES mediawiki.mwuser(user_id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED; > 4. How do the table or tables tend to get updated? Regarding page: New rows are added rarely, usually once a week or two, in small batches (separate transactions). The page_title field is almost never updated (pages are moved extremely rarely). Updates to other fields, like page_len, happen multiple times daily, during page edits, and sometimes happen in large batches but as separate transactions. Regarding transcode: I assume inserts happen with each file upload, often in large batches (separate transactions). Updates happen whenever a transcode job runs, which is usually also in batches. I can check the wiki's recent changes and upload log for more precise data. > Also: > 5. Do you use CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY much?> I assume not, `find . -name "*.php" | xargs grep -i "CONCURRENTLY"` against MediaWiki includes/ gives no SQL results, source code comments only. > If my hypothesis is true, then we might expect problems to not even go > away following a REINDEX. Or maybe they would go away sometimes, but > not other times. REINDEX *seems* to have helped in my case (tested with both page_main_title and the three that showed up in pg_amcheck output). I am considering writing a script that would SELECT COUNT(*) for each page_title to detect the next possible page_main_title index corruption. -- K. R.
On Wed, Oct 27, 2021 at 12:00 AM Kamigishi Rei <iijima.yun@koumakan.jp> wrote: > page and transcode. Thanks! One more question: Did you use pg_restore with -j (jobs) to make the restore faster? Just one more request: Can I have some more pages from the same indexes? You have already provided me with "mediawiki.transcode_key_idx.block1.page". But could you now also provide me with blocks 2, 8, and 16 from the same index? This is a pretty random choice. I just want to see if they all contain corrupt posting list tuples, like page 1 (I suspect that they all will). Note that I pushed some additional hardening for posting list splits today. This will be in Postgres 14.1. The new code won't fix the problem, but it will detect problems earlier, limiting the damage. > > If my hypothesis is true, then we might expect problems to not even go > > away following a REINDEX. Or maybe they would go away sometimes, but > > not other times. > > REINDEX *seems* to have helped in my case (tested with both > page_main_title and the three that showed up in pg_amcheck output). I am > considering writing a script that would SELECT COUNT(*) for each > page_title to detect the next possible page_main_title index corruption. I took another look at your 3 page images today. I noticed some new things. Out of your 3 pages, I saw that 2 pages were the leftmost leaf page in the indexes (mediawiki.transcode_key_idx.block1.page and mediawiki.page_len.block1.page), and 1 was a very low block number (block 14 from mediawiki.page_redirect_namespace_len.block14.page). Also, the amount of posting list corruption on each page was higher than I first realized. For example, almost all of the tuples on the page mediawiki.page_len.block1.page had duplicate TIDs (corruption). We know that the underlying table (for 2 out of the 3 indexes) is at least ~720 blocks, so these are not tiny indexes. And yet amcheck typically complains on the first leaf page it looks at, or not at all. So the particular indexes I've taken a look at probably completely full of corrupt posting lists. And so it seems as if most indexes have no detectable corruption at all, but at least a few are almost totally corrupt. It seems to be either one extreme or the other. Another thing that I noticed today is that all 3 of the pages (from 3 indexes) looked like they have probably never had a regular deduplication pass, even once. It's also unlikely that they were ever split. I can tell because the limit on posting list size is different during CREATE INDEX. The pages looked like there were a few posting list splits, which is normal when you have some non-HOT updates, and probably not relevant. These pages look like CREATE INDEX was run recently, despite all the corruption. Plus Herman (CC'd) has said that sometimes CREATE INDEX doesn't correct his own similar problem on Postgres 14. And so this still very much looks like a bug in CREATE INDEX, somewhere. This suspicion started with the fact that corruption appeared concentrated in only 2 (really just 1) tables. I'll take another guess: I wonder if commit 56788d21 ("Allocate consecutive blocks during parallel seqscans") is somehow causing parallel CREATE INDEX to produce wrong results. The author (David Rowley) is CC'd. Does a bug in that commit seem like it might explain this problem, David? Might parallel workers in a parallel index build somehow become confused about which worker is supposed to scan which heap block, leading to duplicate TIDs in the final index? I notice that the faulty duplicate TIDs tend to be high-ish heap block offset numbers, and often appear in groups of 3 or even 4 duplicates. Does that seem like it might be consistent with my (admittedly not strong) theory about commit 56788d21 being behind all this. -- Peter Geoghegan
On Wed, Oct 27, 2021 at 6:36 PM Peter Geoghegan <pg@bowt.ie> wrote: > Just one more request: Can I have some more pages from the same > indexes? You have already provided me with > "mediawiki.transcode_key_idx.block1.page". But could you now also > provide me with blocks 2, 8, and 16 from the same index? This is a > pretty random choice. I just want to see if they all contain corrupt > posting list tuples, like page 1 (I suspect that they all will). Also, page 0 of the table "page" itself would be useful. The same pageinspect recipe will work with a table. Many thanks -- Peter Geoghegan
On 28.10.2021 4:36, Peter Geoghegan wrote: > On Wed, Oct 27, 2021 at 12:00 AM Kamigishi Rei <iijima.yun@koumakan.jp> wrote: > One more question: Did you use pg_restore with -j (jobs) to make the > restore faster? I am afraid I have not used pg_restore (as far as I know), unless psql replicates its behaviour and/or calls it. My typical procedure is `pg_dump[all] -U … -d … > file.sql`, followed by `psql -U … -d … < file.sql`. It did not seem like psql used any kind of parallel query execution, judging by the console output. > Just one more request: Can I have some more pages from the same > indexes? You have already provided me with > "mediawiki.transcode_key_idx.block1.page". But could you now also > provide me with blocks 2, 8, and 16 from the same index? This is a > pretty random choice. I just want to see if they all contain corrupt > posting list tuples, like page 1 (I suspect that they all will). Despatched in a separate e-mail along with page 0 from "mediawiki.page" requested in your follow-up message. > Note that I pushed some additional hardening for posting list splits > today. This will be in Postgres 14.1. The new code won't fix the > problem, but it will detect problems earlier, limiting the damage. Would you recommend applying the patch to my ‘production’ instance or should keep it as is to see if that undetectable corruption case (leading to weird SELECT etc. results) shows up again? > These pages look like CREATE INDEX was run > recently, despite all the corruption. Plus Herman (CC'd) has said that > sometimes CREATE INDEX doesn't correct his own similar problem on > Postgres 14. CREATE INDEX was run during the import of the 13.3-generated dump; there were no schema changes since. The transcode table gets updated every time an audio file is uploaded, which happened quite a few times since the upgrade. -- K. R.
On 28.10.2021 7:21, Peter Geoghegan wrote: > Also, page 0 of the table "page" itself would be useful. The same > pageinspect recipe will work with a table. I forgot to mention: I have the data directory from the PostgreSQL 13.3 instance that ran inside the same jail on the same FreeBSD system since May, so if that might be useful for the investigation, I can pull it into another VM. -- K. R.
On Thu, 28 Oct 2021 at 14:36, Peter Geoghegan <pg@bowt.ie> wrote: > I'll take another guess: I wonder if commit 56788d21 ("Allocate > consecutive blocks during parallel seqscans") is somehow causing > parallel CREATE INDEX to produce wrong results. The author (David > Rowley) is CC'd. Does a bug in that commit seem like it might explain > this problem, David? Might parallel workers in a parallel index build > somehow become confused about which worker is supposed to scan which > heap block, leading to duplicate TIDs in the final index? I stared at that code for a while and didn't really see how it could be broken, unless if the atomics implementation on that machine were broken. Thomas and I had a look at that earlier and on his FreeBSD machine, it uses the arch-x64.h implementation of pg_atomic_fetch_add_u64_impl(). I also noted that pg_atomic_fetch_add_u64() is not really getting much use. regress.c is the only other location, however, I really doubt that this is the issue here. Just to see what it would look like if this was broken, I went and mocked up such a bug by adding the following code just above "return page;" at then of table_block_parallelscan_nextpage: if (page == 1000) page = 999; I then did: create table b (b int not null, t text not null); insert into b select x,x::text from generate_series(1,200000)x; set max_parallel_workers_per_gather=0; select sum(b), sum(length(t)) from b; set max_parallel_workers_per_gather=2; select sum(b), sum(length(t)) from b; I noted that I get different results between the parallel and non-parallel query due to page 999 being read twice. I then created the following index: set max_parallel_maintenance_workers = 2; create index on b(t); If I run a query to perform an index scan to find any value of "t" that's on page 999, then I get: postgres=# select ctid,* from b where t = '185050'; ctid | b | t ----------+--------+-------- (999,54) | 185050 | 185050 (999,54) | 185050 | 185050 (2 rows) We just get the same tid twice in the index. That's quite different from another value of "t" getting into the list of tids for '185050'. David
On Thu, Oct 28, 2021 at 12:34 AM Kamigishi Rei <iijima.yun@koumakan.jp> wrote: > > Just one more request: Can I have some more pages from the same > > indexes? You have already provided me with > > "mediawiki.transcode_key_idx.block1.page". But could you now also > > provide me with blocks 2, 8, and 16 from the same index? This is a > > pretty random choice. I just want to see if they all contain corrupt > > posting list tuples, like page 1 (I suspect that they all will). > > Despatched in a separate e-mail along with page 0 from "mediawiki.page" > requested in your follow-up message. Thanks! > > Note that I pushed some additional hardening for posting list splits > > today. This will be in Postgres 14.1. The new code won't fix the > > problem, but it will detect problems earlier, limiting the damage. > > Would you recommend applying the patch to my ‘production’ instance or > should keep it as is to see if that undetectable corruption case > (leading to weird SELECT etc. results) shows up again? Yes, that's definitely a good idea, if you have a convenient way to build the REL_14_STABLE branch from source easily (or you can just wait for 14.1). The new defensive checks will at least give you more information about any problems that happen in the B-Tree code. It will also limit possible harms. It would be great from my point of view if you had the extra checks today, because it might lead to better information about what's really going on here. > CREATE INDEX was run during the import of the 13.3-generated dump; there > were no schema changes since. The transcode table gets updated every > time an audio file is uploaded, which happened quite a few times since > the upgrade. Got it. Thanks. -- Peter Geoghegan
On Thu, Oct 28, 2021 at 1:04 AM Kamigishi Rei <iijima.yun@koumakan.jp> wrote: > I forgot to mention: I have the data directory from the PostgreSQL 13.3 > instance that ran inside the same jail on the same FreeBSD system since > May, so if that might be useful for the investigation, I can pull it > into another VM. I know I keep saying it's the last thing, but...I need more pages! Can you please provide me with additional pages from the heap relation "page", as follows: 5, 7, 12, 14, 16, 21, 22, 26 Your heap page (page 0 of the "page" heap relation) was very interesting. I saw that the TID that is incorrectly repeated in the index "page_len" is marked LP_UNUSED in the heap page. It's one of only two such TIDs/line pointers on your heap/table page, out of 61 total, so I doubt that this is a coincidence. That's why I'm asking for more heap pages -- I would like to verify that this is a consistent thing, and not just a remarkable coincidence. Now this is looking like a problem in VACUUM (pruning?), not a CREATE INDEX thing. It looks like somehow an item that should be LP_DEAD ends up being LP_UNUSED during pruning. I have CC'd Andres, to get his thoughts on this. This explanation fits quite well, because: 1. Even with every verification option enabled, amcheck only ever complains about duplicate TIDs (the exact details vary in not-very-interesting ways). I would expect to only see this if my new hypothesis was correct, because in practice, with many workloads, unused line pointers tend to get reused by later non-HOT updates that create new versions without changing the value of indexes columns. Because amcheck doesn't exhaustively verify that the index has no duplicate TIDs, you're only going to see complaints about duplicates that happen to be duplicated on all key values, up to and including the heap TID itself. 2. If this theory was correct, then we'd occasionally notice TIDs that are recycled by totally unrelated logical rows (or maybe the key values do change). This is what we see in the original problem report, where the index seems to think that 'Isokaze' and 'Ying_Swei' are the same value. The TID originally *did* point to some version that had the correct value, but since VACUUM never removed the TID from the index (because it was never made LP_DEAD in the heap by pruning when it should have been), the "pointed-to value" changed without the index having the obsolete TID removed. I am suggesting that 1 and 2 are the same thing, really. They look a little different, but that's only due to limitations in our tooling. In addition to the "amcheck cannot reliably detect duplicate TIDs in an index" issue I mentioned, there are also heapallindexed limitations to consider. We should not expect the heapallindexed stuff to detect the case where an index has TIDs that currently point to LP_UNUSED items in the heap -- heapallindexed verification can only detect the absence of a needed index tuple, not the presence of a faulty/corrupt index tuple. That's probably why heapallindexed doesn't detect one single problem here, even though the original complaint describes exactly the kind of scenario you might expect heapallindexed to catch. 3. Multiple indexes on the same table "independently" have very similar corruption. I noticed that the heap TID (21,41) is duplicated in posting lists in both the page_redirect_namespace_len index (on block 14, posting list tuple at page offset 13), as well as the page_len index (on block 1, posting list tuple at page offset 15). If this was just a coincidence (if the problem was just in nbtree) then it would be a remarkable coincidence. -- Peter Geoghegan
On Thu, Oct 28, 2021 at 2:50 AM David Rowley <dgrowleyml@gmail.com> wrote: > I stared at that code for a while and didn't really see how it could > be broken, unless if the atomics implementation on that machine were > broken. Thomas and I had a look at that earlier and on his FreeBSD > machine, it uses the arch-x64.h implementation of > pg_atomic_fetch_add_u64_impl(). Thank you for going through that exercise. I now strongly doubt that it's CREATE INDEX at all. My suspicion is now falling on the snapshot scalability work. And some interaction with VACUUM. Probably only with shared row locks and concurrency. More on this later. > We just get the same tid twice in the index. That's quite different > from another value of "t" getting into the list of tids for '185050'. FWIW I thought that it might have been possible for the index to become even more corrupt, due to a lack of defensive measures inside nbtree. But I now see that that can't have been it. Apologies for the noise. -- Peter Geoghegan
On Thu, Oct 28, 2021 at 10:16 AM Peter Geoghegan <pg@bowt.ie> wrote: > Now this is looking like a problem in VACUUM (pruning?), not a CREATE > INDEX thing. It looks like somehow an item that should be LP_DEAD ends > up being LP_UNUSED during pruning. I have CC'd Andres, to get his > thoughts on this. Further analysis based on lots more heap pages provided by Kamigishi Rei (thanks a lot!) strongly supports this theory. The problem looks like an interaction between the snapshot scalability work, VACUUM, and shared row locks. I cannot tell for sure if the table in question has actually allocated MultiXacts in its lifetime, but it has certainly had quite a few shared locks. Note that the other people that had similar complaints about Postgres 14 all used foreign keys on affected tables. I'm attaching my personal notes on this. They have a little commentary, but are mostly useful because they outline the exact ways in which the data is corrupt, which is pretty tedious to put together manually. There are some very clear patterns here: * Most of the heap pages I've looked at have rows that were never updated or locked. There are usually 2 or 3 such tuples on each heap page, at least among those known to be corrupt -- 1 or 2 of them usually tie back to corruption in the index. * Most individual duplicated-in-index heap TIDs point to heap tuples that are HEAP_XMAX_KEYSHR_LOCK|HEAP_XMAX_LOCK_ONLY. These heap tuples have the same xmin and xmax. * The transaction ID 365637 is very over-represented, appearing in several corrupt heap tuple headers, located across several heap pages. * Its "neighbor" transaction ID is 365638, which appears once more. To me this suggests some kind of confusion with an OldestXmin style cutoff during VACUUM. * As suspected, there are a smaller number of TIDs in the index that point to LP_UNUSED items in the heap -- a distinct form of corruption to the more noticeable duplicate TIDs (superficially distinct, at least). These aren't usually duplicated in the index, though they can be. This all but confirms that the original complaint was in fact just a result of a TID/item pointer being recycled in the heap "too early". It also explains why amcheck's heapallindexed option didn't ever complain about any index in the whole database (only the index structure itself looked corrupt). -- Peter Geoghegan
Вложения
Hi, On 2021-10-28 10:16:35 -0700, Peter Geoghegan wrote: > Your heap page (page 0 of the "page" heap relation) was very > interesting. I saw that the TID that is incorrectly repeated in the > index "page_len" is marked LP_UNUSED in the heap page. It's one of > only two such TIDs/line pointers on your heap/table page, out of 61 > total, so I doubt that this is a coincidence. That's why I'm asking > for more heap pages -- I would like to verify that this is a > consistent thing, and not just a remarkable coincidence. This makes me wonder if the issue could be that we're loosing writes / that something is reading old page versions (e.g. due to filesystem bug). If both heap and index are vacuumed, but the index write is lost, this'd be what we see, right? Another way this could happen is if we got the wrong relation size for either index or table, and a vacuum scan doesn't scan the whole table or index. I've not yet read the whole thread, but if not done, it seems like a good idea to use pg_waldump and grep for changes to the relevant heap / index pages. That might give us more information about what could have happened. > Now this is looking like a problem in VACUUM (pruning?), not a CREATE > INDEX thing. It looks like somehow an item that should be LP_DEAD ends > up being LP_UNUSED during pruning. I have CC'd Andres, to get his > thoughts on this. Hm. There were a fair bit of changes around the separation between heap and index vacuuming in 14. I wonder if there's potentially something broken around repeatedly vacuuming the heap without doing index vacuums or such. It's also possible that there's something wrong in that darned path that handles recently-dead tuples. Greetings, Andres Freund
Hi, On 2021-10-28 14:18:32 -0700, Peter Geoghegan wrote: > Note that the other people that had similar complaints about Postgres 14 all > used foreign keys on affected tables. Do you have links to the other reports? > * The transaction ID 365637 is very over-represented, appearing in > several corrupt heap tuple headers, located across several heap pages. > > * Its "neighbor" transaction ID is 365638, which appears once more. To > me this suggests some kind of confusion with an OldestXmin style > cutoff during VACUUM. I'm not quite following this bit, could you expand on that? Greetings, Andres Freund
On Fri, Oct 29, 2021 at 10:31 AM Andres Freund <andres@anarazel.de> wrote: > On 2021-10-28 10:16:35 -0700, Peter Geoghegan wrote: > > Your heap page (page 0 of the "page" heap relation) was very > > interesting. I saw that the TID that is incorrectly repeated in the > > index "page_len" is marked LP_UNUSED in the heap page. It's one of > > only two such TIDs/line pointers on your heap/table page, out of 61 > > total, so I doubt that this is a coincidence. That's why I'm asking > > for more heap pages -- I would like to verify that this is a > > consistent thing, and not just a remarkable coincidence. > > This makes me wonder if the issue could be that we're loosing writes / that > something is reading old page versions (e.g. due to filesystem bug). If both > heap and index are vacuumed, but the index write is lost, this'd be what we > see, right? Relevant to this hypothesis, it's interesting that FreeBSD 13.0 is the first release that switched to OpenZFS, a big rebase to a common ZFS implementation shared with eg Ubuntu (whereas previously multiple groups targeting different OSes maintained different forks deriving from OpenSolaris code). I see there is another suspected related report[1], and it's on Ubuntu, but filesystem is not mentioned. [1] https://www.postgresql.org/message-id/flat/8CDB73C1-E3AF-40A6-BA81-8AFE174C6402%40verschooten.net
One hypothesis just out-of-the air:
If heap page visibility map bit is wrongly set to all_visible, index-only scan can give duplicates (in effect new and old versions of the same row), with no apparent corruption of neither index nor heap relation. It just skips checking visibility of tuples in heap. Not sure it's the case now, but I've seen this previousy.
-- It could be diagnosed by deletion of VM file for a relation, the error (if it is this error) will disappear.
Hi, On 2021-10-25 09:43:18 +0100, Andrew Gierth wrote: > >>>>> "Alvaro" == Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > Alvaro> Do you happen to have WAL traffic for this table and index? > > Unfortunately there's no WAL archiving, and no available WAL files from > before the state was known to be corrupt. We'll look into enabling > archiving in case the problem recurs or can be reproduced. Any chance autovacuum logging was enabled and that those logs still exist? Not quite as good as still having the WAL, but it still might allow us to reconstruct the index / vacuum scans, and the horizons used. Greetings, Andres Freund
On 29.10.2021 0:42, Andres Freund wrote: >> Unfortunately there's no WAL archiving, and no available WAL files from >> before the state was known to be corrupt. We'll look into enabling >> archiving in case the problem recurs or can be reproduced. > > Any chance autovacuum logging was enabled and that those logs still exist? Not > quite as good as still having the WAL, but it still might allow us to > reconstruct the index / vacuum scans, and the horizons used. The issue manifested again earlier today *after* a REINDEX followed by enabling WAL replica logging on the 24th of October. I saved a snapshot of the filesystem holding the data directory. Would that be useful for further analysis? (The daily vacuum runs have not been disabled so there have been at least 3 since.) -- K. R.
Hi, On 2021-10-29 00:46:29 +0300, Kamigishi Rei wrote: > The issue manifested again earlier today *after* a REINDEX followed by > enabling WAL replica logging on the 24th of October. I saved a snapshot of > the filesystem holding the data directory. Would that be useful for further > analysis? Yes, that's *quite* useful. I assume you can't just share that snapshot? Once we identified an affected heap and index page with the corruption, we should use pg_waldump to scan for all chanes to that table. Do you have the log file(s) from between the 24th and now? That might give us a good starting point for the LSN range to scan. Greetings, Andres Freund
On Thu, Oct 28, 2021 at 2:40 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Relevant to this hypothesis, it's interesting that FreeBSD 13.0 is the > first release that switched to OpenZFS, a big rebase to a common ZFS > implementation shared with eg Ubuntu (whereas previously multiple > groups targeting different OSes maintained different forks deriving > from OpenSolaris code). I see there is another suspected related > report[1], and it's on Ubuntu, but filesystem is not mentioned. Anything is possible. But Kamigishi Rei has said that this database has never had a hard crash or unclean shut down, which I definitely believe. Also, they are using ECC on a Xeon processor. This is the kind of hardware that is generally assumed to be very reliable. Kamigishi Rei has been an exemplary example of how to report a bug to an open source community. I want to thank him again. Thanks! A second similar complaint from Herman Verschooten on Slack didn't mention ZFS at all. A third similar-seeming report on Slack was from somebody named Brandon Ros, who used Ubuntu (I believe 20.04, like Herman Verschooten). Also no indication that ZFS was used. I find it slightly hard to believe that it's ZFS, simply because all 3 complaints involve Postgres 14. And have a lot of common factors. For example, Herman also used foreign keys -- a lot of users never bother with them. And like Kamigishi Rei, Herman found that a REINDEX (or was it VACUUM FULL?) seemingly made the problem go away. Brandon Ros gave less information (never talked to him directly), but did say that he saw extensive corruption, which this looks just like. Most of the indexes are not corrupt at all, but a small number (usually all on the same one or two tables) are very corrupt indeed -- even though the usual issues with collation could be ruled out. I don't want to overstate my confidence here (I could easily be wrong), but this all seems to add up to a pattern. -- Peter Geoghegan
On Thu, Oct 28, 2021 at 3:23 PM Peter Geoghegan <pg@bowt.ie> wrote: > I find it slightly hard to believe that it's ZFS, simply because all 3 > complaints involve Postgres 14. And have a lot of common factors. For > example, Herman also used foreign keys -- a lot of users never bother > with them. And like Kamigishi Rei, Herman found that a REINDEX (or was > it VACUUM FULL?) seemingly made the problem go away. I mean it would seem to go away, only to return some time later. -- Peter Geoghegan
On Thu, Oct 28, 2021 at 2:36 PM Andres Freund <andres@anarazel.de> wrote: > On 2021-10-28 14:18:32 -0700, Peter Geoghegan wrote: > > Note that the other people that had similar complaints about Postgres 14 all > > used foreign keys on affected tables. > > Do you have links to the other reports? You've already found the other one, so no need to tell you where that is. A third from Brandon Ros was only ever on Slack, but that didn't add new information IMV. I haven't talked to Brandon myself (just the other 2). BTW, it's possible that Brandon wasn't using foreign keys or shared locks at all -- I didn't verify that part. I can only say for sure that that was the case with the other 2 reports. -- Peter Geoghegan
On Thu, Oct 28, 2021 at 2:31 PM Andres Freund <andres@anarazel.de> wrote: > This makes me wonder if the issue could be that we're loosing writes / that > something is reading old page versions (e.g. due to filesystem bug). If both > heap and index are vacuumed, but the index write is lost, this'd be what we > see, right? Right, but that just doesn't seem to fit. That was the first question I asked. > Another way this could happen is if we got the wrong relation size for either > index or table, and a vacuum scan doesn't scan the whole table or index. I doubt that, since the heap blocks involved include heap block 0. On the table/indexes actually affected by this, the indexes are riddled with corruption. But every other table seems fine (at least as far as anybody knows). > I've not yet read the whole thread, but if not done, it seems like a good idea > to use pg_waldump and grep for changes to the relevant heap / index > pages. That might give us more information about what could have happened. I think that there is a fairly high likelihood that that alone will be enough to diagnose the bug. > There were a fair bit of changes around the separation between heap and index > vacuuming in 14. I wonder if there's potentially something broken around > repeatedly vacuuming the heap without doing index vacuums or such. I did ask myself that question earlier today, but quickly rejected the idea. There is very little mechanism involved with that stuff. It's very hard to imagine what could break. The code for this in lazy_vacuum() is quite simple. > It's also possible that there's something wrong in that darned path that > handles recently-dead tuples. That sounds much more likely to me. -- Peter Geoghegan
Hi, On 2021-10-28 15:23:38 -0700, Peter Geoghegan wrote: > Anything is possible. But Kamigishi Rei has said that this database > has never had a hard crash or unclean shut down, which I definitely > believe. Also, they are using ECC on a Xeon processor. This is the > kind of hardware that is generally assumed to be very reliable. That wouldn't protect against e.g. a logic bug in ZFS. Given its copy-on-write nature corruption could very well manifest as seeing an older version of the data when re-reading data from disk. Which could very well lead to the type of corruption we're seeing here. A few years back I tried to help somebody investigate corruption that turned out to be caused by something roughly along those lines (IIRC several bugs in ZFS on linux, although I don't remember the details anymore). Not saying that that is the most likely explanation, just something worth checking. > Kamigishi Rei has been an exemplary example of how to report a bug to > an open source community. I want to thank him again. Thanks! +1 > A second similar complaint from Herman Verschooten on Slack didn't > mention ZFS at all. A third similar-seeming report on Slack was from > somebody named Brandon Ros, who used Ubuntu (I believe 20.04, like > Herman Verschooten). Also no indication that ZFS was used. > > I find it slightly hard to believe that it's ZFS, simply because all 3 > complaints involve Postgres 14. And have a lot of common factors. For > example, Herman also used foreign keys -- a lot of users never bother > with them. And like Kamigishi Rei, Herman found that a REINDEX (or was > it VACUUM FULL?) seemingly made the problem go away. Didn't 14 change the logic when index vacuums are done? That could cause previously existing issues to manifest with a higher likelihood. Greetings, Andres Freund
On Thu, Oct 28, 2021 at 3:48 PM Andres Freund <andres@anarazel.de> wrote: > That wouldn't protect against e.g. a logic bug in ZFS. > Not saying that that is the most likely explanation, just something worth > checking. True. It's too early to rule that out. Though note that a full pg_amcheck of the database mostly didn't complain about anything -- it was just a handful of indexes, associated with just 2 tables. And this is mediawiki, which has lots of tables. None of the new heapam verification functionality found any problems (as with the older index-matches-table heapallindexed stuff). > Didn't 14 change the logic when index vacuums are done? That could cause > previously existing issues to manifest with a higher likelihood. I don't follow. The new logic that skips index vacuuming kicks in 1) in an anti-wraparound vacuum emergency, and 2) when there are very few LP_DEAD line pointers in the heap. We can rule 1 out, I think, because the XIDs we see are in the low millions, and our starting point was a database that was upgraded via a dump and reload. The second criteria for skipping index vacuuming (the "less than 2% of heap pages have any LP_DEAD items" thing) might well have been hit on these tables -- it is after all very common. But I don't see how that could matter. We're never going to get to a code path inside vacuumlazy.c that sets LP_DEAD items from VACUUM's dead_tuples array to LP_UNUSED (how could reached such a code path without also index vacuuming, given the way things are set up inside lazy_vacuum()?). We're always going to have the opportunity to do index vacuuming with any left-behind LP_DEAD line pointers in the next VACUUM -- right after the later VACUUM successfully returns from lazy_vacuum_all_indexes(). -- Peter Geoghegan
On Thu, Oct 28, 2021 at 2:36 PM Andres Freund <andres@anarazel.de> wrote: > > * The transaction ID 365637 is very over-represented, appearing in > > several corrupt heap tuple headers, located across several heap pages. > > > > * Its "neighbor" transaction ID is 365638, which appears once more. To > > me this suggests some kind of confusion with an OldestXmin style > > cutoff during VACUUM. > > I'm not quite following this bit, could you expand on that? I think it's odd that we use both an GlobalVisState and an OldestXmin for VACUUM's first pass over the heap as of Postgres 14 (before the snapshot scalability stuff, we just had OldestXmin). I have in the past wondered if that might cause problems. For example, how do we know that GlobalVisState won't ever slightly disagree with OldestXmin? For example, about which tuples are dead, rather than recently dead? This scenario reminds me of the tupgone mess that used to exist inside the code that's called lazy_scan_prune() in Postgres 14. Actually, I probably thought of when working on removing tupgone (which happened in commit 8523492d4e). I am of course just guessing. Perhaps this is unfair. -- Peter Geoghegan
On Thu, Oct 28, 2021 at 2:40 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote: > One hypothesis just out-of-the air: > If heap page visibility map bit is wrongly set to all_visible, index-only scan can give duplicates (in effect new and oldversions of the same row), with no apparent corruption of neither index nor heap relation. That can't be it, because affected physical index pages directly contain duplicate heap TIDs (and plenty of them). That's never allowed to happen, no matter what. -- Peter Geoghegan
This doesn't look too healthy: CREATE TABLE page ( page_id SERIAL NOT NULL, page_namespace INT NOT NULL, page_title TEXT NOT NULL, page_restrictions TEXT DEFAULT NULL, page_is_redirect SMALLINT DEFAULT 0 NOT NULL, page_is_new SMALLINT DEFAULT 0 NOT NULL, page_random FLOAT NOT NULL, page_touched TIMESTAMPTZ NOT NULL, page_links_updated TIMESTAMPTZ DEFAULT NULL, page_latest INT NOT NULL, page_len INT NOT NULL, page_content_model TEXT DEFAULT NULL, page_lang TEXT DEFAULT NULL, PRIMARY KEY(page_id) ); INSERT INTO page (page_namespace, page_title, page_is_redirect, page_is_new, page_random, page_touched, page_latest, page_len) select case when random() > 0.5 then 1 else 0 end, random()::text, 0, 0, 42.0, now(), 0, 42 from generate_series(1, 170000); CREATE INDEX page_main_title ON page USING btree(page_title text_pattern_ops) WHERE (page_namespace = 0); postgres=# select count(*) from page; count -------- 170000 (1 row) postgres=# select count(*) from page where page_namespace = 0; count ------- 84934 (1 row) postgres=# explain select count(*) from page where page_namespace = 0; QUERY PLAN ---------------------------------------------------------------------------------------------- Aggregate (cost=3179.75..3179.76 rows=1 width=8) -> Index Only Scan using page_main_title on page (cost=0.42..2967.85 rows=84762 width=0) (2 rows)
On Fri, Oct 29, 2021 at 12:54 PM Thomas Munro <thomas.munro@gmail.com> wrote: > This doesn't look too healthy: Erm. Sorry, please ignore, I made a mistake.
Hi, It's not the cause of this problem, but I did find a minor issue: the retry path in lazy_scan_prune() looses track of the deleted tuple count when retrying. The retry codepath also made me wonder if there could be problems if we do FreezeMultiXactId() multiple times due to retry. I think we can end up creating multiple multixactids for the same tuple (if the members change, which is likely in the retry path). But that should be fine, I think. On 2021-10-28 16:04:44 -0700, Peter Geoghegan wrote: > > Didn't 14 change the logic when index vacuums are done? That could cause > > previously existing issues to manifest with a higher likelihood. > > I don't follow. The new logic that skips index vacuuming kicks in 1) > in an anti-wraparound vacuum emergency, and 2) when there are very few > LP_DEAD line pointers in the heap. We can rule 1 out, I think, because > the XIDs we see are in the low millions, and our starting point was a > database that was upgraded via a dump and reload. Right. > The second criteria for skipping index vacuuming (the "less than 2% of > heap pages have any LP_DEAD items" thing) might well have been hit on > these tables -- it is after all very common. But I don't see how that > could matter. We're never going to get to a code path inside > vacuumlazy.c that sets LP_DEAD items from VACUUM's dead_tuples array > to LP_UNUSED (how could reached such a code path without also index > vacuuming, given the way things are set up inside lazy_vacuum()?). > We're always going to have the opportunity to do index vacuuming with > any left-behind LP_DEAD line pointers in the next VACUUM -- right > after the later VACUUM successfully returns from > lazy_vacuum_all_indexes(). Shrug. It doesn't seem that hard to believe that repeatedly trying to prune the same page could unearth some bugs. E.g. via the heap_prune_record_unused() path in heap_prune_chain(). Hm. I assume somebody checked and verified that old_snapshot_threshold is not in use? Seems unlikely, but wrongly entering that heap_prune_record_unused() path could certainly cause issues like we're observing. Greetings, Andres Freund
On Thu, Oct 28, 2021 at 6:19 PM Andres Freund <andres@anarazel.de> wrote: > It's not the cause of this problem, but I did find a minor issue: the retry > path in lazy_scan_prune() looses track of the deleted tuple count when > retrying. Matthias van de Meent (now my coworker) pointed this out several months back. I don't see any reason to prefer remembering it to not remembering it. In any case it's not too important, since the retry behavior is inherently very rare. But we can change it around later, if you prefer. > The retry codepath also made me wonder if there could be problems if we do > FreezeMultiXactId() multiple times due to retry. I think we can end up > creating multiple multixactids for the same tuple (if the members change, > which is likely in the retry path). But that should be fine, I think. That's not possible, because we fully decide what we're going to do with the page as soon as we break out of the for() loop. The closest thing to calling FreeMultiXactId() that can happen before that point (before we "commit" to a certain course of action for the page) are calls made to heap_prepare_freeze_tuple(), from inside the loop. heap_prepare_freeze_tuple() doesn't actually modify anything -- it just decides what to do later on, in the "nfrozen > 0" critical section back in lazy_scan_prune(). > Shrug. It doesn't seem that hard to believe that repeatedly trying to prune > the same page could unearth some bugs. E.g. via the heap_prune_record_unused() > path in heap_prune_chain(). I wasn't thinking of lazy_scan_prune() before, when you brought up index vacuuming -- I was thinking of lazy_vacuum(). But FWIW I'm not sure what you mean here. I believe that the goto retry logic inside lazy_scan_prune() can restart and expect a clean slate. Sure, we're pruning again, but is that appreciably different to a concurrent opportunistic prune that could happen at almost any time during the VACUUM, anyway? (Well, maybe it is in that we get the accounting for things like VACUUM VERBOSE very slightly wrong, but even that's debatable.) > Hm. I assume somebody checked and verified that old_snapshot_threshold is not > in use? Seems unlikely, but wrongly entering that heap_prune_record_unused() > path could certainly cause issues like we're observing. We should inquire about settings used, on general principle. -- Peter Geoghegan
On Thu, Oct 28, 2021 at 2:46 PM Kamigishi Rei <iijima.yun@koumakan.jp> wrote: > The issue manifested again earlier today *after* a REINDEX followed by > enabling WAL replica logging on the 24th of October. I saved a snapshot > of the filesystem holding the data directory. Would that be useful for > further analysis? Access to the WAL archive would be great. Now might be a good time to set "log_min_autovacuum_duration=0" in postgresql.conf -- that information might be very helpful later on. This will appear in the system logs for PostgreSQL. We'll be particularly interested in the "page" table, for the obvious reason. Also, can you please post output of the following on your server: "select * from pg_settings where setting != boot_val;" Thanks -- Peter Geoghegan
On Thu, Oct 28, 2021 at 6:43 PM Peter Geoghegan <pg@bowt.ie> wrote: > heap_prepare_freeze_tuple() doesn't actually modify anything -- it > just decides what to do later on, in the "nfrozen > 0" critical > section back in lazy_scan_prune(). Oh, wait. heap_prepare_freeze_tuple() is advertised as not modifying anything, but that's kind of a lie, since FreezeMultiXactId() is called and can sometimes allocate a new MultiXactId. The comments above heap_prepare_freeze_tuple() never got the memo about this following one of the big MultiXact bug fixes, "Rework tuple freezing protocol" (commit 3b97e6823b). My mistake. Presumably this isn't very different to a hard crash that also allocates a MultiXactId that never gets used. It should be even rarer than that. We hardly ever hit the goto path, with or without freezing or almost-freezing. I see that coverage.postgresql.org says we don't hit it even once. So I think that this is fine. -- Peter Geoghegan
On 29.10.2021 1:01, Andres Freund wrote: >> The issue manifested again earlier today *after* a REINDEX followed by >> enabling WAL replica logging on the 24th of October. I saved a snapshot of >> the filesystem holding the data directory. Would that be useful for further >> analysis? > Yes, that's *quite* useful. I assume you can't just share that snapshot? I am afraid it contains personal data (the mwuser table with e-mail addresses, passwords, and so on) for multiple different MediaWiki instances' databases. I will look into scrubbing that kind of data out later today. I assume dropping the other databases from the cluster should be fine and will not affect further analysis? With the personal data scrubbed I will likely be able to provide SSH access (with su/sudo available) to the VM if needed, though this will take time (I will need to make a DMZ for that VM). Please inform me if this would be desirable. > Once we identified an affected heap and index page with the corruption, we > should use pg_waldump to scan for all changes to that table. > > Do you have the log file(s) from between the 24th and now? That might give us > a good starting point for the LSN range to scan. There are multiple WAL log files, the first of them with the timestamp of Oct 25 09:45. I am currently moving the snapshot over from my server to the VM I made for this investigation. I will look into pg_waldump documentation as soon as possible; I have not had to deal with WAL logs before. P. S. To possibly make some things simpler: I am on #postgresql on Libera as Remilia (or IijimaYun in case of disconnects) and am generally available from 06:30 UTC to around 21:00 UTC. -- K. R.
I think we experienced something similar.
- AWS, i3.8xlarge
- Ubuntu 18.04
- ext4
- It is a shared database, with 8 clusters in total
- Size of each cluster ~1TB
- Each cluster produces ~3TB of WAL every day (plenty of UPDATEs, about 90% of which are HOT updates).
Corruption was found on all shards, but the list of affected indexes a bit varies from shard to shard.
Database schema:
- mostly PRIMARY or UNIQUE keys
- a couple of non-unique btree indexes
- plenty of foreign keys
The timeline:
2021-10-11 - we did the major upgrade from 9.6 to 14
2021-10-14 - executed reindexdb -a --concurrently, which finished successfully. In order to speed up reindexing we were using PGOPTIONS="-c maintenance_work_mem=64GB -c max_parallel_maintenance_workers=4"
2021-10-25 - I noticed that some of the indexes are corrupted, and these are mostly UNIQUE indexes on int and/or bigint.
After that, I identified affected indexes with amcheck, found and removed duplicated rows, and run pg_repack on affected tables. The pg_repack was running with max_parallel_maintenance_workers=0
Since we keep an archive of WALs and backups only for the past 6 days it would not be possible to find respective files that produced the corruption.
As of today (2021-10-29), amcheck doesn't report any problems.
I hope this information could give you some hints.
Regards,
--
Alexander Kukushkin
On 29.10.2021 4:50, Peter Geoghegan wrote: > Access to the WAL archive would be great. Now might be a good time to > set "log_min_autovacuum_duration=0" in postgresql.conf -- that > information might be very helpful later on. This will appear in the > system logs for PostgreSQL. We'll be particularly interested in the > "page" table, for the obvious reason. It looks like this parameter is named "log_autovacuum_min_duration" instead. Done. > Also, can you please post output of the following on your server: > "select * from pg_settings where setting != boot_val;" Attached; this was done before the above. -- K. R.
Вложения
On 29.10.2021 1:48, Andres Freund wrote: > That wouldn't protect against e.g. a logic bug in ZFS. Given its copy-on-write > nature corruption could very well manifest as seeing an older version of the > data when re-reading data from disk. Which could very well lead to the type of > corruption we're seeing here. > Didn't 14 change the logic when index vacuums are done? That could cause > previously existing issues to manifest with a higher likelihood. Note that I ran 13.2 (that came with open_datasync()) and 13.3 (which switched back to fdatasync()) on openzfs for almost a year since upgrading to 13.0-RELEASE, with no discernible issues. The 13.3 data directory has been kept intact since the upgrade to 14. -- K. R.
On 29.10.2021 10:55, Kamigishi Rei wrote: >> Do you have the log file(s) from between the 24th and now? That might >> give us >> a good starting point for the LSN range to scan. > > There are multiple WAL log files, the first of them with the timestamp > of Oct 25 09:45. The newly manifested issue is caught by pg_amcheck: btree index "azurlane_wiki.mediawiki.page_main_title": ERROR: item order invariant violated for index "page_main_title" DETAIL: Lower index tid=(17,157) (points to heap tid=(540,5)) higher index tid=(17,158) (points to heap tid=(540,5)) page lsn=2/A019DD78. The weird part about this is that the WAL archive does not seem to contain any data for 157 and 158 above (in 1663/19243/274869 blk 17). The last two entries are rmgr: Btree len (rec/tot): 53/ 4885, tx: 2085600, lsn: 2/A0195AE0, prev 2/A01943F0, desc: INSERT_LEAF off 155, blkref #0: rel 1663/19243/274869 blk 17 FPW rmgr: Btree len (rec/tot): 72/ 72, tx: 2085602, lsn: 2/A019DD30, prev 2/A019DCF0, desc: INSERT_LEAF off 156, blkref #0: rel 1663/19243/274869 blk 17 The WAL file in data14/pg_wal does not have anything related to 157 and 158 for this filenode/blk as well. -- K. R.
On Fri, Oct 29, 2021 at 11:36 AM Kamigishi Rei <iijima.yun@koumakan.jp> wrote: > The newly manifested issue is caught by pg_amcheck: > > btree index "azurlane_wiki.mediawiki.page_main_title": > ERROR: item order invariant violated for index "page_main_title" > DETAIL: Lower index tid=(17,157) (points to heap tid=(540,5)) > higher index tid=(17,158) (points to heap tid=(540,5)) page lsn=2/A019DD78. Great! I'm not surprised to see that it's the page table, once again. It's not particularly big, right? Are there other tables that are much larger? > The weird part about this is that the WAL archive does not seem to > contain any data for 157 and 158 above (in 1663/19243/274869 blk 17). > The last two entries are > > rmgr: Btree len (rec/tot): 53/ 4885, tx: 2085600, lsn: > 2/A0195AE0, prev 2/A01943F0, desc: INSERT_LEAF off 155, blkref #0: rel > 1663/19243/274869 blk 17 FPW > rmgr: Btree len (rec/tot): 72/ 72, tx: 2085602, lsn: > 2/A019DD30, prev 2/A019DCF0, desc: INSERT_LEAF off 156, blkref #0: rel > 1663/19243/274869 blk 17 > > The WAL file in data14/pg_wal does not have anything related to 157 and > 158 for this filenode/blk as well. If this was a heap relation then that would be true, because the offset number of a heap needs to be stable, at least within a "VACUUM cycle" (otherwise indexes will point to the wrong things, which would of course be wrong). However, this relation is a B-Tree index, where TIDs/page offset numbers are not stable at all. Almost all individual index tuple inserts onto a B-Tree page put the new index tuple "between" existing index tuples. This will "shift" whatever index tuples are to the right of the position of the new tuple. For example, with "INSERT_LEAF off 156", the insert atomic operation will shift any existing index tuple at page offset 156 go to page offset 157, plus any index tuple that was at page offset 157 is going to go to 158. And so on. We don't physically shift the index tuples themselves, but we do shift the item ID/line pointer array at the start of the page, so it's not too expensive. -- Peter Geoghegan
On 2021-Oct-29, Kamigishi Rei wrote: > The weird part about this is that the WAL archive does not seem to contain > any data for 157 and 158 above (in 1663/19243/274869 blk 17). The last two > entries are > > rmgr: Btree len (rec/tot): 53/ 4885, tx: 2085600, lsn: > 2/A0195AE0, prev 2/A01943F0, desc: INSERT_LEAF off 155, blkref #0: rel > 1663/19243/274869 blk 17 FPW > rmgr: Btree len (rec/tot): 72/ 72, tx: 2085602, lsn: > 2/A019DD30, prev 2/A019DCF0, desc: INSERT_LEAF off 156, blkref #0: rel > 1663/19243/274869 blk 17 > > The WAL file in data14/pg_wal does not have anything related to 157 and 158 > for this filenode/blk as well. Hmm, I don't remember precisely how index tuple removal works, but maybe these 157 and 158 entries existed previously and were truncated away. It might be useful to have a look at the page header in the page image contained in the FPW for 2/A0195AE0. pg_waldump doesn't do it, but I have this patch sitting around ... probably outdated, but it may be a useful starting point for somebody. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Вложения
On 2021-Oct-29, Peter Geoghegan wrote: > However, this relation is a B-Tree index, where TIDs/page offset > numbers are not stable at all. > > Almost all individual index tuple inserts onto a B-Tree page put the > new index tuple "between" existing index tuples. This will "shift" > whatever index tuples are to the right of the position of the new > tuple. Bah, of course. Sorry for the noise. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente"
On Fri, Oct 29, 2021 at 11:52 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Bah, of course. Sorry for the noise. It would be nice if users could follow what's going on in the index by looking at pg_waldump. Theoretically you could compensate for every individual change to the page (i.e. track the current page offset number for each tuple dynamically), but that's just too tedious to ever really work. It would be nice if pg_waldump printed the contents of index tuples, a little like pg_filedump can. That would actually offer Kamigishi Rei a practical way of following what's going on at a low level. It's a pity that we don't have some kind of generic facility for dumping a random index tuple, that doesn't require knowledge of the structure of the index, which isn't easy to get into a tool like pg_waldump. It's usually possible to see what's in the index without needing to apply metadata about the "shape" of tuples -- the generic tuple headers show a lot. -- Peter Geoghegan
On Fri, Oct 29, 2021 at 1:12 AM Kamigishi Rei <iijima.yun@koumakan.jp> wrote: > It looks like this parameter is named "log_autovacuum_min_duration" > instead. Done. Great. > > Also, can you please post output of the following on your server: > > "select * from pg_settings where setting != boot_val;" > > Attached; this was done before the above. Looks like there are no changes that could possibly be relevant to our investigation, which is not surprising. Thanks -- Peter Geoghegan
On Fri, Oct 29, 2021 at 1:17 AM Kamigishi Rei <iijima.yun@koumakan.jp> wrote: > Note that I ran 13.2 (that came with open_datasync()) and 13.3 (which > switched back to fdatasync()) on openzfs for almost a year since > upgrading to 13.0-RELEASE, with no discernible issues. The 13.3 data > directory has been kept intact since the upgrade to 14. I now think that this is very likely to be a bug in Postgres 14. Note that another user with very similar corruption (same symptoms, similar workload) happened to be using ext4 on Linux. -- Peter Geoghegan
On 29.10.2021 21:45, Peter Geoghegan wrote: >> btree index "azurlane_wiki.mediawiki.page_main_title": >> ERROR: item order invariant violated for index "page_main_title" >> DETAIL: Lower index tid=(17,157) (points to heap tid=(540,5)) >> higher index tid=(17,158) (points to heap tid=(540,5)) page lsn=2/A019DD78. > > I'm not surprised to see that it's the page table, once again. It's > not particularly big, right? Are there other tables that are much > larger? This table is 15 MB in size and it contains wiki page titles along with their namespace IDs and such. The largest table is mediawiki.text, which is 807 MB and holds article text. Here are some of the largest tables: log_search | 11 MB imagelinks | 11 MB revision_comment_temp | 13 MB page | 15 MB content | 15 MB slots | 19 MB watchlist | 24 MB revision_actor_temp | 25 MB image | 33 MB revision | 37 MB logging | 39 MB pagelinks | 56 MB text | 807 MB -- K. R.
On Fri, Oct 29, 2021 at 11:45 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Oct 29, 2021 at 11:36 AM Kamigishi Rei <iijima.yun@koumakan.jp> wrote: > > The newly manifested issue is caught by pg_amcheck: > > > > btree index "azurlane_wiki.mediawiki.page_main_title": > > ERROR: item order invariant violated for index "page_main_title" > > DETAIL: Lower index tid=(17,157) (points to heap tid=(540,5)) > > higher index tid=(17,158) (points to heap tid=(540,5)) page lsn=2/A019DD78. > I'm not surprised to see that it's the page table, once again. It's > not particularly big, right? Are there other tables that are much > larger? BTW, can you look for autovacuum log output for the "page" table from the logs? Then share it with the list, or at least share it with me privately? Thanks -- Peter Geoghegan
On 29.10.2021 22:20, Peter Geoghegan wrote: > BTW, can you look for autovacuum log output for the "page" table from > the logs? Then share it with the list, or at least share it with me > privately? I am afraid the severity level of autovacuum was too low for the default syslog configuration on FreeBSD. I changed it just now to log *.* to a separate file and I can see autovacuum messages for other databases appear there now. The page table has not been autovacuum'd since the 5th of October (the 13.3 to 14 upgrade day). The only tables that underwent autovacuum since that time were relname | last_vacuum | last_autovacuum -----------------------+-------------------------------+------------------------------- user_properties | 2021-10-29 12:31:09.849931+09 | 2021-10-22 12:41:37.727299+09 cargo_pages | 2021-10-29 12:31:08.925664+09 | 2021-10-24 21:53:27.063671+09 querycache_info | 2021-10-29 12:31:11.49013+09 | 2021-10-28 14:53:44.448064+09 site_stats | 2021-10-29 12:31:12.296138+09 | 2021-10-29 00:16:27.778438+09 job | 2021-10-29 12:31:11.897731+09 | 2021-10-30 00:46:30.479516+09 objectcache | 2021-10-29 12:31:05.71646+09 | 2021-10-30 01:00:33.519683+09 Should I disable the daily vacuum script? -- K. R.
On Fri, Oct 29, 2021 at 12:37 PM Kamigishi Rei <iijima.yun@koumakan.jp> wrote: > I am afraid the severity level of autovacuum was too low for the default > syslog configuration on FreeBSD. I changed it just now to log *.* to a > separate file and I can see autovacuum messages for other databases > appear there now. Oh well. It's enabled now, and it doesn't seem to be too hard for you to reproduce the issue, given time. We can just wait a few more days. > Should I disable the daily vacuum script? Please don't change anything. The daily VACUUM script is probably where the problem actually starts. -- Peter Geoghegan
On 29.10.2021 21:36, Kamigishi Rei wrote: > btree index "azurlane_wiki.mediawiki.page_main_title": > ERROR: item order invariant violated for index "page_main_title" > DETAIL: Lower index tid=(17,157) (points to heap tid=(540,5)) > higher index tid=(17,158) (points to heap tid=(540,5)) page > lsn=2/A019DD78. I forgot to include WAL archive grepped for the "page" table's heap page 540. Note that 540,5 is mentioned below just before VACUUM. `SELECT h.* FROM heap_page_items(get_raw_page('page',540)) h;` returns lp_flags=3 for lp=5 for the snapshot (though I am not sure if lp=5 here is the same 5 as in the tid=(540,5) tuple). 00000001000000020000008D: rmgr: Heap len (rec/tot): 59/ 7939, tx: 2013796, lsn: 2/8DA7C498, prev 2/8DA7C470, desc: LOCK off 5: xid 2013796: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 FPW rmgr: Heap len (rec/tot): 98/ 98, tx: 2013796, lsn: 2/8DAAA0A0, prev 2/8DAA8548, desc: UPDATE off 5 xmax 2013796 flags 0x61 KEYS_UPDATED ; new off 22 xmax 2013796, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 56/ 56, tx: 0, lsn: 2/8DABF558, prev 2/8DABF528, desc: PRUNE latestRemovedXid 2013796 nredirected 0 ndead 1, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 54/ 54, tx: 2013798, lsn: 2/8DABF590, prev 2/8DABF558, desc: LOCK off 22: xid 2013798: flags 0x00 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 83/ 83, tx: 2013798, lsn: 2/8DABF5C8, prev 2/8DABF590, desc: HOT_UPDATE off 22 xmax 2013798 flags 0x60 ; new off 41 xmax 2013798, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 2/8DACCBB0, prev 2/8DACCB88, desc: PRUNE latestRemovedXid 2013798 nredirected 1 ndead 0, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 59/ 7939, tx: 2014289, lsn: 2/8DE8A460, prev 2/8DE8A438, desc: LOCK off 41: xid 2014289: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 FPW rmgr: Heap len (rec/tot): 99/ 99, tx: 2014289, lsn: 2/8DEB5250, prev 2/8DEB36D8, desc: UPDATE off 41 xmax 2014289 flags 0x60 KEYS_UPDATED ; new off 53 xmax 2014289, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 2/8DEC0420, prev 2/8DEC03F0, desc: PRUNE latestRemovedXid 2014289 nredirected 0 ndead 1, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 54/ 54, tx: 2014291, lsn: 2/8DEC0460, prev 2/8DEC0420, desc: LOCK off 53: xid 2014291: flags 0x00 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 82/ 82, tx: 2014291, lsn: 2/8DEC0498, prev 2/8DEC0460, desc: HOT_UPDATE off 53 xmax 2014291 flags 0x60 ; new off 41 xmax 2014291, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 2/8DED6A10, prev 2/8DED69E8, desc: PRUNE latestRemovedXid 2014291 nredirected 1 ndead 0, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 59/ 7939, tx: 2014784, lsn: 2/8DFB15D0, prev 2/8DFB1598, desc: LOCK off 41: xid 2014784: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 FPW rmgr: Heap len (rec/tot): 100/ 100, tx: 2014784, lsn: 2/8DFDAD20, prev 2/8DFD9180, desc: UPDATE off 41 xmax 2014784 flags 0x60 KEYS_UPDATED ; new off 57 xmax 2014784, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 2/8DFE5F90, prev 2/8DFE5F60, desc: PRUNE latestRemovedXid 2014784 nredirected 0 ndead 1, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 54/ 54, tx: 2014786, lsn: 2/8DFE5FD0, prev 2/8DFE5F90, desc: LOCK off 57: xid 2014786: flags 0x00 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 82/ 82, tx: 2014786, lsn: 2/8DFE6020, prev 2/8DFE5FD0, desc: HOT_UPDATE off 57 xmax 2014786 flags 0x60 ; new off 41 xmax 2014786, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 2/8DFFA420, prev 2/8DFFA3F8, desc: PRUNE latestRemovedXid 2014786 nredirected 1 ndead 0, blkref #0: rel 1663/19243/19560 blk 540 000000010000000200000090: rmgr: Heap2 len (rec/tot): 53/ 7937, tx: 0, lsn: 2/90CEF528, prev 2/90CEF4E8, desc: VACUUM nunused 3, blkref #0: rel 1663/19243/19560 blk 540 FPW rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 2/90CF1448, prev 2/90CEF528, desc: VISIBLE cutoff xid 2014786 flags 0x01, blkref #0: rel 1663/19243/19560 fork vm blk 0, blkref #1: rel 1663/19243/19560 blk 540 000000010000000200000093: rmgr: Heap len (rec/tot): 59/ 7943, tx: 2045060, lsn: 2/93FF90E8, prev 2/93FF90B0, desc: LOCK off 41: xid 2045060: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 FPW 000000010000000200000094: rmgr: Heap len (rec/tot): 99/ 99, tx: 2045060, lsn: 2/94026D80, prev 2/94025150, desc: UPDATE off 41 xmax 2045060 flags 0x61 KEYS_UPDATED ; new off 5 xmax 2045060, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 2/94031F28, prev 2/94031EF8, desc: PRUNE latestRemovedXid 2045060 nredirected 0 ndead 1, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 54/ 54, tx: 2045062, lsn: 2/94031F68, prev 2/94031F28, desc: LOCK off 5: xid 2045062: flags 0x00 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 83/ 83, tx: 2045062, lsn: 2/94031FA0, prev 2/94031F68, desc: HOT_UPDATE off 5 xmax 2045062 flags 0x60 ; new off 22 xmax 2045062, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 2/94043C50, prev 2/94043C28, desc: PRUNE latestRemovedXid 2045062 nredirected 1 ndead 0, blkref #0: rel 1663/19243/19560 blk 540 000000010000000200000097: rmgr: Heap len (rec/tot): 73/ 8133, tx: 2065110, lsn: 2/97207F18, prev 2/97207EE0, desc: UPDATE off 8 xmax 2065110 flags 0x00 ; new off 41 xmax 2065110, blkref #0: rel 1663/19243/19560 blk 540 FPW, blkref #1: rel 1663/19243/19560 blk 318 rmgr: Heap2 len (rec/tot): 53/ 8109, tx: 0, lsn: 2/97ED2598, prev 2/97ED2558, desc: VACUUM nunused 1, blkref #0: rel 1663/19243/19560 blk 540 FPW rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 2/97ED4560, prev 2/97ED2598, desc: VISIBLE cutoff xid 2065110 flags 0x01, blkref #0: rel 1663/19243/19560 fork vm blk 0, blkref #1: rel 1663/19243/19560 blk 540 0000000100000002000000A0: rmgr: Heap len (rec/tot): 59/ 8115, tx: 2085600, lsn: 2/A0165A70, prev 2/A0165A38, desc: LOCK off 22: xid 2085600: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 FPW rmgr: Heap len (rec/tot): 54/ 54, tx: 2085600, lsn: 2/A018E858, prev 2/A018D7D8, desc: LOCK off 22: xid 2085600: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 73/ 8237, tx: 2085600, lsn: 2/A018E890, prev 2/A018E858, desc: UPDATE off 22 xmax 2085600 flags 0x03 KEYS_UPDATED ; new off 21 xmax 2085600, blkref #0: rel 1663/19243/19560 blk 328 FPW, blkref #1: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 58/ 58, tx: 2085602, lsn: 2/A019DC20, prev 2/A019BC08, desc: PRUNE latestRemovedXid 2085600 nredirected 0 ndead 1, blkref #0: rel 1663/19243/19560 blk 540 -- K. R.
On Fri, Oct 29, 2021 at 12:52 PM Kamigishi Rei <iijima.yun@koumakan.jp> wrote: > I forgot to include WAL archive grepped for the "page" table's heap page > 540. Note that 540,5 is mentioned below just before VACUUM. `SELECT h.* > FROM heap_page_items(get_raw_page('page',540)) h;` returns lp_flags=3 > for lp=5 for the snapshot (though I am not sure if lp=5 here is the same > 5 as in the tid=(540,5) tuple). Can you send me the heap page privately, please? As well as the index leaf block 17 from the corrupt index "page_main_title"? This looks like exactly what we need! We're close. -- Peter Geoghegan
On Fri, Oct 29, 2021 at 12:58 PM Peter Geoghegan <pg@bowt.ie> wrote: > This looks like exactly what we need! We're close. Also, it would be good to see everything that pg_waldump shows about transaction IDs 2014289, 2014291, 2014784, 2014786, 2045060, 2045062, 2065110, 2085600, and 2085602. You can use "pg_waldump --xid=" to make this easier. -- Peter Geoghegan
Hi, On 2021-10-29 22:52:39 +0300, Kamigishi Rei wrote: > On 29.10.2021 21:36, Kamigishi Rei wrote: > > btree index "azurlane_wiki.mediawiki.page_main_title": > > ERROR: item order invariant violated for index "page_main_title" > > DETAIL: Lower index tid=(17,157) (points to heap tid=(540,5)) > > higher index tid=(17,158) (points to heap tid=(540,5)) page > > lsn=2/A019DD78. > > I forgot to include WAL archive grepped for the "page" table's heap page > 540. Note that 540,5 is mentioned below just before VACUUM. `SELECT h.* FROM > heap_page_items(get_raw_page('page',540)) h;` returns lp_flags=3 for lp=5 > for the snapshot (though I am not sure if lp=5 here is the same 5 as in the > tid=(540,5) tuple). Could you search for btree WAL records before the following records? Most importantly for the corrupted page in the corrupted index, but other ones might be interesting as well > rmgr: Heap2 len (rec/tot): 53/ 7937, tx: 0, lsn: > 2/90CEF528, prev 2/90CEF4E8, desc: VACUUM nunused 3, blkref #0: rel > 1663/19243/19560 blk 540 FPW > rmgr: Heap2 len (rec/tot): 53/ 8109, tx: 0, lsn: > 2/97ED2598, prev 2/97ED2558, desc: VACUUM nunused 1, blkref #0: rel > 1663/19243/19560 blk 540 FPW The currently most likely seeming explanation is that somehow the index entries pointing to the heap items removed here weren't removed themselves. Subsequently leading to visible corruption once those tids are reused. Greetings, Andres Freund
Hi, There's something odd going on with 540,41, see below. On 2021-10-29 22:52:39 +0300, Kamigishi Rei wrote: > rmgr: Heap len (rec/tot): 98/ 98, tx: 2013796, lsn: > 2/8DAAA0A0, prev 2/8DAA8548, desc: UPDATE off 5 xmax 2013796 flags 0x61 > KEYS_UPDATED ; new off 22 xmax 2013796, blkref #0: rel 1663/19243/19560 blk > 540 Does a non-HOT update, from 540,5 to 540,22 > rmgr: Heap2 len (rec/tot): 56/ 56, tx: 0, lsn: > 2/8DABF558, prev 2/8DABF528, desc: PRUNE latestRemovedXid 2013796 > nredirected 0 ndead 1, blkref #0: rel 1663/19243/19560 blk 540 This presumably marks 540,5 dead, given that the removed xid is the same. > rmgr: Heap len (rec/tot): 83/ 83, tx: 2013798, lsn: > 2/8DABF5C8, prev 2/8DABF590, desc: HOT_UPDATE off 22 xmax 2013798 flags 0x60 > ; new off 41 xmax 2013798, blkref #0: rel 1663/19243/19560 blk 540 HOT of 540,22 (which was 540,5), now at 540,41. > rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: > 2/8DACCBB0, prev 2/8DACCB88, desc: PRUNE latestRemovedXid 2013798 > nredirected 1 ndead 0, blkref #0: rel 1663/19243/19560 blk 540 Presumably redirecting 540,22 -> 540->41, > rmgr: Heap len (rec/tot): 99/ 99, tx: 2014289, lsn: > 2/8DEB5250, prev 2/8DEB36D8, desc: UPDATE off 41 xmax 2014289 flags 0x60 > KEYS_UPDATED ; new off 53 xmax 2014289, blkref #0: rel 1663/19243/19560 blk > 540 Non-HOT of 540,41 (which was 540,22, 540,5), now at 540,53. > rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: > 2/8DEC0420, prev 2/8DEC03F0, desc: PRUNE latestRemovedXid 2014289 > nredirected 0 ndead 1, blkref #0: rel 1663/19243/19560 blk 540 Likely marks 540,41 dead > rmgr: Heap len (rec/tot): 54/ 54, tx: 2014291, lsn: > 2/8DEC0460, prev 2/8DEC0420, desc: LOCK off 53: xid 2014291: flags 0x00 > LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540 > rmgr: Heap len (rec/tot): 82/ 82, tx: 2014291, lsn: > 2/8DEC0498, prev 2/8DEC0460, desc: HOT_UPDATE off 53 xmax 2014291 flags 0x60 > ; new off 41 xmax 2014291, blkref #0: rel 1663/19243/19560 blk 540 HOT of 540,53, now at 540,41. Here I am confused. 540,41 was presumably marked dead in 2/8DEC0420, but not marked unused? So this shouldn't be possible. What am I missing? > rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: > 2/8DED6A10, prev 2/8DED69E8, desc: PRUNE latestRemovedXid 2014291 > nredirected 1 ndead 0, blkref #0: rel 1663/19243/19560 blk 540 Likely redirecting 540,53 -> 540,41 > rmgr: Heap len (rec/tot): 59/ 7939, tx: 2014784, lsn: > 2/8DFB15D0, prev 2/8DFB1598, desc: LOCK off 41: xid 2014784: flags 0x00 > LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 > FPW > rmgr: Heap len (rec/tot): 100/ 100, tx: 2014784, lsn: > 2/8DFDAD20, prev 2/8DFD9180, desc: UPDATE off 41 xmax 2014784 flags 0x60 > KEYS_UPDATED ; new off 57 xmax 2014784, blkref #0: rel 1663/19243/19560 blk > 540 > rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: > 2/8DFE5F90, prev 2/8DFE5F60, desc: PRUNE latestRemovedXid 2014784 > nredirected 0 ndead 1, blkref #0: rel 1663/19243/19560 blk 540 > rmgr: Heap len (rec/tot): 54/ 54, tx: 2014786, lsn: > 2/8DFE5FD0, prev 2/8DFE5F90, desc: LOCK off 57: xid 2014786: flags 0x00 > LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540 > rmgr: Heap len (rec/tot): 82/ 82, tx: 2014786, lsn: > 2/8DFE6020, prev 2/8DFE5FD0, desc: HOT_UPDATE off 57 xmax 2014786 flags 0x60 > ; new off 41 xmax 2014786, blkref #0: rel 1663/19243/19560 blk 540 Same deal. I don't understand why this is possible? > rmgr: Heap len (rec/tot): 59/ 8115, tx: 2085600, lsn: > 2/A0165A70, prev 2/A0165A38, desc: LOCK off 22: xid 2085600: flags 0x00 > LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 > FPW > rmgr: Heap len (rec/tot): 54/ 54, tx: 2085600, lsn: > 2/A018E858, prev 2/A018D7D8, desc: LOCK off 22: xid 2085600: flags 0x00 > LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 > rmgr: Heap len (rec/tot): 73/ 8237, tx: 2085600, lsn: > 2/A018E890, prev 2/A018E858, desc: UPDATE off 22 xmax 2085600 flags 0x03 > KEYS_UPDATED ; new off 21 xmax 2085600, blkref #0: rel 1663/19243/19560 blk > 328 FPW, blkref #1: rel 1663/19243/19560 blk 540 This is also odd. Why are we locking the same row twice, in the same transaction? Greetings, Andres Freund
Hi, On 2021-10-29 13:49:45 -0700, Andres Freund wrote: > > rmgr: Heap len (rec/tot): 54/ 54, tx: 2014291, lsn: > > 2/8DEC0460, prev 2/8DEC0420, desc: LOCK off 53: xid 2014291: flags 0x00 > > LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540 > > rmgr: Heap len (rec/tot): 82/ 82, tx: 2014291, lsn: > > 2/8DEC0498, prev 2/8DEC0460, desc: HOT_UPDATE off 53 xmax 2014291 flags 0x60 > > ; new off 41 xmax 2014291, blkref #0: rel 1663/19243/19560 blk 540 > > HOT of 540,53, now at 540,41. > > Here I am confused. 540,41 was presumably marked dead in 2/8DEC0420, but not > marked unused? So this shouldn't be possible. > > What am I missing? Oh. Likely the issue is that heap2_desc() doesn't print the number of redirects. I'm considering writing a patch that 1) displays the number of tuples marked unused in HEAP2_PRUNE. This might only be possible if no FPW was used 2) if a HEAP2_PRUNE or HEAP2_VACUUM isn't an FPW, display the offsets For 15 I think it might be worth to explicitly store the number of offets marked unused, rather than inferring that. It's hard to believe that the 16 bit for that would be a relevant overhead, and having that more readily available seems like a significant improvement in debuggability. Greetings, Andres Freund
On 29.10.2021 23:18, Andres Freund wrote: > Could you search for btree WAL records before the following records? Most > importantly for the corrupted page in the corrupted index, but other ones > might be interesting as well > >> rmgr: Heap2 len (rec/tot): 53/ 7937, tx: 0, lsn: >> 2/90CEF528, prev 2/90CEF4E8, desc: VACUUM nunused 3, blkref #0: rel >> 1663/19243/19560 blk 540 FPW >> rmgr: Heap2 len (rec/tot): 53/ 8109, tx: 0, lsn: >> 2/97ED2598, prev 2/97ED2558, desc: VACUUM nunused 1, blkref #0: rel >> 1663/19243/19560 blk 540 FPW These are daily vacuum runs (on the 27th and on the 28th) with dozens of preceding VACUUM lines for other tables. Should I try to filter out those? -- K. R.
Hi, On 2021-10-30 00:08:25 +0300, Kamigishi Rei wrote: > On 29.10.2021 23:18, Andres Freund wrote: > > Could you search for btree WAL records before the following records? Most > > importantly for the corrupted page in the corrupted index, but other ones > > might be interesting as well > > > > > rmgr: Heap2 len (rec/tot): 53/ 7937, tx: 0, lsn: > > > 2/90CEF528, prev 2/90CEF4E8, desc: VACUUM nunused 3, blkref #0: rel > > > 1663/19243/19560 blk 540 FPW > > > rmgr: Heap2 len (rec/tot): 53/ 8109, tx: 0, lsn: > > > 2/97ED2598, prev 2/97ED2558, desc: VACUUM nunused 1, blkref #0: rel > > > 1663/19243/19560 blk 540 FPW > > These are daily vacuum runs (on the 27th and on the 28th) with dozens of > preceding VACUUM lines for other tables. Should I try to filter out those? I think it's moot for now, because of the discovery that waldump for PRUNE doesn't show records marked unused. I'll try to write a patch to change that. After that it'd mostly be interesting to see just enough records for btree vacuums not affecting the target table to establish which indexes were vacuumed. I.e. only records back until a heap VACUUM for a different table started, and only a single record for each other index. Does that make sense? Greetings, Andres Freund
On Fri, Oct 29, 2021 at 1:59 PM Andres Freund <andres@anarazel.de> wrote: > I'm considering writing a patch that > 1) displays the number of tuples marked unused in HEAP2_PRUNE. This might only > be possible if no FPW was used > 2) if a HEAP2_PRUNE or HEAP2_VACUUM isn't an FPW, display the offsets > For 15 I think it might be worth to explicitly store the number of offets > marked unused, rather than inferring that. It's hard to believe that the 16 > bit for that would be a relevant overhead, and having that more readily > available seems like a significant improvement in debuggability. I like that idea. Attached is output from pg_filedump for heap page 540, with the page header LSN reading 2/A019DC60. I have output, with and without the -i option. This doesn't have any of the real user data, which won't matter to you. -- Peter Geoghegan
Вложения
On 2021-10-29 14:30:48 -0700, Andres Freund wrote: > I think it's moot for now, because of the discovery that waldump for PRUNE > doesn't show records marked unused. I'll try to write a patch to change that. Attached the quick version.
Вложения
On Fri, Oct 29, 2021 at 1:49 PM Andres Freund <andres@anarazel.de> wrote: > > rmgr: Heap len (rec/tot): 59/ 8115, tx: 2085600, lsn: > > 2/A0165A70, prev 2/A0165A38, desc: LOCK off 22: xid 2085600: flags 0x00 > > LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 > > FPW > > rmgr: Heap len (rec/tot): 54/ 54, tx: 2085600, lsn: > > 2/A018E858, prev 2/A018D7D8, desc: LOCK off 22: xid 2085600: flags 0x00 > > LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 > > rmgr: Heap len (rec/tot): 73/ 8237, tx: 2085600, lsn: > > 2/A018E890, prev 2/A018E858, desc: UPDATE off 22 xmax 2085600 flags 0x03 > > KEYS_UPDATED ; new off 21 xmax 2085600, blkref #0: rel 1663/19243/19560 blk > > 328 FPW, blkref #1: rel 1663/19243/19560 blk 540 > > This is also odd. Why are we locking the same row twice, in the same > transaction? My guess is that mediawiki is supposed to be usable with at least MySQL, Postgres, and SQLite, and so uses SELECT FOR UPDATE in ways that seem excessive to us. -- Peter Geoghegan
Hi, On 2021-10-29 15:06:52 -0700, Peter Geoghegan wrote: > On Fri, Oct 29, 2021 at 1:49 PM Andres Freund <andres@anarazel.de> wrote: > > > rmgr: Heap len (rec/tot): 59/ 8115, tx: 2085600, lsn: > > > 2/A0165A70, prev 2/A0165A38, desc: LOCK off 22: xid 2085600: flags 0x00 > > > LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 > > > FPW > > > rmgr: Heap len (rec/tot): 54/ 54, tx: 2085600, lsn: > > > 2/A018E858, prev 2/A018D7D8, desc: LOCK off 22: xid 2085600: flags 0x00 > > > LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 > > > rmgr: Heap len (rec/tot): 73/ 8237, tx: 2085600, lsn: > > > 2/A018E890, prev 2/A018E858, desc: UPDATE off 22 xmax 2085600 flags 0x03 > > > KEYS_UPDATED ; new off 21 xmax 2085600, blkref #0: rel 1663/19243/19560 blk > > > 328 FPW, blkref #1: rel 1663/19243/19560 blk 540 > > > > This is also odd. Why are we locking the same row twice, in the same > > transaction? > > My guess is that mediawiki is supposed to be usable with at least > MySQL, Postgres, and SQLite, and so uses SELECT FOR UPDATE in ways > that seem excessive to us. I was wondering from a different angle: The two locks are from the same transaction. As far as I remember that code we actually shouldn't end up emitting a separate lock if the lock level is the same. Just briefly looking over the code, the second heap_lock_tuple() should get a TM_BeingModified from HTSU, which should then end up in the TransactionIdIsCurrentTransactionId(xwait)) branch, resulting in a goto out_unlocked. Greetings, Andres Freund
On 30.10.2021 0:59, Andres Freund wrote: > On 2021-10-29 14:30:48 -0700, Andres Freund wrote: >> I think it's moot for now, because of the discovery that waldump for PRUNE >> doesn't show records marked unused. I'll try to write a patch to change that. > > Attached the quick version. Here are WAL archive entries related to page 540 of mediawiki.page, with the patch applied: # grep "rel 1663/19243/19560 blk 540" ~/waldump-everything.txt rmgr: Heap len (rec/tot): 59/ 7939, tx: 2013796, lsn: 2/8DA7C498, prev 2/8DA7C470, desc: LOCK off 5: xid 2013796: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 FPW rmgr: Heap len (rec/tot): 98/ 98, tx: 2013796, lsn: 2/8DAAA0A0, prev 2/8DAA8548, desc: UPDATE off 5 xmax 2013796 flags 0x61 KEYS_UPDATED ; new off 22 xmax 2013796, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 56/ 56, tx: 0, lsn: 2/8DABF558, prev 2/8DABF528, desc: PRUNE latestRemovedXid 2013796 nredirected 0 ndead 1 nunused 0, dead: [5], blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 54/ 54, tx: 2013798, lsn: 2/8DABF590, prev 2/8DABF558, desc: LOCK off 22: xid 2013798: flags 0x00 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 83/ 83, tx: 2013798, lsn: 2/8DABF5C8, prev 2/8DABF590, desc: HOT_UPDATE off 22 xmax 2013798 flags 0x60 ; new off 41 xmax 2013798, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 2/8DACCBB0, prev 2/8DACCB88, desc: PRUNE latestRemovedXid 2013798 nredirected 1 ndead 0 nunused 0, redirected: [22->41], blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 59/ 7939, tx: 2014289, lsn: 2/8DE8A460, prev 2/8DE8A438, desc: LOCK off 41: xid 2014289: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 FPW rmgr: Heap len (rec/tot): 99/ 99, tx: 2014289, lsn: 2/8DEB5250, prev 2/8DEB36D8, desc: UPDATE off 41 xmax 2014289 flags 0x60 KEYS_UPDATED ; new off 53 xmax 2014289, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 2/8DEC0420, prev 2/8DEC03F0, desc: PRUNE latestRemovedXid 2014289 nredirected 0 ndead 1 nunused 1, unused: [41], dead: [22], blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 54/ 54, tx: 2014291, lsn: 2/8DEC0460, prev 2/8DEC0420, desc: LOCK off 53: xid 2014291: flags 0x00 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 82/ 82, tx: 2014291, lsn: 2/8DEC0498, prev 2/8DEC0460, desc: HOT_UPDATE off 53 xmax 2014291 flags 0x60 ; new off 41 xmax 2014291, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 2/8DED6A10, prev 2/8DED69E8, desc: PRUNE latestRemovedXid 2014291 nredirected 1 ndead 0 nunused 0, redirected: [53->41], blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 59/ 7939, tx: 2014784, lsn: 2/8DFB15D0, prev 2/8DFB1598, desc: LOCK off 41: xid 2014784: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 FPW rmgr: Heap len (rec/tot): 100/ 100, tx: 2014784, lsn: 2/8DFDAD20, prev 2/8DFD9180, desc: UPDATE off 41 xmax 2014784 flags 0x60 KEYS_UPDATED ; new off 57 xmax 2014784, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 2/8DFE5F90, prev 2/8DFE5F60, desc: PRUNE latestRemovedXid 2014784 nredirected 0 ndead 1 nunused 1, unused: [41], dead: [53], blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 54/ 54, tx: 2014786, lsn: 2/8DFE5FD0, prev 2/8DFE5F90, desc: LOCK off 57: xid 2014786: flags 0x00 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 82/ 82, tx: 2014786, lsn: 2/8DFE6020, prev 2/8DFE5FD0, desc: HOT_UPDATE off 57 xmax 2014786 flags 0x60 ; new off 41 xmax 2014786, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 2/8DFFA420, prev 2/8DFFA3F8, desc: PRUNE latestRemovedXid 2014786 nredirected 1 ndead 0 nunused 0, redirected: [57->41], blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 53/ 7937, tx: 0, lsn: 2/90CEF528, prev 2/90CEF4E8, desc: VACUUM nunused 3, blkref #0: rel 1663/19243/19560 blk 540 FPW rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 2/90CF1448, prev 2/90CEF528, desc: VISIBLE cutoff xid 2014786 flags 0x01, blkref #0: rel 1663/19243/19560 fork vm blk 0, blkref #1: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 59/ 7943, tx: 2045060, lsn: 2/93FF90E8, prev 2/93FF90B0, desc: LOCK off 41: xid 2045060: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 FPW rmgr: Heap len (rec/tot): 99/ 99, tx: 2045060, lsn: 2/94026D80, prev 2/94025150, desc: UPDATE off 41 xmax 2045060 flags 0x61 KEYS_UPDATED ; new off 5 xmax 2045060, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 2/94031F28, prev 2/94031EF8, desc: PRUNE latestRemovedXid 2045060 nredirected 0 ndead 1 nunused 1, unused: [41], dead: [57], blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 54/ 54, tx: 2045062, lsn: 2/94031F68, prev 2/94031F28, desc: LOCK off 5: xid 2045062: flags 0x00 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 83/ 83, tx: 2045062, lsn: 2/94031FA0, prev 2/94031F68, desc: HOT_UPDATE off 5 xmax 2045062 flags 0x60 ; new off 22 xmax 2045062, blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 58/ 58, tx: 0, lsn: 2/94043C50, prev 2/94043C28, desc: PRUNE latestRemovedXid 2045062 nredirected 1 ndead 0 nunused 0, redirected: [5->22], blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 73/ 8133, tx: 2065110, lsn: 2/97207F18, prev 2/97207EE0, desc: UPDATE off 8 xmax 2065110 flags 0x00 ; new off 41 xmax 2065110, blkref #0: rel 1663/19243/19560 blk 540 FPW, blkref #1: rel 1663/19243/19560 blk 318 rmgr: Heap2 len (rec/tot): 53/ 8109, tx: 0, lsn: 2/97ED2598, prev 2/97ED2558, desc: VACUUM nunused 1, blkref #0: rel 1663/19243/19560 blk 540 FPW rmgr: Heap2 len (rec/tot): 59/ 59, tx: 0, lsn: 2/97ED4560, prev 2/97ED2598, desc: VISIBLE cutoff xid 2065110 flags 0x01, blkref #0: rel 1663/19243/19560 fork vm blk 0, blkref #1: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 59/ 8115, tx: 2085600, lsn: 2/A0165A70, prev 2/A0165A38, desc: LOCK off 22: xid 2085600: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 FPW rmgr: Heap len (rec/tot): 54/ 54, tx: 2085600, lsn: 2/A018E858, prev 2/A018D7D8, desc: LOCK off 22: xid 2085600: flags 0x00 LOCK_ONLY EXCL_LOCK KEYS_UPDATED , blkref #0: rel 1663/19243/19560 blk 540 rmgr: Heap len (rec/tot): 73/ 8237, tx: 2085600, lsn: 2/A018E890, prev 2/A018E858, desc: UPDATE off 22 xmax 2085600 flags 0x03 KEYS_UPDATED ; new off 21 xmax 2085600, blkref #0: rel 1663/19243/19560 blk 328 FPW, blkref #1: rel 1663/19243/19560 blk 540 rmgr: Heap2 len (rec/tot): 58/ 58, tx: 2085602, lsn: 2/A019DC20, prev 2/A019BC08, desc: PRUNE latestRemovedXid 2085600 nredirected 0 ndead 1 nunused 1, unused: [22], dead: [5], blkref #0: rel 1663/19243/19560 blk 540 -- K. R.
Hi, Amit, this looks to be a parallel vacuum related bug, see below. Whew, we found the bug, I think. Or at least one that can create exactly this situation. The problem is that parallel vacuum doesn't have correct handling for small indexes, when there are other indexes that are big enough to use parallelism. CREATE TABLE multiple_indexes(id serial primary key, data text, other serial unique); CREATE INDEX ON multiple_indexes(data ); INSERT INTO multiple_indexes(data) SELECT 1 FROM generate_series(1, 50000); \di+ multiple_indexes_* List of relations ┌────────┬────────────────────────────┬───────┬────────┬──────────────────┬─────────────┬───────────────┬─────────┬─────────────┐ │ Schema │ Name │ Type │ Owner │ Table │ Persistence │ Access method │ Size │ Description│ ├────────┼────────────────────────────┼───────┼────────┼──────────────────┼─────────────┼───────────────┼─────────┼─────────────┤ │ public │ multiple_indexes_data_idx │ index │ andres │ multiple_indexes │ permanent │ btree │ 336 kB │ │ │ public │ multiple_indexes_other_key │ index │ andres │ multiple_indexes │ permanent │ btree │ 1112 kB │ │ │ public │ multiple_indexes_pkey │ index │ andres │ multiple_indexes │ permanent │ btree │ 1112 kB │ │ └────────┴────────────────────────────┴───────┴────────┴──────────────────┴─────────────┴───────────────┴─────────┴─────────────┘ (3 rows) DELETE FROM multiple_indexes; postgres[1253065][1]=# SHOW min_parallel_index_scan_size; ┌──────────────────────────────┐ │ min_parallel_index_scan_size │ ├──────────────────────────────┤ │ 512kB │ └──────────────────────────────┘ (1 row) I.e. we have two indexes that are big enough to use parallelism, and one that isn't. first heap scan: VACUUM (VERBOSE,PARALLEL 4) multiple_indexes; INFO: 00000: vacuuming "public.multiple_indexes" LOCATION: lazy_scan_heap, vacuumlazy.c:937 index vacuum: INFO: 00000: launched 1 parallel vacuum worker for index vacuuming (planned: 1) LOCATION: do_parallel_vacuum_or_cleanup, vacuumlazy.c:2768 INFO: 00000: scanned index "multiple_indexes_pkey" to remove 50000 row versions DETAIL: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s LOCATION: lazy_vacuum_one_index, vacuumlazy.c:3057 INFO: 00000: scanned index "multiple_indexes_other_key" to remove 50000 row versions DETAIL: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s LOCATION: lazy_vacuum_one_index, vacuumlazy.c:3057 second heap scan: INFO: 00000: table "multiple_indexes": removed 50000 dead item identifiers in 271 pages DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s LOCATION: lazy_vacuum_heap_rel, vacuumlazy.c:2384 INFO: 00000: table "multiple_indexes": found 50000 removable, 0 nonremovable row versions in 271 out of 271 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 748 Skipped 0 pages due to buffer pins, 0 frozen pages. CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.02 s. LOCATION: lazy_scan_heap, vacuumlazy.c:1668 ... So the two bigger indexes were scanned, but the small one was not. The reason for this is that begin_parallel_vacuum() does: /* * Initialize variables for shared index statistics, set NULL bitmap and * the size of stats for each index. */ memset(shared->bitmap, 0x00, BITMAPLEN(nindexes)); for (int idx = 0; idx < nindexes; idx++) { if (!can_parallel_vacuum[idx]) continue; /* Set NOT NULL as this index does support parallelism */ shared->bitmap[idx >> 3] |= 1 << (idx & 0x07); } which causes do_parallel_processing() to skip over the index, because parallel_stats_for() returns NULL due to checking that bitmap. And do_serial_processing_for_unsafe_indexes() skips over the index because: /* * We're only here for the unsafe indexes */ if (parallel_processing_is_safe(indrel, lvshared)) continue; but the index is actually not unsafe. Which leaves us with a severely corrupted index. Unfortunately one that amcheck at this stage will not recognize as corrupted :(. It'll only get there later, if new rows for the same tids will be inserted. I haven't yet checked whether this is a bug introduced in 14, or whether it was possible to hit before as well. Greetings, Andres Freund
Hi, On 2021-10-29 19:37:40 -0700, Andres Freund wrote: > Amit, this looks to be a parallel vacuum related bug, see below. > > > Whew, we found the bug, I think. Or at least one that can create exactly this > situation. > [explanation] > I haven't yet checked whether this is a bug introduced in 14, or whether it > was possible to hit before as well. It looks like a v14 issue. I can't reproduce it in 13, and I think a change in commit b4af70cb210393c9c8f41643acf6b213e21178e7 Author: Peter Geoghegan <pg@bowt.ie> Date: 2021-04-05 13:21:44 -0700 Simplify state managed by VACUUM. which changed the logic for which relations are done in the leader: - /* Process the indexes skipped by parallel workers */ - if (shared_indstats == NULL || - skip_parallel_vacuum_index(Irel[i], lps->lvshared)) - vacuum_one_index(Irel[i], &(vacrelstats->indstats[i]), lps->lvshared, - shared_indstats, vacrelstats->dead_tuples, - vacrelstats); + /* Skip already-complete indexes */ + if (shared_istat != NULL) + continue; + + indrel = vacrel->indrels[idx]; + + /* + * We're only here for the unsafe indexes + */ + if (parallel_processing_is_safe(indrel, lvshared)) + continue; + without realizing that the "shared_indstats == NULL || " piece is important to handle parallel-safe but too-small indexes correctly. Going a bit further than this, ISTM that once we decide to use parallelism for any index, there's no point not using parallelism for all the parallel-safe indexes... Greetings, Andres Freund
On 30.10.2021 5:37, Andres Freund wrote: > The problem is that parallel vacuum doesn't have correct handling for small > indexes, when there are other indexes that are big enough to use parallelism. Index sizes in use (pg_total_relation_size), btree only, with only the largest, the smallest, and those reported by pg_amcheck listed: ┌───────────────────────────────────┬──────────────┬────────────┐ │ index │ index_method │ index_size │ ├───────────────────────────────────┼──────────────┼────────────┤ │ pl_backlinks_namespace │ btree │ 18 MB │ │ pl_namespace │ btree │ 16 MB │ │ wl_user │ btree │ 6416 kB │ │ … │ … │ … │ │ page_redirect_namespace_len │ btree │ 544 kB │ │ … │ … │ … │ │ page_len │ btree │ 488 kB │ │ … │ … │ … │ │ transcode_key_idx │ btree │ 192 kB │ │ … │ … │ … │ │ page_main_title │ btree │ 144 kB │ │ … │ … │ … │ │ ir_type_value │ btree │ 8192 bytes │ │ ipc_hex_time │ btree │ 8192 bytes │ └───────────────────────────────────┴──────────────┴────────────┘ -- K. R.
On 30.10.2021 7:12, Andres Freund wrote: >> Whew, we found the bug, I think. Or at least one that can create exactly this >> situation. >> [explanation] >> I haven't yet checked whether this is a bug introduced in 14, or whether it >> was possible to hit before as well. > > It looks like a v14 issue. > > Going a bit further than this, ISTM that once we decide to use > parallelism for any index, there's no point not using parallelism for > all the parallel-safe indexes... Probably of use: on Andrew Gierth's suggestion I performed `VACUUM VERBOSE mediawiki.page` with the snapshot database: azurlane_wiki=# vacuum verbose mediawiki.page; INFO: vacuuming "mediawiki.page" INFO: launched 2 parallel vacuum workers for index vacuuming (planned: 2) INFO: scanned index "page_pkey" to remove 251 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.02 s INFO: scanned index "page_random" to remove 251 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.02 s INFO: scanned index "page_redirect_namespace_len" to remove 251 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s INFO: scanned index "name_title" to remove 251 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.02 s INFO: scanned index "ts2_page_title" to remove 251 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.05 s INFO: table "page": removed 251 dead item identifiers in 215 pages DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.05 s INFO: table "page": found 14 removable, 44250 nonremovable row versions in 876 out of 912 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 2097658 Skipped 0 pages due to buffer pins, 0 frozen pages. CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.18 s. INFO: vacuuming "pg_toast.pg_toast_19560" INFO: table "pg_toast_19560": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 2097658 Skipped 0 pages due to buffer pins, 0 frozen pages. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. VACUUM Here is the list of indices associated with mediawiki.page: ┌────────────────┬─────────────────────────────┬──────────────┐ │ table │ index │ index_method │ ├────────────────┼─────────────────────────────┼──────────────┤ │ mediawiki.page │ page_pkey │ btree │ │ mediawiki.page │ name_title │ btree │ │ mediawiki.page │ page_len │ btree │ │ mediawiki.page │ page_main_title │ btree │ │ mediawiki.page │ page_mediawiki_title │ btree │ │ mediawiki.page │ page_project_title │ btree │ │ mediawiki.page │ page_random │ btree │ │ mediawiki.page │ page_redirect_namespace_len │ btree │ │ mediawiki.page │ page_talk_title │ btree │ │ mediawiki.page │ page_user_title │ btree │ │ mediawiki.page │ page_utalk_title │ btree │ │ mediawiki.page │ ts2_page_title │ gist │ └────────────────┴─────────────────────────────┴──────────────┘ VACUUM lists page_redirect_namespace_len, however pg_amcheck still complains about it afterwards. page_len and page_main_title (and others) are not listed by VACUUM. -- K. R.
On Fri, Oct 29, 2021 at 9:12 PM Andres Freund <andres@anarazel.de> wrote: > It looks like a v14 issue. I can't reproduce it in 13, and I think a > change in > > commit b4af70cb210393c9c8f41643acf6b213e21178e7 > Author: Peter Geoghegan <pg@bowt.ie> > Date: 2021-04-05 13:21:44 -0700 > > Simplify state managed by VACUUM. > > which changed the logic for which relations are done in the leader: > without realizing that the "shared_indstats == NULL || " piece is > important to handle parallel-safe but too-small indexes correctly. This was a simple blunder on my part. I didn't intend to change anything about the behavior of parallel VACUUM, and so this commit was just about the only one of mine from Postgres 14 that I didn't consider as a possibility, even once. Mea culpa. Attached is a draft patch that fixes the problem. Also attached is a second patch. This adds assertions to heap_index_delete_tuples() to catch cases where a heap TID in an index points to an LP_UNUSED item in the heap (which is what this bug looked like, mostly). It also checks for certain more or less equivalent inconsistencies: the case where a heap TID in an index points to a line pointer that's past the end of the heap page's line pointer array, and the case where a heap TID in an index points directly to a heap-only tuple. Putting these new assertions in heap_index_delete_tuples() makes a lot of sense. The interlocking rules between an index and a heap relation are very simple here. They're also uniform across all supported index access methods (of those that support index tuple deletion, currently just nbtree, hash, and GiST). This is one of the very few code paths where there is a buffer lock (not just a pin) on the index page held throughout checking the heap; there is clearly no question of spurious assertion failures due to a concurrent vacuum. Note that the index deletion stuff I worked on in Postgres 14 batches-up TIDs in various ways, and so in practice heap_index_delete_tuples() will tend to pass over many of the TIDs from the index page, in many important cases (not so much with GiST and hash right now, but I expect that to change in the future). I want to once again thank Kamigishi Rei for her tireless work on this. If she didn't go to the trouble of providing us with so much help, then it would have probably taken significantly longer to figure out what was going on here. Also want to thank Andres. -- Peter Geoghegan
Вложения
On 30.10.2021 21:46, Peter Geoghegan wrote: > > Attached is a draft patch that fixes the problem. Thank you! Going to install the patched version on the production system to see how it fares (and I expect good things). Tried against the re-extracted snapshot, looks good (everything still needs reindexing, obviously, but that is beside the point right now): azurlane_wiki=# vacuum verbose mediawiki.page; INFO: vacuuming "mediawiki.page" INFO: launched 2 parallel vacuum workers for index vacuuming (planned: 2) INFO: scanned index "page_len" to remove 251 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s INFO: scanned index "page_main_title" to remove 251 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.05 s INFO: scanned index "page_mediawiki_title" to remove 251 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s INFO: scanned index "page_project_title" to remove 251 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s INFO: scanned index "page_talk_title" to remove 251 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s INFO: scanned index "page_user_title" to remove 251 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s INFO: scanned index "page_utalk_title" to remove 251 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s INFO: scanned index "page_pkey" to remove 251 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.07 s INFO: scanned index "page_redirect_namespace_len" to remove 251 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.08 s INFO: scanned index "page_random" to remove 251 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.14 s INFO: scanned index "name_title" to remove 251 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.19 s INFO: scanned index "ts2_page_title" to remove 251 row versions DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.04 s INFO: table "page": removed 251 dead item identifiers in 215 pages DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s INFO: index "page_pkey" now contains 46056 row versions in 131 pages DETAIL: 190 index row versions were removed. 0 index pages were newly deleted. 0 index pages are currently deleted, of which 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: index "name_title" now contains 46056 row versions in 330 pages DETAIL: 211 index row versions were removed. 0 index pages were newly deleted. 0 index pages are currently deleted, of which 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: index "page_len" now contains 51848 row versions in 61 pages DETAIL: 681 index row versions were removed. 0 index pages were newly deleted. 0 index pages are currently deleted, of which 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: index "page_main_title" now contains 3569 row versions in 18 pages DETAIL: 147 index row versions were removed. 0 index pages were newly deleted. 0 index pages are currently deleted, of which 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: index "page_mediawiki_title" now contains 15 row versions in 2 pages DETAIL: 0 index row versions were removed. 0 index pages were newly deleted. 0 index pages are currently deleted, of which 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: index "page_project_title" now contains 1 row versions in 2 pages DETAIL: 0 index row versions were removed. 0 index pages were newly deleted. 0 index pages are currently deleted, of which 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: index "page_random" now contains 46056 row versions in 127 pages DETAIL: 251 index row versions were removed. 0 index pages were newly deleted. 0 index pages are currently deleted, of which 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: index "page_redirect_namespace_len" now contains 47663 row versions in 68 pages DETAIL: 265 index row versions were removed. 0 index pages were newly deleted. 0 index pages are currently deleted, of which 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: index "page_talk_title" now contains 35 row versions in 2 pages DETAIL: 0 index row versions were removed. 0 index pages were newly deleted. 0 index pages are currently deleted, of which 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: index "page_user_title" now contains 735 row versions in 6 pages DETAIL: 6 index row versions were removed. 0 index pages were newly deleted. 0 index pages are currently deleted, of which 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: index "page_utalk_title" now contains 44 row versions in 2 pages DETAIL: 0 index row versions were removed. 0 index pages were newly deleted. 0 index pages are currently deleted, of which 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: index "ts2_page_title" now contains 46056 row versions in 211 pages DETAIL: 251 index row versions were removed. 0 index pages were newly deleted. 0 index pages are currently deleted, of which 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: table "page": found 14 removable, 44250 nonremovable row versions in 876 out of 912 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 2097658 Skipped 0 pages due to buffer pins, 0 frozen pages. CPU: user: 0.00 s, system: 0.01 s, elapsed: 0.50 s. INFO: vacuuming "pg_toast.pg_toast_19560" INFO: table "pg_toast_19560": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 2097658 Skipped 0 pages due to buffer pins, 0 frozen pages. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. VACUUM -- K. R.
Hi, On 2021-10-30 11:46:22 -0700, Peter Geoghegan wrote: > Attached is a draft patch that fixes the problem. I think it probably is worth adding an error check someplace that verifies that problems of this kind will be detected with, uh, less effort. I think it'd also be good to add a test that specifically verifies that parallel vacuum doesn't have a bug around "parallel worthy" and not "parallel worthy" indexes. It's too easy a mistake to make, and because visible corruption is delayed, it's likely that we won't detect such cases. > Also attached is a second patch. This adds assertions to > heap_index_delete_tuples() to catch cases where a heap TID in an index > points to an LP_UNUSED item in the heap (which is what this bug looked > like, mostly). It also checks for certain more or less equivalent > inconsistencies: the case where a heap TID in an index points to a > line pointer that's past the end of the heap page's line pointer > array, and the case where a heap TID in an index points directly to a > heap-only tuple. ISTM that at least a basic version of this is worth doing as a check throwing an ERROR, rather than an assertion. It's hard to believe this'd be a significant portion of the cost of heap_index_delete_tuples(), and I think it would help catch problems a lot earlier. Greetings, Andres Freund
On Sat, Oct 30, 2021 at 1:42 PM Andres Freund <andres@anarazel.de> wrote: > I think it probably is worth adding an error check someplace that verifies > that problems of this kind will be detected with, uh, less effort. The draft fix is something I wanted to get out quickly to signal that the bug is definitely going to be fixed soon. I'll need to carefully think about this some more on Monday, to make sure that it becomes much harder to break in roughly the same way once again. > I think it'd also be good to add a test that specifically verifies that > parallel vacuum doesn't have a bug around "parallel worthy" and not "parallel > worthy" indexes. It's too easy a mistake to make, and because visible > corruption is delayed, it's likely that we won't detect such cases. I agree. > ISTM that at least a basic version of this is worth doing as a check throwing > an ERROR, rather than an assertion. It's hard to believe this'd be a > significant portion of the cost of heap_index_delete_tuples(), and I think it > would help catch problems a lot earlier. I like the idea of doing that, but only on master. I think that we could do this as an ERROR, provided we avoid adding a new inner loop to heap_index_delete_tuples() --- we can definitely afford that. And we can even have a nice, detailed error message that blames a particular tuple from a specific index page, pointing to a specific heap TID. Separately, we should add an assertion that catches cases where a TID in the index points to an LP_REDIRECT line pointer, which does not point to a heap tuple with storage. That check has much less practical value, and necessitates adding a new inner loop, which is why an assert seems good enough to me. (The patch I've posted doesn't have any of this LP_REDIRECT business, but my v2 revision will.) -- Peter Geoghegan
On 2021-10-30 15:11:18 -0700, Peter Geoghegan wrote: > On Sat, Oct 30, 2021 at 1:42 PM Andres Freund <andres@anarazel.de> wrote: > > I think it probably is worth adding an error check someplace that verifies > > that problems of this kind will be detected with, uh, less effort. > > The draft fix is something I wanted to get out quickly to signal that > the bug is definitely going to be fixed soon. I'll need to carefully > think about this some more on Monday, to make sure that it becomes > much harder to break in roughly the same way once again. Makes sense. > > ISTM that at least a basic version of this is worth doing as a check throwing > > an ERROR, rather than an assertion. It's hard to believe this'd be a > > significant portion of the cost of heap_index_delete_tuples(), and I think it > > would help catch problems a lot earlier. > > I like the idea of doing that, but only on master. Hm. I wonder if it's not actually good to do something like it in 14, given that we know of a path to have corrupted indexes out there. > Separately, we should add an assertion that catches cases where a TID > in the index points to an LP_REDIRECT line pointer, which does not > point to a heap tuple with storage. Is it actually guaranteed that that's impossible to happen and wasn't ever possible? It's not obvious to me that a LP_REDIRECT pointing to an LP_DEAD tuple would be a real problem. Greetings, Andres Freund
On Sat, Oct 30, 2021 at 3:18 PM Andres Freund <andres@anarazel.de> wrote: > Hm. I wonder if it's not actually good to do something like it in 14, given > that we know of a path to have corrupted indexes out there. My concern is that people might not be too happy about errors that occur due to corruption in tuples that are only vaguely related to their own incoming inserted tuple. > > Separately, we should add an assertion that catches cases where a TID > > in the index points to an LP_REDIRECT line pointer, which does not > > point to a heap tuple with storage. > > Is it actually guaranteed that that's impossible to happen and wasn't ever > possible? It's not obvious to me that a LP_REDIRECT pointing to an LP_DEAD > tuple would be a real problem. As far as I know an LP_REDIRECT item can never point to an LP_DEAD item. As far as I know it has worked that way since HOT first went in. There certainly seems to be far fewer problems with a rule that says it can never happen, and so I think we should introduce that rule (or clear things up, if you prefer). ISTM that this is all about TID stability for indexes. The only reason we have LP_DEAD items in heap pages is to have something that reliably tells index scans that the heap tuple they're looking for is logically dead (and so we can't recycle TIDs except during VACUUM). Similarly, the only reason we have LP_REDIRECT items is so that index scans have lightweight stable forwarding information from the heap page that contains the HOT chain. The advantage of LP_REDIRECT items is that they allow pruning to avoid keeping around the original heap tuple (the tuple that was not a heap-only tuple) after pruning. Also, pruning doesn't have to "merge line pointers" such that the new first member of a HOT chain has the same page offset number as the original first member had -- imagine how messy that would have to be. VACUUM only ever deletes index tuples from indexes when their pointed-to TIDs were found to be LP_DEAD stub line pointers in the first heap pass. Such an LP_DEAD item represents a (former) whole HOT chain, which could just be a "single tuple degenerate HOT chain", that never actually had any heap-only tuples (say because all UPDATEs for the table will modify indexed columns). If an LP_REDIRECT item points to an LP_DEAD item, then what is VACUUM supposed to do about it when it comes time to vacuum indexes? Which TID is it supposed to delete from indexes? The LP_DEAD item, the LP_REDIRECT item that points to the LP_DEAD item, or both? While it's okay if the link from a tuple header in a HOT chain points to an LP_UNUSED item (that just indicates that the chain is "broken" at that point), it's not okay if a link from an LP_REDIRECT line pointer points to an LP_UNUSED item -- that's virtually the same condition as having a TID from an index point directly to an LP_UNUSED item, which is of course always wrong. We can do no "last tuple's xmax == current tuple's xmin" validation during chain traversal when the "last tuple" was actually just an LP_REDIRECT item. And so we need another rule for LP_REDIRECT items, to compensate. That rule seems to be: HOT chains cannot be allowed to "break" between the LP_REDIRECT item and the first tuple with storage. The only way it could be okay for an LP_REDIRECT item to point to an LP_DEAD item would be if you knew for sure that the LP_REDIRECT item would actually become LP_DEAD at the same time as the LP_DEAD item (so both get removed from indexes) -- which is a contradiction in terms. Why wouldn't pruning just mark the LP_REDIRECT item LP_DEAD instead, while making the would-be LP_DEAD item skip straight to being an LP_UNUSED item? That approach is strictly better. It just makes sense, which leads me to believe that we must have always done it that way. It would be nice to be able to say for sure that we have a simple rule: "marking a heap-only tuple LP_DEAD is always not just unnecessary, but wrong, because LP_DEAD items in heap pages are supposed to have a 1:1 mapping with dead index tuples". -- Peter Geoghegan
On Sun, Oct 31, 2021 at 5:19 PM Peter Geoghegan <pg@bowt.ie> wrote: > The only way it could be okay for an LP_REDIRECT item to point to an > LP_DEAD item would be if you knew for sure that the LP_REDIRECT item > would actually become LP_DEAD at the same time as the LP_DEAD item (so > both get removed from indexes) -- which is a contradiction in terms. > Why wouldn't pruning just mark the LP_REDIRECT item LP_DEAD instead, > while making the would-be LP_DEAD item skip straight to being an > LP_UNUSED item? That approach is strictly better. Attached patch adds assertions and comments to heap_page_prune_execute() that document my understanding of things. This passes "make check-world" for me. How do you feel about doing something like this too? -- Peter Geoghegan
Вложения
Attached patch adds assertions and comments to
heap_page_prune_execute() that document my understanding of things.
This passes "make check-world" for me.
How do you feel about doing something like this too?
I've applied patch v1, built with assertion checking. Installcheck-world passes on my machine (OSX) too.
On Sun, Oct 31, 2021 at 7:21 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached patch adds assertions and comments to > heap_page_prune_execute() that document my understanding of things. > This passes "make check-world" for me. Attached is a more worked out fix for the bug. It includes a test case that will lead to an assertion failure in nbtdedup.c (when the fix itself is absent). The test is loosely based on one from Masahiko's patch on the dedicated parallel VACUUM thread. The testing here isn't quite as extensive as we talked about, but only because it's hard to do that within the confines of the current design, per my remarks earlier on the other thread. It's not like we have long to go before 14.1, and so I'd like to commit something like this in the next couple of days. I've also combined my heap pruning assertion patch (from the other thread) with the heap_index_delete_tuples() assertions we talked about on this thread -- they're very similar, and so might as well just be handled as a single commit. I can delay committing this second patch until I hear back about the details. Note that there still aren't any new defensive ERRORs here, for now. -- Peter Geoghegan
Вложения
On Tue, Nov 2, 2021 at 9:14 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Sun, Oct 31, 2021 at 7:21 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Attached patch adds assertions and comments to > > heap_page_prune_execute() that document my understanding of things. > > This passes "make check-world" for me. > > Attached is a more worked out fix for the bug. It includes a test case > that will lead to an assertion failure in nbtdedup.c (when the fix > itself is absent). The test is loosely based on one from Masahiko's > patch on the dedicated parallel VACUUM thread. Thank you for updating the patch. For parallel vacuum fix, the fix looks good to me. Regarding regression tests: + * XXX: Actually, there might be no statistics at all -- index AMs + * sometimes return NULL from amvacuumcleanup. That makes it hard to + * assert that we really called amvacuumcleanup in the leader from + * here. that's true. We cannot know whether indexes were actually vacuumed by parallel workers by checking the shared index stats. While thinking it's a good idea to use on an assertion in nbtdedup.c I'm concerned a bit about relying on another module to detect a vacuum bug. If we were to need to change the code of nbtdedup.c in the future, the premise that the bug in parallel vacuum leads to the assertion failure might break. It might not happen in back branches, though. The problem with this bug is that leader process misses to vacuum indexes that must be processed by the leader. So, another idea (but not a comprehensive approach) to detect this issue would be that we check if the leader processed all indexes that must be processed by leader at the end of do_serial_processing_for_unsafe_indexes(). That is, we check if all of both indexes whose entries in will_parallel_vacuum[] are false and unsafe indexes have been processed by leader. > I've also combined my heap pruning assertion patch (from the other > thread) with the heap_index_delete_tuples() assertions we talked about > on this thread -- they're very similar, and so might as well just be > handled as a single commit. I can delay committing this second patch > until I hear back about the details. Note that there still aren't any > new defensive ERRORs here, for now. I'll look at the second patch. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Mon, Nov 1, 2021 at 9:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Thank you for updating the patch. For parallel vacuum fix, the fix > looks good to me. I pushed the same patch earlier (with small tweaks). > We cannot know whether indexes were actually vacuumed by > parallel workers by checking the shared index stats. While thinking > it's a good idea to use on an assertion in nbtdedup.c I'm concerned a > bit about relying on another module to detect a vacuum bug. I agree that it's not ideal that we're relying on the nbtdedup.c assertion failure, but there are lots of ways that the test could fail (with assertions enabled). We're relying on nbtdedup.c, but it's not too delicate in practice. > The problem with this bug is that leader process misses to vacuum > indexes that must be processed by the leader. I think that the underlying problem is that there is too much indirection, for no real benefit. A more explicit state machine approach would make testing easy, and would minimize divergence between parallel and serial VACUUM. I think that it would be easier and less risky overall to do this than it would be to have comprehensive testing of the existing design. As I said on the other thread, we should decide *what we're allowed to do* up front, in the leader, probably when compute_parallel_vacuum_workers() is called. After that point, parallel workers (and the parallel leader) don't have to care about *why* the required/allowed work is a certain way -- they just need to actually do the work. We can explicitly remember "can [or cannot] have ambulkdelete() performed by parallel worker", and "can [or cannot] have amvacuumcleanup() performed by parallel worker" as per-index immutable state, stored in shared memory. This immutable state is a little like a query plan. We can also have mutable state that describes the current status of each index, like "still needs vacuuming" and "still needs cleanup" -- this tracks the overall progress for each index, plus maybe extra instrumentation. It will then be easy to notice when invalid state transitions seem to be necessary -- just throw an error at that point. Also, we can directly check "Does what I asked for match the work that parallel workers said they completed using the same data structure?", at the end of each parallel VACUUM. That nails things down. BTW, do we really need separate do_serial_processing_for_unsafe_indexes() and do_parallel_processing() functions? I think that we could just have one event loop function if it was structured as a state machine. Parallel workers could see indexes that can only be processed by the leader, but that would be okay -- they'd understand not to touch those indexes (without caring about the true reason). Right now we really have two separate ambulkddelete (!for_cleanup) and amvacuumcleanup (for_cleanup) parallel operations -- not a single parallel VACUUM operation. That may be part of the problem. It would also be a good idea to have just one parallel VACUUM operation, and to handle the transition from ambulkdelete to amvacuumcleanup using state transitions. That's more ambitious than adding more state to shared memory, but it would be nice if workers (or the leader acting like a worker) could just consume "a unit of work" in any that's convenient (as long as it's also in an order that's correct, based on state transition rules like "go from ambulkddelete directly to amvacuumcleanup, but only when we know that it's the last call to ambulkddelete for the VACUUM operation as a whole"). > So, another idea (but > not a comprehensive approach) to detect this issue would be that we > check if the leader processed all indexes that must be processed by > leader at the end of do_serial_processing_for_unsafe_indexes(). That > is, we check if all of both indexes whose entries in > will_parallel_vacuum[] are false and unsafe indexes have been > processed by leader. I don't think that that would help very much. It still spreads out knowledge of what is supposed to happen, rather than concentrating that knowledge into just one function. -- Peter Geoghegan
On Sun, Oct 31, 2021 at 5:19 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Sat, Oct 30, 2021 at 3:18 PM Andres Freund <andres@anarazel.de> wrote: > > Hm. I wonder if it's not actually good to do something like it in 14, given > > that we know of a path to have corrupted indexes out there. > > My concern is that people might not be too happy about errors that > occur due to corruption in tuples that are only vaguely related to > their own incoming inserted tuple. Attached patch adds ERRORs in the event of detecting index-tuple-TID-points-to-LP_UNUSED conditions, as well as other similar conditions -- not just assertions, as before. I do think that this would be a good idea on HEAD. Still have reservations about doing that for 14, but you're welcome to try and change my mind. Thoughts? -- Peter Geoghegan
Вложения
Hi, On 2021-11-02 19:03:05 -0700, Peter Geoghegan wrote: > Attached patch adds ERRORs in the event of detecting > index-tuple-TID-points-to-LP_UNUSED conditions, as well as other > similar conditions -- not just assertions, as before. > > I do think that this would be a good idea on HEAD. Still have > reservations about doing that for 14, but you're welcome to try and > change my mind. I'm not quite sure either - I'm worried about followon corruption based on the bug. That'll be much more likely be detected with something like this in place. But it's also easy to get some subtlety wrong... > +static inline void > +index_delete_check_htid(TM_IndexDeleteOp *delstate, > + Page page, OffsetNumber maxoff, > + ItemPointer htid, TM_IndexStatus *istatus) > +{ > + OffsetNumber indexpagehoffnum = ItemPointerGetOffsetNumber(htid); > + ItemId iid; > + HeapTupleHeader htup; > + > + Assert(OffsetNumberIsValid(istatus->idxoffnum)); > + > + if (indexpagehoffnum > maxoff) > + ereport(ERROR, > + (errcode(ERRCODE_INDEX_CORRUPTED), > + errmsg_internal("heap tid from index tuple (%u,%u) points past end of heap page line pointer array atoffset %u of block %u in index \"%s\"", > + ItemPointerGetBlockNumber(htid), > + indexpagehoffnum, > + istatus->idxoffnum, delstate->iblknum, > + RelationGetRelationName(delstate->irel)))); > + iid = PageGetItemId(page, indexpagehoffnum); > + if (!ItemIdIsUsed(iid)) > + ereport(ERROR, > + (errcode(ERRCODE_INDEX_CORRUPTED), > + errmsg_internal("heap tid from index tuple (%u,%u) points to unused heap page item at offset %u of block%u in index \"%s\"", > + ItemPointerGetBlockNumber(htid), > + indexpagehoffnum, > + istatus->idxoffnum, delstate->iblknum, > + RelationGetRelationName(delstate->irel)))); > + > + if (ItemIdHasStorage(iid)) > + { > + Assert(ItemIdIsNormal(iid)); > + htup = (HeapTupleHeader) PageGetItem(page, iid); > + > + if (HeapTupleHeaderIsHeapOnly(htup)) > + ereport(ERROR, > + (errcode(ERRCODE_INDEX_CORRUPTED), > + errmsg_internal("heap tid from index tuple (%u,%u) points to heap-only tuple at offset %u of block%u in index \"%s\"", > + ItemPointerGetBlockNumber(htid), > + indexpagehoffnum, > + istatus->idxoffnum, delstate->iblknum, > + RelationGetRelationName(delstate->irel)))); > + } I'd make the error paths unlikely(). > diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c > index db6912e9f..43549be04 100644 > --- a/src/backend/access/heap/pruneheap.c > +++ b/src/backend/access/heap/pruneheap.c > @@ -844,39 +844,115 @@ heap_page_prune_execute(Buffer buffer, > { > Page page = (Page) BufferGetPage(buffer); > OffsetNumber *offnum; > - int i; > + HeapTupleHeader htup PG_USED_FOR_ASSERTS_ONLY; > > /* Shouldn't be called unless there's something to do */ > Assert(nredirected > 0 || ndead > 0 || nunused > 0); > > /* Update all redirected line pointers */ > offnum = redirected; > - for (i = 0; i < nredirected; i++) > + for (int i = 0; i < nredirected; i++) > { > OffsetNumber fromoff = *offnum++; > OffsetNumber tooff = *offnum++; > ItemId fromlp = PageGetItemId(page, fromoff); > + ItemId tolp PG_USED_FOR_ASSERTS_ONLY; > + > +#ifdef USE_ASSERT_CHECKING These I'd definitely only do in HEAD. > + /* > + * The existing items that we set as redirects must be the first tuple > + * of a HOT chain that has not be pruned before now (can't be a > + * heap-only tuple) when target item has tuple storage. When the item > + * has no storage, then we must just be maintaining the LP_REDIRECT of > + * a HOT chain that has been pruned at least once before now. > + */ > + if (!ItemIdIsRedirected(fromlp)) > + { > + Assert(ItemIdHasStorage(fromlp) && ItemIdIsNormal(fromlp)); > + > + htup = (HeapTupleHeader) PageGetItem(page, fromlp); > + Assert(!HeapTupleHeaderIsHeapOnly(htup)); > + } I was a bit confused because I initially read this this as the first tuple redirect *to* can't be a heap-only tuple. Perhaps this could be rephrased a bit ("redirecting item"?). Greetings, Andres Freund
On Thu, Nov 4, 2021 at 3:07 PM Andres Freund <andres@anarazel.de> wrote: > I'm not quite sure either - I'm worried about followon corruption based on the > bug. That'll be much more likely be detected with something like this in > place. True, but the additional hardening I added to nbtree in commit a5213adf is definitely worth something. Sure, tools like amcheck disappointed when it came to detecting the parallel VACUUM bug -- I don't deny that. To some degree the nbtree side hardening will be a little like that too. But let's not overstate it; pg_amcheck was still pretty useful. It has a decent chance of detecting that there are certain problems with one or more indexes. > But it's also easy to get some subtlety wrong... I'm not worried about incorrectly identifying TIDs as corrupt in this hardening patch. I'm more worried about the patch throwing ERRORs in cases where it's arguably not really necessary -- which is not a problem for the nbtree side hardening we have already. With this new hardening, the insert or update that we end up throwing an error for might be pretty far removed from the ERROR. And so it seems like something that would be better left till Postgres 15. Of course I don't believe for a second that ignorance is bliss. At the same time, I don't want to act rashly. > I'd make the error paths unlikely(). I was under the impression that all ERRORs were unlikely(), in effect. At least following David Rowley's work in commit 913ec71d -- the "ereport(ERROR) pg_attribute_cold() shim" patch. But now that I think about it some more, I imagine that compilers are free to interpret these any way they want. They're certainly not officially equivalent. And so it couldn't hurt to add some unlikely()s, for good luck. This is a pretty tight, performance sensitive loop, after all. > These I'd definitely only do in HEAD. Works for me. I see the changes to pruneheap.c as developer documentation more than anything else. It's a nice way to explain what is possible, without leaving much room for misunderstanding. > I was a bit confused because I initially read this this as the first tuple > redirect *to* can't be a heap-only tuple. Perhaps this could be rephrased a > bit ("redirecting item"?). That makes sense. What do you think of this?: /* * The item that we're about to set as an LP_REDIRECT (the 'from' * item) must point to an existing item (the 'to' item) that is * already a heap-only tuple. There can be at most one LP_REDIRECT * item per HOT chain. * * We need to keep around an LP_REDIRECT item (after original * non-heap-only root tuple gets pruned away) so that it's always * possible for VACUUM to easily figure out what TID to delete from * indexes when an entire HOT chain becomes dead. A heap-only tuple * can never become LP_DEAD; an LP_REDIRECT item or a regular heap * tuple can. */ tolp = PageGetItemId(page, tooff); Assert(ItemIdHasStorage(tolp) && ItemIdIsNormal(tolp)); htup = (HeapTupleHeader) PageGetItem(page, tolp); Assert(HeapTupleHeaderIsHeapOnly(htup)); (I changed the first paragraph, not the second. Just showing everything together for your convenience.) -- Peter Geoghegan
On 2021-11-04 15:57:42 -0700, Peter Geoghegan wrote: > That makes sense. What do you think of this?: WFM.
On Thu, Nov 4, 2021 at 4:12 PM Andres Freund <andres@anarazel.de> wrote: > WFM. Okay, everything has been pushed. Thanks -- Peter Geoghegan
On Sun, Nov 7, 2021 at 4:28 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > The index in one case is now 448 kB, in the other case is 632 kB, so I suspect > they're due to this bug (for the 2nd one, a new index created from scratch with > the same definition is 144 kB). That seems quite likely. If you REINDEX, everything should be okay (barring unique violations, perhaps). > I tried looking for duplicate TIDs like this (which I wrote to search every > suspect table before I found the above issue) - I'm not sure this is right, > though. The problem that we have with detecting this in amcheck is that it doesn't detect duplicate TIDs, except when they happen to appear in fully duplicate index tuples. In practice this means that amcheck has a decent chance of detecting some problems, but cannot detect problems caused by this particular bug reliably. Even with the heapallindexed stuff. > But this doesn't find any issue in that index. If I'm not wrong, "duplicate > TIDs" are just one likely symptom of the bug, but not its only consequence. > I donno how the TIDs can get out of order, though? They don't. It's just that amcheck considers two adjacent duplicate TIDs to involve a "misplaced TID" (in a posting list), which is just a catch-all term. It's true that you could in theory see the same error message with a posting list whose TIDs were in entirely the wrong order, but there is no evidence that that ever happened as a result of this bug -- and there is no reason to even suspect it. Not that it makes that much difference, mind you -- corruption is corruption. > I thought you might like to know about another instance of this in the wild. > I'll disable those indexes and leave them around for a bit in case there's any > question about them. Thanks, but there is no further need for this. -- Peter Geoghegan
On Fri, Oct 29, 2021 at 1:10 AM Alexander Kukushkin <cyberdemn@gmail.com> wrote: > - Each cluster produces ~3TB of WAL every day (plenty of UPDATEs, about 90% of which are HOT updates). > > Corruption was found on all shards, but the list of affected indexes a bit varies from shard to shard. > > Database schema: > - mostly PRIMARY or UNIQUE keys > - a couple of non-unique btree indexes > - plenty of foreign keys You have said elsewhere that you're sure that this isn't the parallel VACUUM bug, since you know that you didn't run a manual VACUUM, even once. So I wonder what the issue might be. Since you deleted duplicate rows from a unique index, there probably weren't very many affected rows in total. It sounds like a pretty subtle issue to me (particularly compared to the parallel VACUUM bug, which wasn't all that subtle when it hit at all). If I had to guess, I'd guess that it has something to do with the snapshot scalability work. Specifically, a recently reported issue involving confusion about the structure of HOT chains during pruning: https://www.postgresql.org/message-id/flat/20211110192010.ckvfzz352hsba5xf%40alap3.anarazel.de#4c3d9c9988164f5ea3c15999bcf50ce7 Please join in on the other thread if you have anything more to add. I could easily be wrong about that, though. You upgraded using pg_upgrade, right? That's certainly a confounding factor here. -- Peter Geoghegan
Hi Peter,
On Fri, 12 Nov 2021 at 03:15, Peter Geoghegan <pg@bowt.ie> wrote:
You have said elsewhere that you're sure that this isn't the parallel
VACUUM bug, since you know that you didn't run a manual VACUUM, even
once. So I wonder what the issue might be. Since you deleted duplicate
rows from a unique index, there probably weren't very many affected
rows in total.
It is a sharded setup, 16 shards in total. The table structure is similar between shards.
Also, more or less the same tables were affected across all shards.
Depending on the table the number of affected rows was from one to a few hundreds.
Picture looked similar across all shards, although, the set of tables was a bit varying from shard to shard.
The first duplicates I resolved manually, but after rules became clear I simply wrote queries that remove either everything but the first or the last version of the row based on the timestamp.
It sounds like a pretty subtle issue to me
(particularly compared to the parallel VACUUM bug, which wasn't all
that subtle when it hit at all).
If I had to guess, I'd guess that it has something to do with the
snapshot scalability work. Specifically, a recently reported issue
involving confusion about the structure of HOT chains during pruning:
https://www.postgresql.org/message-id/flat/20211110192010.ckvfzz352hsba5xf%40alap3.anarazel.de#4c3d9c9988164f5ea3c15999bcf50ce7
Please join in on the other thread if you have anything more to add.
Unfortunately, not really. I only ran bt_index_check(index => r.oid, heapallindexed => true), because bt_index_parent_check takes SharedLock on the index and blocks DML, which we have a lot.
All errors looked similarly:
ERROR: heap tuple (314222,250) from table ""${table_name}"" lacks matching index tuple within index ""${table_name}_pkey""
CONTEXT:
DETAIL: SQL statement ""SELECT bt_index_check(index => r.oid, heapallindexed => true)""
HINT: Retrying verification using the function bt_index_parent_check() might provide a more specific error.
CONTEXT:
DETAIL: SQL statement ""SELECT bt_index_check(index => r.oid, heapallindexed => true)""
HINT: Retrying verification using the function bt_index_parent_check() might provide a more specific error.
I could easily be wrong about that, though. You upgraded using
pg_upgrade, right? That's certainly a confounding factor here.
Correct, pg_upgrade --link, from 9.6 to 14. Any other upgrade method is resource and time consuming.
Alexander Kukushkin