pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?

Поиск
Список
Период
Сортировка
От Nikolay Samokhvalov
Тема pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?
Дата
Msg-id CAM527d8heqkjG5VrvjU3Xjsqxg41ufUyabD9QZccdAxnpbRH-Q@mail.gmail.com
обсуждение исходный текст
Ответы Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Hi!

(posting this to -hackers rather than to -docs since it seems a deeper problem than just adjusting the docs)

I recently observed a case with standby corruption after upgrading pg12 to pg14, which was presented in the form of XX001 errors on the new cluster's standby nodes. e.g.:
  ERROR: missing chunk number 0 for toast value 3228893903 in pg_toast_79504413

Comparing the content of the data directory and checking files with md5sum, I noticed that some files for some TOAST index have different content on new standby nodes compared to the new primary – and interesting was the fact all standbys had the same content. Just different compared to the primary.

We used the "rsync --size-only" snippet from the docs https://www.postgresql.org/docs/current/pgupgrade.html to upgrade standbys.

With "--size-only", 1 GiB files for tables and indexes obviously cannot be reliably synchronized. In our case, we perform additional steps involving logical replication, advancing primary to certain LSN position -- and during that, we keep standbys down. This explains the increased corruption risks. But I think these risks are present for those who just follow the steps in the docs as is, and probably some fixes or improvements are needed here.

The main question: why do we consider "rsync --size-only" as reliable in the general case? May standby corruption happen if we use follow steps from https://www.postgresql.org/docs/current/pgupgrade.html?

Considering several general situations:
1. For streaming replication:
    a. if we shut down the primary first, based on the code in walsender.c defining how shutdown even is handled, replicas should receive all the changes?
    b. if shut down standbys first (might be preferred if we run cluster under Patroni control, to avoid unnecessary failover), then some changes from the primary won't be received by standbys – and we do have standby corruption risks 
2. For replication based on WAL shipping, I don't think we can guarantee that all changes are propagated to standbys.

The docs also have this:

> 9. Prepare for standby server upgrades 
> If you are upgrading standby servers using methods outlined in section Step 11, verify that the old standby servers are caught up by running pg_controldata against the old primary and standby clusters. Verify that the “Latest checkpoint location” values match in all clusters. (There will be a mismatch if old standby servers were shut down before the old primary or if the old standby servers are still running.) Also, make sure wal_level is not set to minimal in the postgresql.conf file on the new primary cluster.

– admitting that there might be mismatch. But if there is mismatch, rsync --size-only is not going to help synchronize properly, right?

I was thinking about how to improve here, some ideas:
- "rsync --checksum" doesn't seem to be a good idea, it's, unfortunately, very, very slow, though it would be the most reliable approach (but since it's slow, I guess it's not worth even mentioning, crossing this out)
- we could remove "--size-only" and rely on default rsync behavior – checking size and modification time; but how reliable would it be in general case?
- make the step verifying “Latest checkpoint location” *after* shutting down all nodes as mandatory, with instructions on how to avoid mismatch: e.g., shut down primary first, disabling automated failover software, if any, then run pg_controldata on standbys while they are running, and on primary while it's already shut down (probably, different instructions are needed for WAL shipping and streaming cases) 
- probably, we should always run "rsync --checksum" for pg_wal 
- I think, it's time to provide a snippet to run "rsync" in multiple threads. A lot of installations today have many vCPUs and fast SSDs, and running single-threaded rsync seems to be very slow (especially if we do need to move away from "--size-only"). If it makes sense, I could come up with some patch proposal for the docs
-  it's probably time to implement support for standby upgrade in pg_upgrade itself, finding some way to take care of standbys and moving away from the need to run rsync or to rebuild standby nodes? Although, this is just a raw idea without a proper proposal yet.

Does this make sense or I'm missing something and the current docs describe a reliable process? (As I said, we have deviated from the process, to involve logical replication, so I'm not 100% sure I'm right suspecting the original procedure in having standby corruption risks.)

Thanks,
Nikolay Samokhvalov
Founder, Postgres.ai

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Meson build updates
Следующее
От: "Tristan Partin"
Дата:
Сообщение: Re: Meson build updates