Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?

Поиск
Список
Период
Сортировка
От Nikolay Samokhvalov
Тема Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?
Дата
Msg-id CAM527d8Rzbz1HkA_ptiWPYdJyCmBPJ=UZfHhg13Myv=5ANr1Dg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?  (Bruce Momjian <bruce@momjian.us>)
Ответы Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?  (Bruce Momjian <bruce@momjian.us>)
Re: pg_upgrade instructions involving "rsync --size-only" might lead to standby corruption?  (Stephen Frost <sfrost@snowman.net>)
Список pgsql-hackers


On Fri, Jun 30, 2023 at 14:33 Bruce Momjian <bruce@momjian.us> wrote:
On Fri, Jun 30, 2023 at 04:16:31PM -0400, Robert Haas wrote:   
> I'm not quite clear on how Nikolay got into trouble here. I don't
> think I understand under exactly what conditions the procedure is
> reliable and under what conditions it isn't. But there is no way in
> heck I would ever advise anyone to use this procedure on a database
> they actually care about. This is a great party trick or something to
> show off in a lightning talk at PGCon, not something you ought to be
> doing with valuable data that you actually care about.

Well, it does get used, and if we remove it perhaps we can have it on
our wiki and point to it from our docs.

In my case, we performed some additional writes on the primary before running "pg_upgrade -k" and we did it *after* we shut down all the standbys. So those changes were not replicated and then "rsync --size-only" ignored them. (By the way, that cluster has wal_log_hints=on to allow Patroni run pg_rewind when needed.)

But this can happen with anyone who follows the procedure from the docs as is and doesn't do any additional steps, because in step 9 "Prepare for standby server upgrades":

1) there is no requirement to follow specific order to shut down the nodes
   - "Streaming replication and log-shipping standby servers can remain running until a later step" should probably be changed to a requirement-like "keep them running"

2) checking the latest checkpoint position with pg_controldata now looks like a thing that is good to do, but with uncertainty purpose -- it does not seem to be used to support any decision
  - "There will be a mismatch if old standby servers were shut down before the old primary or if the old standby servers are still running" should probably be rephrased saying that if there is mismatch, it's a big problem

So following the steps as is, if some writes on the primary are not replicated (due to whatever reason) before execution of pg_upgrade -k + rsync --size-only, then those writes are going to be silently lost on standbys.

I wonder, if we ensure that standbys are fully caught up before upgrading the primary, if we check the latest checkpoint positions, are we good to use "rsync --size-only", or there are still some concerns? It seems so to me, but maybe I'm missing something.
--

Thanks,
Nikolay Samokhvalov
Founder, Postgres.ai

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Magnus Hagander
Дата:
Сообщение: Re: Should we remove db_user_namespace?
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Initdb-time block size specification