Re: Race condition with restore_command on streaming replica
От | Dilip Kumar |
---|---|
Тема | Re: Race condition with restore_command on streaming replica |
Дата | |
Msg-id | CAFiTN-t7-1TOF-x2bc1h1CMyVqzJxKhzcMgsQxWqxAi2nN0gdw@mail.gmail.com обсуждение исходный текст |
Ответ на | Race condition with restore_command on streaming replica ("Brad Nicholson" <bradn@ca.ibm.com>) |
Ответы |
RE: Race condition with restore_command on streaming replica
|
Список | pgsql-general |
On Thu, Nov 5, 2020 at 1:39 AM Brad Nicholson <bradn@ca.ibm.com> wrote: > > Hi, > > I've recently been seeing a race condition with the restore_command on replicas using streaming replication. > > On the primary, we are archiving wal files to s3 compatible storage via pgBackRest. In the recovery.conf section of thepostgresql.conf file on the replicas, we define the restore command as follows: > > restore_command = '/usr/bin/pgbackrest --config /conf/postgresql/pgbackrest_restore.conf --stanza=formation archive-get%f "%p"' > > We have a three member setup - m-0, m-1, m-2. Consider the case where m-0 is the Primary and m-1 and m-2 are replicas connectedto the m-0. > > When issuing a switchover (via Patroni) from m-0 to m-1, the connection from m-2 to m-0 is terminated. The restore_commandon m-2 is run, and it looks for the .history file for the new timeline. If this happens before the historyfile is created and pushed to the archive, m-2 will look for the next wal file on the existing timeline in the archive.It will never be created as the source has moved on, so this m-2 hangs waiting on that file. The restore_commandon the replica looking for this non-existent file is only run once. This seems like an odd state to be in.The replica is waiting on a new file, but it's not actually looking for it. Is this expected, or should the restore_commandbe polling for that file? I am not sure how Patroni does it internally, can you explain the scenario in more detail? Suppose you are executing the promote on m-1 and if the promotion is successful it will switch the timeline and it will create the timeline history file. Now, once the promotion is successful if we change the primary_conninfo on the m-2 then it will restart the walsender and look for the latest .history file which it should find either from direct streaming or through the restore_command. If you are saying that m-2 tried to look for the history file before m-1 created it then it seems like you change the primary_conninfo on m-2 before the m-1 promotion got completed. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
В списке pgsql-general по дате отправления: