Corner-case bug in pg_rewind
От | Ian Barwick |
---|---|
Тема | Corner-case bug in pg_rewind |
Дата | |
Msg-id | CABvVfJU-LDWvoz4-Yow3Ay5LZYTuPD7eSjjE4kGyNZpXC6FrVQ@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: Corner-case bug in pg_rewind
|
Список | pgsql-hackers |
Hi Take the following cluster with: - node1 (initial primary) - node2 (standby) - node3 (standby) Following activity takes place (greatly simplified from a real-world situation): 1. node1 is shut down. 2. node3 is promoted 3. node2 is attached to node3. 4. node1 is attached to node3 5. node1 is then promoted (creating a split brain situation with node1 and node3 as primaries) 6. node2 and node3 are shut down (in that order). 7. pg_rewind is executed to reset node2 so it can reattach to node1 as a standby. pg_rewind claims: pg_rewind: servers diverged at WAL location X/XXXXXXX on timeline 2 pg_rewind: no rewind required 8. based off that assurance, node2 is restarted with replication configuration pointing to node1 - but it is unable to attach, with node2's log reporting something like: new timeline 3 forked off current database system timeline 2 before current recovery point X/XXXXXXX The cause is that pg_rewind is assuming that if the node's last checkpoint matches the divergence point, no rewind is needed: if (chkptendrec == divergerec) rewind_needed = false; but in this case there *are* records beyond the last checkpoint, which can be inferred from "minRecoveryPoint" - but this is not checked. Attached patch addresses this. It includes a test, which doesn't make use of the RewindTest module, as that hard-codes a primary and a standby, while here three nodes are needed (I can't come up with a situation where this can be reproduced with only two nodes). The test sets "wal_keep_size" so would need modification for Pg12 and earlier. Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
В списке pgsql-hackers по дате отправления: