Re: [BUGS] BUG #14702: Streaming replication broken after serverclosed connection unexpectedly
От | Palle Girgensohn |
---|---|
Тема | Re: [BUGS] BUG #14702: Streaming replication broken after serverclosed connection unexpectedly |
Дата | |
Msg-id | 3029E88A-83B1-49BC-B2BA-DB3709AA26F7@pingpong.net обсуждение исходный текст |
Ответ на | Re: [BUGS] BUG #14702: Streaming replication broken after serverclosed connection unexpectedly (Michael Paquier <michael.paquier@gmail.com>) |
Список | pgsql-bugs |
> 13 juni 2017 kl. 03:57 skrev Michael Paquier <michael.paquier@gmail.com>: > > On Tue, Jun 13, 2017 at 6:52 AM, <girgen@pingpong.net> wrote: >> Setup is simple streaming replication: master -> slave. There is a >> replication slot at the master, so xlogs should not be removed until the >> client has received them properly. > > Hm. There has been the following discussion as well, which refers to a > legit bug where WAL segments could be removed even if a slot is used: > https://www.postgresql.org/message-id/CACJqAM3xVz0JY1XFDKPP+JoJAjoGx=GNuOAshEDWCext7BFvCQ@mail.gmail.com > The circumstances to trigger the problem are quite particular though > as it needs an incomplete WAL record at the end of a segment. > >> After this, the slave could not be started again, each time the same error >> about "invalid memory alloc request size 1600487424". > > Hm. That smells of data corruption.. Be careful going forward. I believe that corruption was in the broken WAL file though. I saw some notes pointing in that direction on the list, butsure, I could be mistaken. > >> Looking more closely, the last xlog file, let's call it 0000EB, is corrupt >> on the slave, having a different checksum from the proper one at the master. > > To which checksum are you referring here? Did you do yourself a > calculation using what is on-disk? Note that during streaming > replication the content of an unfinished segment may be different than > what is on the primary. I calculated that myself using sha256 from the command line. As you say, it was probably an unfinished segment. Problem is that the slave expects the *previous* wal file to still besaved on the master, but it was already removed. The slave *has* it though, so why would it required it to be transferredagain? 0000EA was requested, although it was already completeley transferred to the slave. I had to copy that0000EA back to the master so it could be transferred again. > >> Now, I don't know exactly what happened when the slave lost track, but the >> bug, I think, is that the streamed WAL was corrupt, and still accepted by >> the slave *and* hence removed from the master. It required too much >> experience to fix that. The slave should not accept a not fully transported >> WAL file. It seems it happened during some connection failure between the >> slave and master, but still it should preferrably fail more gracefully. Are >> the mechanisms implemented to support that, and they failed, or is it just >> not implemented? > > There is a per-record CRC calculation to check the validity of each > record, and this is processed when fetching each record at recovery as > a sanity check. That's one way to prevent applying an incorrect > record. In the event of such an error you would have seen "incorrect > resource manager data checksum in record at" or similar. It seems to > me that you should be careful with the primary as well. OK. "Be careful" is somewhat vague, but I get it. Would a pg_dump + pg_restore, for example, reveal any data corruption.Or is it just not possibly to be totally sure unless checksums would have been activated (they're not, this isan old datbase). > -- > Michael
В списке pgsql-bugs по дате отправления: