Re: Race condition in recovery?

Поиск
Список
Период
Сортировка
От Dilip Kumar
Тема Re: Race condition in recovery?
Дата
Msg-id CAFiTN-tQnJBWEj-juQSZuVnLq8g-xCxezyFF_PmWobSbuPfLPQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Race condition in recovery?  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Race condition in recovery?  (Dilip Kumar <dilipbalaut@gmail.com>)
Re: Race condition in recovery?  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Sat, May 22, 2021 at 1:14 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, May 21, 2021 at 12:52 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I had trouble following it completely, but I didn't really spot
> > anything that seemed definitely wrong. However, I don't understand
> > what it has to do with where we are now. What I want to understand is:
> > under exactly what circumstances does it matter that
> > WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM,
> > will stream from receiveTLI rather than recoveryTargetTLI?
>
> Ah ha! I think I figured it out. To hit this bug, you need to meet the
> following conditions:
>
> 1. Both streaming and archiving have to be configured.
> 2. You have to promote a new primary.
> 3. After promoting the new primary you have to start a new standby
> that doesn't have local WAL and for which the backup was taken from
> the previous timeline. In Dilip's original scenario, this new standby
> is actually the old primary, but that's not required.

No, in my original scenario also the new standby was not old primary,
I had 3 nodes
node1-> primary, node2 -> standby1, node3-> standby2
node2 promoted as a new primary and node3's local WAL was removed (so
that it has to stream checkpoint record from new primary and then
remaining everything happens as you explain in remaining steps).

> 4. The new standby has to be able to find the history file it needs in
> the archive but not the WAL files.
> 5. The new standby needs to have recovery_target_timeline='latest'
> (which is the default)
>
> When you start the new standby, it will fetch the current TLI from its
> control file. Then, since recovery_target_timeline=latest, the system
> will try to figure out the latest timeline, which only works because
> archiving is configured. There seems to be no provision for detecting
> the latest timeline via streaming. With archiving enabled, though,
> findNewestTimeLine() will be able to restore the history file created
> by the promotion of the new primary, which will cause
> validateRecoveryParameters() to change recoveryTargetTLI.

Right

 Then we'll
> try to read the WAL segment containing the checkpoint record and fail
> because, by stipulation, only history files are available from the
> archive. Now, because streaming is also configured, we'll try
> streaming. That will work, so we'll be able to read the checkpoint
> record, but now, because WaitForWALToBecomeAvailable() initialized
> expectedTLEs using receiveTLI instead of recoveryTargetTLI, we can't
> switch to the correct timeline and it all goes wrong.

exactly

> The attached test script, test.sh seems to reliably reproduce this.
> Put that file and the recalcitrant_cp script, also attached, into an
> empty directory, cd to that directory, and run test.sh. Afterwards
> examine pgcascade.log. Basically, these scripts just set up the
> scenario described above. We set up primary and a standby that use
> recalcitrant_cp as the archive command, and because it's recalcitrant,
> it's only willing to copy history files, and always fails for WAL
> files.Then we create a cascading standby by taking a base backup from
> the standby, but before actually starting it, we promote the original
> standby. So now it meets all the conditions described above. I tried a
> couple variants of this test. If I switch the archive command from
> recalcitrant_cp to just regular cp, then there's no problem. And if I
> switch it to something that always fails, then there's also no
> problem. That's because, with either of those changes, condition (4)
> above is no longer met. In the first case, both files end up in the
> archive, and in the second case, neither file.

I haven't tested this, but I will do that.  But now we are on the same
page about the cause of the actual problem I reported.

> What about hitting this in real life, with a real archive command?
> Well, you'd probably need the archive command to be kind of slow and
> get unlucky on the timing, but there's nothing to prevent it from
> happening.

Right

> But, it will be WAY more likely if you have Dilip's original scenario,
> where you try to repurpose an old primary as a standby. It would
> normally be unlikely that the backup used to create a new standby
> would have an older TLI, because you typically wouldn't switch masters
> in between taking a base backup and using it to create a new standby.
> But the old master always has an older TLI. So (3) is satisfied. For
> (4) to be satisfied, you need the old master to fail to archive all of
> its WAL when it shuts down.

For my original case, both standby1 and standby2 are connected to the
primary.  Now, standby1 is promoted and standby2 is shut down. And,
before restarting, all the local WAL of the standby2 is removed so
that it can follow the new primary. The primary info and restore
command for standby2 are changed as per the new primary(standby1).

Now the scenario is that the standby1 has switched the timeline in the
middle of the segment which contains the checkpoint record, so the
segment with old TL is renamed to (.partial) and the same segment with
new TL is not yet archived but the history file for the new TL has
been archived.

Now, when standby2 restart the remaining things happened as you
explained, basically it restores the history file and changes the
recoveryTargetTLI but it doesn't get the WAL file from the archive.
So try to stream checkpoint record from the primary using
"ControlFile->checkPointCopy.ThisTimeLineID", which is old timeline.

Now, we may ask that if the WAL segment with old TL on standby1(new
primary) which contains the checkpoint is already renamed to
".partial" then how can it stream using the old TL then the answer is
below code[1] in the walsender.  Basically, the checkpoint record is
present in both new and old TL as TL switched in the middle of the
segment, it will send you the record from the new TL even if the
walreciever asks to stream with old TL.  Now walrecievr is under
impression that it has read from the old TL.  And, we know the rest of
the story that we will set the expectedTLEs based on the old history
file and never be able to go to the new TL.

Anyways now we understand the issue and there are many ways we can
reproduce it. Still, I thought of explaining the exact steps how it
happed for me because now we understand it well so I think it is easy
to explain :)

[1]
WalSndSegmentOpen()
{
/*-------
* When reading from a historic timeline, and there is a timeline switch
* within this segment, read from the WAL segment belonging to the new
* timeline.
}

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Julien Rouhaud
Дата:
Сообщение: Re: compute_query_id and pg_stat_statements
Следующее
От: Dilip Kumar
Дата:
Сообщение: Re: Logical Replication - behavior of TRUNCATE ... CASCADE