Обсуждение: pg_rewind with cascade standby doesn't work well

Поиск
Список
Период
Сортировка

pg_rewind with cascade standby doesn't work well

От
Kuwamura Masaki
Дата:
Hi there,

I tested pg_rewind behavior and found a suspicious one.

Consider a scenario like this,

Server A: primary
Server B :replica of A
Server C :replica of B

and somehow A down ,so B gets promoted.

Server A: down
Server B :new primary
Server C :replica of B

In this case, pg_rewind can be used to reconstruct the cascade; the source is C and the target is A.
However, we get error as belows by running pg_rewind.

```
pg_rewind: fetched file "global/pg_control", length 8192
pg_rewind: source and target cluster are on the same timeline
pg_rewind: no rewind required
```
Though A's timeline is 1 and C's is 2 ideally,  it says they're on the same timeline.

This is because `pg_rewind` currently uses minRecoveryPointTLI and latest checkpoint's TimelineID to compare the TLI between source and target[1].
Both C's minRecoveryPointTLI and Latest checkpoint's TimelineID are not modified until checkpointing. (even though B's are modified).
And then, if you run pg_rewind immediately, pg_rewind won't work because C and A appear to be on the same timeline. So we have to CHECKPOINT on C before running pg_rewind;

BTW, immediate pg_rewind with cascade standby seems to be already concerned in another discussion[2], but unfortunately missed.

Anyway, I don't think this behavior is kind.
To fix this, should we use another variable to compare TLI?
Or, modify the cascade standby's minRecoveryPointTLI somehow?

Masaki Kuwamura

[1] https://www.postgresql.org/message-id/flat/9f568c97-87fe-a716-bd39-65299b8a60f4%40iki.fi
[2] https://www.postgresql.org/message-id/flat/aeb5f31a-8de2-40a8-64af-ab659a309d6b%40iki.fi

Re: pg_rewind with cascade standby doesn't work well

От
Kuwamura Masaki
Дата:
> Consider a scenario like this,

> Server A: primary
> Server B :replica of A
> Server C :replica of B

> and somehow A down ,so B gets promoted.
> Server A: down
> Server B :new primary
> Server C :replica of B

> In this case, pg_rewind can be used to reconstruct the cascade; the source is C and the target is A.
> However, we get error as belows by running pg_rewind.

>  ```
>  pg_rewind: fetched file "global/pg_control", length 8192
>  pg_rewind: source and target cluster are on the same timeline
>  pg_rewind: no rewind required
>  ```

To fix the above mentioned behavior of pg_rewind, I suggest to change the cascade standby's (i.e. server C's) minRecoveryPointTLI when it receives the new timeline information from the new primary (i.e. server B).

When server B is promoted, it creates an end-of-recovery record by calling CreateEndOfRecoveryRecord(). (in xlog.c)
And also updates B's minRecoveryPoint and minRecoveryPointTLI.
```
/*
      * Update the control file so that crash recovery can follow the timeline
      * changes to this point.
      */
     LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
     ControlFile->minRecoveryPoint = recptr;
     ControlFile->minRecoveryPointTLI = xlrec.ThisTimeLineID;
     UpdateControlFile();
     LWLockRelease(ControlFileLock);
```
Since C is a replica of B, the end-of-recovery record is replicated from B to C, so the record is replayed in C by xlog_redo().
The attached patch updates minRecoveryPoint and minRecoveryPointTLI at this point by mimicking CreateEndOfRecoveryRecord().
With this patch, you can run pg_rewind with cascade standby immediately. (without waiting for checkpointing)

Thoughts?

Masaki Kuwamura
Вложения

Re: pg_rewind with cascade standby doesn't work well

От
Aleksander Alekseev
Дата:
Hi,

> The attached patch updates minRecoveryPoint and minRecoveryPointTLI at this point by mimicking
CreateEndOfRecoveryRecord().
> With this patch, you can run pg_rewind with cascade standby immediately. (without waiting for checkpointing)

Many thanks for submitting the patch. I added it to the nearest open
commitfest [1].

IMO a test is needed that makes sure no one is going to break this in
the future.

[1]: https://commitfest.postgresql.org/45/4559/

-- 
Best regards,
Aleksander Alekseev



Re: pg_rewind with cascade standby doesn't work well

От
Michael Paquier
Дата:
On Mon, Sep 11, 2023 at 07:04:30PM +0300, Aleksander Alekseev wrote:
> Many thanks for submitting the patch. I added it to the nearest open
> commitfest [1].
>
> IMO a test is needed that makes sure no one is going to break this in
> the future.

You definitely need more complex test scenarios for that.  If you can
come up with new ways to make the TAP tests of pg_rewind mode modular
in handling more complicated node setups, that would be a nice
addition, for example.

> [1]: https://commitfest.postgresql.org/45/4559/

@@ -7951,6 +7951,15 @@ xlog_redo(XLogReaderState *record)
             ereport(PANIC,
                     (errmsg("unexpected timeline ID %u (should be %u) in end-of-recovery record",
                             xlrec.ThisTimeLineID, replayTLI)));
+        /*
+         * Update the control file so that crash recovery can follow the timeline
+         * changes to this point.
+         */
+        LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+        ControlFile->minRecoveryPoint = lsn;
+        ControlFile->minRecoveryPointTLI = xlrec.ThisTimeLineID;

This patch is at least incorrect in its handling of crash recovery,
because these two should *never* be set in this case as we want to
replay up to the end of WAL.  For example, see xlog.c or the top of
xlogrecovery.c about the assumptions behind these variables:
    /* crash recovery should always recover to the end of WAL */
    ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
    ControlFile->minRecoveryPointTLI = 0;

If an end-of-recovery record is replayed during crash recovery, these
assumptions are plain broken.

One thing that we could consider is to be more aggressive with
restartpoints when replaying this record for a standby, see a few
lines above the lines added by your patch, for example.  And we could
potentially emulate a post-promotion restart point to get a refresh of
the control file as it should, with the correct code paths involved in
the updates of minRecoveryPoint when the checkpointer does the job.
--
Michael

Вложения

Re: pg_rewind with cascade standby doesn't work well

От
Kuwamura Masaki
Дата:
>> IMO a test is needed that makes sure no one is going to break this in
>> the future.
>
> You definitely need more complex test scenarios for that.  If you can
> come up with new ways to make the TAP tests of pg_rewind mode modular
> in handling more complicated node setups, that would be a nice
> addition, for example.

I'm sorry for lacking tests. For now, I started off with a simple test
that cause the problem I mentioned. The updated WIP patch 0001 includes
the new test for pg_rewind.
And also, I'm afraid that I'm not sure what kind of tests I have to make
for fix this behavior. Would you mind giving me some advice?

> This patch is at least incorrect in its handling of crash recovery,
> because these two should *never* be set in this case as we want to
> replay up to the end of WAL.  For example, see xlog.c or the top of
> xlogrecovery.c about the assumptions behind these variables:
>     /* crash recovery should always recover to the end of WAL */
>     ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
>     ControlFile->minRecoveryPointTLI = 0;
>
> If an end-of-recovery record is replayed during crash recovery, these
> assumptions are plain broken.

That make sense! I really appreciate your knowledgeable review.

> One thing that we could consider is to be more aggressive with
> restartpoints when replaying this record for a standby, see a few
> lines above the lines added by your patch, for example.  And we could
> potentially emulate a post-promotion restart point to get a refresh of
> the control file as it should, with the correct code paths involved in
> the updates of minRecoveryPoint when the checkpointer does the job.

I'm not confident but you meant we could make restartpoint
(i.e., call `RequestCheckpoint()`) instead of my old patch?
The patch 0001 also contains my understanding.

I also found a bug (maybe). If we call `CreateRestartPoint()` during
crash-recovery, the assertion fails at ComputeXidHorizon() in procarray.c.
It's inherently orthogonal to the problem I already reported. So you can
reproduce this at HEAD with this procedure.

1. Start primary and standby server
2. Modify checkpoint_timeout to 1h on standby
3. Insert 10^10 records and concurrently run CHECKPOINT every second on primary
4. Do an immediate stop on both standby and primary at the end of the insert
5. Modify checkpoint_timeout to 30 on standby
6. Remove standby.signal on standby
7. Restart standby (it will start crash-recovery)
8. Assertion failure is raised shortly

I think this is because `TruncateSUBTRANS();`  in `CreateRestartPoint()` is
called but `StartupSUBTRANS()` isn't called yet. In `StartupXLOG()`, we call
`StartupSUBTRANS()` if `(ArchiveRecoveryRequested && EnableHotStandby)`.
However, in `CreateRestartPoint()`, we call `TruncateSUBTRANS()` if
`(EnableHotStandby)`. I guess the difference causes this bug. The latter
possibly be called even crash-recovery while former isn't.
The attached patch 0002 fixes it. I think we could discuss about this bug in
another thread if needed.

Best regards!

Masaki Kuwamura
Вложения

Re: pg_rewind with cascade standby doesn't work well

От
Michael Paquier
Дата:
On Wed, Sep 20, 2023 at 11:46:45AM +0900, Kuwamura Masaki wrote:
> I also found a bug (maybe). If we call `CreateRestartPoint()` during
> crash-recovery, the assertion fails at ComputeXidHorizon() in procarray.c.
> It's inherently orthogonal to the problem I already reported. So you can
> reproduce this at HEAD with this procedure.
>
> 1. Start primary and standby server
> 2. Modify checkpoint_timeout to 1h on standby
> 3. Insert 10^10 records and concurrently run CHECKPOINT every second on
> primary
> 4. Do an immediate stop on both standby and primary at the end of the insert
> 5. Modify checkpoint_timeout to 30 on standby
> 6. Remove standby.signal on standby
> 7. Restart standby (it will start crash-recovery)
> 8. Assertion failure is raised shortly
>
> I think this is because `TruncateSUBTRANS();`  in `CreateRestartPoint()` is
> called but `StartupSUBTRANS()` isn't called yet. In `StartupXLOG()`, we
> call
> `StartupSUBTRANS()` if `(ArchiveRecoveryRequested && EnableHotStandby)`.
> However, in `CreateRestartPoint()`, we call `TruncateSUBTRANS()` if
> `(EnableHotStandby)`. I guess the difference causes this bug. The latter
> possibly be called even crash-recovery while former isn't.
> The attached patch 0002 fixes it. I think we could discuss about this bug
> in
> another thread if needed.

This is a known issue.  I guess that the same as this thread and this
CF entry:
https://commitfest.postgresql.org/44/4244/
https://www.postgresql.org/message-id/flat/ZArVOMifjzE7f8W7@paquier.xyz
--
Michael

Вложения

Re: pg_rewind with cascade standby doesn't work well

От
Fujii Masao
Дата:

On 2023/09/20 12:04, Michael Paquier wrote:
> This is a known issue.  I guess that the same as this thread and this
> CF entry:
> https://commitfest.postgresql.org/44/4244/
> https://www.postgresql.org/message-id/flat/ZArVOMifjzE7f8W7@paquier.xyz

I think this is a separate issue, and we should still use Kuwamura-san's patch
even after the one you posted on the thread gets accepted. BTW, I was able to
reproduce the assertion failure Kuwamura-san reported, even after applying
your latest patch from the thread.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: pg_rewind with cascade standby doesn't work well

От
Aleksander Alekseev
Дата:
Hi,

> >> IMO a test is needed that makes sure no one is going to break this in
> >> the future.
> >
> > You definitely need more complex test scenarios for that.  If you can
> > come up with new ways to make the TAP tests of pg_rewind mode modular
> > in handling more complicated node setups, that would be a nice
> > addition, for example.
>
> I'm sorry for lacking tests. For now, I started off with a simple test
> that cause the problem I mentioned. The updated WIP patch 0001 includes
> the new test for pg_rewind.

Many thanks for a quick update.

> And also, I'm afraid that I'm not sure what kind of tests I have to make
> for fix this behavior. Would you mind giving me some advice?

Personally I would prefer not to increase the scope of work. Your TAP
test added in 0001 seems to be adequate.

> BTW, I was able to
> reproduce the assertion failure Kuwamura-san reported, even after applying
> your latest patch from the thread.

Do you mean that the test fails or it doesn't but there are other
steps to reproduce the issue?

-- 
Best regards,
Aleksander Alekseev



Re: pg_rewind with cascade standby doesn't work well

От
Michael Paquier
Дата:
On Tue, Sep 26, 2023 at 06:44:50PM +0300, Aleksander Alekseev wrote:
>> And also, I'm afraid that I'm not sure what kind of tests I have to make
>> for fix this behavior. Would you mind giving me some advice?
>
> Personally I would prefer not to increase the scope of work. Your TAP
> test added in 0001 seems to be adequate.

Yeah, agreed.  I'm OK with what you are proposing, basically (the
test could be made a bit cheaper actually).

         /*
-         * For Hot Standby, we could treat this like a Shutdown Checkpoint,
-         * but this case is rarer and harder to test, so the benefit doesn't
-         * outweigh the potential extra cost of maintenance.
+         * For Hot Standby, we could treat this like an end-of-recovery checkpoint
          */
+        RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT);

I don't understand what you want to change here.  Archive recovery and
crash recovery are two different things, still this code would be
triggered even if there is no standby.signal, aka the node is not a
standby.  Why isn't this stuff conditional?

>> BTW, I was able to
>> reproduce the assertion failure Kuwamura-san reported, even after applying
>> your latest patch from the thread.
>
> Do you mean that the test fails or it doesn't but there are other
> steps to reproduce the issue?

I get it as Fujii-san testing the patch from [1], still failing the
test from [2]:
[1]: https://www.postgresql.org/message-id/ZArVOMifjzE7f8W7@paquier.xyz
[2]: https://www.postgresql.org/message-id/CAMyC8qryE7mKyvPvGHCt5GpANAmp8sS_tRbraqXcPBx14viy6g@mail.gmail.com

I would be surprised, actually, because the patch from [1] would cause
step 7 of the test to fail: the patch causes standby.signal or
recovery.signal to be required.  Anyway, this specific issue, if any,
had better be discussed on the other thread.  I need to address a few
comments there as well and was planning to get back to it.  It is
possible that I've missed something on the other thread with the
restrictions I was proposing in the latest version of the patch.

For this thread, let's focus on the pg_rewind case and how we want to
treat these records to improve the cascading case.
--
Michael

Вложения

Re: pg_rewind with cascade standby doesn't work well

От
Kuwamura Masaki
Дата:
Thanks for your review!

2023年9月27日(水) 8:33 Michael Paquier <michael@paquier.xyz>:
On Tue, Sep 26, 2023 at 06:44:50PM +0300, Aleksander Alekseev wrote:
>> And also, I'm afraid that I'm not sure what kind of tests I have to make
>> for fix this behavior. Would you mind giving me some advice?
>
> Personally I would prefer not to increase the scope of work. Your TAP
> test added in 0001 seems to be adequate.

Yeah, agreed.  I'm OK with what you are proposing, basically (the
test could be made a bit cheaper actually).

I guess you meant that it contains an unnecessary insert and wait.
I fixed this and some incorrect comments caused by copy & paste.
Please see the attached patch.

 
         /*
-         * For Hot Standby, we could treat this like a Shutdown Checkpoint,
-         * but this case is rarer and harder to test, so the benefit doesn't
-         * outweigh the potential extra cost of maintenance.
+         * For Hot Standby, we could treat this like an end-of-recovery checkpoint
          */
+        RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT);

I don't understand what you want to change here.  Archive recovery and
crash recovery are two different things, still this code would be
triggered even if there is no standby.signal, aka the node is not a
standby.  Why isn't this stuff conditional?
 
You are absolutely right. It should only run in standby mode.
Also, according to the document[1], a server can be "Hot Standby" even if it is 
not in standby mode (i.e. when it is in archive recovery mode).
So I fixed the comment above `RequestCheckpoint()`.


I hope you will review it again.

Regards,

Masaki Kuwamura
Вложения