Обсуждение: An improvement of ProcessTwoPhaseBuffer logic

Поиск
Список
Период
Сортировка

An improvement of ProcessTwoPhaseBuffer logic

От
"Vitaly Davydov"
Дата:
Dear Hackers,

I would like to discuss ProcessTwoPhaseBuffer function. It reads two-phase transaction states from disk or the WAL. It
takesxid as well as some other input parameters and executes the following steps: 

Step #1: Check if xid is committed or aborted in clog (TransactionIdDidCommit, TransactionIdDidAbort)
Step #2: Check if xid is not equal or greater than ShmemVariableCache->nextXid
Step #3: Read two-phase state for the specified xid from memory or the corresponding file and returns it

In some, very rare scenarios, the postgres instance will newer recover because of such logic. Imagine, that the
two_phasedirectory contains some files with two-phase states of transactions of distant future. I assume, it can happen
ifsome WAL segments are broken and ignored (as well as clog data) but two_phase directory was not broken. In recovery,
postgresqlreads all the files in two_phase and tries to recover two-phase states. 

The problem appears in the functions TransactionIdDidCommit or TransactionIdDidAbort. These functions may fail with the
FATALmessage like below when no clog state on disk is available for the xid: 

FATAL:  could not access status of transaction 286331153
DETAIL:  Could not open file "pg_xact/0111": No such file or directory.

Such error do not allow the postgresql instance to be started.

My guess, if to swap Step #1 with Step #2 such error will disappear because transactions will be filtered when
comparingxid with ShmemVariableCache->nextXid before accessing clog. The function will be more robust. In general, it
worksbut I'm not sure that such logic will not break some rare boundary cases. Another solution is to catch and ignore
sucherror, but the original solution is the simpler one. I appreciate any thoughts concerning this topic. May be, you
knowsome cases when such change in logic is not relevant? 

Thank you in advance!

With best regards,
Vitaly




Re: An improvement of ProcessTwoPhaseBuffer logic

От
"Vitaly Davydov"
Дата:
Hi Michael,

Thank you for the explanation and the patch! I'm happy, that I seem to be on the right way.

On Wednesday, December 25, 2024 08:04 MSK, Michael Paquier <michael@paquier.xyz> wrote:
> Vitaly, have you seen that in the wild as an effect of future 2PC files?

I haven't heard about this problem in production, but I've encountered it when I did some development in two-phase
functionality.I fixed it by swapping the if blocks and it solved my problem. 

It is pretty easy to reproduce it on the master branch:

1. Start an instance with enabled prepared transactions
2. Create a simple prepared transaction
3. Do checkpoint to write the transaction two-phase state into a file in pg_twophase subdirectory
4. Copy the created file, change its name to reflect a future xid (in my case: cp 00000000000002E8 00000000000FF2E8)
5. Commit the prepared transaction
6. Stop the instance with -m immediate
7. Start the instance

After starting, you can get an error like "could not start server". In the log file you can find a message like:

LOG:  database system was not properly shut down; automatic recovery in progress
FATAL:  could not access status of transaction 1045224
DETAIL:  Could not read from file "pg_xact/0000" at offset 253952: read too few bytes.
2024-12-25 18:38:30.606 MSK [795557] LOG:  startup process (PID 795560) exited with exit code 1

I tried your patch and it seems the server is started successfully. But I've found another problem in my synthetic test
-it can not remove the file with the following message: 

LOG:  database system was not properly shut down; automatic recovery in progress
WARNING:  removing future two-phase state file for transaction 1045224
WARNING:  could not remove file "pg_twophase/FFFFFFFF000FF2E8": No such file or directory
LOG:  redo starts at 0/1762978

The fill will never be removed automatically.
I guess, it is because we incorrectly calculate the two-phase file name using TwoPhaseFilePath in RemoveTwoPhaseFile in
thisscenario. It can be fixed if to pass file path directly from RecoverPreparedTransactions or
StandbyRecoverPreparedTransactioninto ProcessTwoPhaseBuffer -> RemoveTwoPhaseFile. I did it in the proposed patch
https://www.postgresql.org/message-id/cedbe-65e0c000-1-6db17700%40133269862(it is incomplete now). 

With best regards,
Vitaly




Re: An improvement of ProcessTwoPhaseBuffer logic

От
Давыдов Виталий
Дата:
Hi Michael,

> Here are two patches to address both issues:

Thank you for the preparing the patches and test simplification. My bad, I overcompilcated it.

I concerned about twophase file name generation. The function TwoPhaseFilePath() is pretty straitforward and
unambiguousin 16 and earlier versions. The logic of this function in 17+ seems to be more complex. I do not understand
itclearly. But, I guess, it will work incorrectly after turning to a newer epoch, because the epoch is calculated from
TransamVariables->nextXid,but not from problem xid. The same problem may happen if we are in epoch 1 or greater. It
willproduce a wrong file name, because the epoch will be obtained from the last xid, not from file name xid. In another
words,the function AdjustToFullTransactionId assumes that if xid > TransamVariables->nextXid, then the xid from the
previousepoch. I may be not the case in our scenario. 

> I suspect that this can be still dangerous as-is while complicating the code with more possible paths for the removal
ofthe 2PC files 
Agree, but we may pass file name into TwoPhaseFilePath if any, instead of creation of two functions as in the patch.
Thecost - two new if conditions. Using file names is pretty safe. Once we read the file and extract xid from its name,
justpass this file name to TwoPhaseFilePath(). If not, try to generate it. Anyway, I do not insist on it, just try to
discuss.

With best regards,
Vitaly




Re: An improvement of ProcessTwoPhaseBuffer logic

От
Давыдов Виталий
Дата:
On Monday, December 30, 2024 04:08 MSK, Michael Paquier <michael@paquier.xyz> wrote:

> Instead, I have gone with a new
> FullTransactionIdFromCurrentEpoch() that replaces
> AdjustToFullTransactionId().  It cleans up most of the calls to
> ReadNextFullTransactionId() compared to the previous patch.  It is
> true that these couple with calls to ProcessTwoPhaseBuffer(), but the
> result felt OK this way in the scope of this fix.

Could you please send the latest version of the patch, if any? I haven't found any new patches in the latest email.

Happy New Year!

With best regards,
Vitaly




Re: An improvement of ProcessTwoPhaseBuffer logic

От
Noah Misch
Дата:
On Thu, Jan 16, 2025 at 12:52:54PM -0800, Noah Misch wrote:
> On Thu, Jan 16, 2025 at 04:50:09PM +0900, Michael Paquier wrote:
> > As far as I understand, the most important point of the logic is to
> > detect and discard the future files first in restoreTwoPhaseData() ->
> > ProcessTwoPhaseBuffer() when scanning the contents of pg_twophase at
> > the beginning of recovery.  Once this filtering is done, it should be
> > safe to use your FullTransactionIdFromAllowableAt() when doing
> > the fxid <-> xid transitions between the records and the files on disk
> > flushed by a restartpoint which store an XID, and the shmem state of
> > GlobalTransactionData with a fxid.
> 
> (I did not expect that a function called restoreTwoPhaseData() would run before
> a function called PrescanPreparedTransactions(), but so it is.)
> 
> How is it that restoreTwoPhaseData() -> ProcessTwoPhaseBuffer() can safely
> call TransactionIdDidAbort() when we've not replayed WAL to make CLOG
> consistent?  What can it assume about the value of TransamVariables->nextXid
> at that early time?

I think it can't use CLOG safely.  I regret answering my own question now as
opposed to refraining from asking it, but hopefully this will save you time.

Since EndPrepare() flushes the XLOG_XACT_PREPARE record before anything writes
to pg_twophase, presence of a pg_twophase file tells us its record once
existed in the flushed WAL stream.  Even before we've started the server on a
base backup, each pg_twophase file pertains to an XLOG_XACT_PREPARE LSN no
later than the end-of-backup checkpoint.  If a later file exists, the backup
copied a file after the end-of-backup checkpoint started, which is a violation
of the base backup spec.

Elsewhere in recovery, we have the principle that data directory content means
little until we reach a consistent state by replaying the end-of-backup
checkpoint or by crash recovery reaching the end of WAL.  For twophase, that
calls for ignoring pg_twophase content.  If a restartpoint has a pg_twophase
file to write, we should allow that to overwrite an old file iff we've not
reached consistency.  (We must not read the old pg_twophase file, which may
contain an unfinished write.)

By the time we reach consistency, every file in pg_twophase will be applicable
(not committed or aborted).  Each file either predates the start-of-backup
checkpoint (or crash recovery equivalent), or recovery wrote that file.  We
need readdir(pg_twophase) only at end of recovery, when we need
_twophase_recover callbacks to restore enough state to accept writes.  (During
hot standby, state from XLOG_RUNNING_XACTS suffices[1] for the actions
possible in that mode.)

In other words, the root problem that led to commits e358425 and 7e125b2 was
recovery interpreting pg_twophase file content before reaching consistency.
We can't make the ProcessTwoPhaseBuffer() checks safe before reaching
consistency.  Is that right?  Once ProcessTwoPhaseBuffer() stops running
before consistency, it won't strictly need the range checks.  We may as well
have something like those range checks as a best-effort way to detect base
backup spec violations.

Thanks,
nm

[1] Incidentally, the comment lock_twophase_standby_recover incorrectly claims
it runs "when starting up into hot standby mode."



Re: An improvement of ProcessTwoPhaseBuffer logic

От
Noah Misch
Дата:
On Fri, Jan 17, 2025 at 11:04:03AM +0900, Michael Paquier wrote:
> On Thu, Jan 16, 2025 at 04:52:21PM -0800, Noah Misch wrote:
> > In other words, the root problem that led to commits e358425 and 7e125b2 was
> > recovery interpreting pg_twophase file content before reaching consistency.
> > We can't make the ProcessTwoPhaseBuffer() checks safe before reaching
> > consistency.  Is that right?  Once ProcessTwoPhaseBuffer() stops running
> > before consistency, it won't strictly need the range checks.  We may as well
> > have something like those range checks as a best-effort way to detect base
> > backup spec violations.
> 
> Oh, yeah, it seems like you have a very good point here.
> ProcessTwoPhaseBuffer() is an effort to use a unique path for the 
> handling of this 2PC data, and we cannot do that safely at this stage.
> The problem is restoreTwoPhaseData(), which is called very early in

Agreed.

> the recovery process, while all the other code paths calling
> ProcessTwoPhaseBuffer() are done when we're doing reading WAL after
> we've reached consistency as far as I can see.  This is a wrong
> concept since 728bd991c3c4.  5a1dfde8334b has made that much harder to
> think about, while not integrating fully things.  It also looks like
> we may be able to just get rid of the CLOG checks entirely if we
> overwrite existing files as long as we are not in a consistent state,
> as you say.  Hmm.
> 
> I agree that e358425 and 7e125b2 are weird attempts that just try to
> work around the real issue, and that they're making some cases wrong
> while on it with potential file removals.
> 
> +typedef void (*TwoPhaseCallback) (FullTransactionId fxid, uint16 info,
>                                    void *recdata, uint32 len);
> 
> Based on your latest patch, I doubt that we'll be able to do any of
> that in the back-branches.  That's much nicer in the long term to show
> what this code relies on.

The twophase.c bits of that patch turned out to be orthogonal to the key
issue, so that's fine.  I suspect the signature changes would be okay to
back-patch to v17, subject to PGXN confirmation of no extension using those
APIs.  (The TwoPhaseCallback list is hard-coded; extensions can't add
callbacks.)  That said, the rationale for back-patching fxid stuff is gone.

> I've been thinking about the strategy to use here, and here are some
> rough steps:
> - Let's first revert e358425 and 7e125b2.  This indeed needs to be
> reworked, and there is a release coming by.

+1.  e358425 is borderline, but it's hard to rule out ways of it making
recovery unlink pg_twophase files improperly.  Released code may also be doing
that, but that code's seven years of history suggests it does so infrequently
if at all.  We don't have as much evidence about what the frequency would be
under e358425.

> - Your refactoring around xid8funcs.c is a good idea on its own.  This
> can be an independent patch.

I got here on a yak shave for postgr.es/m/20250111214454.9a.nmisch@google.com.
I expect that project will still want FullTransactionIdFromAllowableAt().  If
so, I'll include it in that thread's patch series.

> - Let's figure out how much surgery is required with a focus on HEAD
> for now (I'm feeling that more non-backpatchable pieces are going to
> be needed here), then extract the relevant pieces that could be
> backpatchable.  Hard to say if this is going to be doable at this
> stage, or even worth doing, but it will be possible to dig into the
> details once we're happy with the state of HEAD.  My first intuition
> is that the risk of touching the back-branches may not be worth the
> potential reward based on the lack of field complaints (not seen
> customer cases, tbh): high risk, low reward.

Let's see how it turns out.  My first intuition would be to target a
back-patch, because a corner-case PITR failure is the kind of thing that can
go unreported and yet be hard to forgive.

I'm not confident I could fix both that other thread's data loss bug and
$SUBJECT in time for the 2024-02 releases.  (By $SUBJECT, I mean the
seven-year-old bug of interpreting pg_twophase before recovery consistency,
which has a higher chance of causing spurious recovery failure starting in
v17.)  Would your or someone else like to fix $SUBJECT, before or after
2024-02 releases?  I'd make time to review a patch.

> Another point to consider is if we'd better switch 2PC files to store
> a fxid rather than a xid..  Perhaps that's not necessary, but if we're
> using FullTransactionIds across the full stack of twophase.c that
> could make the whole better with even less xid <-> fxid translations
> in all these paths.  There is always the counter-argument of the extra
> 4 bytes in the 2PC files and the records, though.

Yes, perhaps so.  It could also be an option to store it only in the
pg_twophase file, not in the WAL record.  Files are a lot rarer.

Similarly, I wondered if pg_twophase files should contain the LSN of the
XLOG_XACT_PREPARE record, which would make the file's oldness unambiguous in a
way fxid doesn't achieve.  I didn't come up with a problem whose solution
requires it, though.

Thanks,
nm



Re: An improvement of ProcessTwoPhaseBuffer logic

От
"Vitaly Davydov"
Дата:
It seems, there are much deeper problems with twophase transactions as I thought. I'm interested in fixing twophase
transactions,because I support a solution which actively uses twophase transactions. I'm interested to get more deeply
intothe twophase functionality. Below, I just want to clarify some ideas behind twophase transactions. I appreciate, if
youcomment my point of view, or just ignore this email if you find it too naive and boring. 

Two phase files are created after checkpoint to keep twophase states on disk after WAL truncation. For transactions,
thatare inside the checkpoint horizon, we do not create two phase files because the states are currently stored in the
WAL.

Based on the thesis above, I guess, we have to read only those twophase files which are related to the transactions
beforethe latest checkpoint. Its full xid should be lesser than TransamVariables->nextXid (which is the same as
ControlFile->checkPointCopy.nextXidat the moment of StartupXLOG -> restoreTwoPhaseData call). The files with greater
(orequal) full xids, should be ignored and removed. That's all what we need in restoreTwoPhaseData, I believe. 

In the current implementation, such check is applied in restoreTwoPhaseData -> ProcessTwoPhaseBuffer but after checking
forxid in CLOG. I'm not sure, why we check CLOG here. Once we truncate the WAL on checkpoint and save twophase states
intopg_twophase, these files must store states of real transactions from past. I mean, if someone creates a stub file
withfull xid < TransamVariables->nextXid, we have no means (except CLOG ?) to check that this file belongs to a real
transactionfrom past. CLOG check seems to be a weak attempt to deal with it. At this point, I'm not sure that CLOG may
containstates for all full xids of existing twophase files. I guess, we should call restoreTwoPhaseData at start of
recoverybut we shouldn't check CLOG at this stage. May be it is reasonable to check some not so old xids which are
greaterthan the current CLOG horizon, but I'm not sure how CLOG segments are managed and how the horizon is moving. 

There is another question about the loading order of twophase files but I think it doesn't matter in which order we
loadthese files. But I would prefer to load it in full xid ascending order. 

On Tuesday, January 28, 2025 08:02 MSK, Michael Paquier <michael@paquier.xyz> wrote:

> Noah's
> proposal at [1] is much closer to the long-term picture that would
> look adapted.
> - The CLOG lookups that can happen in ProcessTwoPhaseBuffer() during
> recovery while a consistent state is not reached are still possible
> (planning to start a different thread about this specific issue).
>
> [1]: https://www.postgresql.org/message-id/20250116205254.65.nmisch@google.com

Agree, thank you, but my simple patch with some adjustments and swapping of checks in ProcessTwoPhaseBuffer may be
back-ported.It doesn't fix all the problems but may help to fix the problem with twophase files related to broken
latestWAL segments. 

With best regards,
Vitaly