Re: Minimal logical decoding on standbys

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Minimal logical decoding on standbys
Дата
Msg-id 20230408032645.t2st7y3tnp3eixto@awork3.anarazel.de
обсуждение исходный текст
Ответ на Re: Minimal logical decoding on standbys  (Melanie Plageman <melanieplageman@gmail.com>)
Список pgsql-hackers
Hi,

On 2023-04-07 18:32:04 -0400, Melanie Plageman wrote:
> Code review only of 0001-0005.
> 
> I noticed you had two 0008, btw.

Yea, sorry for that. One was the older version. Just before sending the patch
I saw another occurance of a test failure, which I then fixed. In the course
of that I changed something in the patch subject.


> On Fri, Apr 07, 2023 at 11:12:26AM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2023-04-07 08:47:57 -0700, Andres Freund wrote:
> > > Integrated all of these.
> > 
> > From 0e038eb5dfddec500fbf4625775d1fa508a208f6 Mon Sep 17 00:00:00 2001
> > From: Andres Freund <andres@anarazel.de>
> > Date: Thu, 6 Apr 2023 20:00:07 -0700
> > Subject: [PATCH va67 1/9] Replace a replication slot's invalidated_at LSN with
> >  an enum
> > 
> > diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h
> > index 8872c80cdfe..ebcb637baed 100644
> > --- a/src/include/replication/slot.h
> > +++ b/src/include/replication/slot.h
> > @@ -37,6 +37,17 @@ typedef enum ReplicationSlotPersistency
> >      RS_TEMPORARY
> >  } ReplicationSlotPersistency;
> >  
> > +/*
> > + * Slots can be invalidated, e.g. due to max_slot_wal_keep_size. If so, the
> > + * 'invalidated' field is set to a value other than _NONE.
> > + */
> > +typedef enum ReplicationSlotInvalidationCause
> > +{
> > +    RS_INVAL_NONE,
> > +    /* required WAL has been removed */
> 
> I just wonder if RS_INVAL_WAL is too generic. Something like
> RS_INVAL_WAL_MISSING or similar may be better since it seems there are
> other inavlidation causes that may be related to WAL.

Renamed to RS_INVAL_WAL_REMOVED



> > From 52c25cc15abc4470d19e305d245b9362e6b8d6a3 Mon Sep 17 00:00:00 2001
> > From: Andres Freund <andres@anarazel.de>
> > Date: Fri, 7 Apr 2023 09:32:48 -0700
> > Subject: [PATCH va67 3/9] Support invalidating replication slots due to
> >  horizon and wal_level
> > MIME-Version: 1.0
> > Content-Type: text/plain; charset=UTF-8
> > Content-Transfer-Encoding: 8bit
> > 
> > Needed for supporting logical decoding on a standby. The new invalidation
> > methods will be used in a subsequent commit.
> > 
> 
> You probably are aware, but applying 0003 and 0004 both gives me two
> warnings:
> 
> warning: 1 line adds whitespace errors.
> Warning: commit message did not conform to UTF-8.
> You may want to amend it after fixing the message, or set the config
> variable i18n.commitEncoding to the encoding your project uses.

I did see the whitespace error, but not the encoding error. We have a bunch of
other commit messages with Fabrizio's name "properly spelled" in, so I think
that's ok.



> > diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
> > index df23b7ed31e..c2a9accebf6 100644
> > --- a/src/backend/replication/slot.c
> > +++ b/src/backend/replication/slot.c
> > @@ -1241,8 +1241,58 @@ ReplicationSlotReserveWal(void)
> >  }
> >  
> >  /*
> > - * Helper for InvalidateObsoleteReplicationSlots -- acquires the given slot
> > - * and mark it invalid, if necessary and possible.
> > + * Report that replication slot needs to be invalidated
> > + */
> > +static void
> > +ReportSlotInvalidation(ReplicationSlotInvalidationCause cause,
> > +                       bool terminating,
> > +                       int pid,
> > +                       NameData slotname,
> > +                       XLogRecPtr restart_lsn,
> > +                       XLogRecPtr oldestLSN,
> > +                       TransactionId snapshotConflictHorizon)
> > +{
> > +    StringInfoData err_detail;
> > +    bool        hint = false;
> > +
> > +    initStringInfo(&err_detail);
> > +
> > +    switch (cause)
> > +    {
> > +        case RS_INVAL_WAL:
> > +            hint = true;
> > +            appendStringInfo(&err_detail, _("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes."),
> > +                             LSN_FORMAT_ARGS(restart_lsn),
> 
> I'm not sure what the below cast is meant to do. If you are trying to
> protect against overflow/underflow, I think you'd need to cast before
> doing the subtraction.


> > +                             (unsigned long long) (oldestLSN - restart_lsn));
> > +            break;

That's our current way of passing 64bit numbers to format string
functions. It ends up as a 64bit number everywhere, even windows (with its
stupid ILP32 model).



> > +        case RS_INVAL_HORIZON:
> > +            appendStringInfo(&err_detail, _("The slot conflicted with xid horizon %u."),
> > +                             snapshotConflictHorizon);
> > +            break;
> > +
> > +        case RS_INVAL_WAL_LEVEL:
> > +            appendStringInfo(&err_detail, _("Logical decoding on standby requires wal_level to be at least logical
onthe primary server"));
 
> > +            break;
> > +        case RS_INVAL_NONE:
> > +            pg_unreachable();
> > +    }
> 
> This ereport is quite hard to read. Is there any simplification you can
> do of the ternaries without undue duplication?

I tried a bunch, and none really seemed an improvement.


> > @@ -1286,10 +1340,45 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
> >          restart_lsn = s->data.restart_lsn;
> >  
> >          /*
> > -         * If the slot is already invalid or is fresh enough, we don't need to
> > -         * do anything.
> > +         * If the slot is already invalid or is a non conflicting slot, we
> > +         * don't need to do anything.
> >           */
> > -        if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
> > +        if (s->data.invalidated == RS_INVAL_NONE)
> > +        {
> > +            switch (cause)
> > +            {
> > +                case RS_INVAL_WAL:
> > +                    if (s->data.restart_lsn != InvalidXLogRecPtr &&
> > +                        s->data.restart_lsn < oldestLSN)
> > +                        conflict = cause;
> > +                    break;
> 
> Should the below be an error? a physical slot with RS_INVAL_HORIZON
> invalidation cause?

InvalidatePossiblyObsoleteSlot() gets called for all existing slots, so it's
normal for RS_INVAL_HORIZON to encounter a physical slot.


> > +                case RS_INVAL_HORIZON:
> > +                    if (!SlotIsLogical(s))
> > +                        break;
> > +                    /* invalid DB oid signals a shared relation */
> > +                    if (dboid != InvalidOid && dboid != s->data.database)
> > +                        break;
> > +                    if (TransactionIdIsValid(s->effective_xmin) &&
> > +                        TransactionIdPrecedesOrEquals(s->effective_xmin,
> > +                                                      snapshotConflictHorizon))
> > +                        conflict = cause;
> > +                    else if (TransactionIdIsValid(s->effective_catalog_xmin) &&
> > +                             TransactionIdPrecedesOrEquals(s->effective_catalog_xmin,
> > +                                                           snapshotConflictHorizon))
> > +                        conflict = cause;
> > +                    break;
> > +                case RS_INVAL_WAL_LEVEL:
> > +                    if (SlotIsLogical(s))
> > +                        conflict = cause;
> > +                    break;
> 
> All three of default, pg_unreachable(), and break seems a bit like
> overkill. Perhaps remove the break?

I always get nervous about case statements without a break, due to the
fallthrough behaviour of switch/case. So I add it very habitually. I replaced
it with
                case RS_INVAL_NONE:
                    pg_unreachable();
that way we get warnings about unhandled cases too. Not sure why I hadn't done
that.

> > @@ -1390,14 +1476,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
> >              ReplicationSlotMarkDirty();
> >              ReplicationSlotSave();
> >              ReplicationSlotRelease();
> > +            pgstat_drop_replslot(s);
> >  
> > -            ereport(LOG,
> > -                    errmsg("invalidating obsolete replication slot \"%s\"",
> > -                           NameStr(slotname)),
> > -                    errdetail("The slot's restart_lsn %X/%X exceeds the limit by %llu bytes.",
> > -                              LSN_FORMAT_ARGS(restart_lsn),
> > -                              (unsigned long long) (oldestLSN - restart_lsn)),
> > -                    errhint("You might need to increase max_slot_wal_keep_size."));
> > +            ReportSlotInvalidation(conflict, false, active_pid,
> > +                                   slotname, restart_lsn,
> > +                                   oldestLSN, snapshotConflictHorizon);
> >  
> >              /* done with this slot for now */
> >              break;
> > @@ -1410,19 +1493,33 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
> >  }
> >  
> >  /*
> > - * Mark any slot that points to an LSN older than the given segment
> > - * as invalid; it requires WAL that's about to be removed.
> > + * Invalidate slots that require resources about to be removed.
> >   *
> >   * Returns true when any slot have got invalidated.
> >   *
> > + * Whether a slot needs to be invalidated depends on the cause. A slot is
> > + * removed if it:
> > + * - RS_INVAL_WAL: requires a LSN older than the given segment
> > + * - RS_INVAL_HORIZON: requires a snapshot <= the given horizon, in the given db
> > +     dboid may be InvalidOid for shared relations
> 
> the comma above reduces readability
> 
> is this what you mean?
> 
> RS_INVAL_HORIZON: requires a snapshot <= the given horizon in the given
> db; dboid may be InvalidOid for shared relations

Yep.


> > @@ -1443,7 +1443,13 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
> >                                         slotname, restart_lsn,
> >                                         oldestLSN, snapshotConflictHorizon);
> >  
> > -                (void) kill(active_pid, SIGTERM);
> > +                if (MyBackendType == B_STARTUP)
> 
> Is SendProcSignal() marked warn_unused_result or something? I don't see
> other callers who don't use its return value void casting it.

I went back and forth about it. I think Bertrand added. It looks a bit odd to
have it, for the reason you say. It also looks a bit odd to not have, given
the parallel (void) kill().


> > diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
> > index 9f56b4e95cf..3b5d654347e 100644
> > --- a/src/backend/storage/ipc/standby.c
> > +++ b/src/backend/storage/ipc/standby.c
> > @@ -24,6 +24,7 @@
> >  #include "access/xlogutils.h"
> >  #include "miscadmin.h"
> >  #include "pgstat.h"
> > +#include "replication/slot.h"
> >  #include "storage/bufmgr.h"
> >  #include "storage/lmgr.h"
> >  #include "storage/proc.h"
> > @@ -466,6 +467,7 @@ ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
> >   */
> >  void
> >  ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
> > +                                    bool isCatalogRel,
> >                                      RelFileLocator locator)
> >  {
> >      VirtualTransactionId *backends;
> > @@ -491,6 +493,16 @@ ResolveRecoveryConflictWithSnapshot(TransactionId snapshotConflictHorizon,
> >                                             PROCSIG_RECOVERY_CONFLICT_SNAPSHOT,
> >                                             WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT,
> >                                             true);
> > +
> > +    /*
> > +     * Note that WaitExceedsMaxStandbyDelay() is not taken into account here
> > +     * (as opposed to ResolveRecoveryConflictWithVirtualXIDs() above). That
> > +     * seems OK, given that this kind of conflict should not normally be
> 
> do you mean "when using a physical replication slot"?

> > +     * reached, e.g. by using a physical replication slot.
> > +     */
> > +    if (wal_level >= WAL_LEVEL_LOGICAL && isCatalogRel)
> > +        InvalidateObsoleteReplicationSlots(RS_INVAL_HORIZON, 0, locator.dbOid,
> > +                                           snapshotConflictHorizon);
> >  }

No. I mean that normally a physical replication slot, or some other approach,
should prevent such conflicts. I replaced 'by' with 'due to'


Thanks a lot for the review!

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: daitch_mokotoff module
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: Minimal logical decoding on standbys