Обсуждение: ReplicationSlotRelease() crashes when the instance is in the single user mode

Поиск
Список
Период
Сортировка

ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear hackers,

I found $SUBJECT when I'm playing with the single user mode.

How to reproduce
===========
You can reproduce the failure with below steps.

```
# Initialize an instance
$ initdb -D data -U postgres
# Start it as single user mode
$ postgres --single -D data/ postgres

PostgreSQL stand-alone backend 18devel
backend> SELECT pg_create_physical_replication_slot(slot_name := 'physical_slot', immediately_reserve := true);
...
backend> SELECT pg_replication_slot_advance('physical_slot', pg_current_wal_lsn());
         1: pg_replication_slot_advance (typeid = 2249, len = -1, typmod = -1, byval = f)
        ----
TRAP: failed Assert("slot != NULL && (slot->active_pid != 0)"), File: "../postgres/src/backend/replication/slot.c",
Line:674, PID: 430860 
postgres(ExceptionalCondition+0xab)[0xb86a2a]
postgres(ReplicationSlotRelease+0x5a)[0x8df10b]
postgres(pg_replication_slot_advance+0x330)[0x8e46ed]
...
```

Analysis
=====
We trapped at below assertion in ReplicationSlotRelease(). IIUC, `slot->active_pid` is set
only when the process is under the postmaster, but ReplicationSlotRelease() always requires it.

```
    Assert(slot != NULL && slot->active_pid != 0);
```

Possible fix
=======

Naively considered, there are two approaches to fix this. 1) set active_pid when even in the single
user mode [1], or 2) ease the condition to accept the situation [2]. I'm not familiar with the mode,
but [1] seems better if we want to unify codes.

Thought?

[1]:
```
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -599,7 +599,7 @@ retry:
                SpinLockRelease(&s->mutex);
        }
        else
-               active_pid = MyProcPid;
+               s->active_pid = active_pid = MyProcPid;
        LWLockRelease(ReplicationSlotControlLock);

        /*
```
[2]:
```
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -671,7 +671,8 @@ ReplicationSlotRelease(void)
        bool            is_logical = false; /* keep compiler quiet */
        TimestampTz now = 0;

-       Assert(slot != NULL && slot->active_pid != 0);
+       Assert(slot != NULL &&
+                  (slot->active_pid != 0 || !IsUnderPostmaster));

        if (am_walsender)
        {
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED




Re: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"David G. Johnston"
Дата:
On Monday, February 17, 2025, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:

backend> SELECT pg_create_physical_replication_slot(slot_name := 'physical_slot', immediately_reserve := true);

Since this function releases the slot when it returns, re-acquisition, even by the same backend, must always re-associate MyProcPid to the named slot.
 

[1]:
```
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -599,7 +599,7 @@ retry:
                SpinLockRelease(&s->mutex);
        }
        else
-               active_pid = MyProcPid;
+               s->active_pid = active_pid = MyProcPid;
        LWLockRelease(ReplicationSlotControlLock);

        /*
```

This, but you cannot modify the slot without holding the spinlock.

I’d probably add an assert that the existing state of s->active_pid is either 0 or MyProcPid already.  In single-user mode it mustn’t, really cannot, be anything else.  But the failure here is because the SQL function does a slot release; there are probably other reasonable paths where the assignment of MyProcPid during slot creation is retained and encountered during a subsequent slot acquire.

David J.

RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Michael,

> Perhaps a very naive question, but is there any point in authorizing
> manipulations of MyReplicationSlot in single-user mode, to begin with?
> With this remark, I would mean to apply a rule to
> ReplicationSlotAcquire(), so as all its callers would know about that.

According to the original thread [1], there was a wide consensus replication-related
operations can be rejected, except the slot removal. I feel this is reasonable.

Currently pg_drop_replication_slot() requires the droping slot can be acquired,
so we cannot reject it in single user mode as-is. Maybe we should revive the 0002
patch in [1] then try to do that. Thought?

[1]: https://www.postgresql.org/message-id/3b2f809f-326c-38dd-7a9e-897f957a4eb1%40enterprisedb.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED




RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Michael,

> Ah, good point for the slot drop.  So 0ce5cf2ef24f is claiming that
> some of these manipulations are OK.  I didn't suspect this one.

Yeah, I think so.

> Slot advancing is a very different beast, unfortunately, that may
> depend on many other subsystems.  For example with logical slots we
> would finish by calling rm_decode, which could be outside of core.
> Justifying that this operation is supported in single-user mode is
> larger than what you are suggesting here..

OK. Actually, I do not have any use-cases in my mind. I played because
I found a special path for single user mode in ReplicationSlotAcquire()
so that I wanted to see the behavior.

Best regards,
Hayato Kuroda
FUJITSU LIMITED




Re: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
Michael Paquier
Дата:
On Wed, Feb 19, 2025 at 02:57:34AM +0000, Hayato Kuroda (Fujitsu) wrote:
> Based on the discussion, I feel it is enough to add quick error out
> for SQL functions. PSA attached.

I did not check how these call behave individually, just a few
comments while putting my eyes on the patch.

+    if (!IsUnderPostmaster)
+        elog(ERROR,
+             "slot operation is prohibited in the single user mode");

elog() should not be used for failures that can be user-facing as this
would not provide any translation.

I'd suggest rewording the error message to provide some more context,
as well, say:
"cannot use %s in single-user mode", "function_name"
--
Michael

Вложения

RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Michael,

> I did not check how these call behave individually, just a few
> comments while putting my eyes on the patch.
>
> +    if (!IsUnderPostmaster)
> +        elog(ERROR,
> +             "slot operation is prohibited in the single user mode");
>
> elog() should not be used for failures that can be user-facing as this
> would not provide any translation.

I intentionally used elog() because I thought single user mode is not user-facing.
But it is OK for me to use ereport() instead.

> I'd suggest rewording the error message to provide some more context,
> as well, say:
> "cannot use %s in single-user mode", "function_name"

Fixed. PSA new version

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

Re: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"David G. Johnston"
Дата:
On Wed, Feb 19, 2025 at 7:23 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:
I intentionally used elog() because I thought single user mode is not user-facing.
But it is OK for me to use ereport() instead.

Single-user mode is also known as "Oh crap!" mode, something used when starting the server in multi-user model fails for non-trivial reasons.

It is also what at least one PostgreSQL Online Service (a.k.a. fiddle) uses as an implementation choice.

David J.

Re: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
Bertrand Drouvot
Дата:
Hi,

On Thu, Feb 20, 2025 at 02:22:41AM +0000, Hayato Kuroda (Fujitsu) wrote:
> Dear Michael,
> 

Thanks for the report and the patch!

> > I did not check how these call behave individually, just a few
> > comments while putting my eyes on the patch.
> > 
> > +    if (!IsUnderPostmaster)
> > +        elog(ERROR,
> > +             "slot operation is prohibited in the single user mode");
> > 
> > elog() should not be used for failures that can be user-facing as this
> > would not provide any translation.
> 
> I intentionally used elog() because I thought single user mode is not user-facing.
> But it is OK for me to use ereport() instead.

Yeah, I think it's also about using ereport for cases that we think can happen (
and so needs to be translated). In this case, it clearly can happen so ereport()
is better.

> > I'd suggest rewording the error message to provide some more context,
> > as well, say:
> > "cannot use %s in single-user mode", "function_name"
> 
> Fixed. PSA new version

=== 1

Nit:

"cannot use function %s in single-user mode", "function_name"" maybe? (that would
somehow match the existing ""user-defined types cannot use subscripting function %s")

=== 2

+ "pg_copy_replication_slot")));

s/pg_copy_replication_slot/copy_replication_slot/ maybe? As it is the function
that would report the error actually (could be called from
pg_copy_logical_replication_slot_[a|b|c] and pg_copy_physical_replication_slot_[a|b]
though).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"Zhijie Hou (Fujitsu)"
Дата:
On Thursday, February 20, 2025 10:23 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:
>
> Dear Michael,
>
> > I did not check how these call behave individually, just a few
> > comments while putting my eyes on the patch.
> >
> > +    if (!IsUnderPostmaster)
> > +        elog(ERROR,
> > +             "slot operation is prohibited in the single user
> mode");
> >
> > elog() should not be used for failures that can be user-facing as this
> > would not provide any translation.
>
> I intentionally used elog() because I thought single user mode is not
> user-facing.
> But it is OK for me to use ereport() instead.
>
> > I'd suggest rewording the error message to provide some more context,
> > as well, say:
> > "cannot use %s in single-user mode", "function_name"
>
> Fixed. PSA new version

I'm curious about the scope of the restrictions we plan to add. For example,
the current patch does not include checks in the functions used for consuming
changes (such as pg_logical_slot_get_changes). Was this omission intentional?

Best Regards,
Hou zj



Re: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
Amit Kapila
Дата:
On Thu, Feb 20, 2025 at 4:26 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Thursday, February 20, 2025 10:23 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Michael,
> >
> > > I did not check how these call behave individually, just a few
> > > comments while putting my eyes on the patch.
> > >
> > > +   if (!IsUnderPostmaster)
> > > +           elog(ERROR,
> > > +                    "slot operation is prohibited in the single user
> > mode");
> > >
> > > elog() should not be used for failures that can be user-facing as this
> > > would not provide any translation.
> >
> > I intentionally used elog() because I thought single user mode is not
> > user-facing.
> > But it is OK for me to use ereport() instead.
> >
> > > I'd suggest rewording the error message to provide some more context,
> > > as well, say:
> > > "cannot use %s in single-user mode", "function_name"
> >
> > Fixed. PSA new version
>
> I'm curious about the scope of the restrictions we plan to add. For example,
> the current patch does not include checks in the functions used for consuming
> changes (such as pg_logical_slot_get_changes). Was this omission intentional?
>

Also, what about pg_replication_origin_* APIs? Do we want to restrict
those as well if we are restricting slot operations? I don't see any
solid theory presented in this thread on why we should add new checks
in multiple APIs restricting those in single-user mode.

--
With Regards,
Amit Kapila.



RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Amit,

> Also, what about pg_replication_origin_* APIs? Do we want to restrict
> those as well if we are restricting slot operations? I don't see any
> solid theory presented in this thread on why we should add new checks
> in multiple APIs restricting those in single-user mode.

As David [1] and documentation [2] described, single user mode is typically used
for initialization, debugging and recovery purpose - not for normal purposes.
I think doing replication stuffs is not intended. Tom also considered like that [4].

The error I reported seemed to be introduced seven years ago (0ce5cf2), and no
one has reported till now. This also implies that there are no reasons to support
it.

Ideally, functions described in [5] can be banned in the single-user mode, except
the pg_drop_replication_slot(). Thought?

[1]: https://www.postgresql.org/message-id/CAKFQuwbnBkGZAM%2B5b1DWmbqU5W7b1r-nQsw87BukrUC5WLrJXg%40mail.gmail.com
[2]: https://www.postgresql.org/docs/devel/app-postgres.html
[3]: https://www.postgresql.org/message-id/3b2f809f-326c-38dd-7a9e-897f957a4eb1%40enterprisedb.com
[4]: https://www.postgresql.org/message-id/25024.1525789919%40sss.pgh.pa.us
[5]: https://www.postgresql.org/docs/devel/functions-admin.html#FUNCTIONS-REPLICATION

Best regards,
Hayato Kuroda
FUJITSU LIMITED

RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear hackers,

Thanks everyone for giving comments! PSA new version.
What's new:

- Message format was modified to {"cannot use function %s in single-user mode", "function_name"}
- Reporting funcname was adjusted based on the parameters. ternary operator was used.
- Guard was added for functions in logicalfunction.c.

For now, functions for replication origin and replication messages were retained.
I can handle them after the discussion.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

Re: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
Amit Kapila
Дата:
On Thu, Feb 20, 2025 at 6:21 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear hackers,
>
> Thanks everyone for giving comments! PSA new version.
> What's new:
>
> - Message format was modified to {"cannot use function %s in single-user mode", "function_name"}
> - Reporting funcname was adjusted based on the parameters. ternary operator was used.
> - Guard was added for functions in logicalfunction.c.
>

Shouldn't such a check be present in the CheckSlotPermissions() kind
of function to perform it in the central place?

> For now, functions for replication origin and replication messages were retained.
> I can handle them after the discussion.
>

Which other functions do we see similar restrictions? I checked
"sequence manipulation functions" (1), and "Transaction ID and
Snapshot Information Functions" (2) but couldn't see similar
restrictions.

(1) - https://www.postgresql.org/docs/current/functions-sequence.html
(2) - https://www.postgresql.org/docs/current/functions-info.html#FUNCTIONS-INFO-SNAPSHOT

--
With Regards,
Amit Kapila.



RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Amit,

> Shouldn't such a check be present in the CheckSlotPermissions() kind
> of function to perform it in the central place?

OK. I checked whether we can reuse pre-existing functions, but it seems not appropriate.
CheckSlotPermissions() is called even by pg_drop_replication_slot(), and
CheckSlotRequirements() is not called by pg_sync_replication_slots(). I defined new
function CheckSlotIsInSingleUserMode() and put the check there.
I removed function name from the ereport(), but I feel it is enough - CheckSlotPermissions()
and CheckSlotRequirements() do not have.

> > For now, functions for replication origin and replication messages were
> retained.
> > I can handle them after the discussion.
> >
> 
> Which other functions do we see similar restrictions? I checked
> "sequence manipulation functions" (1), and "Transaction ID and
> Snapshot Information Functions" (2) but couldn't see similar
> restrictions.
> 
> (1) - https://www.postgresql.org/docs/current/functions-sequence.html
> (2) -
> https://www.postgresql.org/docs/current/functions-info.html#FUNCTIONS-INF
> O-SNAPSHOT

I grepped sources and could not find explicit limitations neither. So...this might
be overkill. But I still think the restriction is OK for the slot - no need to do
some efforts for accepting single-user mode, just close the cover.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

Re: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
Amit Kapila
Дата:
On Thu, Feb 27, 2025 at 1:29 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> >
> > Which other functions do we see similar restrictions? I checked
> > "sequence manipulation functions" (1), and "Transaction ID and
> > Snapshot Information Functions" (2) but couldn't see similar
> > restrictions.
> >
> > (1) - https://www.postgresql.org/docs/current/functions-sequence.html
> > (2) -
> > https://www.postgresql.org/docs/current/functions-info.html#FUNCTIONS-INF
> > O-SNAPSHOT
>
> I grepped sources and could not find explicit limitations neither. So...this might
> be overkill.
>

Yes, this is my worry.

> But I still think the restriction is OK for the slot - no need to do
> some efforts for accepting single-user mode, just close the cover.
>

I understand that we may not have a clear use case for this to work in
single-user mode. But how will we define the boundary in similar
cases? I mean, we should have some rule for such exposed functions,
and it should be followed uniformly. Now, if one needs a bigger or
complex change to make the function work in single-user mode, then it
is easy to take an exception from what is currently being followed in
the code. However, if the change is as trivial as you proposed in the
first email, why not go with that and make this function work in
single-user mode?

--
With Regards,
Amit Kapila.



RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Amit,

> I understand that we may not have a clear use case for this to work in
> single-user mode. But how will we define the boundary in similar
> cases? I mean, we should have some rule for such exposed functions,
> and it should be followed uniformly. Now, if one needs a bigger or
> complex change to make the function work in single-user mode, then it
> is easy to take an exception from what is currently being followed in
> the code. However, if the change is as trivial as you proposed in the
> first email, why not go with that and make this function work in
> single-user mode?

Hmm, the opinion about this is completely opposite with other reviewers. I want
to ask them again how you feel. I also added Tom who pointed out in the initial
thread.

Question: how you feel to restrict SQL functions for slot during the single-user
mode? Nobody has considered use cases for it; we do not have concrete theory and
similar cases. And needed band-aid to work seems very small.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
Paul A Jungwirth
Дата:
Mutaamba (cc'd) and I reviewed this patch together.

To summarize the patch and thread so far: The patch adds a new
function, CheckSlotIsInSingleUserMode. If true then we raise an error.
Otherwise we would trip an assert in ReplicationSlotRelease requiring
the slot to have an active_pid, which is never set in single-user
mode. We do want to support pg_drop_replication_slot in single-user
mode, but not other replication slot functions. In particular,
advancing the slot may call non-core code, which seems risky in
single-user mode.

The patch does not apply. The attached v5 fixes it with a small update
to the context for the slot.h hunk.

The commit message needs some explanation. Why are we doing this? What
does the patch do? What alternatives did we consider?

The current error message seems reasonable.

The patch has no docs. If we plan to forbid some functions in
single-user mode, we should document which ones (e.g. in func.sgml).

There are no tests. Do we have any tests at all for single-user mode?
The only one I see is in recovery/t/017_shm.pl. What if we had tests
that ran the regress suite in single-user mode? Basically instead of
psql we run postgres --single? Do we expect a lot of it would fail? Is
it small enough the test could maintain a diff that expects those
failures? I'm not saying we should do that as part of this patch, but
is there any interest in that? Since single-user mode is most often
used when things are broken, it's a harsh place to hit a bug.

The patch lacks source comments. Something on
CheckSlotIsInSingleUserMode explaining why would be good.

In ReplicationSlotRelease, why only assert that `slot->active_pid !=
0`? Why not assert that it's MyProcPid (even outside single-user
mode)? Can one backend really release a slot while another is using
it? Can you drop it? Maybe you can copy it?

Are we calling CheckSlotIsInSingleUserMode from everywhere that needs
it? We tried to check all the functions that call
ReplicationSlotCreate, ReplicationSlotRelease, and
update_synced_slots_inactive_since (since they all have these
asserts). To call out a few:

The pg_replication_origin_* functions don't call the Assert-ing functions.

binary_upgrade_logical_slot_has_caught_up - Not available from the command line.

WalSndErrorCleanup - Probably not used in single-user mode?

We also see that PostgresMain calls ReplicationSlotRelease from its
error handler. Presumably single-user mode would be executing
PostgresSingleUserMain instead, but still perhaps we should call
CheckSlotIsInSingleUserMode here just as a sanity-check?

On Thu, Feb 27, 2025 at 9:42 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
> > I understand that we may not have a clear use case for this to work in
> > single-user mode. But how will we define the boundary in similar
> > cases? I mean, we should have some rule for such exposed functions,
> > and it should be followed uniformly. Now, if one needs a bigger or
> > complex change to make the function work in single-user mode, then it
> > is easy to take an exception from what is currently being followed in
> > the code. However, if the change is as trivial as you proposed in the
> > first email, why not go with that and make this function work in
> > single-user mode?
>
> Hmm, the opinion about this is completely opposite with other reviewers. I want
> to ask them again how you feel. I also added Tom who pointed out in the initial
> thread.
>
> Question: how you feel to restrict SQL functions for slot during the single-user
> mode? Nobody has considered use cases for it; we do not have concrete theory and
> similar cases. And needed band-aid to work seems very small.

No one has replied yet, but I vote for forbidding these functions. I
can't articulate a full theory for which functions we restrict in
single-user mode, and I think we should permit as much as possible.
But any theory would weigh usefulness against risk. Here no one has
found a use-case for several years, and executing user-supplied code
in an unusual, higher-risk scenario seems like a good reason to be
cautious. Also without tests for single-user mode, I'm extra wary. If
single-user support is desired by someone, they could submit a patch.

Yours,

--
Paul              ~{:-)
pj@illuminatedcomputing.com

Вложения

Re: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
Robert Haas
Дата:
On Tue, Aug 5, 2025 at 12:51 PM Paul A Jungwirth
<pj@illuminatedcomputing.com> wrote:
> No one has replied yet, but I vote for forbidding these functions. I
> can't articulate a full theory for which functions we restrict in
> single-user mode, and I think we should permit as much as possible.
> But any theory would weigh usefulness against risk. Here no one has
> found a use-case for several years, and executing user-supplied code
> in an unusual, higher-risk scenario seems like a good reason to be
> cautious. Also without tests for single-user mode, I'm extra wary. If
> single-user support is desired by someone, they could submit a patch.

I don't feel good about the direction from which this patch is
attacking the problem. The original stack trace looks like this:

postgres(ExceptionalCondition+0xab)[0xb86a2a]
postgres(ReplicationSlotRelease+0x5a)[0x8df10b]
postgres(pg_replication_slot_advance+0x330)[0x8e46ed]

ReplicationSlotRelease() is called near the end of
pg_replication_slot_advance. IIUC, this means that we successfully
acquired the slot and did all the work, and then failed a sanity check
afterward. So, I see two possibilities: either it's not OK to acquire
a replication slot in single-user mode, in which case this should have
failed earlier, or it's also OK to release it, in which case this
should not have failed at all. Interestingly, I see that
ReplicationSlotAcquire() features a special case whose specific
purpose is to allow it to work in single-user mode, which makes me
rather suspect that the latter is correct. It is possible that it was
intentional that acquiring works in single-user mode and dropping
(other than via ReplicationSlotDropAcquired) does not, but so far I
see nothing in the comments suggesting that.

In any event, that inconsistency between how ReplicationSlotAcquire()
works and how ReplicationSlotRelease() works seems to me to be the
core thing that should be fixed here. If the selected fix also
requires error checks in higher-level functions, as added by this
patch, then well and good. But right now, the proposed patch isn't
trying to fix the definitional inconsistency in the underlying layer
at all, and is instead just installing guard rails to try to keep
anyone from bumping into it. That seems like a bad direction to go --
it feels like we're blocking access to potentially-useful
functionality because of an assertion failure that might just be
slightly sloppy coding.

--
Robert Haas
EDB: http://www.enterprisedb.com



RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Robert, Paul, Mutaamba,

Sorry for the late reply. I was in the business trip.

> I don't feel good about the direction from which this patch is
> attacking the problem. The original stack trace looks like this:
> 
> postgres(ExceptionalCondition+0xab)[0xb86a2a]
> postgres(ReplicationSlotRelease+0x5a)[0x8df10b]
> postgres(pg_replication_slot_advance+0x330)[0x8e46ed]
> 
> ReplicationSlotRelease() is called near the end of
> pg_replication_slot_advance. IIUC, this means that we successfully
> acquired the slot and did all the work, and then failed a sanity check
> afterward.

Exactly. The slot is trying to be released after most of tasks are done.

```
#6  0x00000000008fd20b in pg_replication_slot_advance (fcinfo=0x23b4428)
    at ../postgres/src/backend/replication/slotfuncs.c:588
588             ReplicationSlotRelease();
(gdb) list
583              * advancing potentially done.
584              */
585             ReplicationSlotsComputeRequiredXmin(false);
586             ReplicationSlotsComputeRequiredLSN();
587 
588             ReplicationSlotRelease();
```

> So, I see two possibilities: either it's not OK to acquire
> a replication slot in single-user mode, in which case this should have
> failed earlier, or it's also OK to release it, in which case this
> should not have failed at all. Interestingly, I see that
> ReplicationSlotAcquire() features a special case whose specific
> purpose is to allow it to work in single-user mode, which makes me
> rather suspect that the latter is correct.

Agreed.

> It is possible that it was
> intentional that acquiring works in single-user mode and dropping
> (other than via ReplicationSlotDropAcquired) does not, but so far I
> see nothing in the comments suggesting that.
>
> In any event, that inconsistency between how ReplicationSlotAcquire()
> works and how ReplicationSlotRelease() works seems to me to be the
> core thing that should be fixed here. If the selected fix also
> requires error checks in higher-level functions, as added by this
> patch, then well and good. But right now, the proposed patch isn't
> trying to fix the definitional inconsistency in the underlying layer
> at all, and is instead just installing guard rails to try to keep
> anyone from bumping into it. That seems like a bad direction to go --

To confirm; your point is that we should firstly fix to allow acquiring/releasing
slots in the mode, then consider additional guards, is it right? Valid point.
We've not deeply considered but ideas described in [1] could be used.

> it feels like we're blocking access to potentially-useful
> functionality because of an assertion failure that might just be
> slightly sloppy coding.

I still cannot find enough use-cases to allow manipulating slots, though.

[1]:
https://www.postgresql.org/message-id/OSCPR01MB14966ED588A0328DAEBE8CB25F5FA2%40OSCPR01MB14966.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
Robert Haas
Дата:
On Wed, Aug 13, 2025 at 3:21 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
> To confirm; your point is that we should firstly fix to allow acquiring/releasing
> slots in the mode, then consider additional guards, is it right? Valid point.

Yes.

> I still cannot find enough use-cases to allow manipulating slots, though.

The use case for single-user mode is quite limited in general, but
most things work in single-user mode unless they are something that
intrinsically can't. For example, parallel query cannot be used in
single-user mode, because there is only one process. Replication
cannot work, for the same reason. But manual slot operations can work,
so I do not think it is good to arbitrarily prohibit them. We do not
need a reason to specifically allow them; it is enough that there is
no good reason for them to be blocked.

--
Robert Haas
EDB: http://www.enterprisedb.com



RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Paul, Mutaamba,

Here are updated patches. Based on the Robert's suggestion, I separated into two parts.
0001 fixed the original issue and 0002 prohibited the slot manipulation in
single-user mode. I want to focus on 0001 first because on one would argue it.

All comments from you were included in 0002.

> The patch does not apply. The attached v5 fixes it with a small update
> to the context for the slot.h hunk.

Oh, I didn't realize because cfbot said OK. Anyway, new patch could be applied atop HEAD.

> The commit message needs some explanation. Why are we doing this? What
> does the patch do? What alternatives did we consider?

Updated. How do you feel?

> The patch has no docs. If we plan to forbid some functions in
> single-user mode, we should document which ones (e.g. in func.sgml).

A statement was added.

> There are no tests. Do we have any tests at all for single-user mode?
> The only one I see is in recovery/t/017_shm.pl. What if we had tests
> that ran the regress suite in single-user mode? Basically instead of
> psql we run postgres --single? Do we expect a lot of it would fail? Is
> it small enough the test could maintain a diff that expects those
> failures? I'm not saying we should do that as part of this patch, but
> is there any interest in that? Since single-user mode is most often
> used when things are broken, it's a harsh place to hit a bug.
> 
> The patch lacks source comments. Something on
> CheckSlotIsInSingleUserMode explaining why would be good.

Comments were added in all caller functions.

> In ReplicationSlotRelease, why only assert that `slot->active_pid !=
> 0`? Why not assert that it's MyProcPid (even outside single-user
> mode)? Can one backend really release a slot while another is using
> it? Can you drop it? Maybe you can copy it?

You meant we should assert `slot->active_pid == MyProcPid`, right?
Naively considered it can be fix, but it is out-of-scope of the thread. It is
existing code, should be discussed verified in another place.

> Are we calling CheckSlotIsInSingleUserMode from everywhere that needs
> it? We tried to check all the functions that call
> ReplicationSlotCreate, ReplicationSlotRelease, and
> update_synced_slots_inactive_since (since they all have these
> asserts). To call out a few:
> 
> The pg_replication_origin_* functions don't call the Assert-ing functions.

You asked that whether we should call CheckSlotIsInSingleUserMode in
pg_replication_origin_* APIs, right? It depends on the policy. If we want to
prohibit all replication-related command, it should be. It is still under
discussion, so I did not touch.

> binary_upgrade_logical_slot_has_caught_up - Not available from the command
> line.

I found that we can in both single-user and binary-upgrade mode. At that time
the function could be called:

```
$ postgres --single -D data/ postgres -b

PostgreSQL stand-alone backend 19devel
backend> SELECT binary_upgrade_logical_slot_has_caught_up('test');
         1: binary_upgrade_logical_slot_has_caught_up   (typeid = 16, len = 1, typmod = -1, byval = t)
        ----
LOG:  starting logical decoding for slot "test"
DETAIL:  Streaming transactions committing after 0/0182C640, reading WAL from 0/0182C608.
STATEMENT:  SELECT binary_upgrade_logical_slot_has_caught_up('test');

LOG:  logical decoding found consistent point at 0/0182C608
DETAIL:  There are no running transactions.
STATEMENT:  SELECT binary_upgrade_logical_slot_has_caught_up('test');
 
         1: binary_upgrade_logical_slot_has_caught_up = "f"     (typeid = 16, len = 1, typmod = -1, byval = t)
        ----
```

I don't think this is a realistic case, but the check was added.

> WalSndErrorCleanup - Probably not used in single-user mode?

IIUC we won't reach in the single-user mode. Hence it was not called intentionally.

> We also see that PostgresMain calls ReplicationSlotRelease from its
> error handler. Presumably single-user mode would be executing
> PostgresSingleUserMain instead,

ISTM, PostgresMain() would be called from the PostgresSingleUserMain().

> but still perhaps we should call
> CheckSlotIsInSingleUserMode here just as a sanity-check?

I feel this code is to release the acquired slot when ERROR was raised. Since we
have already covered the entrance, it is not needed.

[1]: https://www.postgresql.org/message-id/CA%2BTgmobspWhoMysNHL9b3C9xi%3DOpHdBYhtAgDH4N_A2foyjN-w%40mail.gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Robert,

> The use case for single-user mode is quite limited in general, but
> most things work in single-user mode unless they are something that
> intrinsically can't. For example, parallel query cannot be used in
> single-user mode, because there is only one process. Replication
> cannot work, for the same reason. But manual slot operations can work,
> so I do not think it is good to arbitrarily prohibit them. We do not
> need a reason to specifically allow them; it is enough that there is
> no good reason for them to be blocked.

Thanks for the confirmation. Till now, it is mostly fifty-fifty in the community
whether we should allow the manipulation in the single-user mode.

I did split a patch to fix the initial issue [1]. Let's focus on the 0001 patch
firstly.

[1]:
https://www.postgresql.org/message-id/OSCPR01MB14966B39B3ED21229EAEFA10FF531A%40OSCPR01MB14966.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
Michael Paquier
Дата:
On Mon, Aug 18, 2025 at 08:47:27AM +0000, Hayato Kuroda (Fujitsu) wrote:
> Here are updated patches. Based on the Robert's suggestion, I separated into two parts.
> 0001 fixed the original issue and 0002 prohibited the slot manipulation in
> single-user mode. I want to focus on 0001 first because on one would argue it.

FWIW, I had my doubts at the beginning on the thread about the
use-case, but after re-reading the whole thing I am going to side with
Robert's opinion that if a fix to authorize some of the operations is
simple, then let's just authorize the case(s) and have the fix.

Echoing with Paul, what I find as critically lacking from the proposed
patches are regression tests to check and validate the behaviors you
are looking for in the patches.  Please add some in the shape of perl
commands that use direct --single commands of postgres.  We have cases
that do so currently in the tree: sepgsql and shm.  I don't see a
reason why we could not do that as well here with run_log().  It is
also possible to pass a -c exit_on_error=true to force failures.

Not having a test for the single-user case of a slot drop is of course
something that we are lacking now.  As we are playing with this area
of the code, let's add something for this case as well.
--
Michael

Вложения

RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Michael, Paul, Mutaamba,

> FWIW, I had my doubts at the beginning on the thread about the
> use-case, but after re-reading the whole thing I am going to side with
> Robert's opinion that if a fix to authorize some of the operations is
> simple, then let's just authorize the case(s) and have the fix.

OK, thanks for the clarification. Let's focus on 0001 now.

> Echoing with Paul, what I find as critically lacking from the proposed
> patches are regression tests to check and validate the behaviors you
> are looking for in the patches.  Please add some in the shape of perl
> commands that use direct --single commands of postgres.  We have cases
> that do so currently in the tree: sepgsql and shm.  I don't see a
> reason why we could not do that as well here with run_log().  It is
> also possible to pass a -c exit_on_error=true to force failures.
>
> Not having a test for the single-user case of a slot drop is of course
> something that we are lacking now.  As we are playing with this area
> of the code, let's add something for this case as well.

I added a first version of the test. It could work on my env (Linux).
Since I cannot come up the appropriate place, I introduced new test file in
recovery test. creating/dropping/advancing a slot, and doing a logical decoding
is tested. Since standby mode is not supported by the single-user instance,
I did not test the slot synchronization. I have not known that.

0002 was updated accordingly just in case. In the patch some functions were expected
to fail. Also, a check for pg_sync_replication_slots() was removed because we cannot call
the function in the first place.

IIUC, 0001 can be backpatched for all supported branches. I will create for them
after patch would be a good shape.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"Hayato Kuroda (Fujitsu)"
Дата:
> I added a first version of the test. It could work on my env (Linux).

Best regards,
Hayato Kuroda
FUJITSU LIMITED




Re: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
Michael Paquier
Дата:
On Tue, Aug 19, 2025 at 10:26:22AM +0000, Hayato Kuroda (Fujitsu) wrote:
> I added a first version of the test. It could work on my env (Linux).

You could also validate that using the CI, assuming that you have a
cloned repo on github with the CI enabled.  See src/tools/ci/README.
I use that a lot with local branches for pre-commit validations where
things could break across the board.

> Since I cannot come up the appropriate place, I introduced new test file in
> recovery test. creating/dropping/advancing a slot, and doing a logical decoding
> is tested. Since standby mode is not supported by the single-user instance,
> I did not test the slot synchronization. I have not known that.

Hmm.  src/test/modules/test_misc/ makes more sense to me here.  (No
need to send an updated patch just for that.)

> IIUC, 0001 can be backpatched for all supported branches. I will create for them
> after patch would be a good shape.

Thanks, I'll check all that tomorrow.
--
Michael

Вложения

RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Michael,

> You could also validate that using the CI, assuming that you have a
> cloned repo on github with the CI enabled.  See src/tools/ci/README.
> I use that a lot with local branches for pre-commit validations where
> things could break across the board.

I have run the tests on my CI and found that windows cannot accept the test.
Per attached output, starting the postgres with the single-user mode was failed
because the user had admin permissions [1]. However, pg_ctl start command could be
done by the same user.
I'm not familiar with Windows, but according to your blogpost [2], Windows seems
to allows running some server command by generating the restricted context for
running postgres commands. In case of single-user mode, however, we directly run
the given command with the current user. So, there is a possibility that only
instance can boot only by pg_ctl.
Based on that, I want to skip the test on windows platform rather than modifying
the ci environment.

IIUC, other tests which uses --single cannot work on windows as well.

> Hmm.  src/test/modules/test_misc/ makes more sense to me here.  (No
> need to send an updated patch just for that.)

I did not notice the directory and looked nice. Updated.

[1]:
```
# Running: postgres --single -F -c exit_on_error=true -D
C:\cirrus\build/testrun/recovery/049_slots_in_single_user_mode\data/t_049_slots_in_single_user_mode_node_data/pgdata
postgres
Execution of PostgreSQL by a user with administrative permissions is not
permitted.
The server must be started under an unprivileged user ID to prevent
possible system security compromises.  See the documentation for
more information on how to properly start the server.
```

[2]: https://paquier.xyz/postgresql-2/postgres-utilities-restricted-token/

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

Re: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
Michael Paquier
Дата:
On Tue, Aug 19, 2025 at 12:52:24PM +0000, Hayato Kuroda (Fujitsu) wrote:
> I have run the tests on my CI and found that windows cannot accept the test.
> Per attached output, starting the postgres with the single-user mode was failed
> because the user had admin permissions [1]. However, pg_ctl start command could be
> done by the same user.
> I'm not familiar with Windows, but according to your blogpost [2], Windows seems
> to allows running some server command by generating the restricted context for
> running postgres commands. In case of single-user mode, however, we directly run
> the given command with the current user. So, there is a possibility that only
> instance can boot only by pg_ctl.
> Based on that, I want to skip the test on windows platform rather than modifying
> the ci environment.
>
> IIUC, other tests which uses --single cannot work on windows as well.

This has reminded me of 1a9d80282811, where I have used a trick with
pg_ctl to allow single-user mode executions to bypass the privileged
account permission failure, but I don't see what we could do here as
we should pass down commands to postgres for execution via stdin for
IPC::Run.  We don't really have anything WIN32-specific, so skipping
the tests on Windows sounds fine by me.

One thing that you have forgotten is to update EXTRA_INSTALL to add
test_decoding, or a make check in the module fails.  I have expanded a
bit more the tests, as for example we have paths based on active_pid
for temporary slots, which could matter at some point.  I have added a
few more things with physical slots.

With all that in mind, I don't really see the point of 0002.  There
was also a mention of replication origins upthread, but we don't
really have anything specific to a shared state or a validation in
this area AFAIK, so I don't think that test additions are worth it in
this case.
--
Michael

Вложения

RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Michael,

I found you've pushed 0001. Thanks!
I've considered to create patches for other branches after you said OK, but
you seemed to push all of them directly.

> This has reminded me of 1a9d80282811, where I have used a trick with
> pg_ctl to allow single-user mode executions to bypass the privileged
> account permission failure, but I don't see what we could do here as
> we should pass down commands to postgres for execution via stdin for
> IPC::Run.  We don't really have anything WIN32-specific, so skipping
> the tests on Windows sounds fine by me.

OK, thanks for the confirmation.

> One thing that you have forgotten is to update EXTRA_INSTALL to add
> test_decoding, or a make check in the module fails.  I have expanded a
> bit more the tests, as for example we have paths based on active_pid
> for temporary slots, which could matter at some point.  I have added a
> few more things with physical slots.

Oh, I didn't recognize because I usually use meson, and my CI said OK.
Good update!

> With all that in mind, I don't really see the point of 0002.  There
> was also a mention of replication origins upthread, but we don't
> really have anything specific to a shared state or a validation in
> this area AFAIK, so I don't think that test additions are worth it in
> this case.

Let me clarify your point. For now, there are no decisions to prohibit origin
manipulations. 0002 only restricts to handle slots in the single-user mode. Did
you say that we do not have to do tests to ensure these SQL functions are
prohibited in the mode?

(Since senior members in this thread are against to restrict, I'm planning to
stop working 0002 after creating a final patch for the reference purpose)

Best regards,
Hayato Kuroda
FUJITSU LIMITED




Re: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
Michael Paquier
Дата:
On Wed, Aug 20, 2025 at 07:42:11AM +0000, Hayato Kuroda (Fujitsu) wrote:
> Let me clarify your point. For now, there are no decisions to prohibit origin
> manipulations. 0002 only restricts to handle slots in the single-user mode. Did
> you say that we do not have to do tests to ensure these SQL functions are
> prohibited in the mode?

I am saying that there is little point in having tests for the origin
functions in single-user mode as these don't do anything really
fancy with global states (there's a acquired_by of course, and no
specific IsUnderPostmaster patch), contrary to the replication slots.
I am not convinced that there is any need to restrict them, either.
If somebody shows a reason to justify such a move, we could argue
about it, of course.
--
Michael

Вложения

RE: ReplicationSlotRelease() crashes when the instance is in the single user mode

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Michael and hackers,

This is a wrap-up of this thread. I'm planning to close the thread once.

> I am saying that there is little point in having tests for the origin
> functions in single-user mode as these don't do anything really
> fancy with global states (there's a acquired_by of course, and no
> specific IsUnderPostmaster patch), contrary to the replication slots.
> I am not convinced that there is any need to restrict them, either.
> If somebody shows a reason to justify such a move, we could argue
> about it, of course.

OK, thanks for the clarification. Yes, there are no special path for
single-user mode for origins.

Now, senior members recently appeared here are not so motivated to restrict some
operations in single-user mode unless it cannot be done in principle. Based on
that, I decided to close the thread once until someone raised another point.
For the reference purpose, I attached a patch which prohibits slot manipulation
in single-user mode. Anyone can update and use the patch for any purposes.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения