Обсуждение: AssertLog instead of Assert in some places

Поиск

Список

Период

Сортировка

AssertLog instead of Assert in some places

От

Ashutosh Bapat

Дата:

11 августа 2023 г., 15:29:37

Hi All,
PostgreSQL code uses Assert() as a way to
1. document the assumption or conditions which must be true at a given
place in code
2. make sure that some bug elsewhere does not break these assumptions or rules
3. conditions which can not be (easily) induced by user actions

E.g. following conditions in adjust_appendrel_attrs()
/* If there's nothing to adjust, don't call this function. */
Assert(nappinfos >= 1 && appinfos != NULL);

/* Should never be translating a Query tree. */
Assert(node == NULL || !IsA(node, Query));

These conditions do not make it to non-assert builds and thus do not
make it to the production binary. That saves some CPU cycles.

When an Assert fails, and it fails quite a lot for developers, the
Postgres backend that caused the Assert is Aborted, restarting the
server. So a developer testing code that caused the Assert has to
start a psql again, run any session setup before running the faulting
query, gdb needs to be reattached to the new backend process. That's
some waste of time and frustration, esp. when the Assert does not
damage the backend itself e.g. by corrupting memory.

Most of the Asserts are recoverable by rolling back the transaction
without crashing the backend. So an elog(ERROR, ) is enough. But just
by themselves elogs are compiled into non-debug binary and the
condition check can waste CPU cycles esp. conditions are mostly true
like the ones we use in Assert.

Attached patch combines Assert and elog(ERROR, ) so that when an
Assert is triggered in assert-enabled binary, it will throw an error
while keeping the backend intact. Thus it does not affect gdb session
or psql session. These elogs do not make their way to non-assert
binary so do not make it to production like Assert.

I have been using AssertLog for my work for some time. It is quite
convenient. With AssertLog I get
```
#explain (summary on) select * from a, b, c where a.c1 = b.c1 and b.c1
= c.c1 and a.c2 < b.c2 and a.c3 + b.c3 < c.c3;
ERROR:  failed Assert("equal(child_rinfo, adjust_appendrel_attrs(root,
(Node *) rinfo, nappinfos, appinfos))"), File: "relnode.c", Line: 997,
PID: 568088
```
instead of
```
#explain (summary on) select * from a, b, c where a.c1 = b.c1 and b.c1
= c.c1 and a.c2 < b.c2 and a.c3 + b.c3 < c.c3;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
@!#\q
```

If there is an interest in accepting the patch, I will add it to the
commitfest and work on it further.

--
Best Wishes,
Ashutosh Bapat

Вложения

0001-Add-AssertLog-to-report-an-error-instead-of-20230811.patch

Re: AssertLog instead of Assert in some places

От

Andres Freund

Дата:

11 августа 2023 г., 20:57:23

Hi,

On 2023-08-11 17:59:37 +0530, Ashutosh Bapat wrote:
> Most of the Asserts are recoverable by rolling back the transaction
> without crashing the backend. So an elog(ERROR, ) is enough. But just
> by themselves elogs are compiled into non-debug binary and the
> condition check can waste CPU cycles esp. conditions are mostly true
> like the ones we use in Assert.
> 
> Attached patch combines Assert and elog(ERROR, ) so that when an
> Assert is triggered in assert-enabled binary, it will throw an error
> while keeping the backend intact. Thus it does not affect gdb session
> or psql session. These elogs do not make their way to non-assert
> binary so do not make it to production like Assert.

I am quite strongly against this. This will lead to assertions being hit in
tests without that being noticed, e.g. because they happen in a background
process that just restarts.

Greetings,

Andres Freund

Re: AssertLog instead of Assert in some places

От

Peter Geoghegan

Дата:

11 августа 2023 г., 21:14:34

On Fri, Aug 11, 2023 at 10:57 AM Andres Freund <andres@anarazel.de> wrote:
> I am quite strongly against this. This will lead to assertions being hit in
> tests without that being noticed, e.g. because they happen in a background
> process that just restarts.

Couldn't you say the same thing about defensive "can't happen" ERRORs?
They are essentially a form of assertion that isn't limited to
assert-enabled builds.

I have sometimes thought that it would be handy if there was a variant
of "can't happen" ERRORs that took on the structure of an assertion.
(This is quite different to what Ashutosh has proposed, though, since
it would still look like a conventional assertion failure on
assert-enabled builds.)

--
Peter Geoghegan

Re: AssertLog instead of Assert in some places

От

Andres Freund

Дата:

11 августа 2023 г., 21:23:50

Hi,

On 2023-08-11 11:14:34 -0700, Peter Geoghegan wrote:
> On Fri, Aug 11, 2023 at 10:57 AM Andres Freund <andres@anarazel.de> wrote:
> > I am quite strongly against this. This will lead to assertions being hit in
> > tests without that being noticed, e.g. because they happen in a background
> > process that just restarts.
> 
> Couldn't you say the same thing about defensive "can't happen" ERRORs?
> They are essentially a form of assertion that isn't limited to
> assert-enabled builds.

Yes. A lot of them I hate them with the passion of a thousand suns ;). "Oh,
our transaction state machinery is confused. Yes, let's just continue going
through the same machinery again, that'll resolve it.". Even worse are the
ones that are just WARNINGS. Like "Oh, something wrote beyond the length of
allocated memory, that's something to issue a WARNING about and happily
continue".

There've been people arguing in the past that it's good for robustness to do
stuff like that. I think it's exactly the opposite.

Now, don't get me wrong, there are plenty cases where a "this shouldn't
happen" elog(ERROR) is appropriate...

> I have sometimes thought that it would be handy if there was a variant
> of "can't happen" ERRORs that took on the structure of an assertion.

What are you imagining? Basically something that generates an elog(ERROR) with
the stringified expression as part of the error message?

Greetings,

Andres Freund

Re: AssertLog instead of Assert in some places

От

Peter Geoghegan

Дата:

11 августа 2023 г., 21:56:27

On Fri, Aug 11, 2023 at 11:23 AM Andres Freund <andres@anarazel.de> wrote:
> > Couldn't you say the same thing about defensive "can't happen" ERRORs?
> > They are essentially a form of assertion that isn't limited to
> > assert-enabled builds.
>
> Yes. A lot of them I hate them with the passion of a thousand suns ;). "Oh,
> our transaction state machinery is confused. Yes, let's just continue going
> through the same machinery again, that'll resolve it.".

I am not unsympathetic to Ashutosh's point about conventional ERRORs
being easier to deal with when debugging your own code, during initial
development work. But that seems like a problem with the tooling in
other areas.

For example, dealing with core dumps left behind by the regression
tests can be annoying. Don't you also hate it when there's a
regression.diffs that just shows 20k lines of subtractions? Perhaps
you don't -- perhaps your custom setup makes it quick and easy to get
relevant information about what actually went wrong. But it seems like
that sort of thing could be easier to deal with by default, without
using custom shell scripts or anything -- particularly for those of us
that haven't been Postgres hackers for eons.

> There've been people arguing in the past that it's good for robustness to do
> stuff like that. I think it's exactly the opposite.
>
> Now, don't get me wrong, there are plenty cases where a "this shouldn't
> happen" elog(ERROR) is appropriate...

Right. They're not bad -- they just need to be backed up by some kind
of reasoning, which will be particular to each case. The default
approach should be to crash whenever any invariants are violated,
because all bets are off at that point.

> What are you imagining? Basically something that generates an elog(ERROR) with
> the stringified expression as part of the error message?

I'd probably start with a new elevel, that implied an assertion
failure in assert-enabled builds but otherwise acted just like ERROR.
I remember multiple cases where I added an "Assert(false)" right after
a "can't happen" error, which isn't a great approach.

It might also be useful to offer something along the lines you've
described, which I was sort of thinking of myself. But now that I've
thought about it a little more, I think that such an approach might
end up being overused. If you're going to add what amounts to a "can't
happen" ERROR then you should really be obligated to write a halfway
reasonable error message. As I said, you should have to own the fact
that you think it's better to not just crash for this one particular
callsite, based on some fairly specific chain of reasoning. Ideally
it'll be backed up by practical experience/user reports.

--
Peter Geoghegan

Re: AssertLog instead of Assert in some places

От

Andres Freund

Дата:

11 августа 2023 г., 22:26:17

Hi,

On 2023-08-11 11:56:27 -0700, Peter Geoghegan wrote:
> On Fri, Aug 11, 2023 at 11:23 AM Andres Freund <andres@anarazel.de> wrote:
> > > Couldn't you say the same thing about defensive "can't happen" ERRORs?
> > > They are essentially a form of assertion that isn't limited to
> > > assert-enabled builds.
> >
> > Yes. A lot of them I hate them with the passion of a thousand suns ;). "Oh,
> > our transaction state machinery is confused. Yes, let's just continue going
> > through the same machinery again, that'll resolve it.".
> 
> I am not unsympathetic to Ashutosh's point about conventional ERRORs
> being easier to deal with when debugging your own code, during initial
> development work.

Oh, I am as well - I just don't think it's a good idea to introduce "log + error"
assertions to core postgres, because it seems very likely that they'll end up
getting used a lot.

> But that seems like a problem with the tooling in other areas.

Agreed.

> For example, dealing with core dumps left behind by the regression
> tests can be annoying.

Hm. I don't have a significant problem with that. But I can see it being
problematic. Unfortunately, short of preventing core dumps from happening,
I don't think we really can do much about that - whatever is running the tests
shouldn't have privileges to change system wide settings about where core
dumps end up etc.

> Don't you also hate it when there's a regression.diffs that just shows 20k
> lines of subtractions? Perhaps you don't -- perhaps your custom setup makes
> it quick and easy to get relevant information about what actually went
> wrong.

I do really hate that. At the very least we should switch to using
restart-after-crash by default, and not start new tests once the server has
crashed and do a waitpid(postmaster, WNOHANG) after each failing test, to see
if the reason the test failed is that the backend died.

> But it seems like that sort of thing could be easier to deal with by
> default, without using custom shell scripts or anything -- particularly for
> those of us that haven't been Postgres hackers for eons.

Yes, wholeheartedly agreed.

Greetings,

Andres Freund

Re: AssertLog instead of Assert in some places

От

Peter Geoghegan

Дата:

11 августа 2023 г., 23:19:34

On Fri, Aug 11, 2023 at 12:26 PM Andres Freund <andres@anarazel.de> wrote:
> > For example, dealing with core dumps left behind by the regression
> > tests can be annoying.
>
> Hm. I don't have a significant problem with that. But I can see it being
> problematic. Unfortunately, short of preventing core dumps from happening,
> I don't think we really can do much about that - whatever is running the tests
> shouldn't have privileges to change system wide settings about where core
> dumps end up etc.

I was unclear. I wasn't talking about managing core dumps. I was
talking about using core dumps to get a simple backtrace, that just
gives me some very basic information. I probably shouldn't have even
mentioned core dumps, because what I'm really concerned about is the
workflow around getting very basic information about assertion
failures. Not around core dumps per se.

The inconsistent approach needed to get a simple, useful backtrace for
assertion failures (along with other basic information associated with
the failure) is a real problem. Particularly when running the tests.
Most individual assertion failures that I see are for code that I'm
practically editing in real time. So shaving cycles here really
matters.

For one thing the symbol mangling that we have around the built-in
backtraces makes them significantly less useful. I really hope that
your libbacktrace patch gets committed soon, since that looks like it
would be a nice quality of life improvement, all on its own.

It would also be great if the tests spit out information about
assertion failures that was reasonably complete (backtrace without any
mangling, query text included, other basic context), reliably and
uniformly -- it shouldn't matter if it's from TAP or pg_regress test
SQL scripts. Which kind of test happened to be involved is usually not
interesting to me here (even the query text won't usually be
interesting), so being forced to think about it slows me down quite a
lot.

> > Don't you also hate it when there's a regression.diffs that just shows 20k
> > lines of subtractions? Perhaps you don't -- perhaps your custom setup makes
> > it quick and easy to get relevant information about what actually went
> > wrong.
>
> I do really hate that. At the very least we should switch to using
> restart-after-crash by default, and not start new tests once the server has
> crashed and do a waitpid(postmaster, WNOHANG) after each failing test, to see
> if the reason the test failed is that the backend died.

+1
--
Peter Geoghegan

Re: AssertLog instead of Assert in some places

От

Andres Freund

Дата:

12 августа 2023 г., 00:04:47

Hi,

On 2023-08-11 13:19:34 -0700, Peter Geoghegan wrote:
> On Fri, Aug 11, 2023 at 12:26 PM Andres Freund <andres@anarazel.de> wrote:
> > > For example, dealing with core dumps left behind by the regression
> > > tests can be annoying.
> >
> > Hm. I don't have a significant problem with that. But I can see it being
> > problematic. Unfortunately, short of preventing core dumps from happening,
> > I don't think we really can do much about that - whatever is running the tests
> > shouldn't have privileges to change system wide settings about where core
> > dumps end up etc.
>
> I was unclear. I wasn't talking about managing core dumps. I was
> talking about using core dumps to get a simple backtrace, that just
> gives me some very basic information. I probably shouldn't have even
> mentioned core dumps, because what I'm really concerned about is the
> workflow around getting very basic information about assertion
> failures. Not around core dumps per se.
>
> The inconsistent approach needed to get a simple, useful backtrace for
> assertion failures (along with other basic information associated with
> the failure) is a real problem. Particularly when running the tests.
> Most individual assertion failures that I see are for code that I'm
> practically editing in real time. So shaving cycles here really
> matters.

Ah, yes. Agreed, obviously.

> For one thing the symbol mangling that we have around the built-in
> backtraces makes them significantly less useful. I really hope that
> your libbacktrace patch gets committed soon, since that looks like it
> would be a nice quality of life improvement, all on its own.

I've been hacking a further on it:
https://github.com/anarazel/postgres/tree/backtrace

Haven't yet posted a new version. Doing it properly for fronted tools has some
dependencies on threading stuff. I'm hoping Thomas' patch for that will go in.

Now it also intercepts segfaults and prints
backtraces for them, if that's possible (it requires libbacktrace to be async
signal safe, which it isn't on all platforms).

Where supported, a crash (distinguishing from assertion failures) will now
report something like:

process with pid: 2900527 received signal: SIGSEGV, si_code: 1, si_addr: 0xdeadbeef
    [0x5628ec45212f] pg_fatalsig_handler+0x20f: ../../../../home/andres/src/postgresql/src/common/pg_backtrace.c:615
    [0x7fc4b743650f] [unknown]
    [0x5628ec14897c] check_root+0x19c (inlined): ../../../../home/andres/src/postgresql/src/backend/main/main.c:393
    [0x5628ec14897c] main+0x19c: ../../../../home/andres/src/postgresql/src/backend/main/main.c:183

after I added
    *(volatile int*)0xdeadbeef = 1;
to check_root().

For signals sent by users, it'd show the pid of the process sending the signal
on most OSs. I really would like some generalized infrastructure for that, so
that we can report for things like query cancellations.

As the patch stands, the only OS that doesn't yet have useful "backtrace on
crash" support with that is windows, as libbacktrace unfortunately isn't
signal safe on windows. But it'd still provide useful backtraces on
Assert()s. I haven't yet figured out whether/when it's required to be signal
safe on windows though - crashes are intercepted by
SetUnhandledExceptionFilter() - I don't understand the precise constraints of
that. Plenty people seem to generate backtraces on crashes on windows using
that, without concerns for signal safety like things.

Currently Macos CI doesn't use libbacktrace, but as it turns out
backtrace_symbols() on windows is a heck of a lot better than on glibc
platforms.  CI for windows with visual studio doesn't have libbacktrace
installed yet (and has the aforementioned signal safety issue), I think it's
installed for windows w/ mingw.

> It would also be great if the tests spit out information about
> assertion failures that was reasonably complete (backtrace without any
> mangling, query text included, other basic context), reliably and
> uniformly -- it shouldn't matter if it's from TAP or pg_regress test
> SQL scripts.

Hm. What other basic context are you thinking of? Pid is obvious. I guess
backend type could be useful too, but normally be inferred from the stack
trace pretty easily. Application name could be useful to know which test
caused the crash.

I'm a bit wary about trying to print things like query text - what if that
string points to memory not terminated by a \0? I guess we could use an
approach similar to pgstat_get_crashed_backend_activity().

One issue with reporting crashes from signal handlers is that the obvious way
to make that signal safe (lots of small writes) leads to the potential for
interspersed lines.  It's probably worth having a statically sized buffer that
will commonly be large enough to print a whole backtrace. When too small, it
should include the pid at the start of every "chunk".

> Which kind of test happened to be involved is usually not interesting to me
> here (even the query text won't usually be interesting), so being forced to
> think about it slows me down quite a lot.

Interesting - I quite often end up spending time trying to dig out which query
from what sql file triggered a crash, so I can try to trigger it in
isolation. I often wished the server knew the source line associated with the
query. Enough that I pondered ways to have psql transport that knowledge to the
the server.

Greetings,

Andres Freund

Re: AssertLog instead of Assert in some places

От

Peter Geoghegan

Дата:

12 августа 2023 г., 00:39:45

On Fri, Aug 11, 2023 at 2:04 PM Andres Freund <andres@anarazel.de> wrote:
> Where supported, a crash (distinguishing from assertion failures) will now
> report something like:
>
> process with pid: 2900527 received signal: SIGSEGV, si_code: 1, si_addr: 0xdeadbeef
>         [0x5628ec45212f] pg_fatalsig_handler+0x20f:
../../../../home/andres/src/postgresql/src/common/pg_backtrace.c:615
>         [0x7fc4b743650f] [unknown]
>         [0x5628ec14897c] check_root+0x19c (inlined):
../../../../home/andres/src/postgresql/src/backend/main/main.c:393
>         [0x5628ec14897c] main+0x19c: ../../../../home/andres/src/postgresql/src/backend/main/main.c:183
>
> after I added
>         *(volatile int*)0xdeadbeef = 1;
> to check_root().

It'll be like living in the future!

> For signals sent by users, it'd show the pid of the process sending the signal
> on most OSs. I really would like some generalized infrastructure for that, so
> that we can report for things like query cancellations.

That sounds great.

> > It would also be great if the tests spit out information about
> > assertion failures that was reasonably complete (backtrace without any
> > mangling, query text included, other basic context), reliably and
> > uniformly -- it shouldn't matter if it's from TAP or pg_regress test
> > SQL scripts.
>
> Hm. What other basic context are you thinking of? Pid is obvious. I guess
> backend type could be useful too, but normally be inferred from the stack
> trace pretty easily. Application name could be useful to know which test
> caused the crash.
>
> I'm a bit wary about trying to print things like query text - what if that
> string points to memory not terminated by a \0? I guess we could use an
> approach similar to pgstat_get_crashed_backend_activity().

I agree that being less verbose by default is good. On second thought
even query text isn't all that important.

> One issue with reporting crashes from signal handlers is that the obvious way
> to make that signal safe (lots of small writes) leads to the potential for
> interspersed lines.  It's probably worth having a statically sized buffer that
> will commonly be large enough to print a whole backtrace. When too small, it
> should include the pid at the start of every "chunk".

Good idea.

> > Which kind of test happened to be involved is usually not interesting to me
> > here (even the query text won't usually be interesting), so being forced to
> > think about it slows me down quite a lot.
>
> Interesting - I quite often end up spending time trying to dig out which query
> from what sql file triggered a crash, so I can try to trigger it in
> isolation. I often wished the server knew the source line associated with the
> query. Enough that I pondered ways to have psql transport that knowledge to the
> the server.

I actually do plenty of that too. My overall point was this: there is
likely some kind of pareto principle here. That should guide the sorts
of scenarios we optimize for.

If you actually benchmarked where I spent time while writing code,
minute to minute, I bet it would show that most of the individual
debug-compile cycles were triggered by issues that had a fairly simple
and immediate cause. Cases where I improve one small thing, and then
rerun the tests, which show an assertion failure in nearby code. As
soon as I see very basic details I immediately think "duh, of course"
in these cases, at which point I'll come up with a likely-good fix in
seconds. And then I'll rinse and repeat. My fix might just work (at
least to the extent that all tests now pass), but it also might lead
to another assertion failure of the same general nature.

There are also lots of cases where I really do have to think about
recreating the details from the test in order to truly understand
what's going on, of course. But there are still way way more
individual "duh, of course" assertion failures in practice. Those are
where productivity wins are still possible, because the bottleneck
isn't just that I have an incomplete mental model that I need to work
to expand.

Perhaps my experiences here aren't universal. But it seems like they
might be roughly the same as everybody else that works on Postgres?
Assuming that they are, then the information that is output should be
optimized for the "duh, of course" scenarios. Not to an absurd degree,
mind you. But the output shouldn't be too verbose. Ideally there'd be
a still fairly straightforward way of getting extra information, for
the cases where debugging is likely to take a few minutes, and require
real focus. The extra work in those other cases is relatively
insignificant, because the "startup costs" are relatively large -- a
little extra indirection (though only a little) can't hurt too much.

--
Peter Geoghegan

Re: AssertLog instead of Assert in some places

От

Ashutosh Bapat

Дата:

14 августа 2023 г., 17:37:34

On Fri, Aug 11, 2023 at 11:27 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > Attached patch combines Assert and elog(ERROR, ) so that when an
> > Assert is triggered in assert-enabled binary, it will throw an error
> > while keeping the backend intact. Thus it does not affect gdb session
> > or psql session. These elogs do not make their way to non-assert
> > binary so do not make it to production like Assert.
>
> I am quite strongly against this. This will lead to assertions being hit in
> tests without that being noticed, e.g. because they happen in a background
> process that just restarts.

Fair point. Our regression doesn't check server error logs for
unwanted errors. How about restricting it to only client backends? I
don't know how to identify those from others but there must be a way.

--
Best Wishes,
Ashutosh Bapat

Re: AssertLog instead of Assert in some places

От

Ashutosh Bapat

Дата:

14 августа 2023 г., 17:47:19

On Sat, Aug 12, 2023 at 12:56 AM Andres Freund <andres@anarazel.de> wrote:
> On 2023-08-11 11:56:27 -0700, Peter Geoghegan wrote:
> > On Fri, Aug 11, 2023 at 11:23 AM Andres Freund <andres@anarazel.de> wrote:
> > > > Couldn't you say the same thing about defensive "can't happen" ERRORs?
> > > > They are essentially a form of assertion that isn't limited to
> > > > assert-enabled builds.
> > >
> > > Yes. A lot of them I hate them with the passion of a thousand suns ;). "Oh,
> > > our transaction state machinery is confused. Yes, let's just continue going
> > > through the same machinery again, that'll resolve it.".
> >
> > I am not unsympathetic to Ashutosh's point about conventional ERRORs
> > being easier to deal with when debugging your own code, during initial
> > development work.
>
> Oh, I am as well - I just don't think it's a good idea to introduce "log + error"
> assertions to core postgres, because it seems very likely that they'll end up
> getting used a lot.
>
>

I am open to ideas which allow the same backend to recover after
meeting an easily recoverable but "can't happen" condition rather than
losing that backend and starting all over with a new backend. Not all
Assert'ed conditions are recoverable so a blanket GUC or compile time
option won't help. Those might make things worse. We need two separate
incantations for non-recoverable and recoverable Asserts respectively.

I like Peter's idea of having a new elevel, however it still requires
adding conditional USE_ASSERT, an if testing the condition and then
writing an error message. AssertLog() in the patch uses just a few
more letters.

It won't help to expand the scope of the problem since that will
reduce the chances of getting anything done.

--
Best Wishes,
Ashutosh Bapat

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: AssertLog instead of Assert in some places

Вложения