Re: RFC: Logging plan of the running query

Поиск
Список
Период
Сортировка
От James Coleman
Тема Re: RFC: Logging plan of the running query
Дата
Msg-id CAAaqYe_hyWLs-ObFQfu54yA-ZOjNCd-L_Z=1TnYjYNqrLuGiXQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: RFC: Logging plan of the running query  (torikoshia <torikoshia@oss.nttdata.com>)
Ответы Re: RFC: Logging plan of the running query  (torikoshia <torikoshia@oss.nttdata.com>)
Список pgsql-hackers
On Tue, Sep 5, 2023 at 9:59 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> On 2023-08-28 22:47, James Coleman wrote:
> > On Mon, Aug 28, 2023 at 3:01 AM torikoshia <torikoshia@oss.nttdata.com>
> > wrote:
> >>
> >> On 2023-08-26 21:03, James Coleman wrote:
> >> > On Fri, Aug 25, 2023 at 7:43 AM James Coleman <jtc331@gmail.com> wrote:
> >> >>
> >> >> On Thu, Aug 17, 2023 at 10:02 AM torikoshia
> >> >> <torikoshia@oss.nttdata.com> wrote:
> >> >> >
> >> >> > On 2023-06-16 01:34, James Coleman wrote:
> >> >> > > Attached is v28
> >> >> > > which sets ProcessLogQueryPlanInterruptActive to false in errfinish
> >> >> > > when necessary. Once built with those two patches I'm simply running
> >> >> > > `make check`.
> >> >> >
> >> >> > With v28-0001 and v28-0002 patch, I confirmed backend processes consume
> >> >> > huge
> >> >> > amount of memory and under some environments they were terminated by OOM
> >> >> > killer.
> >> >> >
> >> >> > This was because memory was allocated from existing memory contexts and
> >> >> > they
> >> >> > were not freed after ProcessLogQueryPlanInterrupt().
> >> >> > Updated the patch to use dedicated memory context for
> >> >> > ProcessLogQueryPlanInterrupt().
> >> >> >
> >> >> > Applying attached patch and v28-0002 patch, `make check` successfully
> >> >> > completed after 20min and 50GB of logs on my environment.
> >> >> >
> >> >> > >>> On 2023-06-15 01:48, James Coleman wrote:
> >> >> > >>> > The tests have been running since last night, but have been apparently
> >> >> > >>> > hung now for many hours.
> >> >> >
> >> >> > I don't know if this has anything to do with the hung you faced, but I
> >> >> > thought
> >> >> > it might be possible that the large amount of memory usage resulted in
> >> >> > swapping, which caused a significant delay in processing.
> >> >>
> >> >> Ah, yes, I think that could be a possible explanation. I was delaying
> >> >> on this thread because I wasn't comfortable with having caused an
> >> >> issue once (even if I couldn't easily reproduce) without at least some
> >> >> theory as to the cause (and a fix).
> >> >>
> >> >> > If possible, I would be very grateful if you could try to reproduce this
> >> >> > with
> >> >> > the v29 patch.
> >> >>
> >> >> I'll kick off some testing.
> >> >>
> >> >
> >> > I don't have time to investigate what's happening here, but 24 hours
> >> > later the first "make check" is still running, and at first glance it
> >> > seems to have the same behavior I'd seen that first time. The test
> >> > output is to this point:
> >> >
> >> > # parallel group (5 tests):  index_including create_view
> >> > index_including_gist create_index create_index_spgist
> >> > ok 66        + create_index                            26365 ms
> >> > ok 67        + create_index_spgist                     27675 ms
> >> > ok 68        + create_view                              1235 ms
> >> > ok 69        + index_including                          1102 ms
> >> > ok 70        + index_including_gist                     1633 ms
> >> > # parallel group (16 tests):  create_aggregate create_cast errors
> >> > roleattributes drop_if_exists hash_func typed_table create_am
> >> > infinite_recurse
> >> >
> >> > and it hasn't progressed past that point since at least ~16 hours ago
> >> > (the first several hours of the run I wasn't monitoring it).
> >> >
> >> > I haven't connected up gdb yet, and won't be able to until maybe
> >> > tomorrow, but here's the ps output for postgres processes that are
> >> > running:
> >> >
> >> > admin    3213249  0.0  0.0 196824 20552 ?        Ss   Aug25   0:00
> >> > /home/admin/postgresql-test/bin/postgres -D
> >> > /home/admin/postgresql-test-data
> >> > admin    3213250  0.0  0.0 196964  7284 ?        Ss   Aug25   0:00
> >> > postgres: checkpointer
> >> > admin    3213251  0.0  0.0 196956  4276 ?        Ss   Aug25   0:00
> >> > postgres: background writer
> >> > admin    3213253  0.0  0.0 196956  8600 ?        Ss   Aug25   0:00
> >> > postgres: walwriter
> >> > admin    3213254  0.0  0.0 198424  7316 ?        Ss   Aug25   0:00
> >> > postgres: autovacuum launcher
> >> > admin    3213255  0.0  0.0 198412  5488 ?        Ss   Aug25   0:00
> >> > postgres: logical replication launcher
> >> > admin    3237967  0.0  0.0   2484   516 pts/4    S+   Aug25   0:00
> >> > /bin/sh -c echo "# +++ regress check in src/test/regress +++" &&
> >> >
PATH="/home/admin/postgres/tmp_install/home/admin/postgresql-test/bin:/home/admin/postgres/src/test/regress:$PATH"
> >> > LD_LIBRARY_PATH="/home/admin/postgres/tmp_install/home/admin/postgresql-test/lib"
> >> > INITDB_TEMPLATE='/home/admin/postgres'/tmp_install/initdb-template
> >> > ../../../src/test/regress/pg_regress --temp-instance=./tmp_check
> >> > --inputdir=. --bindir=     --dlpath=. --max-concurrent-tests=20
> >> > --schedule=./parallel_schedule
> >> > admin    3237973  0.0  0.0 197880 20688 pts/4    S+   Aug25   0:00
> >> > postgres -D /home/admin/postgres/src/test/regress/tmp_check/data -F -c
> >> > listen_addresses= -k /tmp/pg_regress-7mmGUa
> >> > admin    3237976  0.0  0.1 198332 44608 ?        Ss   Aug25   0:00
> >> > postgres: checkpointer
> >> > admin    3237977  0.0  0.0 198032  4640 ?        Ss   Aug25   0:00
> >> > postgres: background writer
> >> > admin    3237979  0.0  0.0 197880  8580 ?        Ss   Aug25   0:00
> >> > postgres: walwriter
> >> > admin    3237980  0.0  0.0 199484  7608 ?        Ss   Aug25   0:00
> >> > postgres: autovacuum launcher
> >> > admin    3237981  0.0  0.0 199460  5488 ?        Ss   Aug25   0:00
> >> > postgres: logical replication launcher
> >> > admin    3243644  0.0  0.2 252400 74656 ?        Ss   Aug25   0:01
> >> > postgres: admin regression [local] SELECT waiting
> >> > admin    3243645  0.0  0.1 205480 33992 ?        Ss   Aug25   0:00
> >> > postgres: admin regression [local] SELECT waiting
> >> > admin    3243654 99.9  0.0 203156 31504 ?        Rs   Aug25 1437:49
> >> > postgres: admin regression [local] VACUUM
> >> > admin    3243655  0.0  0.1 212036 38504 ?        Ss   Aug25   0:00
> >> > postgres: admin regression [local] SELECT waiting
> >> > admin    3243656  0.0  0.0 206024 30892 ?        Ss   Aug25   0:00
> >> > postgres: admin regression [local] DELETE waiting
> >> > admin    3243657  0.0  0.1 205568 32232 ?        Ss   Aug25   0:00
> >> > postgres: admin regression [local] ALTER TABLE waiting
> >> > admin    3243658  0.0  0.0 203740 21532 ?        Ss   Aug25   0:00
> >> > postgres: admin regression [local] ANALYZE waiting
> >> > admin    3243798  0.0  0.0 199884  8464 ?        Ss   Aug25   0:00
> >> > postgres: autovacuum worker
> >> > admin    3244733  0.0  0.0 199492  5956 ?        Ss   Aug25   0:00
> >> > postgres: autovacuum worker
> >> > admin    3245652  0.0  0.0 199884  8468 ?        Ss   Aug25   0:00
> >> > postgres: autovacuum worker
> >> >
> >> > As you can see there are a bunch of backends presumably waiting, and
> >> > also the VACUUM process has been pegging a single CPU core for at
> >> > least since that ~16 hour ago mark.
> >> >
> >> > I hope to be able to do more investigation later, but I wanted to at
> >> > least give you this information now.
> >>
> >> Thanks a lot for testing the patch!
> >> I really appreciate your cooperation.
> >>
> >> Hmm, I also tested on the current HEAD(165d581f146b09) again on Ubuntu
> >> 22.04 and macOS, but unfortunately(fortunately?) they succeeded as
> >> below:
> >>
> >> ```
> >> $ git apply v29-0001-Add-function-to-log-the-plan-of-the-query.patch
> >> $ git apply
> >> v28-0002-Testing-attempt-logging-plan-on-ever-CFI-call.patch
> >> $ ./configure --enable-debug --enable-cassert
> >> $ make
> >> $ make check
> >>
> >> ...(snip)...
> >>
> >> # parallel group (5 tests):  index_including index_including_gist
> >> create_view create_index create_index_spgist
> >> ok 66        + create_index                            25033 ms
> >> ok 67        + create_index_spgist                     26144 ms
> >> ok 68        + create_view                              3061 ms
> >> ok 69        + index_including                           976 ms
> >> ok 70        + index_including_gist                     2998 ms
> >> # parallel group (16 tests):  create_cast errors create_aggregate
> >> roleattributes drop_if_exists hash_func typed_table
> >> create_am select constraints updatable_views inherit triggers vacuum
> >> create_function_sql infinite_recurse
> >> ok 71        + create_aggregate                          225 ms
> >> ok 72        + create_function_sql                     18874 ms
> >> ok 73        + create_cast                               168 ms
> >>
> >> ...(snip)...
> >>
> >> # All 215 tests passed.
> >> ```
> >>
> >> If you notice any difference, I would be grateful if you could let me
> >> know.
> >
> > I've never been able to reproduce it (haven't tested the new version,
> > but v28 at least) on my M1 Mac; where I've reproduced it is on Debian
> > (first buster and now bullseye).
> >
> > I'm attaching several stacktraces in the hope that they provide some
> > clues. These all match the ps output I sent earlier, though note in
> > that output there is both the regress instance and my test instance
> > (pid 3213249) running (different ports, of course, and they are from
> > the exact same compilation run). I've attached ps output for the
> > postgres processes under the make check process to simplify cross
> > referencing.
> >
> > A few interesting things:
> > - There's definitely a lock on a relation that seems to be what's
> > blocking the processes.
> > - When I try to connect with psql the process forks but then hangs
> > (see the ps output with task names stuck in "authentication"). I've
> > also included a trace from one of these.
>
> Thanks for sharing them!
>
> Many processes are waiting to acquire the LW lock, including the process
> trying to output the plan(select1.trace).
>
> I suspect that this is due to a lock that was acquired prior to being
> interrupted by ProcessLogQueryPlanInterrupt(), but have not been able to
> reproduce the same situation..
>

I don't have time immediately to dig into this, but perhaps loading up
the core dumps would allow us to see what query is running in each
backend process (if it hasn't already been discarded by that point)
and thereby determine what point in each test process led to the error
condition?

Alternatively we might be able to apply the same trick to the test
client instead...

BTW, for my own easy reference in this thread: relid 1259 is pg_class
if I'm not mistaken.

Thoughts?
James Coleman



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: information_schema and not-null constraints
Следующее
От: Erik Wienhold
Дата:
Сообщение: Re: Autogenerate some wait events code and documentation