Обсуждение: Re: query hangs out

Поиск
Список
Период
Сортировка

Re: query hangs out

От
Антон Глушаков
Дата:
Thanks for the advice.
I tried to remove all indexes and constraints from the table - it did not help.
I have a copy of the data (before truncate) - I can test any hypothesis


вт, 20 мая 2025 г. в 18:25, Laurenz Albe <laurenz.albe@cybertec.at>:
On Tue, 2025-05-20 at 16:48 +0300, Антон Глушаков wrote:
> I encountered a very strange behavior.
> For any query (even a simple count(*) to one specific table (a small 30MB table with 3 indexes,
> without any specific data types - everything is standard out of the box vanilla Postgres) -
> the query hangs dead. Waited more than 24 hours - the query did not complete).
>
>
> Similarly, the vacuum process to the table hangs.
> Only Kill -9 with a full restart helps
>
> I get a backtrace, from it - I then examined the pg_multixact directory, which at the time of
> the problem had swelled to 900MB and had several thousand files.
> I excluded long and inactive transactions, as well as prepared statements.
>
> The workaround in the end was this - truncate the table (it was successful), then vacuum freeze
> each DB, and after that the files from pg_multixact disappeared.
>
> What could it be? vacuum\freeze\mulitxact  settings are default.
> At the same time, the value pg_database.datminmxid=1
> Could the problem with the hang be related to the many old files in pg_multixact ? (judging by the backtrace - yes)

I can't say for certain, but I have seen cases like that where index corruption sent
processes into an endless loop.  Next time you could try to rebuild the indexes.

Yours,
Laurenz Albe

Re: query hangs out

От
Антон Глушаков
Дата:
The problem is with only one table.

As a result, I determined that the problem is on page 5 of the table (I made SELECT ctid selections until it hangs).
Then I tried to delete rows by ctid (5, 0-100) from the table until I found the problematic row.

Content through pageinspect:
# SELECT * FROM heap_page_items(get_raw_page('"InboxState"', 5)) where lp = 51;
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------
lp          | 51
lp_off      | 3760
lp_flags    | 1
lp_len      | 100
t_xmin      | 136269917
t_xmax      | 66664135
t_field3    | 0
t_ctid      | (47,13)
t_infomask2 | 8203
t_infomask  | 4929
t_hoff      | 32
t_bits      | 1111011000000000
t_oid       |
t_data      | \x3e8a7c00000000000100000090877a16b4b308dd9460898784c4af2dab692693d29bdf78bcf5153401000000fd55f20f44ec08dd9460f88969b943ab3cd8020000000000


I couldn't delete it in the standard way (delete from "InboxState" where ctid = '(5,51)') - it also hangs.

But I can freeze it through pg_surgery.

# select heap_force_freeze('"InboxState"'::regclass, ARRAY['(5, 51)']::tid[]);

Output after freeze:
digitalarchive=# SELECT * FROM heap_page_items(get_raw_page('"InboxState"', 5)) where lp = 51;
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------
lp          | 51
lp_off      | 3760
lp_flags    | 1
lp_len      | 100
t_xmin      | 2
t_xmax      | 0
t_field3    | 0
t_ctid      | (5,51)
t_infomask2 | 11
t_infomask  | 2817
t_hoff      | 32
t_bits      | 1111011000000000
t_oid       |
t_data      | \x3e8a7c00000000000100000090877a16b4b308dd9460898784c4af2dab692693d29bdf78bcf5153401000000fd55f20f44ec08dd9460f88969b943ab3cd8020000000000


After that, queries to the table started to work normally.
I'll note that there are absolutely no errors in the postgres logs, checksums are enabled, there are no errors for them either.

It seems that this is a bug.

ср, 21 мая 2025 г. в 02:52, ikramuddin <ikram.amani815@gmail.com>:
Is it taking too long only for this table or other tables also? If the issue is with this single table then check when it started to happened , mean after creating one index or whatever change you perform just get back to that point and now the query should run fine




On Tue, 20 May 2025 at 9:14 PM, Антон Глушаков <a.glushakov86@gmail.com> wrote:
Thanks for the advice.
I tried to remove all indexes and constraints from the table - it did not help.
I have a copy of the data (before truncate) - I can test any hypothesis


вт, 20 мая 2025 г. в 18:25, Laurenz Albe <laurenz.albe@cybertec.at>:
On Tue, 2025-05-20 at 16:48 +0300, Антон Глушаков wrote:
> I encountered a very strange behavior.
> For any query (even a simple count(*) to one specific table (a small 30MB table with 3 indexes,
> without any specific data types - everything is standard out of the box vanilla Postgres) -
> the query hangs dead. Waited more than 24 hours - the query did not complete).
>
>
> Similarly, the vacuum process to the table hangs.
> Only Kill -9 with a full restart helps
>
> I get a backtrace, from it - I then examined the pg_multixact directory, which at the time of
> the problem had swelled to 900MB and had several thousand files.
> I excluded long and inactive transactions, as well as prepared statements.
>
> The workaround in the end was this - truncate the table (it was successful), then vacuum freeze
> each DB, and after that the files from pg_multixact disappeared.
>
> What could it be? vacuum\freeze\mulitxact  settings are default.
> At the same time, the value pg_database.datminmxid=1
> Could the problem with the hang be related to the many old files in pg_multixact ? (judging by the backtrace - yes)

I can't say for certain, but I have seen cases like that where index corruption sent
processes into an endless loop.  Next time you could try to rebuild the indexes.

Yours,
Laurenz Albe

Re: query hangs out

От
Laurenz Albe
Дата:
On Wed, 2025-05-21 at 12:06 +0300, Антон Глушаков wrote:
> The problem is with only one table.
>
> As a result, I determined that the problem is on page 5 of the table (I made SELECT ctid selections until it hangs).
> Then I tried to delete rows by ctid (5, 0-100) from the table until I found the problematic row.
>
> Content through pageinspect:
> # SELECT * FROM heap_page_items(get_raw_page('"InboxState"', 5)) where lp = 51;
> -[ RECORD 1
]-------------------------------------------------------------------------------------------------------------------------------------------
> lp          | 51
> lp_off      | 3760
> lp_flags    | 1
> lp_len      | 100
> t_xmin      | 136269917
> t_xmax      | 66664135
> t_field3    | 0
> t_ctid      | (47,13)
> t_infomask2 | 8203
> t_infomask  | 4929
> t_hoff      | 32
> t_bits      | 1111011000000000
> t_oid       |
> t_data      |
\x3e8a7c00000000000100000090877a16b4b308dd9460898784c4af2dab692693d29bdf78bcf5153401000000fd55f20f44ec08dd9460f88969b943ab3cd8020000000000

So the tuple is frozen AND updated, with the new version at (47,13).  Odd.
What do you see at (47,13)?

> I couldn't delete it in the standard way (delete from "InboxState" where ctid = '(5,51)') - it also hangs.
>
> But I can freeze it through pg_surgery.
>
> # select heap_force_freeze('"InboxState"'::regclass, ARRAY['(5, 51)']::tid[]);
>
> Output after freeze:
> digitalarchive=# SELECT * FROM heap_page_items(get_raw_page('"InboxState"', 5)) where lp = 51;
> -[ RECORD 1
]-------------------------------------------------------------------------------------------------------------------------------------------
> lp          | 51
> lp_off      | 3760
> lp_flags    | 1
> lp_len      | 100
> t_xmin      | 2
> t_xmax      | 0
> t_field3    | 0
> t_ctid      | (5,51)
> t_infomask2 | 11
> t_infomask  | 2817
> t_hoff      | 32
> t_bits      | 1111011000000000
> t_oid       |
> t_data      |
\x3e8a7c00000000000100000090877a16b4b308dd9460898784c4af2dab692693d29bdf78bcf5153401000000fd55f20f44ec08dd9460f88969b943ab3cd8020000000000

Now it is only frozen.

> After that, queries to the table started to work normally.
> I'll note that there are absolutely no errors in the postgres logs, checksums are enabled, there are no errors for
themeither. 

I would still recommend a dump and restore to get rid of the data corruption.

> It seems that this is a bug.

Possible.

Yours,
Laurenz Albe



Re: query hangs out

От
Антон Глушаков
Дата:
> What do you see at (47,13)?

I have restored the data before the freeze, here is the content (47.13)
SELECT * FROM heap_page_items(get_raw_page('"InboxState"', 47)) where lp = 13;
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------
lp          | 13
lp_off      | 4632
lp_flags    | 1
lp_len      | 100
t_xmin      | 136385644
t_xmax      | 136385520
t_field3    | 0
t_ctid      | (47,13)
t_infomask2 | 11
t_infomask  | 8337
t_hoff      | 32
t_bits      | 1111011000000000
t_oid       |
t_data      | \x3e8a7c00000000000100000090877a16b4b308dd9460898784c4af2dab692693d29bdf78bcf5153401000000ad43d6a5273608dd947a749769b943ab3cd8020001000000


> I would still recommend a dump and restore to get rid of the data corruption.
Yes, thank you. It was a non-production environment, the data was recreated.

ср, 21 мая 2025 г. в 15:38, Laurenz Albe <laurenz.albe@cybertec.at>:
On Wed, 2025-05-21 at 12:06 +0300, Антон Глушаков wrote:
> The problem is with only one table.
>
> As a result, I determined that the problem is on page 5 of the table (I made SELECT ctid selections until it hangs).
> Then I tried to delete rows by ctid (5, 0-100) from the table until I found the problematic row.
>
> Content through pageinspect:
> # SELECT * FROM heap_page_items(get_raw_page('"InboxState"', 5)) where lp = 51;
> -[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------
> lp          | 51
> lp_off      | 3760
> lp_flags    | 1
> lp_len      | 100
> t_xmin      | 136269917
> t_xmax      | 66664135
> t_field3    | 0
> t_ctid      | (47,13)
> t_infomask2 | 8203
> t_infomask  | 4929
> t_hoff      | 32
> t_bits      | 1111011000000000
> t_oid       |
> t_data      | \x3e8a7c00000000000100000090877a16b4b308dd9460898784c4af2dab692693d29bdf78bcf5153401000000fd55f20f44ec08dd9460f88969b943ab3cd8020000000000

So the tuple is frozen AND updated, with the new version at (47,13).  Odd.
What do you see at (47,13)?

> I couldn't delete it in the standard way (delete from "InboxState" where ctid = '(5,51)') - it also hangs.
>
> But I can freeze it through pg_surgery.
>
> # select heap_force_freeze('"InboxState"'::regclass, ARRAY['(5, 51)']::tid[]);
>
> Output after freeze:
> digitalarchive=# SELECT * FROM heap_page_items(get_raw_page('"InboxState"', 5)) where lp = 51;
> -[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------
> lp          | 51
> lp_off      | 3760
> lp_flags    | 1
> lp_len      | 100
> t_xmin      | 2
> t_xmax      | 0
> t_field3    | 0
> t_ctid      | (5,51)
> t_infomask2 | 11
> t_infomask  | 2817
> t_hoff      | 32
> t_bits      | 1111011000000000
> t_oid       |
> t_data      | \x3e8a7c00000000000100000090877a16b4b308dd9460898784c4af2dab692693d29bdf78bcf5153401000000fd55f20f44ec08dd9460f88969b943ab3cd8020000000000

Now it is only frozen.

> After that, queries to the table started to work normally.
> I'll note that there are absolutely no errors in the postgres logs, checksums are enabled, there are no errors for them either.

I would still recommend a dump and restore to get rid of the data corruption.

> It seems that this is a bug.

Possible.

Yours,
Laurenz Albe

Re: query hangs out

От
Laurenz Albe
Дата:
On Wed, 2025-05-21 at 16:27 +0300, Антон Глушаков wrote:
> > What do you see at (47,13)?
>
> I have restored the data before the freeze, here is the content (47.13)
> SELECT * FROM heap_page_items(get_raw_page('"InboxState"', 47)) where lp = 13;
> -[ RECORD 1
]-------------------------------------------------------------------------------------------------------------------------------------------
> lp          | 13
> lp_off      | 4632
> lp_flags    | 1
> lp_len      | 100
> t_xmin      | 136385644
> t_xmax      | 136385520
> t_field3    | 0
> t_ctid      | (47,13)
> t_infomask2 | 11
> t_infomask  | 8337
> t_hoff      | 32
> t_bits      | 1111011000000000
> t_oid       |
> t_data      |
\x3e8a7c00000000000100000090877a16b4b308dd9460898784c4af2dab692693d29bdf78bcf5153401000000ad43d6a5273608dd947a749769b943ab3cd8020001000000

That tuple should be visible.

So I'd say you should killed the other tuple with heap_force_kill().

Yours,
Laurenz Albe



Re: query hangs out

От
Антон Глушаков
Дата:
Thank you.
I have investigated the logs to find the cause.
The only thing I found is that the problem started after a failover (we use patroni with synchronous replication).
The problem is unlikely on the hardware level, since we use a private cloud with enterprise-level hardware, and the checksum verification in postgres did not find any problems.
If this is a bug in postgres, and hackers are willing to investigate the problem, I can provide an archive with pgdata

ср, 21 мая 2025 г. в 18:09, Laurenz Albe <laurenz.albe@cybertec.at>:
On Wed, 2025-05-21 at 16:27 +0300, Антон Глушаков wrote:
> > What do you see at (47,13)?
>
> I have restored the data before the freeze, here is the content (47.13)
> SELECT * FROM heap_page_items(get_raw_page('"InboxState"', 47)) where lp = 13;
> -[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------
> lp          | 13
> lp_off      | 4632
> lp_flags    | 1
> lp_len      | 100
> t_xmin      | 136385644
> t_xmax      | 136385520
> t_field3    | 0
> t_ctid      | (47,13)
> t_infomask2 | 11
> t_infomask  | 8337
> t_hoff      | 32
> t_bits      | 1111011000000000
> t_oid       |
> t_data      | \x3e8a7c00000000000100000090877a16b4b308dd9460898784c4af2dab692693d29bdf78bcf5153401000000ad43d6a5273608dd947a749769b943ab3cd8020001000000

That tuple should be visible.

So I'd say you should killed the other tuple with heap_force_kill().

Yours,
Laurenz Albe