Обсуждение: broken reading on standby (PostgreSQL 16.2)

Поиск
Список
Период
Сортировка

broken reading on standby (PostgreSQL 16.2)

От
Pavel Stehule
Дата:
Hi

yesterday, I had to fix strange issue on standby server

The query to freshly updated data fails

select * from seller_success_rate where create_time::date = '2024-04-23';
ERROR:  58P01: could not access status of transaction 1393466389
DETAIL:  Could not open file "pg_xact/0530": No such file or directory.
LOCATION:  SlruReportIOError, slru.c:947

amcheck

select * from verify_heapam('seller_success_rate');
 blkno | offnum | attnum |                                msg                                
-------+--------+--------+-------------------------------------------------------------------
  5763 |    111 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
  5863 |    109 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
  5863 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
  5868 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
  5868 |    111 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
  5875 |    111 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
  5895 |    109 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
  5895 |    110 |        | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078
  6245 |    108 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
  6245 |    109 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
  6245 |    110 |        | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078
  6245 |    112 |        | xmin 1424677216 precedes oldest valid transaction ID 3:1523885078
  6378 |    109 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
  6378 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
  6382 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
  6590 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
  6590 |    111 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
  7578 |    112 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
  7581 |    112 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
  8390 |    112 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
 10598 |    109 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
 10598 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078

I verified xmin against the primary server, and it was the same. There was not any replication gap.

I checked the fields from pg_database table, and looks same too

These rows were valid (and visible) on primary.

On this server there was not any long session (when I was connected), unfortunately I cannot test restart of this server.  One wal sender is executing on standby. Fortunately, there was a possibility to run VACUUM FULL, and it fixed the issue.

The customer has archived wals.

My question - is it possible to do some diagnostics from SQL level? I didn't find a way to get values that are used for comparison by amcheck from SQL.

Regards

Pavel

Re: broken reading on standby (PostgreSQL 16.2)

От
"Andrey M. Borodin"
Дата:

> On 25 Apr 2024, at 11:12, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> yesterday, I had to fix strange issue on standby server

It’s not just broken reading, if this standby is promoted in HA cluster - this would lead to data loss.
Recently I’ve observed some lost heap updates ofter OOM-ing cluster on 14.11. This might be unrelated most probably,
butI’ll post a link here, just in case [0]. In February and March we had 3 clusters with similar problem, and this is
unusuallybig number for us in just 2 months. 

Can you check LSN of blocks with corrupted tuples with pageinpsect on primary and on standby? I suspect they are frozen
onprimary, but have usual xmin on standby. 


Best regards, Andrey Borodin.

[0]
https://www.postgresql.org/message-id/flat/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE%40yandex-team.ru#1266dd8b898ba02686c2911e0a50ab47


Re: broken reading on standby (PostgreSQL 16.2)

От
Pavel Stehule
Дата:


čt 25. 4. 2024 v 8:52 odesílatel Andrey M. Borodin <x4mmm@yandex-team.ru> napsal:


> On 25 Apr 2024, at 11:12, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> yesterday, I had to fix strange issue on standby server

It’s not just broken reading, if this standby is promoted in HA cluster - this would lead to data loss.
Recently I’ve observed some lost heap updates ofter OOM-ing cluster on 14.11. This might be unrelated most probably, but I’ll post a link here, just in case [0]. In February and March we had 3 clusters with similar problem, and this is unusually big number for us in just 2 months.

Can you check LSN of blocks with corrupted tuples with pageinpsect on primary and on standby? I suspect they are frozen on primary, but have usual xmin on standby.

Unfortunately, I have not direct access to backup, so I am not able to test it. But VACUUM FREEZE DISABLE_PAGE_SKIPPING on master didn't help


 


Best regards, Andrey Borodin.

[0] https://www.postgresql.org/message-id/flat/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE%40yandex-team.ru#1266dd8b898ba02686c2911e0a50ab47

Re: broken reading on standby (PostgreSQL 16.2)

От
"Andrey M. Borodin"
Дата:

> On 25 Apr 2024, at 12:06, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Unfortunately, I have not direct access to backup, so I am not able to test it. But VACUUM FREEZE
DISABLE_PAGE_SKIPPINGon master didn't help 
>

If Primary considers all freezable tuples frozen, but a standby does not, "disable page skipping" won't change
anything.Primary will not emit WAL record to freeze tuples again. 


Best regards, Andrey Borodin.


Re: broken reading on standby (PostgreSQL 16.2)

От
Pavel Stehule
Дата:


čt 25. 4. 2024 v 10:52 odesílatel Andrey M. Borodin <x4mmm@yandex-team.ru> napsal:


> On 25 Apr 2024, at 12:06, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Unfortunately, I have not direct access to backup, so I am not able to test it. But VACUUM FREEZE DISABLE_PAGE_SKIPPING on master didn't help
>

If Primary considers all freezable tuples frozen, but a standby does not, "disable page skipping" won't change anything. Primary will not emit WAL record to freeze tuples again.

yes, this is possible. I git just info about broken xmin, so I expected broken xlog, and then I first tested FREEZE

 


Best regards, Andrey Borodin.

Re: broken reading on standby (PostgreSQL 16.2)

От
Melanie Plageman
Дата:
On Thu, Apr 25, 2024 at 2:13 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Hi
>
> yesterday, I had to fix strange issue on standby server
>
> The query to freshly updated data fails
>
> select * from seller_success_rate where create_time::date = '2024-04-23';
> ERROR:  58P01: could not access status of transaction 1393466389
> DETAIL:  Could not open file "pg_xact/0530": No such file or directory.
> LOCATION:  SlruReportIOError, slru.c:947
>
> amcheck
>
> select * from verify_heapam('seller_success_rate');
>  blkno | offnum | attnum |                                msg
> -------+--------+--------+-------------------------------------------------------------------
>   5763 |    111 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   5863 |    109 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   5863 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   5868 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   5868 |    111 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   5875 |    111 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   5895 |    109 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   5895 |    110 |        | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078
>   6245 |    108 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   6245 |    109 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   6245 |    110 |        | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078
>   6245 |    112 |        | xmin 1424677216 precedes oldest valid transaction ID 3:1523885078
>   6378 |    109 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   6378 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   6382 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   6590 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   6590 |    111 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   7578 |    112 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   7581 |    112 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   8390 |    112 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>  10598 |    109 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>  10598 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>
> I verified xmin against the primary server, and it was the same. There was not any replication gap.
>
> I checked the fields from pg_database table, and looks same too
>
> These rows were valid (and visible) on primary.
>
> On this server there was not any long session (when I was connected), unfortunately I cannot test restart of this
server. One wal sender is executing on standby. Fortunately, there was a possibility to run VACUUM FULL, and it fixed
theissue. 
>
> The customer has archived wals.

Did you mention what version of Postgres this is?

- Melanie



Re: broken reading on standby (PostgreSQL 16.2)

От
Pavel Stehule
Дата:


čt 25. 4. 2024 v 12:53 odesílatel Melanie Plageman <melanieplageman@gmail.com> napsal:
On Thu, Apr 25, 2024 at 2:13 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Hi
>
> yesterday, I had to fix strange issue on standby server
>
> The query to freshly updated data fails
>
> select * from seller_success_rate where create_time::date = '2024-04-23';
> ERROR:  58P01: could not access status of transaction 1393466389
> DETAIL:  Could not open file "pg_xact/0530": No such file or directory.
> LOCATION:  SlruReportIOError, slru.c:947
>
> amcheck
>
> select * from verify_heapam('seller_success_rate');
>  blkno | offnum | attnum |                                msg
> -------+--------+--------+-------------------------------------------------------------------
>   5763 |    111 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   5863 |    109 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   5863 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   5868 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   5868 |    111 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   5875 |    111 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   5895 |    109 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   5895 |    110 |        | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078
>   6245 |    108 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   6245 |    109 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   6245 |    110 |        | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078
>   6245 |    112 |        | xmin 1424677216 precedes oldest valid transaction ID 3:1523885078
>   6378 |    109 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   6378 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   6382 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   6590 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   6590 |    111 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   7578 |    112 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   7581 |    112 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>   8390 |    112 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>  10598 |    109 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>  10598 |    110 |        | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>
> I verified xmin against the primary server, and it was the same. There was not any replication gap.
>
> I checked the fields from pg_database table, and looks same too
>
> These rows were valid (and visible) on primary.
>
> On this server there was not any long session (when I was connected), unfortunately I cannot test restart of this server.  One wal sender is executing on standby. Fortunately, there was a possibility to run VACUUM FULL, and it fixed the issue.
>
> The customer has archived wals.

Did you mention what version of Postgres this is?

16.2 and related rows was today one day old

 

- Melanie