Обсуждение: broken reading on standby (PostgreSQL 16.2)
Hi
yesterday, I had to fix strange issue on standby server
The query to freshly updated data fails
select * from seller_success_rate where create_time::date = '2024-04-23';
ERROR: 58P01: could not access status of transaction 1393466389
DETAIL: Could not open file "pg_xact/0530": No such file or directory.
LOCATION: SlruReportIOError, slru.c:947
ERROR: 58P01: could not access status of transaction 1393466389
DETAIL: Could not open file "pg_xact/0530": No such file or directory.
LOCATION: SlruReportIOError, slru.c:947
amcheck
select * from verify_heapam('seller_success_rate');
blkno | offnum | attnum | msg
-------+--------+--------+-------------------------------------------------------------------
5763 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5863 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5863 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5868 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5868 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5875 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5895 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5895 | 110 | | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078
6245 | 108 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6245 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6245 | 110 | | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078
6245 | 112 | | xmin 1424677216 precedes oldest valid transaction ID 3:1523885078
6378 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6378 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6382 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6590 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6590 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
7578 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
7581 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
8390 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
10598 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
10598 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
blkno | offnum | attnum | msg
-------+--------+--------+-------------------------------------------------------------------
5763 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5863 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5863 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5868 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5868 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5875 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5895 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
5895 | 110 | | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078
6245 | 108 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6245 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6245 | 110 | | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078
6245 | 112 | | xmin 1424677216 precedes oldest valid transaction ID 3:1523885078
6378 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6378 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6382 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6590 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
6590 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
7578 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
7581 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
8390 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
10598 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
10598 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
I verified xmin against the primary server, and it was the same. There was not any replication gap.
I checked the fields from pg_database table, and looks same too
These rows were valid (and visible) on primary.
On this server there was not any long session (when I was connected), unfortunately I cannot test restart of this server. One wal sender is executing on standby. Fortunately, there was a possibility to run VACUUM FULL, and it fixed the issue.
The customer has archived wals.
My question - is it possible to do some diagnostics from SQL level? I didn't find a way to get values that are used for comparison by amcheck from SQL.
Regards
Pavel
> On 25 Apr 2024, at 11:12, Pavel Stehule <pavel.stehule@gmail.com> wrote: > > yesterday, I had to fix strange issue on standby server It’s not just broken reading, if this standby is promoted in HA cluster - this would lead to data loss. Recently I’ve observed some lost heap updates ofter OOM-ing cluster on 14.11. This might be unrelated most probably, butI’ll post a link here, just in case [0]. In February and March we had 3 clusters with similar problem, and this is unusuallybig number for us in just 2 months. Can you check LSN of blocks with corrupted tuples with pageinpsect on primary and on standby? I suspect they are frozen onprimary, but have usual xmin on standby. Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE%40yandex-team.ru#1266dd8b898ba02686c2911e0a50ab47
čt 25. 4. 2024 v 8:52 odesílatel Andrey M. Borodin <x4mmm@yandex-team.ru> napsal:
> On 25 Apr 2024, at 11:12, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> yesterday, I had to fix strange issue on standby server
It’s not just broken reading, if this standby is promoted in HA cluster - this would lead to data loss.
Recently I’ve observed some lost heap updates ofter OOM-ing cluster on 14.11. This might be unrelated most probably, but I’ll post a link here, just in case [0]. In February and March we had 3 clusters with similar problem, and this is unusually big number for us in just 2 months.
Can you check LSN of blocks with corrupted tuples with pageinpsect on primary and on standby? I suspect they are frozen on primary, but have usual xmin on standby.
Unfortunately, I have not direct access to backup, so I am not able to test it. But VACUUM FREEZE DISABLE_PAGE_SKIPPING on master didn't help
Best regards, Andrey Borodin.
[0] https://www.postgresql.org/message-id/flat/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE%40yandex-team.ru#1266dd8b898ba02686c2911e0a50ab47
> On 25 Apr 2024, at 12:06, Pavel Stehule <pavel.stehule@gmail.com> wrote: > > Unfortunately, I have not direct access to backup, so I am not able to test it. But VACUUM FREEZE DISABLE_PAGE_SKIPPINGon master didn't help > If Primary considers all freezable tuples frozen, but a standby does not, "disable page skipping" won't change anything.Primary will not emit WAL record to freeze tuples again. Best regards, Andrey Borodin.
čt 25. 4. 2024 v 10:52 odesílatel Andrey M. Borodin <x4mmm@yandex-team.ru> napsal:
> On 25 Apr 2024, at 12:06, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Unfortunately, I have not direct access to backup, so I am not able to test it. But VACUUM FREEZE DISABLE_PAGE_SKIPPING on master didn't help
>
If Primary considers all freezable tuples frozen, but a standby does not, "disable page skipping" won't change anything. Primary will not emit WAL record to freeze tuples again.
yes, this is possible. I git just info about broken xmin, so I expected broken xlog, and then I first tested FREEZE
Best regards, Andrey Borodin.
On Thu, Apr 25, 2024 at 2:13 AM Pavel Stehule <pavel.stehule@gmail.com> wrote: > > Hi > > yesterday, I had to fix strange issue on standby server > > The query to freshly updated data fails > > select * from seller_success_rate where create_time::date = '2024-04-23'; > ERROR: 58P01: could not access status of transaction 1393466389 > DETAIL: Could not open file "pg_xact/0530": No such file or directory. > LOCATION: SlruReportIOError, slru.c:947 > > amcheck > > select * from verify_heapam('seller_success_rate'); > blkno | offnum | attnum | msg > -------+--------+--------+------------------------------------------------------------------- > 5763 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 5863 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 5863 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 5868 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 5868 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 5875 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 5895 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 5895 | 110 | | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078 > 6245 | 108 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 6245 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 6245 | 110 | | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078 > 6245 | 112 | | xmin 1424677216 precedes oldest valid transaction ID 3:1523885078 > 6378 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 6378 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 6382 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 6590 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 6590 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 7578 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 7581 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 8390 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 10598 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > 10598 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078 > > I verified xmin against the primary server, and it was the same. There was not any replication gap. > > I checked the fields from pg_database table, and looks same too > > These rows were valid (and visible) on primary. > > On this server there was not any long session (when I was connected), unfortunately I cannot test restart of this server. One wal sender is executing on standby. Fortunately, there was a possibility to run VACUUM FULL, and it fixed theissue. > > The customer has archived wals. Did you mention what version of Postgres this is? - Melanie
čt 25. 4. 2024 v 12:53 odesílatel Melanie Plageman <melanieplageman@gmail.com> napsal:
On Thu, Apr 25, 2024 at 2:13 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Hi
>
> yesterday, I had to fix strange issue on standby server
>
> The query to freshly updated data fails
>
> select * from seller_success_rate where create_time::date = '2024-04-23';
> ERROR: 58P01: could not access status of transaction 1393466389
> DETAIL: Could not open file "pg_xact/0530": No such file or directory.
> LOCATION: SlruReportIOError, slru.c:947
>
> amcheck
>
> select * from verify_heapam('seller_success_rate');
> blkno | offnum | attnum | msg
> -------+--------+--------+-------------------------------------------------------------------
> 5763 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 5863 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 5863 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 5868 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 5868 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 5875 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 5895 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 5895 | 110 | | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078
> 6245 | 108 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 6245 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 6245 | 110 | | xmin 1439564642 precedes oldest valid transaction ID 3:1523885078
> 6245 | 112 | | xmin 1424677216 precedes oldest valid transaction ID 3:1523885078
> 6378 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 6378 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 6382 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 6590 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 6590 | 111 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 7578 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 7581 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 8390 | 112 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 10598 | 109 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
> 10598 | 110 | | xmin 1393466389 precedes oldest valid transaction ID 3:1523885078
>
> I verified xmin against the primary server, and it was the same. There was not any replication gap.
>
> I checked the fields from pg_database table, and looks same too
>
> These rows were valid (and visible) on primary.
>
> On this server there was not any long session (when I was connected), unfortunately I cannot test restart of this server. One wal sender is executing on standby. Fortunately, there was a possibility to run VACUUM FULL, and it fixed the issue.
>
> The customer has archived wals.
Did you mention what version of Postgres this is?
16.2 and related rows was today one day old
- Melanie