Обсуждение: Why are some WAL files in pg_xlog symlinks to old files?
Hello,
We're running PG 8.3 in a warm standby configuration. About 3 weeks ago we had to fail over from the primary to the standby. That worked fine, but we're having problems getting standby mode set up again. On the new standby, everything works fine for a little while: WALs were rsynced over and processed correctly as far as I can tell. But every 65-75 minutes (very regularly), a WAL file is copied that's actually a symlink. When the standby tries to read the rsynced symlink, it hangs indefinitely, presumably because the target of the link doesn't exist on the standby.
In the primary's pg_xlog, I see the expected WAL files with increasing numbers and recent modification dates, but every 65-75 files there's one of these symlinks. For example:
Sep 28 16:13 0000000300000A5C00000070
Sep 28 16:15 0000000300000A5C00000071
Sep 28 16:12 0000000300000A5C00000072
Sep 5 01:00 0000000300000A5C00000073 -> /srv/db/chdbprod_wal_archives/00000001000009D6000000D6
Sep 28 16:21 0000000300000A5C00000074
Sep 28 16:19 0000000300000A5C00000075
The "/srv/db/chdbprod_wal_archives" directory is where incoming WAL files used to go, back when the current primary server was the standby. The September 5 date you see above is shortly before the failover was done. It confused me at first until I remembered that it's the mod date of the target of the symlink, not the link itself (which in this case was presumably created around 16:20). The target of the symlinks is always the same.
pg_xlog also contains a 00000003.history file, which references the target of the symlinks. Here's its contents:
1 00000001000009D6000000D6 before transaction 0 at 2000-01-01 00:00:00+00
I gather that my problems here are due to having a primary server that was itself formerly a standby, but I'm not sure what action to take. I don't know enough about how the history files work and what the significance of the symlinks is. What purpose to the symlinks serve? Why are they recreated regularly at slighly more than hourly intervals? Why do they point to a directory that was only used back when the primary was a standby? (If it makes any difference, back when the primary server was a standby, it was running pg_standby with the -l option.) Does their presence mean that something's wrong on the primary, or should they be ignored when copying to the standby?
Thanks in advance for any information!
Chris
We're running PG 8.3 in a warm standby configuration. About 3 weeks ago we had to fail over from the primary to the standby. That worked fine, but we're having problems getting standby mode set up again. On the new standby, everything works fine for a little while: WALs were rsynced over and processed correctly as far as I can tell. But every 65-75 minutes (very regularly), a WAL file is copied that's actually a symlink. When the standby tries to read the rsynced symlink, it hangs indefinitely, presumably because the target of the link doesn't exist on the standby.
In the primary's pg_xlog, I see the expected WAL files with increasing numbers and recent modification dates, but every 65-75 files there's one of these symlinks. For example:
Sep 28 16:13 0000000300000A5C00000070
Sep 28 16:15 0000000300000A5C00000071
Sep 28 16:12 0000000300000A5C00000072
Sep 5 01:00 0000000300000A5C00000073 -> /srv/db/chdbprod_wal_archives/00000001000009D6000000D6
Sep 28 16:21 0000000300000A5C00000074
Sep 28 16:19 0000000300000A5C00000075
The "/srv/db/chdbprod_wal_archives" directory is where incoming WAL files used to go, back when the current primary server was the standby. The September 5 date you see above is shortly before the failover was done. It confused me at first until I remembered that it's the mod date of the target of the symlink, not the link itself (which in this case was presumably created around 16:20). The target of the symlinks is always the same.
pg_xlog also contains a 00000003.history file, which references the target of the symlinks. Here's its contents:
1 00000001000009D6000000D6 before transaction 0 at 2000-01-01 00:00:00+00
I gather that my problems here are due to having a primary server that was itself formerly a standby, but I'm not sure what action to take. I don't know enough about how the history files work and what the significance of the symlinks is. What purpose to the symlinks serve? Why are they recreated regularly at slighly more than hourly intervals? Why do they point to a directory that was only used back when the primary was a standby? (If it makes any difference, back when the primary server was a standby, it was running pg_standby with the -l option.) Does their presence mean that something's wrong on the primary, or should they be ignored when copying to the standby?
Thanks in advance for any information!
Chris
On Wed, Sep 29, 2010 at 10:15 AM, Nigel <nigelspleen@gmail.com> wrote: > Hello, > > We're running PG 8.3 in a warm standby configuration. About 3 weeks ago we > had to fail over from the primary to the standby. That worked fine, but > we're having problems getting standby mode set up again. On the new > standby, everything works fine for a little while: WALs were rsynced over > and processed correctly as far as I can tell. But every 65-75 minutes (very > regularly), a WAL file is copied that's actually a symlink. When the > standby tries to read the rsynced symlink, it hangs indefinitely, presumably > because the target of the link doesn't exist on the standby. > > In the primary's pg_xlog, I see the expected WAL files with increasing > numbers and recent modification dates, but every 65-75 files there's one of > these symlinks. For example: > > Sep 28 16:13 0000000300000A5C00000070 > Sep 28 16:15 0000000300000A5C00000071 > Sep 28 16:12 0000000300000A5C00000072 > Sep 5 01:00 0000000300000A5C00000073 -> > /srv/db/chdbprod_wal_archives/00000001000009D6000000D6 > Sep 28 16:21 0000000300000A5C00000074 > Sep 28 16:19 0000000300000A5C00000075 > > The "/srv/db/chdbprod_wal_archives" directory is where incoming WAL files > used to go, back when the current primary server was the standby. The > September 5 date you see above is shortly before the failover was done. It > confused me at first until I remembered that it's the mod date of the target > of the symlink, not the link itself (which in this case was presumably > created around 16:20). The target of the symlinks is always the same. > > pg_xlog also contains a 00000003.history file, which references the target > of the symlinks. Here's its contents: > > 1 00000001000009D6000000D6 before transaction 0 at 2000-01-01 > 00:00:00+00 > > I gather that my problems here are due to having a primary server that was > itself formerly a standby, but I'm not sure what action to take. I don't > know enough about how the history files work and what the significance of > the symlinks is. What purpose to the symlinks serve? Why are they > recreated regularly at slighly more than hourly intervals? Why do they > point to a directory that was only used back when the primary was a > standby? (If it makes any difference, back when the primary server was a > standby, it was running pg_standby with the -l option.) Does their presence > mean that something's wrong on the primary, or should they be ignored when > copying to the standby? I guess that the cause is -l option. The symlink to the archived WAL file is created in pg_xlog by "pg_standby -l". At the failover, unfortunately that symlink in pg_xlog is renamed to the new for WAL recycling. Then, the symlink to old archived WAL file remains in pg_xlog. AFAIR, because of this problem, -l option was removed from pg_standby. http://archives.postgresql.org/pgsql-committers/2009-06/msg00323.php Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Thank you for your response!
So I guess what's happening is that the old symlink from 3 weeks ago (generated by pg_standby -l) is now stuck in the primary's pg_xlog, and gets repeatedly recycled and renamed to be a new WAL file. I checked the mod date of the target of the symlink, and confirmed that it's being updated as that file is rewritten with recycled WAL data.
To get out of this situation, I guess I should replace the symlink in pg_xlog with the file that's the target of the symlink, renamed with the name of the symlink? (In other words, "follow" the symlink by hand so the file in pg_xlog is an ordinary file again.) That would break us out of having that symlink recycled over and over. And then we'll change the new standby server to not use the -l option with pg_standby anymore. (-:
Thanks,
Chris
On Tue, Sep 28, 2010 at 10:50 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
I guess that the cause is -l option. The symlink to the archived WAL file isOn Wed, Sep 29, 2010 at 10:15 AM, Nigel <nigelspleen@gmail.com> wrote:
> Hello,
>
> We're running PG 8.3 in a warm standby configuration. About 3 weeks ago we
> had to fail over from the primary to the standby. That worked fine, but
> we're having problems getting standby mode set up again. On the new
> standby, everything works fine for a little while: WALs were rsynced over
> and processed correctly as far as I can tell. But every 65-75 minutes (very
> regularly), a WAL file is copied that's actually a symlink. When the
> standby tries to read the rsynced symlink, it hangs indefinitely, presumably
> because the target of the link doesn't exist on the standby.
>
> In the primary's pg_xlog, I see the expected WAL files with increasing
> numbers and recent modification dates, but every 65-75 files there's one of
> these symlinks. For example:
>
> Sep 28 16:13 0000000300000A5C00000070
> Sep 28 16:15 0000000300000A5C00000071
> Sep 28 16:12 0000000300000A5C00000072
> Sep 5 01:00 0000000300000A5C00000073 ->
> /srv/db/chdbprod_wal_archives/00000001000009D6000000D6
> Sep 28 16:21 0000000300000A5C00000074
> Sep 28 16:19 0000000300000A5C00000075
>
> The "/srv/db/chdbprod_wal_archives" directory is where incoming WAL files
> used to go, back when the current primary server was the standby. The
> September 5 date you see above is shortly before the failover was done. It
> confused me at first until I remembered that it's the mod date of the target
> of the symlink, not the link itself (which in this case was presumably
> created around 16:20). The target of the symlinks is always the same.
>
> pg_xlog also contains a 00000003.history file, which references the target
> of the symlinks. Here's its contents:
>
> 1 00000001000009D6000000D6 before transaction 0 at 2000-01-01
> 00:00:00+00
>
> I gather that my problems here are due to having a primary server that was
> itself formerly a standby, but I'm not sure what action to take. I don't
> know enough about how the history files work and what the significance of
> the symlinks is. What purpose to the symlinks serve? Why are they
> recreated regularly at slighly more than hourly intervals? Why do they
> point to a directory that was only used back when the primary was a
> standby? (If it makes any difference, back when the primary server was a
> standby, it was running pg_standby with the -l option.) Does their presence
> mean that something's wrong on the primary, or should they be ignored when
> copying to the standby?
created in pg_xlog by "pg_standby -l". At the failover, unfortunately that
symlink in pg_xlog is renamed to the new for WAL recycling. Then, the symlink
to old archived WAL file remains in pg_xlog.
AFAIR, because of this problem, -l option was removed from pg_standby.
http://archives.postgresql.org/pgsql-committers/2009-06/msg00323.php
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Wed, Sep 29, 2010 at 11:17 PM, Nigel <nigelspleen@gmail.com> wrote: > To get out of this situation, I guess I should replace the symlink in > pg_xlog with the file that's the target of the symlink, renamed with the > name of the symlink? Yes. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center