Обсуждение: wal exist in slave but getting err requested WAL segment has alreadybeen removed

Поиск
Список
Период
Сортировка

wal exist in slave but getting err requested WAL segment has alreadybeen removed

От
Mariel Cherkassky
Дата:
Hi,
I have in my cluster 3 nodes (1 master version 9.6.3+ 2 slaves version 9.6.3). I configured repmgr (with repmgrd active) v 4.0.4.

Suddenly today after a few good weeks I noticed that there is a lag in one of the slaves and the error in the log indicated that the slave didnt get the wal : 

could not receive data from WAL stream: ERROR:  requested WAL segment 0000000900002E61000000BD has already been removed

However, when I check if the wal was recieveed : 
postgres=# select pg_is_in_recovery(),pg_is_xlog_replay_paused(),pg_last_xlog_receive_location(),pg_last_xlog_replay_location();
 pg_is_in_recovery | pg_is_xlog_replay_paused | pg_last_xlog_receive_location | pg_last_xlog_replay_location 
-------------------+--------------------------+-------------------------------+------------------------------
 t                 | f                        | 2E61/BDF5C000                 | 2E61/BDF5B930
(1 row)

and  I checked in pg_xlog directory : 
ls -l ../pg_xlog/0000000900002E61000000BD
-rw------- 1 postgres postgres 16777216 Jul 11 11:13 ../pg_xlog/0000000900002E61000000BD

and the xlog is exist.
Now is my question, why the wal wasnt replayed ?
In my repmgr.conf I dont have any parameters regarding recovery just some basic things.  The recovery.conf file in the data directory : 

standby_mode = 'on'
primary_conninfo = 'host=xxxxxxx user=repmgr application_name=''psgsqldb2'' connect_timeout=2'
recovery_target_timeline = 'latest'


any idea ? 

Re: wal exist in slave but getting err requested WAL segment hasalready been removed

От
Achilleas Mantzios
Дата:
On 11/07/2018 16:09, Mariel Cherkassky wrote:
Hi,
I have in my cluster 3 nodes (1 master version 9.6.3+ 2 slaves version 9.6.3). I configured repmgr (with repmgrd active) v 4.0.4.

Suddenly today after a few good weeks I noticed that there is a lag in one of the slaves and the error in the log indicated that the slave didnt get the wal : 

could not receive data from WAL stream: ERROR:  requested WAL segment 0000000900002E61000000BD has already been removed

However, when I check if the wal was recieveed : 
postgres=# select pg_is_in_recovery(),pg_is_xlog_replay_paused(),pg_last_xlog_receive_location(),pg_last_xlog_replay_location();
 pg_is_in_recovery | pg_is_xlog_replay_paused | pg_last_xlog_receive_location | pg_last_xlog_replay_location 
-------------------+--------------------------+-------------------------------+------------------------------
 t                 | f                        | 2E61/BDF5C000                 | 2E61/BDF5B930
(1 row)

and  I checked in pg_xlog directory : 
ls -l ../pg_xlog/0000000900002E61000000BD
-rw------- 1 postgres postgres 16777216 Jul 11 11:13 ../pg_xlog/0000000900002E61000000BD

and the xlog is exist.

In which node did you check for the file?
If the file in the primary is still available, try to compare their md5sum .
If you have a working WAL shipping method in place, then add the appropriate line in the recovery.conf of your standby :
restore_command = 'rsync somemachine:/somepath/pitr/%f "%p" '

Now is my question, why the wal wasnt replayed ?
In my repmgr.conf I dont have any parameters regarding recovery just some basic things.  The recovery.conf file in the data directory : 

standby_mode = 'on'
primary_conninfo = 'host=xxxxxxx user=repmgr application_name=''psgsqldb2'' connect_timeout=2'
recovery_target_timeline = 'latest'


any idea ? 


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

Re: wal exist in slave but getting err requested WAL segment hasalready been removed

От
Mariel Cherkassky
Дата:
The wal is available on the standby, not on the primary. It is already in the pg_xlog directory of the slave...

2018-07-11 16:26 GMT+03:00 Achilleas Mantzios <achill@matrix.gatewaynet.com>:
On 11/07/2018 16:09, Mariel Cherkassky wrote:
Hi,
I have in my cluster 3 nodes (1 master version 9.6.3+ 2 slaves version 9.6.3). I configured repmgr (with repmgrd active) v 4.0.4.

Suddenly today after a few good weeks I noticed that there is a lag in one of the slaves and the error in the log indicated that the slave didnt get the wal : 

could not receive data from WAL stream: ERROR:  requested WAL segment 0000000900002E61000000BD has already been removed

However, when I check if the wal was recieveed : 
postgres=# select pg_is_in_recovery(),pg_is_xlog_replay_paused(),pg_last_xlog_receive_location(),pg_last_xlog_replay_location();
 pg_is_in_recovery | pg_is_xlog_replay_paused | pg_last_xlog_receive_location | pg_last_xlog_replay_location 
-------------------+--------------------------+-------------------------------+------------------------------
 t                 | f                        | 2E61/BDF5C000                 | 2E61/BDF5B930
(1 row)

and  I checked in pg_xlog directory : 
ls -l ../pg_xlog/0000000900002E61000000BD
-rw------- 1 postgres postgres 16777216 Jul 11 11:13 ../pg_xlog/0000000900002E61000000BD

and the xlog is exist.

In which node did you check for the file?
If the file in the primary is still available, try to compare their md5sum .
If you have a working WAL shipping method in place, then add the appropriate line in the recovery.conf of your standby :
restore_command = 'rsync somemachine:/somepath/pitr/%f "%p" '

Now is my question, why the wal wasnt replayed ?
In my repmgr.conf I dont have any parameters regarding recovery just some basic things.  The recovery.conf file in the data directory : 

standby_mode = 'on'
primary_conninfo = 'host=xxxxxxx user=repmgr application_name=''psgsqldb2'' connect_timeout=2'
recovery_target_timeline = 'latest'


any idea ? 


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

Re: wal exist in slave but getting err requested WAL segment hasalready been removed

От
Achilleas Mantzios
Дата:
On 11/07/2018 16:32, Mariel Cherkassky wrote:
The wal is available on the standby, not on the primary. It is already in the pg_xlog directory of the slave...
Ok but apparently this is not complete. Can you see its contents with pg_waldump (or pg_xlogdump) ?
Do you have any backup mechanism in place? Any WAL shipping / archiving mechanism ?

2018-07-11 16:26 GMT+03:00 Achilleas Mantzios <achill@matrix.gatewaynet.com>:
On 11/07/2018 16:09, Mariel Cherkassky wrote:
Hi,
I have in my cluster 3 nodes (1 master version 9.6.3+ 2 slaves version 9.6.3). I configured repmgr (with repmgrd active) v 4.0.4.

Suddenly today after a few good weeks I noticed that there is a lag in one of the slaves and the error in the log indicated that the slave didnt get the wal : 

could not receive data from WAL stream: ERROR:  requested WAL segment 0000000900002E61000000BD has already been removed

However, when I check if the wal was recieveed : 
postgres=# select pg_is_in_recovery(),pg_is_xlog_replay_paused(),pg_last_xlog_receive_location(),pg_last_xlog_replay_location();
 pg_is_in_recovery | pg_is_xlog_replay_paused | pg_last_xlog_receive_location | pg_last_xlog_replay_location 
-------------------+--------------------------+-------------------------------+------------------------------
 t                 | f                        | 2E61/BDF5C000                 | 2E61/BDF5B930
(1 row)

and  I checked in pg_xlog directory : 
ls -l ../pg_xlog/0000000900002E61000000BD
-rw------- 1 postgres postgres 16777216 Jul 11 11:13 ../pg_xlog/0000000900002E61000000BD

and the xlog is exist.

In which node did you check for the file?
If the file in the primary is still available, try to compare their md5sum .
If you have a working WAL shipping method in place, then add the appropriate line in the recovery.conf of your standby :
restore_command = 'rsync somemachine:/somepath/pitr/%f "%p" '

Now is my question, why the wal wasnt replayed ?
In my repmgr.conf I dont have any parameters regarding recovery just some basic things.  The recovery.conf file in the data directory : 

standby_mode = 'on'
primary_conninfo = 'host=xxxxxxx user=repmgr application_name=''psgsqldb2'' connect_timeout=2'
recovery_target_timeline = 'latest'


any idea ? 


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

Re: wal exist in slave but getting err requested WAL segment hasalready been removed

От
Mariel Cherkassky
Дата:
Yes i can see its content. However in the end of its content I'm getting the next msg : 
pg_xlogdump: FATAL:  error in WAL record at 2E61/BDF59950: invalid magic number 0000 in log segment 0000000000002E61000000BD, offset 16105472
Maybe this is the reason behind it ?

2018-07-11 16:39 GMT+03:00 Achilleas Mantzios <achill@matrix.gatewaynet.com>:
On 11/07/2018 16:32, Mariel Cherkassky wrote:
The wal is available on the standby, not on the primary. It is already in the pg_xlog directory of the slave...
Ok but apparently this is not complete. Can you see its contents with pg_waldump (or pg_xlogdump) ?
Do you have any backup mechanism in place? Any WAL shipping / archiving mechanism ?


2018-07-11 16:26 GMT+03:00 Achilleas Mantzios <achill@matrix.gatewaynet.com>:
On 11/07/2018 16:09, Mariel Cherkassky wrote:
Hi,
I have in my cluster 3 nodes (1 master version 9.6.3+ 2 slaves version 9.6.3). I configured repmgr (with repmgrd active) v 4.0.4.

Suddenly today after a few good weeks I noticed that there is a lag in one of the slaves and the error in the log indicated that the slave didnt get the wal : 

could not receive data from WAL stream: ERROR:  requested WAL segment 0000000900002E61000000BD has already been removed

However, when I check if the wal was recieveed : 
postgres=# select pg_is_in_recovery(),pg_is_xlog_replay_paused(),pg_last_xlog_receive_location(),pg_last_xlog_replay_location();
 pg_is_in_recovery | pg_is_xlog_replay_paused | pg_last_xlog_receive_location | pg_last_xlog_replay_location 
-------------------+--------------------------+-------------------------------+------------------------------
 t                 | f                        | 2E61/BDF5C000                 | 2E61/BDF5B930
(1 row)

and  I checked in pg_xlog directory : 
ls -l ../pg_xlog/0000000900002E61000000BD
-rw------- 1 postgres postgres 16777216 Jul 11 11:13 ../pg_xlog/0000000900002E61000000BD

and the xlog is exist.

In which node did you check for the file?
If the file in the primary is still available, try to compare their md5sum .
If you have a working WAL shipping method in place, then add the appropriate line in the recovery.conf of your standby :
restore_command = 'rsync somemachine:/somepath/pitr/%f "%p" '

Now is my question, why the wal wasnt replayed ?
In my repmgr.conf I dont have any parameters regarding recovery just some basic things.  The recovery.conf file in the data directory : 

standby_mode = 'on'
primary_conninfo = 'host=xxxxxxx user=repmgr application_name=''psgsqldb2'' connect_timeout=2'
recovery_target_timeline = 'latest'


any idea ? 


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt


-- 
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt

Re: wal exist in slave but getting err requested WAL segment hasalready been removed

От
Kenneth Marshall
Дата:
On Wed, Jul 11, 2018 at 04:44:24PM +0300, Mariel Cherkassky wrote:
> Yes i can see its content. However in the end of its content I'm getting
> the next msg :
> pg_xlogdump: FATAL:  error in WAL record at 2E61/BDF59950: invalid magic
> number 0000 in log segment 0000000000002E61000000BD, offset 16105472
> Maybe this is the reason behind it ?
> 

Hi Mariel,

I do not know if this applies to your case, but 9.6.9 has this
in the release notes:

Fix a corner case where a streaming standby gets stuck at a WAL continuation record (Kyotaro Horiguchi)

Regards,
Ken


Re: wal exist in slave but getting err requested WAL segment hasalready been removed

От
Mariel Cherkassky
Дата:
How can I get more info regarding this bug ? I would like to be sure that i faced a real bug.

2018-07-11 16:50 GMT+03:00 Kenneth Marshall <ktm@rice.edu>:
On Wed, Jul 11, 2018 at 04:44:24PM +0300, Mariel Cherkassky wrote:
> Yes i can see its content. However in the end of its content I'm getting
> the next msg :
> pg_xlogdump: FATAL:  error in WAL record at 2E61/BDF59950: invalid magic
> number 0000 in log segment 0000000000002E61000000BD, offset 16105472
> Maybe this is the reason behind it ?
>

Hi Mariel,

I do not know if this applies to your case, but 9.6.9 has this
in the release notes:

Fix a corner case where a streaming standby gets stuck at a WAL continuation record (Kyotaro Horiguchi)

Regards,
Ken