Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders
От | Mark Dilger |
---|---|
Тема | Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders |
Дата | |
Msg-id | 4DBA9267-A668-4DC9-8908-7183ABBA3081@gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking forwalsenders
|
Список | pgsql-hackers |
> On Apr 23, 2017, at 7:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> On Sun, Apr 23, 2017 at 6:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Fair enough. But I'd still like an explanation of why only about >>> half of the population is showing a failure here. Seems like every >>> machine should be seeing the LSN as moving backwards in this test. >>> So (a) why aren't they all failing, and (b) should we change the >>> test to make sure every platform sees that happening? > >> Every machine sees the LSN moving backwards, but the code path that >> had the assertion only reached if it decides to interpolate, which is >> timing dependent: there needs to be a future sample in the lag >> tracking buffer, which I guess is not the case in those runs. > > I'm dissatisfied with this explanation because if it's just timing, > it doesn't seem very likely that some machines would reproduce the > failure every single time while others never would. Maybe that can be > blamed on kernel scheduler vagaries + different numbers of cores, but > I can't escape the feeling that there's something here we've not > fully understood. > > While chasing after this earlier today, I turned on some debug logging > and noted that the standby's reports look like > > 2017-04-23 15:46:46.206 EDT [34829] LOG: database system is ready to accept read only connections > 2017-04-23 15:46:46.212 EDT [34834] LOG: fetching timeline history file for timeline 2 from primary server > 2017-04-23 15:46:46.212 EDT [34834] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 > 2017-04-23 15:46:46.213 EDT [34834] LOG: sending write 0/3020000 flush 0/3028470 apply 0/3028470 > 2017-04-23 15:46:46.214 EDT [34834] LOG: replication terminated by primary server > 2017-04-23 15:46:46.214 EDT [34834] DETAIL: End of WAL reached on timeline 1 at 0/3028470. > 2017-04-23 15:46:46.214 EDT [34834] LOG: sending write 0/3028470 flush 0/3028470 apply 0/3028470 > 2017-04-23 15:46:46.214 EDT [34830] LOG: new target timeline is 2 > 2017-04-23 15:46:46.214 EDT [34834] LOG: restarted WAL streaming at 0/3000000 on timeline 2 > 2017-04-23 15:46:46.228 EDT [34834] LOG: sending write 0/3020000 flush 0/3028470 apply 0/3028470 > > So you're right that the standby's reported "write" position can go > backward, but it seems pretty darn odd that the flush and apply > positions didn't go backward too. Is there a bug there? > > I remain of the opinion that if we can't tell from the transmitted > data whether a timeline switch has caused the position to go backward, > then that's a protocol shortcoming that ought to be fixed. The recent fix in 546c13e11b29a5408b9d6a6e3cca301380b47f7f has local variable overwriteOK assigned but not used in twophase.c RecoverPreparedTransactions(void). I'm not sure if that's future-proofing or an oversight. It seems to be used in other functions. Just FYI. Mark Dilger
В списке pgsql-hackers по дате отправления: