Re: Cascading replication: should we detect/prevent cycles?

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Cascading replication: should we detect/prevent cycles?
Дата
Msg-id CA+TgmoZy+sqw2fzq69CRaJzUE2ETdGJ2758KckXcX=fV9q40uQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Cascading replication: should we detect/prevent cycles?  (Joshua Berkus <josh@agliodbs.com>)
Ответы Re: Cascading replication: should we detect/prevent cycles?  (Joshua Berkus <josh@agliodbs.com>)
Re: Cascading replication: should we detect/prevent cycles?  (Peter Geoghegan <peter@2ndquadrant.com>)
Список pgsql-hackers
On Thu, Dec 20, 2012 at 5:28 PM, Joshua Berkus <josh@agliodbs.com> wrote:
>> > What would such a test look like?  It's not obvious to me that
>> > there's any rapid way for a user to detect this situation, without
>> > checking each server individually.
>>
>> Change something on the master and observe that none of the supposed
>> standbys notice?
>
> That doesn't sound like an infallible test, or a 60-second one.
>
> My point is that in a complex situation (imagine a shop with 9 replicated servers in 3 different cascaded groups,
immediatelyafter a failover of the original master), it would be easy for a sysadmin, responding to middle of the night
page,to accidentally fat-finger an IP address and create a cycle instead of a new master.  And once he's done that, a
longishtroubleshooting process to figure out what's wrong and why writes aren't working, especially if he goes to bed
andsome other sysadmin picks up the "Writes failing to PostgreSQL" ticket. 
>
> *if* it's relatively easy for us to detect cycles (that's a big if, I'm not sure how we'd do it), then it would help
alot for us to at least emit a WARNING.  That would short-cut a lot of troubleshooting. 

I'm sure it's possible; I don't *think* it's terribly easy.  The usual
algorithm for cycle detection is to have each node send to the next
node the path that the data has taken.  But, there's no unique
identifier for each slave that I know of - you could use IP address,
but that's not really unique.  And, if the WAL passes through an
archive, how do you deal with that?  I'm sure somebody could figure
all of this stuff out, but it seems fairly complicated for the benefit
we'd get.  I just don't think this is going to be a terribly common
problem; if it turns out I'm wrong, I may revise my opinion.  :-)

To me, it seems that lag monitoring between master and standby is
something that anyone running a complex replication configuration
should be doing - and yeah, I think anything involving four standbys
(or cascading) qualifies as complex.  If you're doing that, you should
notice pretty quickly that your replication lag is increasing
steadily.  You might also check pg_stat_replication the master and
notice that there are no connections there any more.  Could someone
miss those tell-tale signs?  Sure.  But they could also set
autovacuum_naptime to an hour and then file a support ticket
complaining that about table bloat - and they do.  Personally, as user
screw-ups go, I'd consider that scenario (and its fourteen cousins,
twenty-seven second cousins, and three hundred and ninety two other
extended family members) as higher-priority and lower effort to fix
than this particular thing.

YMMV, of course.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: need a function to extract list items from pg_node_tree
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Feature Request: pg_replication_master()