Re: emergency outage requiring database restart
От | Merlin Moncure |
---|---|
Тема | Re: emergency outage requiring database restart |
Дата | |
Msg-id | CAHyXU0xr+PcufmcbJk5hvz9w+H5R2Sc65NJ8-B+MFGqqT98EkQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: emergency outage requiring database restart (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: emergency outage requiring database restart
|
Список | pgsql-hackers |
On Tue, Oct 25, 2016 at 2:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> What if the subsequent dataloss was in fact a symptom of the first >> outage? Is in theory possible for data to appear visible but then be >> eaten up as the transactions making the data visible get voided out by >> some other mechanic? I had to pull a quick restart the first time and >> everything looked ok -- or so I thought. What I think was actually >> happening is that data started to slip into the void. It's like >> randomly sys catalogs were dropping off. I bet other data was, too. I >> can pull older backups and verify that. It's as if some creeping xmin >> was snuffing everything out. > > Might be interesting to look at age(xmin) in a few different system > catalogs. I think you can ignore entries with age = 2147483647; > those should be frozen rows. But if you see entries with very large > ages that are not that, it'd be suspicious. nothing really stands out. The damage did re-occur after a dump/restore -- not sure about a cluster level rebuild. No problems previous to that. This suggests that if this theory holds the damage would have had to have been under the database level -- perhaps in clog. Maybe hint bits and clog did not agree as to commit or delete status for example. clog has plenty of history leading past the problem barrier: -rwx------ 1 postgres postgres 256K Jul 10 16:21 0000 -rwx------ 1 postgres postgres 256K Jul 21 12:39 0001 -rwx------ 1 postgres postgres 256K Jul 21 13:19 0002 -rwx------ 1 postgres postgres 256K Jul 21 13:59 0003 <snip> Confirmation of problem re-occurrence will come in a few days. I'm much more likely to believe 6+sigma occurrence (storage, freak bug, etc) should it prove the problem goes away post rebuild. merlin
В списке pgsql-hackers по дате отправления: