Re: bug, bad memory, or bad disk?
От | Amit Kapila |
---|---|
Тема | Re: bug, bad memory, or bad disk? |
Дата | |
Msg-id | 00a601ce0b85$f1fcb730$d5f62590$@kapila@huawei.com обсуждение исходный текст |
Ответ на | bug, bad memory, or bad disk? (Ben Chobot <bench@silentmedia.com>) |
Список | pgsql-general |
On Friday, February 15, 2013 1:33 AM Ben Chobot wrote: > 2013-02-13T23:13:18.042875+00:00 pgdb18-vpc postgres[20555]: [76-1] =A0ERROR: =A0invalid memory alloc request size=20 > 1968078400 > 2013-02-13T23:13:18.956173+00:00 pgdb18-vpc postgres[23880]: [58-1] =A0ERROR: =A0invalid page header in block 2948 of=20 > relation pg_tblspc/16435/PG_9.1_201105231/188417/56951641 > 2013-02-13T23:13:19.025971+00:00 pgdb18-vpc postgres[25027]: [36-1] =A0ERROR: =A0could not open file=20 > "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block 3936767042): No such file or directory > 2013-02-13T23:13:19.847422+00:00 pgdb18-vpc postgres[28333]: [8-1] = =A0ERROR: =A0could not open file=20 > "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block 3936767042): No such file or directory > 2013-02-13T23:13:19.913595+00:00 pgdb18-vpc postgres[28894]: [8-1] = =A0ERROR: =A0could not open file=20 > "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block 3936767042): No such file or directory > 2013-02-13T23:13:20.043527+00:00 pgdb18-vpc postgres[20917]: [72-1] =A0ERROR: =A0invalid memory alloc request size=20 > 1968078400 > 2013-02-13T23:13:21.548259+00:00 pgdb18-vpc postgres[23318]: [54-1] =A0ERROR: =A0could not open file=20 > "pg_tblspc/16435/PG_9.1_201105231/188417/58206627.1" (target block 3936767042): No such file or directory > 2013-02-13T23:13:28.405529+00:00 pgdb18-vpc postgres[28055]: [12-1] =A0ERROR: =A0invalid page header in block 38887 of=20 > relation pg_tblspc/16435/PG_9.1_201105231/188417/58206627 > 2013-02-13T23:13:29.199447+00:00 pgdb18-vpc postgres[25513]: [46-1] =A0ERROR: =A0invalid page header in block 2368 of=20 > relation pg_tblspc/16435/PG_9.1_201105231/188417/60418945 > There didn't seem to be much correlation to which files were affected, = and this was a critical server, so once we=20 > realized a simple reindex wasn't going to solve things, we shut it = down and brought up a slave as the new master db. > While that seemed to fix these issues, we soon noticed problems with missing clog files. The missing clogs were outside > the range of the existing clogs, so we tried using dummy clog files. It didn't help, and running pg_check we found that > one block of one table was definitely corrupt. Worse, that corruption had spread to all our replicas. Can you check that corrupted block is from one of the relations = mentioned in your errors. This is just to reconfirm. > I know this is a little sparse on details, but my questions are: > 1. What kind of fault should I be looking to fix? Because it spread to = all the replicas, both those that stream and=20 > those that replicate by replaying wals in the wal archive, I assume = it's not=A0a storage issue. (My understanding is that > streaming replicas = stream their changes from memory, not from wals.)=20 Streaming replication stream their changes from wals. > 2. Is it possible that the corruption that was on the master got replicated to the slaves when I tried to cleanly shut > down the master before bringing up a new slave as the new master and switching the other slaves over to replicating=20 > from that? At shutdown, master will send all WAL (upto shutdown checkpoint) I think there are 2 issues in your mail 1. access to corrupted blocks - there are 2 things in this, one is how = the block get corrupted in master and why it's replicated to other servers. The corrupted block replication can be done because of WAL as WAL = contains backup copies of blocks if full_page_write=3Don, which is default configuration. So I think now the main question remains is how the block/'s get = corrupted on master. For that I think some more information is required, like what kind of operations are being done for relation which has corrupted = block. If we drop and recreate that relation, will this problem remains. Is there any chance that the block gets corrupted due to hardware = problem? 2. missing clog files - how did you find missing clog files, is any operation got failed or just an observation? Do you see any problems in system due to it? With Regards, Amit Kapila.
В списке pgsql-general по дате отправления: