Обсуждение: BUG #3110: Online Backup introduces Duplicate OIDs
The following bug has been logged online:
Bug reference: 3110
Logged by: Randy Isbell
Email address: jisbell@cisco.com
PostgreSQL version: 8.2.3
Operating system: FreeBSD 6.1 i386
Description: Online Backup introduces Duplicate OIDs
Details:
This issue is observed in version 8.0.7 on FreeBSD 6.0 and 8.0.7 on FreeBSD
6.1 as well.
When restoring the output of an online backup, many tables now have
duplicate OID values / primary keys, viz:
--- (cut here) ---
# psql -Upostgres sn
Welcome to psql 8.2.3, the PostgreSQL interactive terminal.
Type: \copyright for distribution terms
\h for help with SQL commands
\? for help with psql commands
\g or terminate with semicolon to execute query
\q to quit
sn=# \d at_dns
Table "public.at_dns"
Column | Type | Modifiers
---------------------+--------------------------+-----------
ac_b | text | not null
ac_zone | text | not null
ac_host | text | not null
ac_ttl | bigint | not null
ac_type | text | not null
ac_data | text | not null
ac_password | text |
ac_desc | text |
ac_lastupdate | timestamp with time zone |
ac_lastddns | timestamp with time zone |
ac_mx_priority | bigint |
ac_soa_contact | text |
ac_soa_serial | bigint |
ac_soa_refresh | bigint |
ac_soa_retry | bigint |
ac_soa_expire | bigint |
ac_soa_minimum | bigint |
ac_srv_priority | integer |
ac_srv_weight | integer |
ac_srv_port | integer |
ac_naptr_flags | text |
ac_naptr_order | integer |
ac_naptr_preference | integer |
ac_naptr_service | text |
ac_naptr_regexp | text |
ac_naptr_ac_n | text |
Indexes:
"ai_dns_host" PRIMARY KEY, btree (ac_zone, ac_host, ac_type, ac_data)
"ai_dns_b_host" UNIQUE, btree (ac_b, ac_zone, ac_host, ac_type,
ac_data)
"ai_oid_dns" UNIQUE, btree (oid)
Triggers:
at_dns_delete_serial AFTER DELETE ON at_dns FOR EACH ROW EXECUTE
PROCEDURE at_delete_serial()
at_dns_touch AFTER INSERT OR DELETE OR UPDATE ON at_dns FOR EACH
STATEMENT EXECUTE PROCEDURE af_file_touch()
at_dns_update_serial AFTER INSERT OR UPDATE ON at_dns FOR EACH ROW
EXECUTE PROCEDURE at_update_serial()
sn=# select ac_zone,ac_host,ac_type,ac_data, count(oid) from at_dns group by
ac_zone,ac_host,ac_type,ac_data having count(oid) > 1;
ac_zone | ac_host | ac_type | ac_data | count
-------------------+---------+---------+------------------------+-------
nqa5.l1.cisco.com | @ | soa | ns1.nqa5.l1.cisco.com. | 2
(1 row)
sn=# reindex table at_dns;
ERROR: could not create unique index
DETAIL: Table contains duplicated values.
--- (cut here) ---
I verified the source database does NOT have duplicate values prior to the
online dump.
Note that this problem seems to occur when the database is under load. The
above error results when approximately 30 transactions per second are active
during the online backup. A quiescent database does NOT exhibit this
problem.
Thoughts?
- r.
"Randy Isbell" <jisbell@cisco.com> writes:
> When restoring the output of an online backup, many tables now have
> duplicate OID values / primary keys, viz:
> ...
> sn=# reindex table at_dns;
> ERROR: could not create unique index
I'm confused. You're claiming that the reload succeeds, but after that
a reindex fails? Are there actually duplicate OIDs in the dump file,
as evidenced by looking at it? What pg_dump options are you using
(I suppose -o at least)? If using pg_restore, what options there?
regards, tom lane
Randy Isbell <jisbell@cisco.com> writes:
> This is not a pg_dump. I am performing an online backup ala the
> "pg_start_backup()" and "pg_stop_backup()" mechanism.
Oh, OK. Could we see the system columns (ctid,xmin,xmax,cmin,cmax) for
the rows with duplicated OIDs? Also, would you look at the user data in
these rows and see if it's the same in each pair? I'm wondering if the
pairs might be both old and new states of an UPDATE.
regards, tom lane
Randy Isbell <jisbell@cisco.com> writes:
> Here you go:
> SELECT
> ctid,xmin,xmax,cmin,cmax,oid,*
Thanks. This is real interesting, because none of the rows have
xmax/cmax set, so it doesn't appear that they were meant to have been
updated out of existence.
> For the at_dns table, it appears one column (ac_soa_serial) changes.
Does that correspond to something your application does, ie UPDATE
ac_soa_serial to a new value without changing anything else? I'm trying
to guess if the duplicates arose by means of a misfiring UPDATE, or if
they were independent insertions. Is it plausible that two rows that
are the same except for ac_soa_serial would be inserted by your app?
If the latter, a possible theory is that the OID counter is somehow
being rolled back by the dump/reload process.
regards, tom lane
Tom Lane wrote: > Randy Isbell <jisbell@cisco.com> writes: > > Here you go: > > SELECT > > ctid,xmin,xmax,cmin,cmax,oid,* > > Thanks. This is real interesting, because none of the rows have > xmax/cmax set, so it doesn't appear that they were meant to have been > updated out of existence. Question for Randy -- is this plain PostgreSQL, or are you using some, hmmm, proprietary derivate of it? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Thanks for the reply.
This is not a pg_dump. I am performing an online backup ala the
"pg_start_backup()" and "pg_stop_backup()" mechanism.
Here is another look at the same table but with the OID values
included (this is from a different dump, not the initial bug report):
sn=# select oid, ac_zone,ac_host,ac_type,ac_data, count(oid) from
at_dns group by
oid,ac_zone,ac_host,ac_type,ac_data having count(oid) > 1;
oid | ac_zone | ac_host | ac_type
| ac_data | count
---------+-------------------------------+---------+---------
+------------------------+-------
7049453 | e164.lr0007.nqa5.l1.cisco.com | @ | soa |
ns1.nqa5.l1.cisco.com. | 2
7049454 | e164.nqa5.l1.cisco.com | @ | soa |
ns1.nqa5.l1.cisco.com. | 2
7049503 | e164.lr0008.nqa5.l1.cisco.com | @ | soa |
ns1.nqa5.l1.cisco.com. | 2
7049512 | e164.lr0006.nqa5.l1.cisco.com | @ | soa |
ns1.nqa5.l1.cisco.com. | 2
7049515 | e164.lr0005.nqa5.l1.cisco.com | @ | soa |
ns1.nqa5.l1.cisco.com. | 2
7049531 | e164.lr0009.nqa5.l1.cisco.com | @ | soa |
ns1.nqa5.l1.cisco.com. | 2
(6 rows)
So the scenerio is:
1. Verify that the original database is void of duplicate primary key
values. This is done be running queries similar to the above for
each primary key on all user tables.
2. Create a load on the database, about 30 transactions / sec.
3. Issue pg_start_backup()
4. Save off the data cluster
5. Issue pg_stop_backup()
6. Collect the WAL files
7. Create a big hairy tar file with the stuff from items 4 and 6.
8. Take the big hairy tar file to another server running the same pg
8.2.3, untar and start postgres
9. On the server from item 8, run the same queries from item 1.
Viola, behold the duplicates.
I've run a number of these scenerios and I've found some common
features which may be clues:
a. The duplicate records result from a DELETE or UPDATE query
b. They occur near the time of the pg_start_backup() or pg_stop_backup
(). That is, the DELETE or UPDATE query is issued within 90 to 110
seconds of either the start or stop, and this backup takes 20 min to
create.
Is there a transaction log file dump utility? It would be nice to
see if the corruption is actually in the WAL files.
Regards,
- r.
On Mar 6, 2007, at 11:43 AM, Tom Lane wrote:
> "Randy Isbell" <jisbell@cisco.com> writes:
>> When restoring the output of an online backup, many tables now have
>> duplicate OID values / primary keys, viz:
>> ...
>> sn=# reindex table at_dns;
>> ERROR: could not create unique index
>
> I'm confused. You're claiming that the reload succeeds, but after
> that
> a reindex fails? Are there actually duplicate OIDs in the dump file,
> as evidenced by looking at it? What pg_dump options are you using
> (I suppose -o at least)? If using pg_restore, what options there?
>
> regards, tom lane
Answers inline. On Mar 7, 2007, at 3:54 PM, Tom Lane wrote: > Randy Isbell <jisbell@cisco.com> writes: >> Here you go: >> SELECT >> ctid,xmin,xmax,cmin,cmax,oid,* > > Thanks. This is real interesting, because none of the rows have > xmax/cmax set, so it doesn't appear that they were meant to have been > updated out of existence. > >> For the at_dns table, it appears one column (ac_soa_serial) changes. > > Does that correspond to something your application does, ie UPDATE > ac_soa_serial to a new value without changing anything else? Yes. In fact, the app may perform an update even when no columns change. > I'm trying > to guess if the duplicates arose by means of a misfiring UPDATE, or if > they were independent insertions. Is it plausible that two rows that > are the same except for ac_soa_serial would be inserted by your app? Very unlikely. I have the requirement to keep an audit trail of all changes made to the database. This is done by triggers on each table which update a corresponding mirror table in an audit schema. There is a trigger for each insert, update, and delete. I'm using this audit information to try and isolate the problem. What I found is that no duplication ever exists for INSERTs, only UPDATEs and DELETEs. Also, recall my previous note that the duplication happens near the time of the pg_start_backup() and pg_stop_backup(). Based on the audit schema information I collected, I see numerous updates and deletes happen with no duplication problems when they are in the middle of the backup time. In my environment the duplication happens within 2 minutes of the start or stop. This may be incidental, but I've seen it on 8 of 10 backup/restore runs. > If the latter, a possible theory is that the OID counter is somehow > being rolled back by the dump/reload process. The reload process is simply: start postgres and let it replay the necessary WAL files. Is there a way to determine if the WAL file data is bad? It would be helpful to know if the problem is caused by the backup, or if something is wrong in the replay of the WAL files. > > regards, tom lane - r.