Обсуждение: COPY FROM command v8.1.4

Поиск
Список
Период
Сортировка

COPY FROM command v8.1.4

От
"Mr. Dan"
Дата:
This behavior is on big tables.
6.5 GB - 12+ million records



Re: COPY FROM command v8.1.4

От
"Mr. Dan"
Дата:
Hey Tom,

I've finished some vacuums on the table with the following results.  I'll
continue to run a couple more of these vacuums, but if you don't see another
post from me in the next couple of hours, you can assume that they continue
to look clean as well.

Thanks again for any clues on this.  I'd be interested in any alternatives,
like a trace if that's an option.  One thing I'm going to try next is to run
the COPY TO/FROM test on a redhat version of postgresql v8.1.4 that was
compiled with gcc 3.4.5 instead of gcc 3.4.3 & gmake version 3.80.
We are considering a test with the data on a Solaris build of postgres as
well, but that's coming slowly.

~DjK

1st vacuum - 70 minutes

INFO:  vacuuming "_test"
INFO:  "_test": found 0 removable, 12149265 nonremovable row versions in
492247 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.25s/0.05u sec elapsed 2374.30 sec.
INFO:  analyzing "_test"
INFO:  "_test": scanned 3000 of 492247 pages, containing 74030 live rows and
0 dead rows; 3000 rows in sample, 12147015 estimated total rows

Total query runtime: 4175635 ms.

2nd vacuum - 40 minutes

INFO:  vacuuming "_test"
INFO:  "_test": found 0 removable, 12149265 nonremovable row versions in
492247 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 1.98s/0.33u sec elapsed 2419.27 sec.
INFO:  analyzing "_test"
INFO:  "_est": scanned 3000 of 492247 pages, containing 74065 live rows and
0 dead rows; 3000 rows in sample, 12152758 estimated total rows

Total query runtime: 2433517 ms.


>From: Tom Lane <tgl@sss.pgh.pa.us>
>To: "Mr. Dan" <bitsandbytes88@hotmail.com>
>CC: pgsql-admin@postgresql.org
>Subject: Re: [ADMIN] COPY FROM command v8.1.4 Date: Wed, 13 Sep 2006
>12:50:26 -0400
>
>"Mr. Dan" <bitsandbytes88@hotmail.com> writes:
> >> How are you doing the copies, exactly?  SQL COPY command, psql \copy,
> >> something else?
>
> > We've tried SQL COY and psql \copy and always get random results - 0,1,
>or 2
> > blocks of 25 rows missing.
>
>Hmph.  If it happens with a SQL COPY command then psql seems to be off
>the hook, and that also eliminates some theories about dropped TCP
>packets and such.
>
>Would you check back in the source table for the COPY and see what the
>ctid values are for the missing rows?  I'm wondering about a pattern
>like "the dropped rows of a group are all on the same disk page", ie,
>what's being missed is one whole page at a time.
>
>If that's what's happening, the only very plausible theory I can think
>of is that your disk drive is sometimes glitching and returning a page
>of all-zeroes instead of what it should return.  Postgres will not
>complain about this in normal operation (because there are legitimate
>error-recovery scenarios where a zero page can be in a table); it'll
>just treat the page as empty.  VACUUM will complain though, so the next
>step would be to set up a test table by copying your large table and
>then repeatedly run plain VACUUM on the test table.  If you get sporadic
>warnings "relation foo page N is uninitialized --- fixing" then we have
>the smoking gun.  Don't run this test directly on a valuable table, as
>each such message would mean you just lost another page of data :-(
>
>FWIW, I spent several hours yesterday evening copying 6GB tables around
>to see if I could reproduce any such problem, and I couldn't...
>
>            regards, tom lane



Re: COPY FROM command v8.1.4

От
Tom Lane
Дата:
"theman" <bitsandbytes88@hotmail.com> writes:
> lseek(10, 0, SEEK_END)                  = 913072128
> write(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
> lseek(10, 0, SEEK_END)                  = 913080320
> write(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
> lseek(10, 0, SEEK_END)                  = 913088512
> write(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
> lseek(10, 0, SEEK_END)                  = 913088512
> write(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
> lseek(10, 0, SEEK_END)                  = 913096704
> write(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192

Boy, that looks like a smoking gun to me.  We're dealing with a kernel
bug.

What's the platform here, including the exact kernel version?  What
filesystem are you storing the database on?

            regards, tom lane

Re: COPY FROM command v8.1.4

От
Ray Stell
Дата:
Curious how you get to an OS kernel bug from these writes?  Aren't we
missing the associated db reads in this trc data?  Maybe there been
other data provided off the list?  Would love to see it.

Thanks.



On Fri, Sep 22, 2006 at 05:59:31PM -0400, Tom Lane wrote:
> "theman" <bitsandbytes88@hotmail.com> writes:
> > lseek(10, 0, SEEK_END)                  = 913072128
> > write(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
> > lseek(10, 0, SEEK_END)                  = 913080320
> > write(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
> > lseek(10, 0, SEEK_END)                  = 913088512
> > write(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
> > lseek(10, 0, SEEK_END)                  = 913088512
> > write(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
> > lseek(10, 0, SEEK_END)                  = 913096704
> > write(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
>
> Boy, that looks like a smoking gun to me.  We're dealing with a kernel
> bug.
>
> What's the platform here, including the exact kernel version?  What
> filesystem are you storing the database on?
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

Re: COPY FROM command v8.1.4

От
Tom Lane
Дата:
Ray Stell <stellr@cns.vt.edu> writes:
> Curious how you get to an OS kernel bug from these writes?

Note the two successive lseek's returning the same result.  With a
write() in between that should have extended the file, that is clearly
the wrong answer.

Subsequent discussion reveals that Dan is running

> Linux pike 2.6.5-7.244-smp #1 SMP Mon Dec 12 18:32:25 UTC 2005 x86_64 x86_64 x86_64 GNU/Linux
> SLES 9 suse linux enterprise 9 service pack 3

which is probably overdue for an update ...

            regards, tom lane

Re: COPY FROM command v8.1.4

От
"theman"
Дата:


Thanks a million Tom!

SLES support helped upgrade our SLES 9, sp3, linux kernel from 2.6.5-7.244
to 2.6.5-7.282.  Since that we haven't had any blocks of rows beign
re-written or blanked out by the kernel.  The new kernel is handling the
wirtes much better.








-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Saturday, September 23, 2006 1:41 PM
To: Ray Stell
Cc: theman; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] COPY FROM command v8.1.4

Ray Stell <stellr@cns.vt.edu> writes:
> Curious how you get to an OS kernel bug from these writes?

Note the two successive lseek's returning the same result.  With a
write() in between that should have extended the file, that is clearly the
wrong answer.

Subsequent discussion reveals that Dan is running

> Linux pike 2.6.5-7.244-smp #1 SMP Mon Dec 12 18:32:25 UTC 2005 x86_64
> x86_64 x86_64 GNU/Linux SLES 9 suse linux enterprise 9 service pack 3

which is probably overdue for an update ...

            regards, tom lane