Обсуждение: Mass updates on a large table
Good afternoon,
I am attempting an update on two new date field columns on a 17 million row table. Every row gets updated.
The update statement is a simple one:
UPDATE EMAILRCPTS SET ID = ID
And the update of the new date fields themselves occurs as the result of a before trigger.
The update took 3 days, 10 hours to complete on the testing box. I have already adjusted the CHECKPOINT_SEGMENTS parameter up by 3X
To minimize the impact of checkpoints. The SHARED_BUFFERS parameter has been bumped up to 140000 on a 20meg RAM box.
There are about 9 indexes on this table although none of them reference the date fields so since there are no inserts I don’t think they would have an impact on the update (I’ve been wrong before though)
Would an update statement referencing the date fields work faster than a trigger? Do you have any other suggestions to speed this up?
We simply cannot afford this table to be down for 3+ days during a production update. The production box is a 32meg RAM box.
We are at Postgres 7.4.5.
Thanks for your time and brainpower
Mark Steben
Senior Database Administrator
@utoRevenue™
A Dominion Enterprises Company
480 Pleasant Street
Suite B200
Lee, MA 01238
413-243-4800 Home Office
413-243-4809 Corporate Fax
msteben@autorevenue.com
Visit our new website at
www.autorevenue.com
IMPORTANT: The information contained in this e-mail message is confidential and is intended only for the named addressee(s). If the reader of this e-mail message is not the intended recipient (or the individual responsible for the delivery of this e-mail message to the intended recipient), please be advised that any re-use, dissemination, distribution or copying of this e-mail message is prohibited. If you have received this e-mail message in error, please reply to the sender that you have received this e-mail message in error and then delete it. Thank you.
On 8/10/07, Mark Steben <msteben@autorevenue.com> wrote: > > Good afternoon, > > I am attempting an update on two new date field columns on a 17 million row > table. Every row gets updated. > > The update statement is a simple one: > > UPDATE EMAILRCPTS SET ID = ID > > And the update of the new date fields themselves occurs as the result of a > before trigger. > > The update took 3 days, 10 hours to complete on the testing box. That's quite a while for only 17 million rows. Are these rows particularly wide? Is it possible to do it by groups with a vacuum in between each group? That would keep the bloat down. You don't mention your vacuuming strategy. That might affect performance here. Also, are there any FKs to / from this table? > To minimize the impact of checkpoints. The SHARED_BUFFERS parameter has > been bumped up to 140000 on a 20meg RAM box. I assume you meant 20Gig box. Under 7.4 larger shared_buffers may not be a good thing. that's a very large shared buffer setting for 7.4 to handle. > There are about 9 indexes on this table although none of them reference the > date fields so since there are no inserts I don't think they would have an > impact on the update (I've been wrong before though) Doesn't matter. PostgreSQL's implementation of MVCC means that each update results in a new row, and therefore each index has to be updated for each row updated. > Would an update statement referencing the date fields work faster than a > trigger? Possibly. > Do you have any other suggestions to speed this up? > We are at Postgres 7.4.5. Upgrade to a modern version? 7.4 is getting old fast, and 7.4.5 has a LOT of bugs that have been fixed in later versions. It's up to like 7.4.17 so you're missing a LOT of updates just in the branch you're in. But upgrading to 8.2.4 would definitely be a help. > We simply cannot afford this table to be down for 3+ days during a > production update. The production box is a 32meg RAM box. I would question the business process that requires an entire 17 million row table be updated. Also, posting your schema and your triggers might help a bit as well.
"Mark Steben" <msteben@autorevenue.com> writes: > I am attempting an update on two new date field columns on a 17 million row > table. Every row gets updated. > The update statement is a simple one: > UPDATE EMAILRCPTS SET ID = ID > And the update of the new date fields themselves occurs as the result of a > before trigger. When you don't show us the trigger, it's hard to make any useful comment ... but 60 rows/second seems slow enough to suspect that the trigger is very inefficient. > There are about 9 indexes on this table although none of them reference the > date fields so since there are no inserts I don't think they would have an > impact on the update (I've been wrong before though) And you are again ... but still, it's awfully slow. > We are at Postgres 7.4.5. That's a big problem right there. One thing I can tell you is that it is sheer folly to set shared_buffers so high on PG 7.x. It wasn't till about 8.1 that we had buffer management algorithms that were good enough for lots of buffers. Dial it down to something under 100K buffers. And think about an update. At the very least you should be on a far more current 7.4.x release. regards, tom lane
On 8/10/07, Mark Steben <msteben@autorevenue.com> wrote: > > > > > Hi Scott, thanks for the response. > 4. The business reason for the update: we are replacing within a > query a > COALESCE function that gave us the most recent date between 3 dates. We > Are instead doing the COALESCE as part of a new update trigger, populating > the New date fields with the most recent dates and referencing the new dates So this is a one time data migration. OK, then I'm no as worried about the business case being a bit crazy. > 5. We are planning an upgrade to 8.2.4 in the near future – hopefully > in a couple months. Cool. Start testing now. You'd be surprised how many test cases you'll find where your laptop running 8.2 will outrun your server running 7.x... Nothing too out of the ordinary in your schema, but those multi-dimensional arrays might be a bit expensive, especially for updates.