Re: Practical error logging for very large COPY statements
От | Christopher Kings-Lynne |
---|---|
Тема | Re: Practical error logging for very large COPY statements |
Дата | |
Msg-id | 43827BC9.7050105@familyhealth.com.au обсуждение исходный текст |
Ответ на | Practical error logging for very large COPY statements (Simon Riggs <simon@2ndquadrant.com>) |
Ответы |
Re: Practical error logging for very large COPY
|
Список | pgsql-hackers |
Seems similar to the pgloader project on pgfoundry.org. Chris Simon Riggs wrote: > If you've ever loaded 100 million rows, you'll know just how annoying it > is to find that you have a duplicate row somewhere in there. Experience > shows that there is always one, whatever oath the analyst swears > beforehand. > > It's hard to find out which row is the duplicate, plus you've just > screwed up a big table. It needs a VACUUM, then a reload. > > I'd like to find a way to handle this manual task programmatically. > > What I'd like to do is add an ERRORTABLE clause to COPY. The main > problem is how we detect a duplicate row violation, yet prevent it from > aborting the transaction. > > What I propose is to log uniqueness violations only when there is only a > single unique index on a table. > > Flow of control would be to: > > locate page of index where value should go > lock index block > _bt_check_unique, but don't error > if violation then insert row into ERRORTABLE > else > insert row into data block > insert row into unique index > unlock index block > do other indexes > > Which is very similar code to the recently proposed MERGE logic. > > With that logic, a COPY will run to completion, yet be able to report > the odd couple of unique index violations in found along the way. More > importantly we can then handle rows those with another program to locate > where those errors came from and resolve them. > > > > In most cases with a single unique index, the index inserts are > rightmost index entries anyway, so there is scope here for an additional > optimisation: keep both index and data blocks locked across multiple row > inserts until either the unique index or the data block fills. Thats > better than taking a full table lock, since it allows concurrent access > to the rest of the table, but its also more efficient than continually > re-requesting the same blocks (which looks like about 10-15% saving on > performance from hash lookups, lock/unlock, etc). > > Best Regards, Simon Riggs > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match
В списке pgsql-hackers по дате отправления: