Re: Improving performance of merging data between tables

Поиск
Список
Период
Сортировка
От Pawel Veselov
Тема Re: Improving performance of merging data between tables
Дата
Msg-id CAMnJ+BfQOBzv8L_VzdKeWWXgJW1Y2pjXq6EHu2DpvqPpZpfFNg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Improving performance of merging data between tables  (Andy Colson <andy@squeakycode.net>)
Ответы Re: Improving performance of merging data between tables
Re: Improving performance of merging data between tables
Список pgsql-general

Andy,

thanks for looking into this.

On Mon, Dec 29, 2014 at 9:00 AM, Andy Colson <andy@squeakycode.net> wrote:
On 12/28/2014 3:49 PM, Pawel Veselov wrote:
Hi.

I was wondering if anybody would have any ideas on how to improve
certain operations that we are having.

<SNIP>

Besides "can somebody please look at this and let me know if I'm doing
something utterly stupid", here are my questions.

1) How do I find out what exactly is consuming the CPU in a PL/pgSQL
function? All I see is that the calls to merge_all() function take long
time, and the CPU is high while this is going on.



First, I'll admit I didn't read your entire post.

I can think of a couple methods:

1) try each of the statements in merge_all by hand with an "explain analyze" in front to see which is slow.  Look for things that hit big tables without an index.  Check that fk lookups are indexes.

If I didn't miss anything, that seems to be OK, even on function-based queries.
 
2) try pg_stat_statements, setting "pg_stat_statements.track = all".  see:
http://www.postgresql.org/docs/9.4/static/pgstatstatements.html

I have used this to profile some functions, and it worked pretty well. Mostly I use it on a test box, but once ran it on the live, which was scary, but worked great.

That looks promising. Turned it on, waiting for when I can turn the server at the next "quiet time".
 
3) try auto-explain:
http://www.postgresql.org/docs/9.4/static/auto-explain.html

I've never used it, so don't know if it'll show each statement inside a function.  Dumps stuff to the log AFAIK, so you'll have to dig out the info by hand.

> 2) Is there a better way to merge individual rows, except doing
> UPDATE/INSERT in a loop, and would that be CPU expensive?
>

Not that I know of.  I use pretty much the same thing.  Soon!  we will have merge/upsert support.  Hopefully it'll be fast.

Well, anytime I cancelled the PID that was executing this whole mess, it would always stop at UPDATE ... SET ... WHERE on the main table. Which does make me believe that bulk update would really help.
 
> 3) Is there a better way to merge whole tables? However, note that I
> need to translate primary keys from node main table into the common main
> table, as they are used as foreign keys, hence the loops. I suspect the
> looping is CPU intensive.

Avoiding loops and doing things as sets is the best way.  If possible. The only loop I saw was looping over the merge_xxx tables, which is probably the only way.

There is an endless loop that is just a device for merging, but then there are loops going over each record in all the tables that are being merge, feeding them into the function that actually does the merge. That table iteration is what I want to eliminate (especially if I knew it would help :) )
 
If possible (if you haven't already) you could add and extra column to your secondary table that you can set as the main table's key.

bulk insert into second;
update second set magic = (select key from main where ... );

Then, maybe, you can do two ops in batch:

update main (where key exists in main)
insert into main (where key not exists in main)


I was thinking along the same lines. I can't really do bulk insert, at any point, because any key can be inserted by another process at any time, and with a good probability. However, there will be a lot less inserts than updates. So, in general, I'm making it do this:

with pivot as ( select main_table.id, node_table.id as node_id as main_id from node_table left join main_table using (key fields) )
update node_table set translate_id = pivot.main_id where node_table.id = pivot.node_id;

(missing is cursor as select from node_table where main_id is null)

for row in missing loop
  -- merge_function will return PK of either the updated, or inserted record.
  -- use (0) data values, so there it's an identity update, if the merge results
  -- into an update, or "empty" data if not.
  select merge_function(missing.key_fields, 0) into use_id;
  update node_table set translate_id = use_id where current of missing;
end loop

At this point, I have a guarantee that I can update all records, and there is nothing to insert.
So,

with new as ( select * from node_table ) 
update main_table old 
set new.val = f(old.val, new.val)
where new.translate_id = old.id

So, I don't need full key matching anymore, I can use PKs instead.


В списке pgsql-general по дате отправления:

Предыдущее
От: Andrew Sullivan
Дата:
Сообщение: Re: Hostnames, IDNs, Punycode and Unicode Case Folding
Следующее
От: Bernd Helmle
Дата:
Сообщение: Re: [HACKERS] ON_ERROR_ROLLBACK