Re: Parallel CREATE INDEX for GIN indexes
От | Andy Fan |
---|---|
Тема | Re: Parallel CREATE INDEX for GIN indexes |
Дата | |
Msg-id | 87o77gveoy.fsf@163.com обсуждение исходный текст |
Ответ на | Re: Parallel CREATE INDEX for GIN indexes (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Ответы |
Re: Parallel CREATE INDEX for GIN indexes
|
Список | pgsql-hackers |
Tomas Vondra <tomas.vondra@enterprisedb.com> writes: Hi Tomas, I am in a incompleted review process but I probably doesn't have much time on this because of my internal tasks. So I just shared what I did and the non-good-result patch. What I tried to do is: 1) remove all the "sort" effort for the state->bs_sort_state tuples since its input comes from state->bs_worker_state which is sorted already. 2). remove *partial* "sort" operations between accum.rbtree to state->bs_worker_state because the tuple in accum.rbtree is sorted already. Both of them can depend on the same API changes. 1. struct Tuplesortstate { .. + bool input_presorted; /* the tuples are presorted. */ + new_tapes; // writes the tuples in memory into a new 'run'. } and user can set it during tuplesort_begin_xx(, presorte=true); 2. in tuplesort_puttuple, if memory is full but presorted is true, we can (a) avoid the sort. (b) resuse the existing 'runs' to reduce the effort of 'mergeruns' unless new_tapes is set to true. once it switch to a new tapes, the set state->new_tapes to false and wait 3) to change it to true again. 3. tuplesort_dumptuples(..); // dump the tuples in memory and set new_tapes=true to tell the *this batch of input is presorted but they are done, the next batch is just presort in its own batch*. In the gin-parallel-build case, for the case 1), we can just use for tuple in bs_worker_sort: tuplesort_putgintuple(state->bs_sortstate, ..); tuplesort_dumptuples(..); At last we can get a). only 1 run in the worker so that the leader can have merge less runs in mergeruns. b). reduce the sort both in perform_sort_tuplesort and in sortstate_puttuple_common. for the case 2). we can have: for tuple in RBTree.tuples: tuplesort_puttuples(tuple) ; // this may cause a dumptuples internally when the memory is full, // but it is OK. tuplesort_dumptuples(..); we can just remove the "sort" into sortstate_puttuple_common but probably increase the 'runs' in sortstate which will increase the effort of mergeruns at last. But the test result is not good, maybe the 'sort' is not a key factor of this. I do missed the perf step before doing this. or maybe my test data is too small. Here is the patch I used for the above activity. and I used the following sql to test. CREATE TABLE t(a int[], b numeric[]); -- generate 1000 * 1000 rows. insert into t select i, n from normal_rand_array(1000, 90, 1::int4, 10000::int4) i, normal_rand_array(1000, 90, '1.00233234'::numeric, '8.239241989134'::numeric) n; alter table t set (parallel_workers=4); set debug_parallel_query to on; set max_parallel_maintenance_workers to 4; create index on t using gin(a); create index on t using gin(b); I found normal_rand_array is handy to use in this case and I register it into https://commitfest.postgresql.org/48/5061/. Besides the above stuff, I didn't find anything wrong in the currrent patch, and the above stuff can be categoried into "furture improvement" even it is worthy to. -- Best Regards Andy Fan
Вложения
В списке pgsql-hackers по дате отправления: