Обсуждение: Speeding Up Bitmapset

Поиск
Список
Период
Сортировка

Speeding Up Bitmapset

От
Ranier Vilela
Дата:
Hi,

Yuya Watari presented a series of patches, with the objective of improving the Bitmapset [1].
After reading through the patches, I saw a lot of good ideas and thought I'd help.
Unfortunately, my suggestions were not well received.
Even so, I decided to work on these patches and see what could be improved.

Eventually it arrived at the attached patch, which I'm naming v7, because of the sequence it had established.

Those who follow the other thread will see that the original patch actually improves the overall performance of Bitmapset,
but I believe there is room for further improvement.

I ran the same tests provided by Yuya Watari and the results are attached.
Both on Windows and Linux Ubuntu, the performance of v7 outperforms head and v4.
At Ubuntu 64 bits:
v7 outperforms v4 by 29% (Query A)
v7 outperforms v4 by 19% (Query B)

At Windows 64 bits:
v7 outperforms v4 by 22% (Query A)
v7 outperforms v4 by 33% (Query B)

I believe patch v7 leaves the Bitmapset in good shape and readable, well written.

Questions arose regarding possible regression when using the backward loop, but in all the tests I ran this version with backward performed better.
Possibly because of the smaller number of variables and the efficient test (--i <= 0), which both the msvc and gcc compilers successfully optimize.

If the v7 version with loop forward performs worse on x86_64 cpus,
I don't see how it will perform better on other architectures, since the vast majority of modern ones and with great cache support.

Another question is related to an alleged "spurius test", whose objective is to avoid test and set instructions for each element of the array.
Again I ran the tests without the test and the performance was worse, showing its value.

regards,
Ranier Vilela

Вложения

Re: Speeding Up Bitmapset

От
David Rowley
Дата:
On Mon, 26 Jun 2023 at 00:29, Ranier Vilela <ranier.vf@gmail.com> wrote:
> Yuya Watari presented a series of patches, with the objective of improving the Bitmapset [1].
> After reading through the patches, I saw a lot of good ideas and thought I'd help.
> Unfortunately, my suggestions were not well received.

For the future, I recommend if you do have ideas that you wish to
convey and think a patch is the easiest way to do that, then it's best
to send those with an extension that won't cause the CFbot to pickup
your patch instead. You patch did and still does seem to have a load
of extra changes that are unjustified by your claims.  There are quite
a lot of white space changes going on. There's no reason in the world
that those will speed up Bitmapsets, so why include them?

> v7 outperforms v4 by 29% (Query A)

I tested v7 with query-a and I also see additional gains. However,
it's entirely down to your changes to bms_is_subset(). It seems, by
chance, with the given Bitmapsets that looping backwards for the given
sets is able to determine the result more quickly

Here's some results from "perf top"

query-a
v4

  30.08%  postgres          [.] bms_is_subset
  15.84%  postgres          [.] create_join_clause
  13.54%  postgres          [.] bms_equal
  11.03%  postgres          [.] get_eclass_for_sort_expr
   8.53%  postgres          [.] generate_implied_equalities_for_column
   3.11%  postgres          [.] generate_join_implied_equalities_normal
   1.03%  postgres          [.] add_child_rel_equivalences
   0.82%  postgres          [.] SearchCatCacheInternal
   0.73%  postgres          [.] AllocSetAlloc
   0.53%  postgres          [.] find_ec_member_matching_expr
   0.40%  postgres          [.] hash_search_with_hash_value
   0.36%  postgres          [.] palloc
   0.36%  postgres          [.] palloc0

latency average = 452.480 ms

v7
  20.51%  postgres          [.] create_join_clause
  15.33%  postgres          [.] bms_equal
  14.17%  postgres          [.] get_eclass_for_sort_expr
  12.05%  postgres          [.] bms_is_subset
  10.40%  postgres          [.] generate_implied_equalities_for_column
   3.90%  postgres          [.] generate_join_implied_equalities_normal
   1.34%  postgres          [.] add_child_rel_equivalences
   1.06%  postgres          [.] AllocSetAlloc
   1.00%  postgres          [.] SearchCatCacheInternal
   0.72%  postgres          [.] find_ec_member_matching_expr
   0.58%  postgres          [.] palloc0
   0.49%  postgres          [.] palloc
   0.47%  postgres          [.] hash_search_with_hash_value
   0.44%  libc.so.6         [.] __memmove_avx_unaligned_erms


latency average = 350.543 ms

modified v7's bms_is_subset to go forwards then I get latency average
= 445.987 ms.

If I add some debugging to bms_is_subset to have it record how many
words it checks, I see:

postgres=# select sum(nwords) from forward;
    sum
-----------
 181490660
(1 row)

postgres=# select sum(nwords) from backwards;
   sum
----------
 11322564
(1 row)

So, it took about 181 million loops in bms_is_member to plan query-a
when looping forward, but only 11 million when looping backwards.

I think unless you've got some reason that you're able to justify why
we're always more likely to have to perform fewer word checks in
bms_is_subset() by looping backwards instead of forwards then I think
the performance gains you're showing here only happen to show better
results due to the given workload.  It's just as easy to imagine that
you'll apply the equivalent slowdown for some other workload where the
initial words differ but all the remaining words all match.

This seems equivalent to someone suggesting that looping backwards in
list_member() is better than looping forwards because it'll find the
member more quickly.  I can't imagine us ever considering that would
be a good change to make and I think the same for bms_is_member().

David



Re: Speeding Up Bitmapset

От
Ranier Vilela
Дата:
Em dom., 25 de jun. de 2023 às 17:49, David Rowley <dgrowleyml@gmail.com> escreveu:
There's no reason in the world
that those will speed up Bitmapsets, so why include them?
Of course optimization is the most important thing,
but since you're going to touch the source, why not make it more readable.
 

> v7 outperforms v4 by 29% (Query A)

I tested v7 with query-a and I also see additional gains. However,
it's entirely down to your changes to bms_is_subset(). It seems, by
chance, with the given Bitmapsets that looping backwards for the given
sets is able to determine the result more quickly
I redid the tests and it seems that most of the difference comes from bms_subset.
 

Here's some results from "perf top"

query-a
v4

  30.08%  postgres          [.] bms_is_subset
  15.84%  postgres          [.] create_join_clause
  13.54%  postgres          [.] bms_equal
  11.03%  postgres          [.] get_eclass_for_sort_expr
   8.53%  postgres          [.] generate_implied_equalities_for_column
   3.11%  postgres          [.] generate_join_implied_equalities_normal
   1.03%  postgres          [.] add_child_rel_equivalences
   0.82%  postgres          [.] SearchCatCacheInternal
   0.73%  postgres          [.] AllocSetAlloc
   0.53%  postgres          [.] find_ec_member_matching_expr
   0.40%  postgres          [.] hash_search_with_hash_value
   0.36%  postgres          [.] palloc
   0.36%  postgres          [.] palloc0

latency average = 452.480 ms

v7
  20.51%  postgres          [.] create_join_clause
  15.33%  postgres          [.] bms_equal
  14.17%  postgres          [.] get_eclass_for_sort_expr
  12.05%  postgres          [.] bms_is_subset
  10.40%  postgres          [.] generate_implied_equalities_for_column
   3.90%  postgres          [.] generate_join_implied_equalities_normal
   1.34%  postgres          [.] add_child_rel_equivalences
   1.06%  postgres          [.] AllocSetAlloc
   1.00%  postgres          [.] SearchCatCacheInternal
   0.72%  postgres          [.] find_ec_member_matching_expr
   0.58%  postgres          [.] palloc0
   0.49%  postgres          [.] palloc
   0.47%  postgres          [.] hash_search_with_hash_value
   0.44%  libc.so.6         [.] __memmove_avx_unaligned_erms


latency average = 350.543 ms

modified v7's bms_is_subset to go forwards then I get latency average
= 445.987 ms.

If I add some debugging to bms_is_subset to have it record how many
words it checks, I see:

postgres=# select sum(nwords) from forward;
    sum
-----------
 181490660
(1 row)

postgres=# select sum(nwords) from backwards;
   sum
----------
 11322564
(1 row)

So, it took about 181 million loops in bms_is_member to plan query-a
when looping forward, but only 11 million when looping backwards.

I think unless you've got some reason that you're able to justify why
we're always more likely to have to perform fewer word checks in
bms_is_subset() by looping backwards instead of forwards then I think
the performance gains you're showing here only happen to show better
results due to the given workload.  It's just as easy to imagine that
you'll apply the equivalent slowdown for some other workload where the
initial words differ but all the remaining words all match.
Have you seen bms_compare?
For some reason someone thought it would be better to loop backward the array.

Since bms_subset performed very well with backward, I think that's a good reason to leave it as bms_compare.
As in most cases, the array size is small, in general in both modes the performance will be equivalent.
 
Anyway, I made a new patch v8, based on v4 but with some changes that I believe improve it.

Windows 64 bits
msvc 2019 64 bits

== Query A ==
psql -U postgres -f c:\postgres_bench\tmp\bitmapset\create-tables-a.sql
psql -U postgres -f c:\postgres_bench\tmp\bitmapset\query-a.sql
=============

patched v4:
Time: 2305,445 ms (00:02,305)
Time: 2185,972 ms (00:02,186)
Time: 2177,434 ms (00:02,177)
Time: 2169,883 ms (00:02,170)

patched v8:
Time: 2143,532 ms (00:02,144)
Time: 2140,313 ms (00:02,140)
Time: 2138,481 ms (00:02,138)
Time: 2130,290 ms (00:02,130)

== Query B ==
psql -U postgres -f c:\postgres_bench\tmp\bitmapset\create-tables-b.sql
psql -U postgres -f c:\postgres_bench\tmp\bitmapset\query-b.sql

patched v4:
Time: 2684,360 ms (00:02,684)
Time: 2482,571 ms (00:02,483)
Time: 2452,699 ms (00:02,453)
Time: 2465,223 ms (00:02,465)

patched v8:
Time: 2493,281 ms (00:02,493)
Time: 2490,090 ms (00:02,490)
Time: 2432,515 ms (00:02,433)
Time: 2426,860 ms (00:02,427)


linux Ubuntu 64 bit
gcc 64 bits

== Query A ==
/usr/local/pgsql/bin/psql -U postgres -f /home/ranier/Documentos/benchmarks/bitmapset/create-tables-a.sql
/usr/local/pgsql/bin/psql -U postgres -f /home/ranier/Documentos/benchmarks/bitmapset/query-a.sql
=============

patched v4:
Time: 933,181 ms
Time: 931,520 ms
Time: 906,496 ms
Time: 872,446 ms

patched v8:
Time: 937,079 ms
Time: 930,408 ms
Time: 865,548 ms
Time: 865,382 ms

== Query B ==
/usr/local/pgsql/bin/psql -U postgres -f /home/ranier/Documentos/benchmarks/bitmapset/create-tables-b.sql
/usr/local/pgsql/bin/psql -U postgres -f /home/ranier/Documentos/benchmarks/bitmapset/query-b.sql

patched v4:
Time: 1581,317 ms (00:01,581)
Time: 1568,371 ms (00:01,568)
Time: 1468,036 ms (00:01,468)
Time: 1445,698 ms (00:01,446)

patched v8:
Time: 1437,997 ms (00:01,438)
Time: 1437,435 ms (00:01,437)
Time: 1440,422 ms (00:01,440)
Time: 1436,112 ms (00:01,436)

regards,
Ranier Vilela
Вложения

Re: Speeding Up Bitmapset

От
David Rowley
Дата:
On Mon, 26 Jun 2023 at 12:55, Ranier Vilela <ranier.vf@gmail.com> wrote:
> Have you seen bms_compare?
> For some reason someone thought it would be better to loop backward the array.

That's nothing to do with efficiency. It's related to behaviour. Have
a look at the function's header comment, it's trying to find the set
with the highest value member.  It wouldn't make much sense to start
at the lowest-order words for that.

David



Re: Speeding Up Bitmapset

От
David Rowley
Дата:
On Mon, 26 Jun 2023 at 12:55, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> Em dom., 25 de jun. de 2023 às 17:49, David Rowley <dgrowleyml@gmail.com> escreveu:
>>
>> There's no reason in the world
>> that those will speed up Bitmapsets, so why include them?
>
> Of course optimization is the most important thing,
> but since you're going to touch the source, why not make it more readable.

Have a look at [1]. In particular "Reasons your patch might be
returned" and precisely "Reformatting lines that haven't changed"

That entire page is quite valuable and I'd recommend having a read of all of it.

David

[1] https://wiki.postgresql.org/wiki/Submitting_a_Patch