Обсуждение: [HACKERS] [WIP] Zipfian distribution in pgbench

Поиск
Список
Период
Сортировка

[HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:
Hello!

PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% UPDATE of random row by PK) on benchmarking
withbig number of clients using Zipfian distribution. MySQL also has decline but it is not significant as it is in
PostgreSQL.MongoDB does not have decline at all. And if pgbench would have Zipfian distribution random number
generator,everyone will be able to make research on this topic without using YCSB.  

This is the reason why I am currently working on random_zipfian function.

The bottleneck of algorithm that I use is that it calculates zeta function (it has linear complexity -
https://en.wikipedia.org/wiki/Riemann_zeta_function).It my cause problems on generating huge amount of big numbers.  

That’s why I added caching for zeta value. And it works good for cases when random_zipfian called with same parameters
inscript. For example: 

…
\set a random_zipfian(1, 100, 1.2)
\set b random_zipfian(1, 100, 1.2)
…

In other case, second call will override cache of first and caching does not make any sense:
…
\set a random_zipfian(1, 100, 1.2)
\set b random_zipfian(1, 200, 1.4)
…

That’s why I have a question: should I implement support of caching zeta values for calls with different parameters, or
not? 

P.S. I attaching patch and script - analogue of YCSB Workload A.
Run benchmark with command:
$ pgbench -f  ycsb_read_zipf.sql -f  ycsb_update_zipf.sql

On scale = 10(1 million rows) it gives following results on machine with 144 cores(with synchronous_commit=off):
    nclients    tps
    1        8842.401870
    2        18358.140869
    4        45999.378785
    8        88713.743199
    16        170166.998212
    32        290069.221493
    64        178128.030553
    128        88712.825602
    256        38364.937573
    512        13512.765878
    1000     6188.136736

—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Fabien COELHO
Дата:
Hello Alik,

> PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% 
> UPDATE of random row by PK) on benchmarking with big number of clients 
> using Zipfian distribution. MySQL also has decline but it is not 
> significant as it is in PostgreSQL. MongoDB does not have decline at 
> all. And if pgbench would have Zipfian distribution random number 
> generator, everyone will be able to make research on this topic without 
> using YCSB.

Your description is not very precise. What version of Postgres is used? If 
there is a decline, compared to which version? Is there a link to these 
results?

> This is the reason why I am currently working on random_zipfian function.

> The bottleneck of algorithm that I use is that it calculates zeta 
> function (it has linear complexity - 
> https://en.wikipedia.org/wiki/Riemann_zeta_function). It my cause 
> problems on generating huge amount of big numbers.

Indeed, the function computation is over expensive, and the numerical 
precision of the implementation is doubtful.

If there is no better way to compute this function, ISTM that it should be 
summed in reverse order to accumulate small values first, from (1/n)^s + 
... + (1/2)^ s. As 1/1 == 1, the corresponding term is 1, no point in 
calling pow for this one, so it could be:
       double ans = 0.0;       for (i = n; i >= 2; i--)             ans += pow(1. / i, theta);       return 1.0 + ans;

> That’s why I added caching for zeta value. And it works good for cases 
> when random_zipfian called with same parameters in script. For example:

> That’s why I have a question: should I implement support of caching zeta 
> values for calls with different parameters, or not?

I do not envision the random_zipfian function to be used widely, but if it 
is useful to someone this is fine with me. Could an inexpensive 
exponential distribution be used instead for the same benchmarking 
purpose?

If the functions when actually used is likely to be called with different 
parameters, then some caching beyond the last value would seem in order. 
Maybe a small fixed size array?

However, it should be somehow thread safe, which does not seem to be the 
case with the current implementation. Maybe a per-thread cache? Or use a 
lock only to update a shared cache? At least it should avoid locking to 
read values...

> P.S. I attaching patch and script - analogue of YCSB Workload A.
> Run benchmark with command:
> $ pgbench -f  ycsb_read_zipf.sql -f  ycsb_update_zipf.sql

Given the explanations, the random draw mostly hits values at the 
beginning of the interval, so when the number of client goes higher one 
just get locking contention on the updated row?

ISTM that also having the tps achieved with a flat distribution would 
allow to check this hypothesis.

> On scale = 10(1 million rows) it gives following results on machine with 
> 144 cores(with synchronous_commit=off):

>     nclients    tps
>     1        8842.401870
>     2        18358.140869
>     4        45999.378785
>     8        88713.743199
>     16        170166.998212
>     32        290069.221493
>     64        178128.030553
>     128        88712.825602
>     256        38364.937573
>     512        13512.765878
>     1000     6188.136736

-- 
Fabien.

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Robert Haas
Дата:
On Fri, Jul 7, 2017 at 3:45 AM, Alik Khilazhev
<a.khilazhev@postgrespro.ru> wrote:
> PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% UPDATE of random row by PK) on benchmarking
withbig number of clients using Zipfian distribution. MySQL also has decline but it is not significant as it is in
PostgreSQL.MongoDB does not have decline at all. 

How is that possible?  In a Zipfian distribution, no matter how big
the table is, almost all of the updates will be concentrated on a
handful of rows - and updates to any given row are necessarily
serialized, or so I would think.  Maybe MongoDB can be fast there
since there are no transactions, so it can just lock the row slam in
the new value and unlock the row, all (I suppose) without writing WAL
or doing anything hard.  But MySQL is going to have to hold the row
lock until transaction commit just like we do, or so I would think.
Is it just that their row locking is way faster than ours?

I'm more curious about why we're performing badly than I am about a
general-purpose random_zipfian function.  :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Peter Geoghegan
Дата:
On Fri, Jul 7, 2017 at 5:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> How is that possible?  In a Zipfian distribution, no matter how big
> the table is, almost all of the updates will be concentrated on a
> handful of rows - and updates to any given row are necessarily
> serialized, or so I would think.  Maybe MongoDB can be fast there
> since there are no transactions, so it can just lock the row slam in
> the new value and unlock the row, all (I suppose) without writing WAL
> or doing anything hard.

If you're not using the Wired Tiger storage engine, than the locking
is at the document level, which means that a Zipfian distribution is
no worse than any other as far as lock contention goes. That's one
possible explanation. Another is that indexed organized tables
naturally have much better locality, which matters at every level of
the memory hierarchy.

> I'm more curious about why we're performing badly than I am about a
> general-purpose random_zipfian function.  :-)

I'm interested in both. I think that a random_zipfian function would
be quite helpful for modeling certain kinds of performance problems,
like CPU cache misses incurred at the page level.

-- 
Peter Geoghegan



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Peter Geoghegan
Дата:
On Fri, Jul 7, 2017 at 12:45 AM, Alik Khilazhev
<a.khilazhev@postgrespro.ru> wrote:
> On scale = 10(1 million rows) it gives following results on machine with 144 cores(with synchronous_commit=off):
>         nclients        tps
>         1               8842.401870
>         2               18358.140869
>         4               45999.378785
>         8               88713.743199
>         16              170166.998212
>         32              290069.221493
>         64              178128.030553
>         128             88712.825602
>         256             38364.937573
>         512             13512.765878
>         1000    6188.136736

Is it possible for you to instrument the number of B-Tree page
accesses using custom instrumentation for pgbench_accounts_pkey?

If that seems like too much work, then it would still be interesting
to see what the B-Tree keyspace looks like before and after varying
the "nclient" count from, say, 32 to 128. Maybe there is a significant
difference in how balanced or skewed it is in each case. Or, the index
could simply be more bloated.

There is a query that I sometimes use, that itself uses pageinspect,
to summarize the keyspace quickly. It shows you the highkey for every
internal page, starting from the root and working down to the lowest
internal page level (the one just before the leaf level -- level 1),
in logical/keyspace order. You can use it to visualize the
distribution of values. It could easily include the leaf level, too,
but that's less interesting and tends to make the query take ages. I
wonder what the query will show here.

Here is the query:

with recursive index_details as ( select   'pgbench_accounts_pkey'::text idx
),
size_in_pages_index as ( select   (pg_relation_size(idx::regclass) / (2^13))::int4 size_pages from   index_details
),
page_stats as ( select   index_details.*,   stats.* from   index_details,   size_in_pages_index,   lateral (select i
fromgenerate_series(1, size_pages - 1) i) series,   lateral (select * from bt_page_stats(idx, i)) stats),
 
internal_page_stats as ( select   * from   page_stats where   type != 'l'),
meta_stats as ( select   * from   index_details s,   lateral (select * from bt_metap(s.idx)) meta),
internal_items as ( select   * from   internal_page_stats order by   btpo desc),
-- XXX: Note ordering dependency within this CTE, on internal_items
ordered_internal_items(item, blk, level) as ( select   1,   blkno,   btpo from   internal_items where   btpo_prev = 0
andbtpo = (select level from meta_stats) union select   case when level = btpo then o.item + 1 else 1 end,   blkno,
btpofrom   internal_items i,   ordered_internal_items o where   i.btpo_prev = o.blk or (btpo_prev = 0 and btpo =
o.level- 1)
 
)
select idx, btpo as level, item as l_item, blkno, btpo_prev, btpo_next, btpo_flags, type, live_items, dead_items,
avg_item_size,page_size, free_size, -- Only non-rightmost pages have high key. case when btpo_next != 0 then (select
datafrom bt_page_items(idx,
 
blkno) where itemoffset = 1) end as highkey
from ordered_internal_items o join internal_items i on o.blk = i.blkno
order by btpo desc, item;

-- 
Peter Geoghegan



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alvaro Herrera
Дата:
Peter Geoghegan wrote:

> Here is the query:
> 
> with recursive index_details as (
>   select
>     'pgbench_accounts_pkey'::text idx
> ), [...]

Hmm, this seems potentially very useful.  Care to upload it to
https://wiki.postgresql.org/wiki/Category:Snippets ?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Peter Geoghegan
Дата:
On Fri, Jul 7, 2017 at 12:59 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Hmm, this seems potentially very useful.  Care to upload it to
> https://wiki.postgresql.org/wiki/Category:Snippets ?

Sure. I've added it here, under "index maintenance":

https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index

It would be a nice additional touch if there was an easy way of taking
the on-disk representation of index tuples (in this case that would be
little-endian signed integers from bt_page_items()), and from that
output actual typed values. Maybe just for a few select datatypes.

-- 
Peter Geoghegan



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:
Hello, Fabien!

Your description is not very precise. What version of Postgres is used? If there is a decline, compared to which version? Is there a link to these results?

Benchmark have been done in master v10. I am attaching image with results:
.

Indeed, the function computation is over expensive, and the numerical precision of the implementation is doubtful.

If there is no better way to compute this function, ISTM that it should be summed in reverse order to accumulate small values first, from (1/n)^s + ... + (1/2)^ s. As 1/1 == 1, the corresponding term is 1, no point in calling pow for this one, so it could be:

      double ans = 0.0;
      for (i = n; i >= 2; i--)
            ans += pow(1. / i, theta);
      return 1.0 + ans;

You are right, it’s better to reverse order.

If the functions when actually used is likely to be called with different parameters, then some caching beyond the last value would seem in order. Maybe a small fixed size array?

However, it should be somehow thread safe, which does not seem to be the case with the current implementation. Maybe a per-thread cache? Or use a lock only to update a shared cache? At least it should avoid locking to read values…

Yea, I forget about thread-safety. I will implement per-thread cache with small fixed array.

Given the explanations, the random draw mostly hits values at the beginning of the interval, so when the number of client goes higher one just get locking contention on the updated row?

Yes, exactly. 

ISTM that also having the tps achieved with a flat distribution would allow to check this hypothesis.

On Workload A with uniform distribution PostgreSQL shows better results than MongoDB and MySQL(see attachment). Also you can notice that for small number of clients  type of distribution does not affect on tps on MySQL. 


And it’s important to mention that postgres run with option synchronous_commit=off, to satisfy  durability MongoDB writeConcern=1&journaled=false. In this mode there is possibility to lose all changes in the last second. If we run postgres with max durability MongoDB will lag far behind. 
---
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Amit Kapila
Дата:
On Mon, Jul 10, 2017 at 12:19 PM, Alik Khilazhev <a.khilazhev@postgrespro.ru> wrote:
Hello, Fabien!

Your description is not very precise. What version of Postgres is used? If there is a decline, compared to which version? Is there a link to these results?

Benchmark have been done in master v10. I am attaching image with results:
.


It will be interesting to see what the profiling data with perf says about this for PostgreSQL.  Can you try to get the perf report?


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Fabien COELHO
Дата:
Hello Alik,

>> Your description is not very precise. What version of Postgres is used? 
>> If there is a decline, compared to which version? Is there a link to 
>> these results?
>
> Benchmark have been done in master v10. I am attaching image with results:
> .

Ok, thanks.

More precision would be helpful, such as the exact pgbench option used (eg 
how many client per thread in pgbench, how long does it run, prepared 
transactions, ...).

Intuitively, contention should explain a saturation of the tps 
performance, because more clients are not effective to improve tps as the 
wait for other clients, and the latency would degrade.

But it is unclear to me why the tps would be reduced even with lock 
contention, so something seems amiss.

Performance debugging by mail is an uneasy task.

Maybe you could try zipf with unlogged tables, to check whether skipping 
the WAL write does something.

Also Amit advice about the perf report looks useful.

>> Given the explanations, the random draw mostly hits values at the 
>> beginning of the interval, so when the number of client goes higher one 
>> just get locking contention on the updated row?
>
> Yes, exactly.

Ok. The uniform distribution run, if all other parameters are equal, gives 
a hint about the potential performance when the performance bottleneck is 
hit.

> On Workload A with uniform distribution PostgreSQL shows better results 
> than MongoDB and MySQL(see attachment). Also you can notice that for 
> small number of clients type of distribution does not affect on tps on 
> MySQL.

Ok. I assume that you use pgbench for pg and other ad-hoc tools for the 
other targets (mysql & mongodb).

-- 
Fabien.



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:

On 7 Jul 2017, at 21:53, Peter Geoghegan <pg@bowt.ie> wrote:

Is it possible for you to instrument the number of B-Tree page
accesses using custom instrumentation for pgbench_accounts_pkey?

If that seems like too much work, then it would still be interesting
to see what the B-Tree keyspace looks like before and after varying
the "nclient" count from, say, 32 to 128. Maybe there is a significant
difference in how balanced or skewed it is in each case. Or, the index
could simply be more bloated.

There is a query that I sometimes use, that itself uses pageinspect,
to summarize the keyspace quickly. It shows you the highkey for every
internal page, starting from the root and working down to the lowest
internal page level (the one just before the leaf level -- level 1),
in logical/keyspace order. You can use it to visualize the
distribution of values. It could easily include the leaf level, too,
but that's less interesting and tends to make the query take ages. I
wonder what the query will show here.

Here is the query:

I am attaching results of query that you sent. It shows that there is nothing have changed after executing tests. 
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:
Hello!

I want to say that our company is already engaged in the search for the causes of the problem and their solution. And
alsowe have few experimental patches that increases performance for 1000 clients by several times. 

In addition, I have fixed threadsafety issues and implemented per-thread cache for zeta values. See attached patch.
—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Peter Geoghegan
Дата:
On Wed, Jul 12, 2017 at 4:28 AM, Alik Khilazhev
<a.khilazhev@postgrespro.ru> wrote:
> I am attaching results of query that you sent. It shows that there is
> nothing have changed after executing tests.

But something did change! In the case where performance was good, all
internal pages on the level above the leaf level have exactly 285
items, excluding the rightmost page, which unsurprisingly didn't fill.
However, after increasing client count to get the slow case, the "hot"
part of the keyspace (the leaf pages owned by the first/leftmost
internal page on that level) has 353 items rather than 285.

Now, that might not seem like that much of a difference, but if you
consider how duplicates are handled in the B-Tree code, and how unique
index enforcement works, I think it could be. It could lead to heavy
buffer lock contention, because we sometimes do a lot of work with an
exclusive buffer lock held.

This is something that I go into a lot of detail on in the Wiki page
on Key normalization:
https://wiki.postgresql.org/wiki/Key_normalization#Avoiding_unnecessary_unique_index_enforcement

Now, I'm not saying that I know for sure that that's what it is. It
seems like a good area to investigate, though. Even if it wasn't
buffer lock contention, we're still talking about the difference
between the hot part of the B-Tree being about 353 pages, versus 285.
Buffer lock contention could naturally limit the growth in size to
"only" 353, by slowing everything down.

-- 
Peter Geoghegan



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Peter Geoghegan
Дата:
On Wed, Jul 12, 2017 at 12:30 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Jul 12, 2017 at 4:28 AM, Alik Khilazhev
> <a.khilazhev@postgrespro.ru> wrote:
>> I am attaching results of query that you sent. It shows that there is
>> nothing have changed after executing tests.
>
> But something did change! In the case where performance was good, all
> internal pages on the level above the leaf level have exactly 285
> items, excluding the rightmost page, which unsurprisingly didn't fill.
> However, after increasing client count to get the slow case, the "hot"
> part of the keyspace (the leaf pages owned by the first/leftmost
> internal page on that level) has 353 items rather than 285.

Could you please run the query again for both cases, but this time
change it to get items from the leaf level too, and not just internal
levels? Just remove the "wheretype != 'l' " clause in one of the CTEs.
That should do it.

It's probably the case that the hot part of the B-tree is actually
significantly less than 353 items (or 285 items), and so the "before"
and "after" difference in how many pages are used for the hot part of
the keyspace could be quite a lot larger than my initial estimate. It
could be that the granularity that is shown on the internal pages is
insufficient to see just how bad the problem is.

-- 
Peter Geoghegan



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alvaro Herrera
Дата:
Peter Geoghegan wrote:

> Now, that might not seem like that much of a difference, but if you
> consider how duplicates are handled in the B-Tree code, and how unique
> index enforcement works, I think it could be. It could lead to heavy
> buffer lock contention, because we sometimes do a lot of work with an
> exclusive buffer lock held.

Not to mention work done with a "buffer cleanup lock" held -- which is
compounded by the fact that acquiring such a lock is prone to starvation
if there are many scanners of that index.  I've seen a case where a very
hot table is scanned so heavily that vacuum is starved for days waiting
to acquire cleanup on a single page (vacuum was only able to finish
because the app using the table was restarted).  I'm sure that a uniform
distribution of keys, with a uniform distribution of values scanned,
would give a completely different behavior than a highly skewed
distribution where a single key receives a large fraction of the scans.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Peter Geoghegan
Дата:
On Wed, Jul 12, 2017 at 1:55 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Not to mention work done with a "buffer cleanup lock" held -- which is
> compounded by the fact that acquiring such a lock is prone to starvation
> if there are many scanners of that index.  I've seen a case where a very
> hot table is scanned so heavily that vacuum is starved for days waiting
> to acquire cleanup on a single page (vacuum was only able to finish
> because the app using the table was restarted).  I'm sure that a uniform
> distribution of keys, with a uniform distribution of values scanned,
> would give a completely different behavior than a highly skewed
> distribution where a single key receives a large fraction of the scans.

Good point.

I'd be interested in seeing the difference it makes if Postgres is
built with the call to _bt_check_unique() commented out within
nbtinsert.c. The UPDATE query doesn't ever change indexed columns, so
there should be no real need for the checking it does in this case.
We're seeing a lot of duplicates in at least part of the keyspace, and
yet the queries themselves should be as HOT-safe as possible.

Another possibly weird thing that I noticed is latency standard
deviation. This is from Alik's response to my request to run that SQL
query:

latency average = 1.375 ms
tps = 93085.250384 (including connections establishing)
tps = 93125.724773 (excluding connections establishing)
SQL script 1: /home/nglukhov/ycsb_read_zipf.sql- weight: 1 (targets 50.0% of total)- 2782999 transactions (49.8% of
total,tps = 46364.447705)- latency average = 0.131 ms- latency stddev = 0.087 ms
 
SQL script 2: /home/nglukhov/ycsb_update_zipf.sql- weight: 1 (targets 50.0% of total)- 2780197 transactions (49.8% of
total,tps = 46317.766703)- latency average = 2.630 ms- latency stddev = 14.092 ms
 

This is from the 128 client case -- the slow case.

Notice that the standard deviation is very high for
ycsb_update_zipf.sql. I wonder if this degrades because of some kind
of feedback loop, that reaches a kind of tipping point at higher
client counts.

-- 
Peter Geoghegan



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Peter Geoghegan
Дата:
On Wed, Jul 12, 2017 at 2:17 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I'd be interested in seeing the difference it makes if Postgres is
> built with the call to _bt_check_unique() commented out within
> nbtinsert.c.

Actually, I mean that I wonder how much of a difference it would make
if this entire block was commented out within _bt_doinsert():

if (checkUnique != UNIQUE_CHECK_NO)
{   ...
}


-- 
Peter Geoghegan



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:

On 13 Jul 2017, at 00:20, Peter Geoghegan <pg@bowt.ie> wrote:

Actually, I mean that I wonder how much of a difference it would make
if this entire block was commented out within _bt_doinsert():

if (checkUnique != UNIQUE_CHECK_NO)
{
   …
}

I am attaching results of test for 32 and 128 clients for original and patched(_bt_doinsert) variants.
— 
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company

Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Fabien COELHO
Дата:
Hello Alik,

A few comments about the patch v2.

Patch applies and compiles.

Documentation says that the closer theta is from 0 the flatter the distribution
but the implementation requires at least 1, including strange error messages:

   zipfian parameter must be greater than 1.000000 (not 1.000000)

Could theta be allowed between 0 & 1 ? I've tried forcing with theta = 0.1
and it worked well, so I'm not sure that I understand the restriction.
I also tried with theta=0.001 but it seemed less good.

I have also tried to check the distribution wrt the explanations, with the 
attached scripts, n=100, theta=1.000001/1.5/3.0: It does not seem to work, 
there is repeatable +15% bias on i=3 and repeatable -3% to -30% bias for 
values in i=10-100, this for different values of theta (1.000001,1.5, 
3.0).

If you try the script, beware to set parameters (theta, n) consistently.

About the code:

Remove spaces and tabs at end of lines or on empty lines.

zipfn: I would suggest to move the pg_erand48 out and pass the result
instead of passing the thread. the erand call would move to getZipfianRand.

I'm wondering whether the "nearest" hack could be removed so as to simplify
the cache functions code...

Avoid editing lines without changes (spacesn/tabs?)
   -  thread->logfile = NULL; /* filled in later */
   +  thread->logfile = NULL; /* filled in later */

The documentation explaining the intuitive distribution talks about a N
but does not provide any hint about its value depending on the parameters.

There is an useless empty lien before "</para>" after that.

-- 
Fabien.
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Peter Geoghegan
Дата:
On Thu, Jul 13, 2017 at 4:38 AM, Alik Khilazhev
<a.khilazhev@postgrespro.ru> wrote:
> I am attaching results of test for 32 and 128 clients for original and
> patched(_bt_doinsert) variants.

Thanks.

The number of leaf pages at the left hand side of the leaf level seems
to be ~50 less than the unpatched 128 client case was the first time
around, which seems like a significant difference. I wonder why. Maybe
autovacuum ran at the right/wrong time this time around?

Did you happen to record TPS for this most recent set of tests?

I notice one possibly interesting thing from these new numbers: the 32
client case is slightly more bloated when unique index enforcement is
removed.

-- 
Peter Geoghegan



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Peter Geoghegan
Дата:
On Thu, Jul 13, 2017 at 10:02 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> The number of leaf pages at the left hand side of the leaf level seems
> to be ~50 less than the unpatched 128 client case was the first time
> around, which seems like a significant difference. I wonder why. Maybe
> autovacuum ran at the right/wrong time this time around?

To reiterate what I say above:

The number of leaf pages with dead items is 20 with this most recent
run (128 clients, patched + unpatched). The leftmost internal page one
level up from the leaf level contains 289 items. Whereas last time it
was 353 items.

That's a difference between having 20 hot/bloated leaf pages, and
probably 84 hot/bloated pages, which I infer must have been the total
number of bloated leaf pages within "result.txt". I think that
something about all the "pgbench_index_*txt" tests are very different
to what we see within "result.txt". It's as if there was a problem
when "result.txt" ran, but that problem somehow didn't come up when
you did new tests.


-- 
Peter Geoghegan



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Peter Geoghegan
Дата:
On Thu, Jul 13, 2017 at 12:49 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> To reiterate what I say above:
>
> The number of leaf pages with dead items is 20 with this most recent
> run (128 clients, patched + unpatched). The leftmost internal page one
> level up from the leaf level contains 289 items. Whereas last time it
> was 353 items.
>
> That's a difference between having 20 hot/bloated leaf pages, and
> probably 84 hot/bloated pages, which I infer must have been the total
> number of bloated leaf pages within "result.txt". I think that
> something about all the "pgbench_index_*txt" tests are very different
> to what we see within "result.txt". It's as if there was a problem
> when "result.txt" ran, but that problem somehow didn't come up when
> you did new tests.

I just figured out that "result.txt" is only a 60 second pgbench run.
Is the same true for other tests?

It would be much more interesting to see runs that lasted 10 minutes
or more. Maybe with pgbench-tools. I would expect that a decline in
performance due to bloating the index could happen only after a
certain threshold was crossed. Things like HOT pruning and LP_DEAD
cleanup could be pretty effective, until finally a tipping point is
reached and they're much less effective.

-- 
Peter Geoghegan



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:

On 13 Jul 2017, at 19:14, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Documentation says that the closer theta is from 0 the flatter the distribution
but the implementation requires at least 1, including strange error messages:

 zipfian parameter must be greater than 1.000000 (not 1.000000)

Could theta be allowed between 0 & 1 ? I've tried forcing with theta = 0.1
and it worked well, so I'm not sure that I understand the restriction.
I also tried with theta=0.001 but it seemed less good.

Algorithm works with theta less than 1. The only problem here is that theta can not be 1, because of next line of code

cell->alpha = 1. / (1 - theta);

That’s why I put such restriction. Now I see 2 possible solutions for that:
1) Exclude 1, and allow everything in range (0;+∞).
2) Or just increase/decrease theta by very small number if it is 1. 

I have also tried to check the distribution wrt the explanations, with the attached scripts, n=100, theta=1.000001/1.5/3.0: It does not seem to work, there is repeatable +15% bias on i=3 and repeatable -3% to -30% bias for values in i=10-100, this for different values of theta (1.000001,1.5, 3.0).

If you try the script, beware to set parameters (theta, n) consistently.

I've executed scripts that you attached with different theta and number of outcomes(not n, n remains the same = 100) and I found out that for theta = 0.1 and big number of outcomes it gives distribution very similar to zipfian(for number of outcomes = 100 000, bias -6% to 8% in whole range and for NOO = 1000 000, bias is -2% to 2%).

By, number of outcomes(NOO) I mean how many times random_zipfian was called. For example:
pgbench -f compte_bench.sql -t 100000

So, I think it works but works worse for small number of outcomes. And also we need to find optimal theta for better results.

— 
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:
> On 13 Jul 2017, at 23:13, Peter Geoghegan <pg@bowt.ie> wrote:
>
> I just figured out that "result.txt" is only a 60 second pgbench run.
> Is the same true for other tests?

Yes, other tests ran 60 seconds too.

>
> It would be much more interesting to see runs that lasted 10 minutes
> or more.


I am attaching results of tests for 32 and 128 clients that were running for 10 minutes, and TPS remains 305 and 115
ktpsrespectively.  

Tests was executed with configuration set for YCSB. And there is very aggressively autovacuum, this can be reason why
thereis no decline appears with increasing working time. 
Autovacuum config:

    autovacuum_max_workers = 8
    autovacuum_naptime = 10s
    autovacuum_vacuum_scale_factor = 0.1
    autovacuum_vacuum_cost_delay = 0ms
    autovacuum_vacuum_cost_limit = 10000


—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Fabien COELHO
Дата:
> Algorithm works with theta less than 1. The only problem here is that 
> theta can not be 1, because of next line of code
>
> cell->alpha = 1. / (1 - theta);
>
> That’s why I put such restriction. Now I see 2 possible solutions for that:
> 1) Exclude 1, and allow everything in range (0;+∞).

Yep.

> 2) Or just increase/decrease theta by very small number if it is 1.

Nope, this seems quite arbitrary.

> I've executed scripts that you attached with different theta and number 
> of outcomes(not n, n remains the same = 100) and I found out that for 
> theta = 0.1 and big number of outcomes it gives distribution very 
> similar to zipfian(for number of outcomes = 100 000, bias -6% to 8% in 
> whole range and for NOO = 1000 000, bias is -2% to 2%).

Ok, so you did not get the large bias for i=3. Strange.

> By, number of outcomes(NOO) I mean how many times random_zipfian was 
> called. For example: pgbench -f compte_bench.sql -t 100000 So, I think 
> it works but works worse for small number of outcomes. And also we need 
> to find optimal theta for better results.

Hmmm. I've run one million outcomes each time.

I'll check on the next version.

-- 
Fabien.

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Peter Geoghegan
Дата:
On Fri, Jul 14, 2017 at 6:37 AM, Alik Khilazhev
<a.khilazhev@postgrespro.ru> wrote:
> I am attaching results of tests for 32 and 128 clients that were running for 10 minutes, and TPS remains 305 and 115
ktpsrespectively.
 
>
> Tests was executed with configuration set for YCSB. And there is very aggressively autovacuum, this can be reason why
thereis no decline appears with increasing working time.
 
> Autovacuum config:
>
>         autovacuum_max_workers = 8
>         autovacuum_naptime = 10s
>         autovacuum_vacuum_scale_factor = 0.1
>         autovacuum_vacuum_cost_delay = 0ms
>         autovacuum_vacuum_cost_limit = 10000

I think that what this probably comes down to, more than anything
else, is that you have leftmost hot/bloated leaf pages like this:

         idx          | level | l_item | blkno | btpo_prev |
btpo_next | btpo_flags | type | live_items | dead_items |
avg_item_size | page_size | free_size |         highkey

-----------------------+-------+--------+-------+-----------+-----------+------------+------+------------+------------+---------------+-----------+-----------+-------------------------...pgbench_accounts_pkey
|    0 |      1 |     1 |         0 |
 
2751 |         65 | l    |        100 |         41 |            16 |  8192 |      5328 | 11 00 00 00 00 00 00
00pgbench_accounts_pkey|     0 |      2 |  2751 |         1 |
 
2746 |         65 | l    |         48 |         90 |            16 |  8192 |      5388 | 32 00 00 00 00 00 00 00
...

The high key for the leftmost shows that only values below 0x11 belong
on the first page. This is about 16 or 17 possible distinct values,
and yet the page has 100 live items, and 41 dead items; in total,
there is room for 367 items. It's only slightly better with other
nearby pages. This is especially bad because once the keyspace gets
split up this finely, it's *impossible* to reverse it -- it's more or
less a permanent problem, at least until a REINDEX. You cannot merge
pages back together until one is completely empty, which in this case
and many cases will in fact never happen. Aggressive vacuuming is
probably helpful in part because it prevents the problem from ever
getting completely out of hand. That doesn't seem like a very
practical solution, though.

We should probably be treating unique indexes in a special way, since
inserting a new conflicting tuple necessarily supersedes whatever it
conflicted with. Postgres holds on to the old tuple today, but only
because the transaction might abort, or because an old snapshot might
be interested in old tuple versions. However, the general pattern with
unique indexes is that there must be one tuple visible to new
snapshots, and old versions are almost certainly going to became
garbage very quickly. Unique indexes really are quite special, which
nbtree doesn't take advantage of at all. If you read the coverage of
B-Trees within "Transaction Processing: Concepts and Techniques", and
many other publications, the general trend seems to be that unique
indexes have truly unique keys, based only on the user-visible key
values. They make a sharp distinction between primary and secondary
indexes, which doesn't really exist in Postgres (at least, not at the
access method level).

I suspect that the best long term solution is to add GIN-style
duplicate handling within nbtree for unique indexes, with special
pruning style optimizations to the posting list structure. This would
probably only happen with unique indexes. The useful thing about this
approach is it separates these two problems:

1. Representing what values are in the index for lookup, and their
latest row version.

2. Finding old row versions, in the less common case where you have an
old snapshot.

With a design like that, nbtree would never "cut up the keyspace too
finely" because of a temporary surge of UPDATE insertions. You still
get bloat, but you add it to a place where garbage collection can be
much better targeted. Under this scheme, it doesn't really matter so
much if garbage collection doesn't happen very frequently, because
there could be LP_DEAD style hints for the auxiliary posting list
structure. That could probably work based on atomic ops, and would
greatly reduce the number of pages that UPDATE workloads like this
dirtied.

It probably wouldn't make sense to add things like GIN's compression
of item pointers, since most data within the auxiliary posting list
structure is removed very quickly. I wouldn't expect btree_gin to be
faster for this workload today, because it doesn't have
kill_prior_tuple/LP_DEAD support, and because it doesn't support
unique indexes, and so cannot exploit the special situation that
exists with unique indexes.

-- 
Peter Geoghegan



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:
Hello Fabien,

On 14 Jul 2017, at 17:51, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Ok, so you did not get the large bias for i=3. Strange.

I got large bias for i=3 and theta > 1 even with a million outcomes, but for theta < 1 (I have tested on theta = 0.1 and 0.3) it showed quite good results.


I am attaching patch v3. Among other things I fixed small typo in description of random_exponential function in pgbench.sgml file.


Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Fabien COELHO
Дата:
>> Ok, so you did not get the large bias for i=3. Strange.
>
> I got large bias for i=3 and theta > 1 even with a million outcomes,

Ok. So this is similar to what I got.

Is this bias expected from the drawing method, say because it is 
approximated and the approximation is weak at some points, or is there an 
issue with its implementation, says some shift which gets smoothed down 
for higher indexes?

> I am attaching patch v3. Among other things I fixed small typo in 
> description of random_exponential function in pgbench.sgml file.

I'll look into that.

-- 
Fabien.



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:
> On 17 Jul 2017, at 13:51, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
> Is this bias expected from the drawing method, say because it is approximated and the approximation is weak at some
points,or is there an issue with its implementation, says some shift which gets smoothed down for higher indexes? 
>

I have checked paper where such implementation was proposed and there theta allowed only on range between 0 and 1. It
seemslike it is not guaranteed that it should work well when theta is more than 1. 

I am attaching paper, see page 23.

—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Fabien COELHO
Дата:
Hello,

>> Is this bias expected from the drawing method, say because it is 
>> approximated and the approximation is weak at some points, or is there 
>> an issue with its implementation, says some shift which gets smoothed 
>> down for higher indexes?
>
> I have checked paper where such implementation was proposed and there 
> theta allowed only on range between 0 and 1. It seems like it is not 
> guaranteed that it should work well when theta is more than 1.

Ok.

I see a significant issue with having a random_zipfian function which does 
not really return a zipfian distribution under some parameter values. If 
there is no better alternative, I would suggest to restrict the parameter 
for values between 0 and 1, or to find a better approximation for theta >= 
0.

> I am attaching paper, see page 23.

Thanks for the paper. It reminds me that I intended to propose a 
parametric pseudo-random permutation for pgbench, some day.

-- 
Fabien.



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Fabien COELHO
Дата:
Hello Alik,

> I am attaching patch v3.

> Among other things I fixed small typo in description of 
> random_exponential function in pgbench.sgml file.

Ok. Probably this typo should be committed separatly and independently.

A few comments about v3:

Patch applies cleanly, compiles, works.

About the maths: As already said, I'm not at ease with a random_zipfian 
function which does not display a (good) zipfian distribution. At the 
minimum the documentation should be clear about the approximations implied 
depending on the parameter value.

In the litterature the theta parameter seems to be often called alpha
or s (eg see https://en.wikipedia.org/wiki/Zipf%27s_law). I would suggest to
stick to "s" instead of "theta"?

About the code: looks simpler than the previous version, which is good.

Double constants should be used when the underlying type is a double,
instead of relying on implicit int to double promotion (0 -> 0.0, 1 -> 1.0).

Functions zipfZeta(n, theta) does not really computes the zeta(n) function,
so I think that a better name should be chosen. It seems to compute
H_{n,theta}, the generalized harmonic number. Idem "thetan" field in struct.

The handling of cache overflow by randomly removing one entry looks like
a strange idea. Rather remove the oldest entry?

ISTM that it should print a warning once if the cache array overflows as 
performance would drop heavily.

If the zipf cache is constant size, there is no point in using dynamic 
allocation, just declare an array...

Comments need updating: eg "theta parameter of previous execution" which
dates back when there was only one value.

There should be a comment to explain that the structure is in the thread
for thread safety.

There should be non regression tests somehow. If the "improve pgbench
tap test infrastructure" get through, things should be added there.

-- 
Fabien.



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:
Hello Fabien,

I am attaching patch v4. 

On 19 Jul 2017, at 17:21, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

About the maths: As already said, I'm not at ease with a random_zipfian function which does not display a (good) zipfian distribution. At the minimum the documentation should be clear about the approximations implied depending on the parameter value.

I add one more sentence to documentation to emphasize that degree of proximity depends on parameter . And also I made restriction on parameter, now it can be only in range (0; 1)

In the litterature the theta parameter seems to be often called alpha
or s (eg see https://en.wikipedia.org/wiki/Zipf%27s_law). I would suggest to
stick to "s" instead of "theta”?

I have renamed it to “s”.

Functions zipfZeta(n, theta) does not really computes the zeta(n) function,
so I think that a better name should be chosen. It seems to compute
H_{n,theta}, the generalized harmonic number. Idem "thetan" field in struct.

Renamed zipfZeta to zipfGeneralizedHarmonic, zetan to harmonicn.

The handling of cache overflow by randomly removing one entry looks like
a strange idea. Rather remove the oldest entry?

Replaced with Least Recently Used replacement algorithm.

ISTM that it should print a warning once if the cache array overflows as performance would drop heavily.

Now it prints warning message if array overflowed. To print message only one time, it uses global flag, which is available for all threads. 
And theoretically message can be printed more than one time. 
It could be solved easily using pg_atomic_test_set_flag() from src/include/port/atomics.h but it can not be used in pgbench because of following lines of code there:

#ifdef FRONTEND
#error "atomics.h may not be included from frontend code"
#endif

Or it can be fixed by using mutexes from pthread, but I think code become less readable and more complex in this case.
So, should I spend time on solving this issue? 


If the zipf cache is constant size, there is no point in using dynamic allocation, just declare an array…

Fixed. Does ZIPF_CACHE_SIZE = 15 is ok? 

There should be non regression tests somehow. If the "improve pgbench
tap test infrastructure" get through, things should be added there.

I will send tests later, as separate patch.


Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company


Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Fabien COELHO
Дата:
Hello Alik,

>> About the maths: As already said, I'm not at ease with a random_zipfian 
>> function which does not display a (good) zipfian distribution. At the 
>> minimum the documentation should be clear about the approximations 
>> implied depending on the parameter value.
>
> I add one more sentence to documentation to emphasize that degree of 
> proximity depends on parameter .
> And also I made restriction on 
> parameter, now it can be only in range (0; 1)

Hmmm. On second thought, maybe one or the other is enough, either restrict 
the parameter to values where the approximation is good, or put out a 
clear documentation about when the approximation is not very good, but it 
may be still useful even if not precise.

So I would be in favor of expanding the documentation but not restricting 
the parameter beyond avoiding value 1.0.

>> [...]
> Now it prints warning message if array overflowed. To print message only 
> one time, it uses global flag, which is available for all threads. And 
> theoretically message can be printed more than one time. [...]
> So, should I spend time on solving this issue?

No. Just write a comment.

>> If the zipf cache is constant size, there is no point in using dynamic 
>> allocation, just declare an array…
>
> Fixed. Does ZIPF_CACHE_SIZE = 15 is ok?

My guess is yes, till someone complains that it is not the case;-)

>> There should be non regression tests somehow. If the "improve pgbench
>> tap test infrastructure" get through, things should be added there.
>
> I will send tests later, as separate patch.

I think that developping a test would be much simpler with the improved 
tap test infrastructure, so I would suggest to wait to know the result of 
the corresponding patch.

Also, could you recod the patch to CF 2017-09?
https://commitfest.postgresql.org/14/

-- 
Fabien.

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:



I think that developping a test would be much simpler with the improved tap test infrastructure, so I would suggest to wait to know the result of the corresponding patch.

Ok, I will wait then.

Also, could you recod the patch to CF 2017-09?
https://commitfest.postgresql.org/14/

Yea, I will send it.


Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:
Hmmm. On second thought, maybe one or the other is enough, either restrict the parameter to values where the approximation is good, or put out a clear documentation about when the approximation is not very good, but it may be still useful even if not precise.

So I would be in favor of expanding the documentation but not restricting the parameter beyond avoiding value 1.0.

I have removed restriction and expanded documentation in attaching patch v5. 

Also I have recorded patch to CF 2017-09 —  https://commitfest.postgresql.org/14/1206/


Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company


Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:
Hello!

I realized that I was sending emails as HTML and latest patch is not visible in the archive now.
That’s why I am attaching it again.

I am sorry for that.


—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Peter Geoghegan
Дата:
On Fri, Jul 21, 2017 at 4:51 AM, Alik Khilazhev
<a.khilazhev@postgrespro.ru> wrote:
> (Latest version of pgbench Zipfian patch)

While I'm +1 on this idea, I think that it would also be nice if there
was an option to make functions like random_zipfian() actually return
a value that has undergone perfect hashing. When this option is used,
any given value that the function returns would actually be taken from
a random mapping to some other value in the same range. So, you can
potentially get a Zipfian distribution without the locality.

While I think that Zipfian distributions are common, it's probably
less common for the popular items to be clustered together within the
primary key, unless the PK has multiple attributes, and the first is
the "popular attribute". For example, there would definitely be a lot
of interest among contiguous items within an index on "(artist,
album)" where "artist" is a popular artist, which is certainly quite
possible. But, if "album" is the primary key, and it's a SERIAL PK,
then there will only be weak temporal locality within the PK, and only
because today's fads are more popular than the fads from previous
years.

Just another idea, that doesn't have to hold this patch up.

-- 
Peter Geoghegan



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Fabien COELHO
Дата:
Hello Peter,

> I think that it would also be nice if there was an option to make 
> functions like random_zipfian() actually return a value that has 
> undergone perfect hashing. When this option is used, any given value 
> that the function returns would actually be taken from a random mapping 
> to some other value in the same range. So, you can potentially get a 
> Zipfian distribution without the locality.

I definitely agree. This is a standard problem with all non uniform random 
generators in pgbench, namely random_{gaussian,exponential}.

However hashing is not a good solution on a finite domain because of the 
significant collision rate, so that typically 1/3 of values are empty and 
collisions cause spikes. Also, collisions would break PKs.

The solution is to provide a (good) pseudo-random parametric permutation, 
which is non trivial especially for non powers of two, so ISTM that it 
should be a patch on its own.

The good news is that it is on my todo list and I have some ideas on how 
to do it.

The bad news is that given the rate at which I succeed in getting things 
committed in pgbench, it might take some years:-( For instance, simplistic 
functions and operators to extend the current set have been in the pipe 
for 15 months and missed pg10.

-- 
Fabien.



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Fabien COELHO
Дата:
Hello Alik,

>> So I would be in favor of expanding the documentation but not 
>> restricting the parameter beyond avoiding value 1.0.
>
> I have removed restriction and expanded documentation in attaching patch v5.

I've done some math investigations, which consisted in spending one hour 
with Christian, a statistician colleague of mine. He took an old book out 
of a shelf, opened it to page 550 (roughly in the middle), and explained 
to me how to build a real zipfian distribution random generator.

The iterative method is for parameter a>1 and works for unbounded values. 
It is simple to add a bound. In practice the iterative method is quite 
effective, i.e. number of iterations is typically small, at least if the 
bound is large and if parameter a is not too close to 1.

I've attached a python3 script which implements the algorithm. It looks 
like magic. Beware that a C implementation should take care of float and 
int overflows.

   # usage: a, #values, #tests

   sh> zipf.py 1.5 1000 1000000
   # after 1.7 seconds
   c = [391586, 138668, 75525, 49339, 35222, 26621, ...
        ... 11, 13, 12, 11, 16] (1.338591 iterations per draw)

   sh> zipf.py 1.1 1000 1000000
   # after 3.1 seconds
   c = [179302, 83927, 53104, 39015, 30557, 25164, ...
        ... 82, 95, 93, 81, 80] (2.681451 iterations per draw)

I think that this method should be used for a>1, and the other very rough 
one can be kept for parameter a in [0, 1), a case which does not make much 
sense to a mathematician as it diverges if unbounded.

-- 
Fabien.
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:
Hello Fabien,


> On 5 Aug 2017, at 12:15, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
>
> Hello Alik,
>
> I've done some math investigations, which consisted in spending one hour with Christian, a statistician colleague of
mine.He took an old book out of a shelf, opened it to page 550 (roughly in the middle), and explained to me how to
builda real zipfian distribution random generator. 
>
> The iterative method is for parameter a>1 and works for unbounded values. It is simple to add a bound. In practice
theiterative method is quite effective, i.e. number of iterations is typically small, at least if the bound is large
andif parameter a is not too close to 1. 

> I've attached a python3 script which implements the algorithm. It looks like magic. Beware that a C implementation
shouldtake care of float and int overflows. 
>
Thank you for the script. I will rewrite it to C and add to the patch soon.

—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company





Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:
Hello Fabien,

>
> I think that this method should be used for a>1, and the other very rough one can be kept for parameter a in [0, 1),
acase which does not make much sense to a mathematician as it diverges if unbounded. 

Now “a” does not have upper bound, that’s why on using iterative algorithm with a >= 10000 program will stuck on
infiniteloop because of following line of code: 
double b = pow(2.0, s - 1.0);
Because after overflow “b” becomes “+Inf”.

So should upper bound for “a" be set?

Should I mention in docs that there are two algorithms are used depending on values of a(s/theta)?

In attaching patch, I have added computeIterativeZipfian method and it’s usage in getZipfianRand.
Is it better to move code of computing via cache to new method, so that getZipfianRand will contain only 2
computeXXXZipfianmethod calls? 
—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Fabien COELHO
Дата:
Hello Alik,

> Now “a” does not have upper bound, that’s why on using iterative algorithm with a >= 10000 program will stuck on
infiniteloop because of following line of code:
 
> double b = pow(2.0, s - 1.0);
> Because after overflow “b” becomes “+Inf”.

Yep, overflow can happen.

> So should upper bound for “a" be set?

Yes, I agree. a >= 10000 does not make much sense... If you want uniform 
you should use random(), not call random_zipfian with a = 10000. Basically 
it suggests that too large values of "a" should be rejected. Not sure 
where to put the limit, though.

> Should I mention in docs that there are two algorithms are used 
> depending on values of a(s/theta)?

Yes, as a general principle I think that the documentation should reflect 
the implementation.

> In attaching patch, I have added computeIterativeZipfian method and it’s 
> usage in getZipfianRand. Is it better to move code of computing via 
> cache to new method, so that getZipfianRand will contain only 2 
> computeXXXZipfian method calls?

I have not looked in detail, but from what you say I would agree that the 
implementation should be symmetric, so having one function calling one 
method or the other sounds good.

-- 
Fabien.

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:
Hello, Fabien

I am attaching patch v7.

> Yes, I agree. a >= 10000 does not make much sense... If you want uniform you should use random(), not call
random_zipfianwith a = 10000. Basically it suggests that too large values of "a" should be rejected. Not sure where to
putthe limit, though. 

I set upper bound for parameter to be equal to 1000.
>
> Yes, as a general principle I think that the documentation should reflect the implementation.

Documentation have been updated, I have removed algorithms descriptions and put references to them there.

—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Fabien COELHO
Дата:
Hello Alik,

> I am attaching patch v7.

Patch generates multiple warnings with "git apply", apparently because of 
end-of-line spaces, and fails:
  pgbench-zipf-07v.patch:52: trailing whitespace.        {  pgbench-zipf-07v.patch:53: trailing whitespace.
  "random_zipfian", 3, PGBENCH_RANDOM_ZIPFIAN  pgbench-zipf-07v.patch:54: trailing whitespace.        },
pgbench-zipf-07v.patch:66:trailing whitespace.  #define ZIPF_CACHE_SIZE 15              /* cache cells number */
pgbench-zipf-07v.patch:67:trailing whitespace.
 
  error: patch failed: doc/src/sgml/ref/pgbench.sgml:1013  error: doc/src/sgml/ref/pgbench.sgml: patch does not apply
error:patch failed: src/bin/pgbench/exprparse.y:191  error: src/bin/pgbench/exprparse.y: patch does not apply  error:
patchfailed: src/bin/pgbench/pgbench.c:93  error: src/bin/pgbench/pgbench.c: patch does not apply  error: patch failed:
src/bin/pgbench/pgbench.h:75 error: src/bin/pgbench/pgbench.h: patch does not apply
 

It also complains with "patch", although it seems to work:
 (Stripping trailing CRs from patch; use --binary to disable.) patching file doc/src/sgml/ref/pgbench.sgml (Stripping
trailingCRs from patch; use --binary to disable.) patching file src/bin/pgbench/exprparse.y (Stripping trailing CRs
frompatch; use --binary to disable.) patching file src/bin/pgbench/pgbench.c (Stripping trailing CRs from patch; use
--binaryto disable.) patching file src/bin/pgbench/pgbench.h patch unexpectedly ends in middle of line patch
unexpectedlyends in middle of line
 

Please do not put spaces at the end of lines, eg there is a lonely tab on 
the second line of "computeHarmonicZipfian".

It seems that "CR" characters are used:
  sh> sha1sum ~/pgbench-zipf-07v.patch  439ad1de0a960b3b415ff6de9412b3cc4d6922e7
  sh> file ~/pgbench-zipf-07v.patch  pgbench-zipf-07v.patch: unified diff output, ASCII text, with CRLF line
terminators

Compilation issues one warning:
 pgbench.c: In function ‘computeHarmonicZipfian’: pgbench.c:894:2: warning: ISO C90 forbids mixed declarations and code
[-Wdeclaration-after-statement]   double  uniform = pg_erand48(thread->random_state);
 

Please conform to pg standard for portability, even if it is a very old one:-)


About the documentation:

I would suggest to replace 0.5 in the first example by 1.5, as a zipfian 
distribution with a lower-than-one parameter is not that well defined.

I'm fine with using references to papers or books for the algorithm.

The documentation lines in the sgml file are too long. At least
limit text paragraphs to maybe 80 columns as done elsewhere.

typo: "<entry> Zipfian" (remove space)

typo: "used(described" (missing space)

typo: "numbers(algorithm" (again)

The sentence "It's recommended to pick <replaceable>parameter</> in range 
(0, 1) for better accuracy." does not make sense with the updated 
generation algorithm.

The text would benefit from a review by a native English speaker, who I am 
not. Anyway here is an attempt at improving/simplifying the text (I have 
removed sgml formatting). If some native English speaker could review it, 
that would be great.

"""  "random_zipfian" generates an approximated bounded zipfian distribution.  For parameter in (0,1), an approximated
algorithmis taken from  "Quickly Generating Billion-Record Synthetic Databases", Jim Gray et al, SIGMOD 1994.  For
parameterin (1, 1000), a rejection method is used, based on  "Non-Uniform Random Variate Generation", Luc Devroye, p.
550-551,Springer 1986.  The distribution is not defined when the parameter's value is 1.0.  The drawing performance is
poorfor parameter values close and above 1.0  and on a small range.
 
  "parameter" defines how skewed the distribution is.  The larger the "parameter", the more frequently values close to
thebeginning  of the interval are drawn.  The closer to 0 "parameter" is, the flatter (more uniform) the access
distribution.
"""


English in comments and code:

"inited" is not English, I think you want "initialized". Maybe "nb_cells" 
would do.

"it is not exist" (multiple instances) -> "it does not exist".

I think that the references to the paper/book should appear as comments in 
the code, for instance in front of the functions which implement the 
algorithm.

"current_time", "last_touch_time" are counter based, they are not "time". 
I suggest to update the name to "current" and "last_touch" or "last_used".

"time when cell was used last time" -> "last used logical time"


About the code:

I would prefer that you use "1.0" instead of "1." for double constants.

I think that getZipfianRand should be symmetric. Maybe a direct formula 
could do:
  return min - 1 + (s > 1) ? xxx() : yyy();

or at least use an explicit "else" for symmetry.

Function "zipfn" asks for s and n as arguments, but ISTM that they are 
already include as fields in cell, so this seems redundant. Moreover I do 
not think that the zipfn function is that useful. I would suggest to 
inline it in its caller.

"zipfGeneralizedHarmonic" is not really related to zipf, it is just used 
there. I suggest to call it "generalizedHarmonicNumber" to reflect what it 
does.

I suggest to put all LRU management in the getCell* function, that is to 
move the last_use update from computeHarmonicZipfian to getCell.

computeHarminicZipfian and computeIterativeZipfian should have the same 
signature, not one with a random state and the other with the thread. I 
suggest to do the same as other random functions, whatever it is.


Otherwise, no action required:

The patch probably conflicst with other patches in the CF, which means 
that there will be rebase needed. That is life. The queue is long and 
slow.

Also, this function deserve some tap tests, that should be added if the 
"pgbench tap test" patch get through. Otherwise it is as well tested as 
everything else around, which is basically no test.

-- 
Fabien.

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:
Hello Fabien,

Thank you for detailed review. I hope I have fixed all the issues you mentioned in your letter.


—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Fabien COELHO
Дата:
Hello Alik,

Applies, compiles, works for me.

Some minor comments and suggestions.

Two typos: - "usinng" -> "using" - "a rejection method used" -> "a rejection method is used"

I'm not sure of "least_recently_used_i", this naming style is not used in 
pgbench. "least_recently_used" would be ok.

"..nb_cells.. != ZIPF_CACHE_SIZE", ISTM that "<" is more logical,
even if the result is the same?

I would put the parameter value check in getZipfianRand, so that if 
someone reuse the function elsewhere the check is also performed. That 
would also simplify a bit the already very large expression evaluation 
function.

When/if the pgbench tap test patch get through, coverage tests should
be added.

Maybe the cache overflow could be counted, to allow for a possible warning 
message in the final report?

-- 
Fabien.



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Daniel Gustafsson
Дата:
> On 06 Sep 2017, at 08:42, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
>
> Hello Alik,
>
> Applies, compiles, works for me.
>
> Some minor comments and suggestions.
>
> Two typos:
> - "usinng" -> "using"
> - "a rejection method used" -> "a rejection method is used"
>
> I'm not sure of "least_recently_used_i", this naming style is not used in pgbench. "least_recently_used" would be ok.
>
> "..nb_cells.. != ZIPF_CACHE_SIZE", ISTM that "<" is more logical,
> even if the result is the same?
>
> I would put the parameter value check in getZipfianRand, so that if someone reuse the function elsewhere the check is
alsoperformed. That would also simplify a bit the already very large expression evaluation function. 
>
> When/if the pgbench tap test patch get through, coverage tests should
> be added.
>
> Maybe the cache overflow could be counted, to allow for a possible warning message in the final report?

Since this patch has been Waiting for author and no update on this patch has
been posted during the commitfest, it is Returned with feedback.  When you have
a new version of the patch, please re-submit to a future commitfest.

cheers ./daniel

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:
Hello Fabien,

Sorry for not responding    for long time.

> Two typos:
> - "usinng" -> "using"
> - "a rejection method used" -> "a rejection method is used"
>
> I'm not sure of "least_recently_used_i", this naming style is not used in pgbench. "least_recently_used" would be ok.
>
> "..nb_cells.. != ZIPF_CACHE_SIZE", ISTM that "<" is more logical,
> even if the result is the same?

Fixed.

> Maybe the cache overflow could be counted, to allow for a possible warning message in the final report?
Also done. But one question: Should it be done by printing to stdout or stderr ?

> When/if the pgbench tap test patch get through, coverage tests should
> be added.

I have added few tests and I am not sure if they OK or not. It was my first experience with perl and tests in
PostgreSQL. 

—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company



Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Fabien COELHO
Дата:
Hello Alik,

Patch applies with "patch", but not with "git apply", probably because it 
is in CR-NL eol format. No big deal.

Documentation has been switched from SGML to XML, so now tags must be 
explicitely closed, i.e. use <foo>x</foo> instead of <foo>x</>.

Patch compiles with a warning:

  pgbench.c: In function ‘getZipfianRand’:
  pgbench.c:905:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]

Declaration must move before assertion.

The random_zipfian(1,100,2) call result is not tested.

"1 times" -> "once".

"array overflow on \random_zipfian" remove spurious "\".



-- 
Fabien.

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:
> Patch applies with "patch", but not with "git apply", probably because it is in CR-NL eol format. No big deal.
>
> Documentation has been switched from SGML to XML, so now tags must be explicitely closed, i.e. use <foo>x</foo>
insteadof <foo>x</>. 
>
> Patch compiles with a warning:
>
> Declaration must move before assertion.
>
> "array overflow on \random_zipfian" remove spurious "\".

Fixed.

> The random_zipfian(1,100,2) call result is not tested.

I have reduced range to [1, 10] and updated the test.

> "1 times" -> "once”.

I am not sure that it need to be fixed in this way, because there can be another number instead of 1.
So, I updated it to “%d time(s)”


—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company



Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Fabien COELHO
Дата:
Hello Alik,

Patch applies, compiles, works (I checked the distribution). Doc gen ok. 
make check ok.

> I have reduced range to [1, 10] and updated the test.

I would suggest to use [1, 9] and the simpler regex [1-9] instead of the 
full list.

Also, there should be an empty line between functions. I just noticed that 
one is missing after zipfFindOrCreateCacheCell().

-- 
Fabien.


Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:
> I would suggest to use [1, 9] and the simpler regex [1-9] instead of the full list.
>
> Also, there should be an empty line between functions. I just noticed that one is missing after
zipfFindOrCreateCacheCell().

Fixed.

—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company


Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Fabien COELHO
Дата:
> Fixed.

Applies, compiles, make check ok, doc ok.

Note that the patch may interact with other patches which add functions to 
pgbench, so might need a rebase depending on the order in which the patch 
are applied. We'll see.

I have added the thread to the next CF (January 2018), and marked it as 
"ready for committer".

-- 
Fabien.


Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Alik Khilazhev
Дата:
> I have added the thread to the next CF (January 2018), and marked it as "ready for committer”.

That’s cool, thank you for review!
—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Fabien COELHO
Дата:
> Note that the patch may interact with other patches which add functions to 
> pgbench, so might need a rebase depending on the order in which the patch are 
> applied.

Attached a minor rebase after 16827d4424.

-- 
Fabien.
Вложения

Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Teodor Sigaev
Дата:
Thank you for all, pushed

Fabien COELHO wrote:
> 
>> Note that the patch may interact with other patches which add functions to 
>> pgbench, so might need a rebase depending on the order in which the patch are 
>> applied.
> 
> Attached a minor rebase after 16827d4424.
> 

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Peter Geoghegan
Дата:
On Fri, Jul 7, 2017 at 12:45 AM Alik Khilazhev
<a.khilazhev@postgrespro.ru> wrote:
> PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% UPDATE of random row by PK) on benchmarking
withbig number of clients using Zipfian distribution. MySQL also has decline but it is not significant as it is in
PostgreSQL.MongoDB does not have decline at all. And if pgbench would have Zipfian distribution random number
generator,everyone will be able to make research on this topic without using YCSB. 

I've been thinking about this again, in light of my recent work to
improve the B-Tree code by making all entries have fully unique keys,
adding suffix truncation, and making better choices about split points
[1].

Somebody posted a patch that fixed the issue by making the heavyweight
lock manager lock a value rather than a heap TID within the last year.
I can't find that thread right now -- perhaps someone can point it out
now. I suspect that my patch will have a similar effect to this other
patch I'm thinking of with the Zipfian workload, though it will do so
without touching anything outside of the B-Tree code, and without
changing the user-visible semantics of locking.

Can we possibly find a way to rerun this Zipfian experiment/test case?
I bet Alexander's recent B-Tree patch will also help.

Here is why I suspect my patch will help with the Zipfian workload,
particularly on a multi-socket machine:

The logic for following downlinks in Postgres is that downlinks are a
lower bound, but we still cannot follow them when the key is equal to
our insertion scankey -- we must go left, even when we have all keys
filled out. This means that we sometimes end up holding an exclusive
buffer lock on "the first leaf page the value could be on", even
though that page doesn't have any such values (they're all in later
pages, to the left). So we accidentally lock-out insertions of the
value 1 when we insert the value 2, and of the value 2 when we insert
the value 3, and of the value 3 when we insert the value 4, and so on
down the line. This is the worse possible thing for the Zipfian test
case, I think, since most of the contention is artificial.

I think that my patch may ameliorate the problem, because:

* We make the tie-breaker heap TID attribute have DESC sort order, so
generally speaking contention starts on the leftmost page among leaf
pages full of duplicates -- for reads, and for writes.

* The patch works very hard to make sure that large groups of
duplicates are not spread across pages. There would be no mixing of
leaf pages containing the value 1 and the value 2 with the Zipfian
test case, for example. Old duplicates would be packed into full
pages.

* We can invent a fake lower-than-any-real-value heap TID in the
insertion scankey in certain cases that don't have one. This way, the
scan key as a whole is *not* equal to pivot tuples that are at least
equal on non-heap-TID attributes, so we *can* follow a downlink that
is "equal" to our scankey, provided it has a truncated-away heap TID
attribute value.

In short, the artificial "false sharing" that we see in the B-Tree for
the Zipfian test case may go away. There will be *no* contention
between an inserter (or really UPDATE-er) that inserts the value 2
with concurrent inserters of the value 1. All of the ways in which
that was likely to happen are each fixed by the patch. There is still
buffer lock contention on heap pages, and the contention of waiting
for a row lock. Maybe the busy waiting will automatically be fixed,
though, since you have one point of contention for each of the really
hot values at the far left of the leaf level.

[1] https://commitfest.postgresql.org/20/1787/
--
Peter Geoghegan



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

От
Peter Geoghegan
Дата:
On Fri, Jul 7, 2017 at 12:45 AM Alik Khilazhev
<a.khilazhev@postgrespro.ru> wrote:
> PostgreSQL shows very bad results in YCSB Workload A (50% SELECT and 50% UPDATE of random row by PK) on benchmarking
withbig number of clients using Zipfian distribution. MySQL also has decline but it is not significant as it is in
PostgreSQL.MongoDB does not have decline at all. And if pgbench would have Zipfian distribution random number
generator,everyone will be able to make research on this topic without using YCSB. 

I've been thinking about this again, in light of my recent work to
improve the B-Tree code by making all entries have fully unique keys,
adding suffix truncation, and making better choices about split points
[1].

Somebody posted a patch that fixed the issue by making the heavyweight
lock manager lock a value rather than a heap TID within the last year.
I can't find that thread right now -- perhaps someone can point it out
now. I suspect that my patch will have a similar effect to this other
patch I'm thinking of with the Zipfian workload, though it will do so
without touching anything outside of the B-Tree code, and without
changing the user-visible semantics of locking.

Can we possibly find a way to rerun this Zipfian experiment/test case?
I bet Alexander's recent B-Tree patch will also help.

Here is why I suspect my patch will help with the Zipfian workload,
particularly on a multi-socket machine:

The logic for following downlinks in Postgres is that downlinks are a
lower bound, but we still cannot follow them when the key is equal to
our insertion scankey -- we must go left, even when we have all keys
filled out. This means that we sometimes end up holding an exclusive
buffer lock on "the first leaf page the value could be on", even
though that page doesn't have any such values (they're all in later
pages, to the left). So we accidentally lock-out insertions of the
value 1 when we insert the value 2, and of the value 2 when we insert
the value 3, and of the value 3 when we insert the value 4, and so on
down the line. This is the worse possible thing for the Zipfian test
case, I think, since most of the contention is artificial.

I think that my patch may ameliorate the problem, because:

* We make the tie-breaker heap TID attribute have DESC sort order, so
generally speaking contention starts on the leftmost page among leaf
pages full of duplicates -- for reads, and for writes.

* The patch works very hard to make sure that large groups of
duplicates are not spread across pages. There would be no mixing of
leaf pages containing the value 1 and the value 2 with the Zipfian
test case, for example. Old duplicates would be packed into full
pages.

* We can invent a fake lower-than-any-real-value heap TID in the
insertion scankey in certain cases that don't have one. This way, the
scan key as a whole is *not* equal to pivot tuples that are at least
equal on non-heap-TID attributes, so we *can* follow a downlink that
is "equal" to our scankey, provided it has a truncated-away heap TID
attribute value.

In short, the artificial "false sharing" that we see in the B-Tree for
the Zipfian test case may go away. There will be *no* contention
between an inserter (or really UPDATE-er) that inserts the value 2
with concurrent inserters of the value 1. All of the ways in which
that was likely to happen are each fixed by the patch. There is still
buffer lock contention on heap pages, and the contention of waiting
for a row lock. Maybe the busy waiting will automatically be fixed,
though, since you have one point of contention for each of the really
hot values at the far left of the leaf level.

[1] https://commitfest.postgresql.org/20/1787/
--
Peter Geoghegan