Re: logical decoding and replication of sequences, take 2

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: logical decoding and replication of sequences, take 2
Дата
Msg-id 8a2f25d8-23d4-4ab8-2a23-7c8d2d976209@enterprisedb.com
обсуждение исходный текст
Ответ на RE: logical decoding and replication of sequences, take 2  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Ответы Re: logical decoding and replication of sequences, take 2  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Re: logical decoding and replication of sequences, take 2  (Amit Kapila <amit.kapila16@gmail.com>)
Re: logical decoding and replication of sequences, take 2  (Dilip Kumar <dilipbalaut@gmail.com>)
Список pgsql-hackers
On 12/3/23 13:55, Hayato Kuroda (Fujitsu) wrote:
> Dear Tomas,
> 
>>> I did also performance tests (especially case 3). First of all, there are some
>>> variants from yours.
>>>
>>> 1. patch 0002 was reverted because it has an issue. So this test checks whether
>>>    refactoring around ReorderBufferSequenceIsTransactional seems really
>> needed.
>>
>> FWIW I also did the benchmarks without the 0002 patch, for the same
>> reason. I forgot to mention that.
> 
> Oh, good news. So your bench markings are quite meaningful.
> 
>>
>> Interesting, so what exactly does the transaction do?
> 
> It is quite simple - PSA the script file. It was executed with 64 multiplicity.
> The definition of alter_sequence() is same as you said.
> (I did use normal bash script for running them, but your approach may be smarter)
> 
>> Anyway, I don't
>> think this is very surprising - I believe it behaves like this because
>> of having to search in many hash tables (one in each toplevel xact). And
>> I think the solution I explained before (maintaining a single toplevel
>> hash, instead of many per-top-level hashes).
> 
> Agreed. And I can benchmark again for new ones, maybe when we decide new
> approach.
> 

Thanks for the script. Are you also measuring the time it takes to
decode this using test_decoding?

FWIW I did more comprehensive suite of tests over the weekend, with a
couple more variations. I'm attaching the updated scripts, running it
should be as simple as

  ./run.sh BRANCH TRANSACTIONS RUNS

so perhaps

  ./run.sh master 1000 3

to do 3 runs with 1000 transactions per client. And it'll run a bunch of
combinations hard-coded in the script, and write the timings into a CSV
file (with "master" in each row).

I did this on two machines (i5 with 4 cores, xeon with 16/32 cores). I
did this with current master, the basic patch (without the 0002 part),
and then with the optimized approach (single global hash table, see the
0004 part). That's what master / patched / optimized in the results is.

Interestingly enough, the i5 handled this much faster, it seems to be
better in single-core tasks. The xeon is still running, so the results
for "optimized" only have one run (out of 3), but shouldn't change much.

Attached is also a table summarizing this, and visualizing the timing
change (vs. master) in the last couple columns. Green is "faster" than
master (but we don't really expect that), and "red" means slower than
master (the more red, the slower).

There results are grouped by script (see the attached .tgz), with either
32 or 96 clients (which does affect the timing, but not between master
and patch). Some executions have no pg_sleep() calls, some have 0.001
wait (but that doesn't seem to make much difference).

Overall, I'd group the results into about three groups:

1) good cases [nextval, nextval-40, nextval-abort]

These are cases that slow down a bit, but the slowdown is mostly within
reasonable bounds (we're making the decoding to do more stuff, so it'd
be a bit silly to require that extra work to make no impact). And I do
think this is reasonable, because this is pretty much an extreme / worst
case behavior. People don't really do just nextval() calls, without
doing anything else. Not to mention doing aborts for 100% transactions.

So in practice this is going to be within noise (and in those cases the
results even show speedup, which seems a bit surprising). It's somewhat
dependent on CPU too - on xeon there's hardly any regression.


2) nextval-40-abort

Here the slowdown is clear, but I'd argue it generally falls in the same
group as (1). Yes, I'd be happier if it didn't behave like this, but if
someone can show me a practical workload affected by this ...


3) irrelevant cases [all the alters taking insane amounts of time]

I absolutely refuse to care about these extreme cases where decoding
100k transactions takes 5-10 minutes (on i5), or up to 30 minutes (on
xeon). If this was a problem for some practical workload, we'd have
already heard about it I guess. And even if there was such workload, it
wouldn't be up to this patch to fix that. There's clearly something
misbehaving in the snapshot builder.


I was hopeful the global hash table would be an improvement, but that
doesn't seem to be the case. I haven't done much profiling yet, but I'd
guess most of the overhead is due to ReorderBufferQueueSequence()
starting and aborting a transaction in the non-transactinal case. Which
is unfortunate, but I don't know if there's a way to optimize that.

Some time ago I floated the idea of maybe "queuing" the sequence changes
and only replay them on the next commit, somehow. But we did ran into
problems with which snapshot to use, that I didn't know how to solve.
Maybe we should try again. The idea is we'd queue the non-transactional
changes somewhere (can't be in the transaction, because we must keep
them even if it aborts), and then "inject" them into the next commit.
That'd mean we wouldn't do the separate start/abort for each change.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Joe Conway
Дата:
Сообщение: Re: Emitting JSON to file using COPY TO
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: logical decoding and replication of sequences, take 2