Обсуждение: GUC-ify walsender MAX_SEND_SIZE constant

Поиск
Список
Период
Сортировка

GUC-ify walsender MAX_SEND_SIZE constant

От
Majid Garoosi
Дата:
Hello all,

Following the threads here and there, I decided to submit this patch.

Following is the description which is also written in the commit message:
MAX_SEND_SIZE parameter was used in WALSender to limit maximum size of
a WAL data packet sent to a WALReceiver during replication. Although
its default value (128kB) was a reasonable value, it was written in
2010. Since the hardwares have changed a lot since then, a PostgreSQL
user might need to customize this value.
For example, if a database's disk has a high bandwidth and a high
latency at the same time, it makes more sense to read bigger chunks of
data from disk in each iteration. One example of such disks is a remote
disk. (e.g. an RBD volume)
However, this value does not need to be larger than wal_segment_size,
thus its checker function returns false if a larger value is set for
this.

This is my first patch. So, I hope I haven't done something wrong. :'D

Best regards
Majid
Вложения

Re: GUC-ify walsender MAX_SEND_SIZE constant

От
Michael Paquier
Дата:
On Fri, Jan 19, 2024 at 11:04:50PM +0330, Majid Garoosi wrote:
> However, this value does not need to be larger than wal_segment_size,
> thus its checker function returns false if a larger value is set for
> this.
>
> This is my first patch. So, I hope I haven't done something wrong. :'D

You've done nothing wrong.  Thanks for the patch!

+   if (*newval > wal_segment_size)
+       return false;
+   return true;

I was not sure first that we need a dynamic check, but I could get why
somebody may want to make it higher than 1MB these days.

The patch is missing a couple of things:
- Documentation in doc/src/sgml/config.sgml, that has a section for
"Sending Servers".
- It is not listed in postgresql.conf.sample.  I would suggest to put
it in REPLICATION -> Sending Servers.
The default value of 128kB should be mentioned in both cases.

- * We don't have a good idea of what a good value would be; there's some
- * overhead per message in both walsender and walreceiver, but on the other
- * hand sending large batches makes walsender less responsive to signals
- * because signals are checked only between messages.  128kB (with
- * default 8k blocks) seems like a reasonable guess for now.
[...]
+    gettext_noop("Walsender procedure consists of a loop, reading wal_sender_max_send_size "
+         "bytes of WALs from disk and sending them to the receiver. Sending large "
+         "batches makes walsender less responsive to signals."),

This is removing some information about why it may be a bad idea to
use a too low value (message overhead) and why it may be a bad idea to
use a too large value (responsiveness).  I would suggest to remove the
second gettext_noop() in guc_tables.c and move all this information to
config.sgml with the description of the new GUC.  Perhaps just put it
after wal_sender_timeout in the sample file and the docs?

Three comments in walsender.c still mention MAX_SEND_SIZE.  These
should be switched to mention the GUC instead.

I am switching the patch as waiting on author for now.
--
Michael

Вложения

Re: GUC-ify walsender MAX_SEND_SIZE constant

От
Majid Garoosi
Дата:
Thank you very much for your review.

I generally agree with your suggestions, so just applied them.
You can find the new patch in the attached file.

Best
Majid

On Tue, 6 Feb 2024 at 09:26, Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Jan 19, 2024 at 11:04:50PM +0330, Majid Garoosi wrote:
> However, this value does not need to be larger than wal_segment_size,
> thus its checker function returns false if a larger value is set for
> this.
>
> This is my first patch. So, I hope I haven't done something wrong. :'D

You've done nothing wrong.  Thanks for the patch!

+   if (*newval > wal_segment_size)
+       return false;
+   return true;

I was not sure first that we need a dynamic check, but I could get why
somebody may want to make it higher than 1MB these days.

The patch is missing a couple of things:
- Documentation in doc/src/sgml/config.sgml, that has a section for
"Sending Servers".
- It is not listed in postgresql.conf.sample.  I would suggest to put
it in REPLICATION -> Sending Servers.
The default value of 128kB should be mentioned in both cases.

- * We don't have a good idea of what a good value would be; there's some
- * overhead per message in both walsender and walreceiver, but on the other
- * hand sending large batches makes walsender less responsive to signals
- * because signals are checked only between messages.  128kB (with
- * default 8k blocks) seems like a reasonable guess for now.
[...]
+       gettext_noop("Walsender procedure consists of a loop, reading wal_sender_max_send_size "
+         "bytes of WALs from disk and sending them to the receiver. Sending large "
+         "batches makes walsender less responsive to signals."),

This is removing some information about why it may be a bad idea to
use a too low value (message overhead) and why it may be a bad idea to
use a too large value (responsiveness).  I would suggest to remove the
second gettext_noop() in guc_tables.c and move all this information to
config.sgml with the description of the new GUC.  Perhaps just put it
after wal_sender_timeout in the sample file and the docs?

Three comments in walsender.c still mention MAX_SEND_SIZE.  These
should be switched to mention the GUC instead.

I am switching the patch as waiting on author for now.
--
Michael
Вложения

Re: GUC-ify walsender MAX_SEND_SIZE constant

От
Michael Paquier
Дата:
On Thu, Feb 08, 2024 at 02:42:00PM +0330, Majid Garoosi wrote:
> Thank you very much for your review.

Something to be aware of, but the community lists use bottom-posting
for replies because it is easier to follow the logic of a thread this
way.  See here:
https://en.wikipedia.org/wiki/Posting_style#Bottom-posting

> I generally agree with your suggestions, so just applied them.
> You can find the new patch in the attached file.

Thanks for the patch, that looks rather fine.  I have spent some time
polishing the docs, adding a mention that increasing the value can
show benefits depending on what you do.  How does the attached look to
you?
--
Michael

Вложения

Re: GUC-ify walsender MAX_SEND_SIZE constant

От
Majid Garoosi
Дата:
On Fri, 9 Feb 2024 at 08:50, Michael Paquier <michael@paquier.xyz> wrote:
Something to be aware of, but the community lists use bottom-posting
for replies because it is easier to follow the logic of a thread this
way.  See here:
https://en.wikipedia.org/wiki/Posting_style#Bottom-posting
Oh, sorry for not using the convention here. I just noticed that after pressing the send button. =) 

Thanks for the patch, that looks rather fine.  I have spent some time
polishing the docs, adding a mention that increasing the value can
show benefits depending on what you do.  How does the attached look to
you?
I took a look at it and it seems good to me.
Maybe I should work on my writing skills more. :D

Best
Majid

Re: GUC-ify walsender MAX_SEND_SIZE constant

От
Andres Freund
Дата:
Hi,

On 2024-01-19 23:04:50 +0330, Majid Garoosi wrote:
> Following is the description which is also written in the commit message:
> MAX_SEND_SIZE parameter was used in WALSender to limit maximum size of
> a WAL data packet sent to a WALReceiver during replication. Although
> its default value (128kB) was a reasonable value, it was written in
> 2010. Since the hardwares have changed a lot since then, a PostgreSQL
> user might need to customize this value.
> For example, if a database's disk has a high bandwidth and a high
> latency at the same time, it makes more sense to read bigger chunks of
> data from disk in each iteration. One example of such disks is a remote
> disk. (e.g. an RBD volume)

The way we read the WAL data is perfectly prefetchable by the the OS, so I
wouldn't really expect gains here.  Have you actually been able to see a
performance benefit by increasing MAX_SEND_SIZE?

I don't think we should add configuration parameters without at least some
demonstration of practical gains, otherwise we'll end up with hundreds of
never-useful config options.

Greetings,

Andres Freund



Re: GUC-ify walsender MAX_SEND_SIZE constant

От
Majid Garoosi
Дата:
Hi Andres,

On Fri, 9 Feb 2024 at 22:33, Andres Freund <andres@anarazel.de> wrote:
On 2024-01-19 23:04:50 +0330, Majid Garoosi wrote:
> Following is the description which is also written in the commit message:
> MAX_SEND_SIZE parameter was used in WALSender to limit maximum size of
> a WAL data packet sent to a WALReceiver during replication. Although
> its default value (128kB) was a reasonable value, it was written in
> 2010. Since the hardwares have changed a lot since then, a PostgreSQL
> user might need to customize this value.
> For example, if a database's disk has a high bandwidth and a high
> latency at the same time, it makes more sense to read bigger chunks of
> data from disk in each iteration. One example of such disks is a remote
> disk. (e.g. an RBD volume)

The way we read the WAL data is perfectly prefetchable by the the OS, so I
wouldn't really expect gains here.  Have you actually been able to see a
performance benefit by increasing MAX_SEND_SIZE?

Yes, I have seen a huge performance jump. Take a look at here for more info.

Best
Majid

Re: GUC-ify walsender MAX_SEND_SIZE constant

От
Michael Paquier
Дата:
On Sun, Feb 11, 2024 at 04:32:20PM +0330, Majid Garoosi wrote:
> On Fri, 9 Feb 2024 at 22:33, Andres Freund <andres@anarazel.de> wrote:
>> The way we read the WAL data is perfectly prefetchable by the the OS, so I
>> wouldn't really expect gains here.  Have you actually been able to see a
>> performance benefit by increasing MAX_SEND_SIZE?
>
> Yes, I have seen a huge performance jump. Take a look at here
>
<https://www.postgresql.org/message-id/CAFWczPvi_5FWH%2BJTqkWbi%2Bw83hy%3DMYg%3D2hKK0%3DJZBe9%3DhTpE4w%40mail.gmail.com>
> for
> more info.

Yes, I can get the idea that grouping more replication messages in
one shot can be beneficial in some cases while being
environment-dependent, though I also get the point that we cannot
simply GUC-ify everything either.  I'm OK with this one at the end,
because it is not performance critical.

Note that it got lowered to the current value in ea5516081dcb to make
it more responsive, while being half a WAL segment in 40f908bdcdc7
when WAL senders have been introduced in 2010.  I cannot pinpoint the
exact thread that led to this change, but I'm adding Fujii-san and
Heikki in CC for comments.
--
Michael

Вложения

Re: GUC-ify walsender MAX_SEND_SIZE constant

От
Majid Garoosi
Дата:
Hey folks,

Any news, comments, etc. about this thread?

Best regards
Majid Garoosi

On Mon, 12 Feb 2024 at 01:10, Michael Paquier <michael@paquier.xyz> wrote:
On Sun, Feb 11, 2024 at 04:32:20PM +0330, Majid Garoosi wrote:
> On Fri, 9 Feb 2024 at 22:33, Andres Freund <andres@anarazel.de> wrote:
>> The way we read the WAL data is perfectly prefetchable by the the OS, so I
>> wouldn't really expect gains here.  Have you actually been able to see a
>> performance benefit by increasing MAX_SEND_SIZE?
>
> Yes, I have seen a huge performance jump. Take a look at here
> <https://www.postgresql.org/message-id/CAFWczPvi_5FWH%2BJTqkWbi%2Bw83hy%3DMYg%3D2hKK0%3DJZBe9%3DhTpE4w%40mail.gmail.com>
> for
> more info.

Yes, I can get the idea that grouping more replication messages in
one shot can be beneficial in some cases while being
environment-dependent, though I also get the point that we cannot
simply GUC-ify everything either.  I'm OK with this one at the end,
because it is not performance critical.

Note that it got lowered to the current value in ea5516081dcb to make
it more responsive, while being half a WAL segment in 40f908bdcdc7
when WAL senders have been introduced in 2010.  I cannot pinpoint the
exact thread that led to this change, but I'm adding Fujii-san and
Heikki in CC for comments.
--
Michael

Re: GUC-ify walsender MAX_SEND_SIZE constant

От
Michael Paquier
Дата:
On Mon, Apr 22, 2024 at 03:40:01PM +0200, Majid Garoosi wrote:
> Any news, comments, etc. about this thread?

FWIW, I'd still be in favor of doing a GUC-ification of this part, but
at this stage I'd need more time to do a proper study of a case where
this shows benefits to prove your point, or somebody else could come
in and show it.

Andres has objected to this change, on the ground that this was not
worth it, though you are telling the contrary.  I would be curious to
hear from others, first, so as we gather more opinions to reach a
consensus.
--
Michael

Вложения

Re: GUC-ify walsender MAX_SEND_SIZE constant

От
Jakub Wartak
Дата:
On Tue, Apr 23, 2024 at 2:24 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Apr 22, 2024 at 03:40:01PM +0200, Majid Garoosi wrote:
> > Any news, comments, etc. about this thread?
>
> FWIW, I'd still be in favor of doing a GUC-ification of this part, but
> at this stage I'd need more time to do a proper study of a case where
> this shows benefits to prove your point, or somebody else could come
> in and show it.
>
> Andres has objected to this change, on the ground that this was not
> worth it, though you are telling the contrary.  I would be curious to
> hear from others, first, so as we gather more opinions to reach a
> consensus.

I'm more with Andres on this one.I vaguely remember researching impact
of MAX_SEND_SIZE on independent two occasions (earlier async and more
recent sync case where I've investigated variety of ways to keep
latency down) and my summary would be:

First: it's very hard to get *reliable* replication setup for
benchmark, where one could demonstrate correlation between e.g.
increasing MAX_SEND_SIZE and observing benefits (in sync rep it is
easier, as you are simply stalled in pgbench). Part of the problem are
the following things:

a) workload can be tricky, for this purpose it needs to be trivial but bulky
b) it needs to be on isolated network and with guaranteed bandwidth
c) wal_init_zero impact should be ruled out
d) OS should be properly tuned autotuning TCP max(3rd value) + have
setup rmem_max/wmem_max properly
e) more serious TCP congestion should be used that the default one in OS
f) one should prevent any I/O stalls on walreceiver writeback during
huge WAL activity and restart points on standby (dirty_bytes and so
on, isolated pg_xlog, BDI max_ratio)

Second: once you perform above and ensure that there are no network or
I/O stalls back then I *think* I couldn't see any impact of playing
with MAX_SEND_SIZE from what I remember as probably something else is
saturated first.

I can offer help with trying to test this with artificial tests and
even injecting proper latency (WAN-like), but OP (Majid) I think needs
first describe his env much better (exact latency, bandwidth,
workload, TCP sysctl values, duration of the tests, no# of attempts
tried, exact commands used, iperf3 TCP results demonstrating hw used
and so on). So in short the patch is easy, but demonstrating the
effect and persuading others here would be hard.

-J.



Re: GUC-ify walsender MAX_SEND_SIZE constant

От
Andres Freund
Дата:
Hi,

On 2024-04-23 14:47:31 +0200, Jakub Wartak wrote:
> On Tue, Apr 23, 2024 at 2:24 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > > Any news, comments, etc. about this thread?
> >
> > FWIW, I'd still be in favor of doing a GUC-ification of this part, but
> > at this stage I'd need more time to do a proper study of a case where
> > this shows benefits to prove your point, or somebody else could come
> > in and show it.
> >
> > Andres has objected to this change, on the ground that this was not
> > worth it, though you are telling the contrary.  I would be curious to
> > hear from others, first, so as we gather more opinions to reach a
> > consensus.

I think it's a bad idea to make it configurable. It's just one more guc that
nobody has a chance of realistically tuning.  I'm not saying we shouldn't
improve the code - just that making MAX_SEND_SIZE configurable doesn't really
seem like a good answer.

FWIW, I have a hard time believing that MAX_SEND_SIZE is going to be the the
only or even primary issue with high latency, high bandwidth storage devices.



> First: it's very hard to get *reliable* replication setup for
> benchmark, where one could demonstrate correlation between e.g.
> increasing MAX_SEND_SIZE and observing benefits (in sync rep it is
> easier, as you are simply stalled in pgbench). Part of the problem are
> the following things:

Depending on the workload, it's possible to measure streaming-out performance
without actually regenerating WAL. E.g. by using pg_receivewal to stream the
data out multiple times.


Another way to get fairly reproducible WAL workloads is to drive
pg_logical_emit_message() from pgbench, that tends to havea lot less
variability than running tpcb-like or such.


> Second: once you perform above and ensure that there are no network or
> I/O stalls back then I *think* I couldn't see any impact of playing
> with MAX_SEND_SIZE from what I remember as probably something else is
> saturated first.

My understanding of Majid's use-case for tuning MAX_SEND_SIZE is that the
bottleneck is storage, not network. The reason MAX_SEND_SIZE affects that is
that it determines the max size passed to WALRead(), which in turn determines
how much we read from the OS at once.  If the storage has high latency but
also high throughput, and readahead is disabled or just not aggressive enough
after crossing segment boundaries, larger reads reduce the number of times
you're likely to be blocked waiting for read IO.

Which is also why I think that making MAX_SEND_SIZE configurable is a really
poor proxy for improving the situation.

We're imo much better off working on read_stream.[ch] support for reading WAL.

Greetings,

Andres Freund



Re: GUC-ify walsender MAX_SEND_SIZE constant

От
Jakub Wartak
Дата:
Hi,

> My understanding of Majid's use-case for tuning MAX_SEND_SIZE is that the
> bottleneck is storage, not network. The reason MAX_SEND_SIZE affects that is
> that it determines the max size passed to WALRead(), which in turn determines
> how much we read from the OS at once.  If the storage has high latency but
> also high throughput, and readahead is disabled or just not aggressive enough
> after crossing segment boundaries, larger reads reduce the number of times
> you're likely to be blocked waiting for read IO.
>
> Which is also why I think that making MAX_SEND_SIZE configurable is a really
> poor proxy for improving the situation.
>
> We're imo much better off working on read_stream.[ch] support for reading WAL.

Well then that would be a consistent message at least, because earlier
in [1] it was rejected to have prefetch the WAL segment but on the
standby side, where the patch was only helping in configurations
having readahead *disabled* for some reason.

Now Majid stated that he uses "RBD" - Majid, any chance to specify
what that RBD really is ? What's the tech? What fs? Any ioping or fio
results? What's the blockdev --report /dev/XXX output ? (you stated
"high" latency and "high" bandwidth , but it is relative, for me 15ms+
is high latency and >1000MB/s sequential, but it would help others in
future if you could specify it by the exact numbers please). Maybe
it's just a matter of enabling readahead (line in [1]) there and/or
using a higher WAL segment during initdb.

[1] -
https://www.postgresql.org/message-id/flat/CADVKa1WsQMBYK_02_Ji%3DpbOFnms%2BCT7TV6qJxDdHsFCiC9V_hw%40mail.gmail.com



Re: GUC-ify walsender MAX_SEND_SIZE constant

От
Majid Garoosi
Дата:
Hi,

> Now Majid stated that he uses "RBD" - Majid, any chance to specify
what that RBD really is ? What's the tech? What fs? Any ioping or fio
results? What's the blockdev --report /dev/XXX output ? (you stated
"high" latency and "high" bandwidth , but it is relative, for me 15ms+
is high latency and >1000MB/s sequential, but it would help others in
future if you could specify it by the exact numbers please). Maybe
it's just a matter of enabling readahead (line in [1]) there and/or
using a higher WAL segment during initdb.

Unfortunately, I quit that company a month ago (I wish we could
discuss this earlier) and don't have access to the environment
anymore.
I'll try to ask my teammates and see if they can find anything about
the exact values of latency, bw, etc.

Anyway, here is some description of the environment. Sadly, there
are no numbers in this description, but I'll try to provide as much details
as possible.
There is a k8s cluster running over some VMs. Each postgres
instance runs as a pod inside the k8s cluster. So, both the
primary and standby servers are in the same DC, but might be on
different baremetal nodes. There is an overlay network for the pods to
see each other, and there's also another overlay network for the VMs
to see each other.
The storage servers are in the same DC, but we're sure they're on some
racks other than the postgres pods. They run Ceph [1] project and provide
Rados Block Devices (RBD) [2] interface. In order for k8s to use ceph, a
Ceph-CSI [3] controller is running inside the k8s cluster.
BTW, the FS type is ext4.

Re: GUC-ify walsender MAX_SEND_SIZE constant

От
Michael Paquier
Дата:
On Thu, Apr 25, 2024 at 02:53:33PM +0200, Majid Garoosi wrote:
> Unfortunately, I quit that company a month ago (I wish we could
> discuss this earlier) and don't have access to the environment
> anymore.
> I'll try to ask my teammates and see if they can find anything about
> the exact values of latency, bw, etc.
>
> Anyway, here is some description of the environment. Sadly, there
> are no numbers in this description, but I'll try to provide as much details
> as possible.
> There is a k8s cluster running over some VMs. Each postgres
> instance runs as a pod inside the k8s cluster. So, both the
> primary and standby servers are in the same DC, but might be on
> different baremetal nodes. There is an overlay network for the pods to
> see each other, and there's also another overlay network for the VMs
> to see each other.
> The storage servers are in the same DC, but we're sure they're on some
> racks other than the postgres pods. They run Ceph [1] project and provide
> Rados Block Devices (RBD) [2] interface. In order for k8s to use ceph, a
> Ceph-CSI [3] controller is running inside the k8s cluster.
> BTW, the FS type is ext4.

Okay, seeing the feedback for this patch and Andres disagreeing with
it as being a good idea, I have marked the patch as rejected as it was
still in the CF app.
--
Michael

Вложения