Обсуждение: Postgres-R: internal messaging

Поиск
Список
Период
Сортировка

Postgres-R: internal messaging

От
Markus Wanner
Дата:
Hi,

As you certainly know by now, Postgres-R introduces an additional
manager process. That one is forked from the postmaster, so are all
backends, no matter if they are processing local or remote transactions.
That led to a communication problem, which has originally (i.e. around
Postgres-R for 6.4) been solved by using unix pipes. I didn't like that
approach for various reasons: first, AFAIK there are portability issues,
second it eats file descriptors and third it involves copying around the
messages several times. As the replication manager needs to talk to the
backends, but they both need to be forked from the postmaster, pipes
would also have to go through the postmaster process.

Trying to be as portable as Postgres itself and still wanting an
efficient messaging system, I came up with that imessages stuff, which
I've already posted to -patches before [1]. It uses shared memory to
store and 'transfer' the messages and signals to notify other processes
(the so far unused SIGUSR2, IIRC). Of course this implies having a hard
limit on the total size of messages waiting to be delivered, due to the
fixed size of the shared memory area.

Besides the communication between the replication manager and the
backends, which is currently done by using these imessages, the
replication manager also needs to communicate with the postmaster: it
needs to be able to request new helper backends and it wants to be
notified upon termination (or crash) of such a helper backend (and other 
backends as well...). I'm currently doing this with imessages as well, 
which violates the rule that the postmaster may not to touch shared 
memory. I didn't look into ripping that out, yet. I'm not sure it can be 
done with the existing signaling of the postmaster.

Let's have a simple example: consider a local transaction which changes 
some tuples. Those are being collected into a change set, which gets 
written to the shared memory area as an imessage for the replication 
manager. The backend then also signals the manager, which then awakes 
from its select(), checks its imessages queue and processes the message, 
delivering it to the GCS. It then removes the imessage from the shared 
memory area again.

My initial design features only a single doubly linked list as the 
message queue, holding all messages for all processes. An imessages lock 
blocks concurrent writing acces. That's still what's in there, but I 
realize that's not enough. Each process should better have its own 
queue, and the single lock needs to vanish to avoid contention on that 
lock. However, that would require dynamically allocatable shared memory...

As another side node: I've had to write methods similar to those in 
libpq, which serialize and deserialize integers or strings. The libpq 
functions were not appropriate because they cannot write shared memory, 
instead they are designed to flush to a socket, if I understand 
correctly. Maybe, these could be extended or modified to be usable there 
as well? I've been hesitating and rather implemented separate methods in 
src/backed/storage/ipc/buffer.c.

Comments?

Regards

Markus Wanner

[1]: last time I published IMessage stuff on -patches, WIP:
http://archives.postgresql.org/pgsql-patches/2007-01/msg00578.php



Re: Postgres-R: internal messaging

От
Alexey Klyukin
Дата:
Markus Wanner wrote:
> Besides the communication between the replication manager and the
> backends, which is currently done by using these imessages, the
> replication manager also needs to communicate with the postmaster: it
> needs to be able to request new helper backends and it wants to be
> notified upon termination (or crash) of such a helper backend (and other  
> backends as well...). I'm currently doing this with imessages as well,  
> which violates the rule that the postmaster may not to touch shared  
> memory. I didn't look into ripping that out, yet. I'm not sure it can be  
> done with the existing signaling of the postmaster.

In Replicator we avoided the need for postmaster to read/write backend's
shmem data by using it as a signal forwarder. When a backend wants to
inform a special process (i.e. queue monitor) about replication-related
event (such as commit) it sends SIGUSR1 to Postmaster with a related
"reason" flag and the postmaster upon receiving this signal forwards
it to the destination process. Termination of backends and special
processes are handled by the postmaster itself.

>
> Let's have a simple example: consider a local transaction which changes  
> some tuples. Those are being collected into a change set, which gets  
> written to the shared memory area as an imessage for the replication  
> manager. The backend then also signals the manager, which then awakes  
> from its select(), checks its imessages queue and processes the message,  
> delivering it to the GCS. It then removes the imessage from the shared  
> memory area again.

Hm...what would happen with the new data under heavy load when the queue 
would eventually be filled with messages, the relevant transactions
would be aborted or they would wait for the manager to release the queue
space occupied by already processed messages? ISTM that having a fixed
size buffer limits the maximum transaction rate.

>
> My initial design features only a single doubly linked list as the  
> message queue, holding all messages for all processes. An imessages lock  
> blocks concurrent writing acces. That's still what's in there, but I  
> realize that's not enough. Each process should better have its own  
> queue, and the single lock needs to vanish to avoid contention on that  
> lock. However, that would require dynamically allocatable shared 
> memory...

What about keeping the per-process message queue in the local memory of
the process, and exporting only the queue head to the shmem, thus having
only one message per-process there. When the queue manager gets a
message from the process it may signal that process to copy the next
message from the process local memory into the shmem. To keep a
correct ordering of queue messages an additional shared memory queue of
pid_t can be maintained, containing one pid per each message. 

-- 
Alexey Klyukin                         http://www.commandprompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: Postgres-R: internal messaging

От
Markus Wanner
Дата:
Hi Alexey,

thanks for your feedback, these are interesting points.

Alexey Klyukin wrote:
> In Replicator we avoided the need for postmaster to read/write backend's
> shmem data by using it as a signal forwarder. When a backend wants to
> inform a special process (i.e. queue monitor) about replication-related
> event (such as commit) it sends SIGUSR1 to Postmaster with a related
> "reason" flag and the postmaster upon receiving this signal forwards
> it to the destination process. Termination of backends and special
> processes are handled by the postmaster itself.

Hm.. how about larger data chunks, like change sets? In Postgres-R, 
those need to travel between the backends and the replication manager, 
which then sends it to the GCS.

> Hm...what would happen with the new data under heavy load when the queue 
> would eventually be filled with messages, the relevant transactions
> would be aborted or they would wait for the manager to release the queue
> space occupied by already processed messages? ISTM that having a fixed
> size buffer limits the maximum transaction rate.

That's why the replication manager is a very simple forwarder, which 
does not block messages, but consumes them immediately from shared 
memory. It already features a message cache, which holds messages it 
cannot currently forward to a backend, because all backends are busy.

And it takes care to only send change sets to helper backend which are 
not busy and can consume the process the remote transaction immediately. 
That way, I don't think the limit on shared memory is the bottleneck. 
However, I didn't measure.

WRT waiting vs aborting: I think at the moment I don't handle this 
situation gracefully. I've never encountered it. ;-)  But I think the 
simpler option is letting the sender wait until there is enough room in 
the queue for its message. To avoid deadlocks, each process should 
consume its messages, before trying to send one. (Which is done 
correctly only for the replication manager ATM, not for the backends, IIRC).

> What about keeping the per-process message queue in the local memory of
> the process, and exporting only the queue head to the shmem, thus having
> only one message per-process there.

The replication manager already does that with its cache. No other 
process needs to send (large enough) messages which cannot be consumed 
immediately. So such a local cache does not make much sense for any 
other process.

Even for the replication manager, I find it dubious to require such a 
cache, because it introduces an unnecessary copying of data within memory.

> When the queue manager gets a
> message from the process it may signal that process to copy the next
> message from the process local memory into the shmem. To keep a
> correct ordering of queue messages an additional shared memory queue of
> pid_t can be maintained, containing one pid per each message.

The replication manager takes care of the ordering for cached messages.

Regards

Markus Wanner



Re: Postgres-R: internal messaging

От
Tom Lane
Дата:
Alexey Klyukin <alexk@commandprompt.com> writes:
> Markus Wanner wrote:
>> I'm currently doing this with imessages as well,  
>> which violates the rule that the postmaster may not to touch shared  
>> memory. I didn't look into ripping that out, yet. I'm not sure it can be  
>> done with the existing signaling of the postmaster.

> In Replicator we avoided the need for postmaster to read/write backend's
> shmem data by using it as a signal forwarder.

You should also look at the current code for communication between
autovac launcher and autovac workers.  That seems to be largely a
similar problem, and it's been solved in a way that seems to be
safe enough with respect to the postmaster vs shared memory issue.
        regards, tom lane


Re: Postgres-R: internal messaging

От
Markus Wanner
Дата:
Hi,

Tom Lane wrote:
> You should also look at the current code for communication between
> autovac launcher and autovac workers.  That seems to be largely a
> similar problem, and it's been solved in a way that seems to be
> safe enough with respect to the postmaster vs shared memory issue.

Oh yeah, thanks for reminding me. Back when it was added I thought I 
might find some helpful insights in there. But I didn't ever take the 
time to read through it...

Regards

Markus Wanner



Re: Postgres-R: internal messaging

От
Markus Wanner
Дата:
Hi,

what follows are some comments after trying to understand how the 
autovacuum launcher works and thoughts on how to apply this to the 
replication manager in Postgres-R.

The initial comments in autovacuum.c say:

> If the fork() call fails in the postmaster, it sets a flag in the shared
> memory area, and sends a signal to the launcher.  

I note that the shmem area that the postmaster is writing to is pretty 
static and not dependent on any other state stored in shmem. That 
certainly makes a difference compared to my imessages approach, where a 
corruption in the shmem for imessages could also confuse the postmaster.

Reading on, the 'can_launch' flag in the launcher's main loop makes sure 
that only one worker is requested concurrently, so that the launcher 
doesn't miss a failure or success notice from either the postmaster or 
the newly started worker. The replication manager currently shamelessly 
requests as many helper backend as it wants. I think I can change that 
without much trouble. Would certainly make sense.

Notifications of the replication manager after termination or crashes of 
a helper backend remain. Upon normal errors (i.e. elog(ERROR... ), the 
backend processes themselves should take care of notifying the 
replication manager. But crashes are more difficult. IMO the replication 
manager needs to stay alive during this reinitialization, to keep the 
GCS connection. However, it can easily detach from shared memory 
temporarily (the imessages stuff is the only shmem place it touches, 
IIRC). However, a more difficult aspect is: it must be able to tell if a 
backend has applied its transaction *before* it died or not. Thus, after 
all backends have been killed, the postmaster needs to wait with 
reinitializing shared memory, until the replication manager has consumed 
all its messages. (Otherwise we would risk "losing" local transactions, 
probably also remote ones).

So, yes, after thinking about it, detaching the postmaster from shared 
memory seems doable for Postgres-R (in the sense of "the postmaster does 
not rely on possibly corrupted data in shared memory"). Reinitialization 
needs some more thoughts, but in general that seems like the way to go.

Regards

Markus Wanner



Re: Postgres-R: internal messaging

От
Tom Lane
Дата:
Markus Wanner <markus@bluegap.ch> writes:
> ... crashes are more difficult. IMO the replication 
> manager needs to stay alive during this reinitialization, to keep the 
> GCS connection. However, it can easily detach from shared memory 
> temporarily (the imessages stuff is the only shmem place it touches, 
> IIRC). However, a more difficult aspect is: it must be able to tell if a 
> backend has applied its transaction *before* it died or not. Thus, after 
> all backends have been killed, the postmaster needs to wait with 
> reinitializing shared memory, until the replication manager has consumed 
> all its messages. (Otherwise we would risk "losing" local transactions, 
> probably also remote ones).

I hope you're not expecting the contents of shared memory to still be
trustworthy after a backend crash.  If the manager is working strictly
from its own local memory, then it would be reasonable to operate
as above.
        regards, tom lane


Re: Postgres-R: internal messaging

От
Markus Wanner
Дата:
Hi,

Tom Lane wrote:
> I hope you're not expecting the contents of shared memory to still be
> trustworthy after a backend crash.

Hm.. that's a good point.

So I either need to bullet-proof the imessages with checksums or some 
such. I'm not sure that's doable reliably. Not to speak about performance.

Thus it might be better to just restart the replication manager as well. 
Note that this means leaving the replication group temporarily and going 
through node recovery to apply remote transactions it has missed in 
between. This sounds expensive, but it's certainly the safer way to do 
it. And as such backend crashes are Expected Not To Happen(tm) on 
production systems, that's probably good enough.

> If the manager is working strictly
> from its own local memory, then it would be reasonable to operate
> as above.

That's not the case... :-(

Thanks for your excellent guidance.

Regards

Markus



Re: Postgres-R: internal messaging

От
Markus Wanner
Дата:
Hi,

That's now changed in today's snapshot of Postgres-R: the postmaster no 
longer uses imessages (and thus shared memory) to communicate with the 
replication manager. Instead the manager signals the postmaster using a 
newish PMSIGNAL for requesting new helper backends. It now only requests 
one helper at a time and keeps track of pending requests. The helper 
backends now read the name of the database to which they must connect to 
from shared memory themselves. That should now adhere to the standard 
Postgres rules for shared memory safety.

Additionally, the replication manager is also restarted after a backend 
crash, to make sure it never tries to work on corrupted shared memory. 
However, that part isn't complete, as the replication manager cannot 
really handle that situation just yet. There are other outstanding 
issues having to do with that change. Those are documented in the TODO 
file in src/backend/replication/.

Regards

Markus Wanner