Re: "could not reattach to shared memory" on buildfarm member dory
От | Noah Misch |
---|---|
Тема | Re: "could not reattach to shared memory" on buildfarm member dory |
Дата | |
Msg-id | 20180819020007.GD3795674@rfd.leadboat.com обсуждение исходный текст |
Ответ на | Re: "could not reattach to shared memory" on buildfarm member dory (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: "could not reattach to shared memory" on buildfarm member dory
|
Список | pgsql-hackers |
On Tue, May 01, 2018 at 11:31:50AM -0400, Tom Lane wrote: > Well, at this point the only thing that's entirely clear is that none > of the ideas I had work. I think we are going to be forced to pursue > Noah's idea of doing an end-to-end retry. Somebody else will need to > take point on that; I lack a Windows environment and have already done > a lot more blind patch-pushing than I like in this effort. Having tried this, I find a choice between performance and complexity. Both of my designs use proc_exit(4) to indicate failure to reattach. The simpler, slower design has WIN32 internal_forkexec() block until the child reports (via SetEvent()) that it reattached to shared memory. This caused a fivefold reduction in process creation performance[1]. The less-simple, faster design stashes the Port structure and retry count in the BackendList entry, which reaper() uses to retry the fork upon seeing status 4. Notably, this requires new code for regular backends, for bgworkers, and for others. It's currently showing a 30% performance _increase_ on the same benchmark; I can't explain that increase and doubt it will last, but I think it's plausible for the less-simple design to be performance-neutral. I see these options: 1. Use the simpler design with a GUC, disabled by default, to control whether the new code is active. Mention the GUC in a new errhint() for the "could not reattach to shared memory" error. 2. Like (1), but enable the GUC by default. 3. Like (1), but follow up with a patch to enable the GUC by default in v12 only. 4. In addition to (1), enable retries if the GUC is set _or_ this postmaster has seen at least one child fail to reattach. 5. Use the less-simple design, with retries enabled unconditionally. I think I prefer (3), with (1) being a close second. My hesitation on (3) is that parallel query has made startup time count even if you use a connection pool, and all the Windows users not needing these retries will see parallel query become that much slower. I dislike (5) for its impact on platform-independent postmaster code. Other opinions? I'm attaching a mostly-finished patch for the slower design. I tested correctness with -DREATTACH_FORCE_FAIL_PERCENT=99. I'm also attaching a proof-of-concept patch for the faster design. In this proof of concept, the postmaster does not close its copy of a backend socket until the backend exits. Also, bgworkers can change from BGWH_STARTED back to BGWH_NOT_YET_STARTED; core code tolerates this, but external code may not. Those would justify paying some performance to fix. The proof of concept handles bgworkers and regular backends, but it does not handle the startup process, checkpointer, etc. That doesn't affect benchmarking, of course. nm [1] This (2 forks per transaction) dropped from 139tps to 27tps: echo 'select 1' >script env PGOPTIONS="--default_transaction_isolation=repeatable\\ read --force_parallel_mode=on" pgbench -T15 -j30 -c30 --connect-n -fscript
Вложения
В списке pgsql-hackers по дате отправления: