Обсуждение: select_parallel test failure: gather sometimes losing tuples (maybeduring rescans)?
select_parallel test failure: gather sometimes losing tuples (maybeduring rescans)?
От
Thomas Munro
Дата:
Hi, I saw a one-off failure like this: QUERY PLAN -------------------------------------------------------------------------- Aggregate (actual rows=1 loops=1) ! -> Nested Loop (actual rows=98000 loops=1) -> Seq Scan on tenk2 (actual rows=10 loops=1) Filter: (thousand = 0) Rows Removed by Filter: 9990 ! -> Gather (actual rows=9800 loops=10) Workers Planned: 4 Workers Launched: 4 -> Parallel Seq Scan on tenk1 (actual rows=1960 loops=50) --- 485,495 ---- QUERY PLAN -------------------------------------------------------------------------- Aggregate (actual rows=1 loops=1) ! -> Nested Loop (actual rows=97984 loops=1) -> Seq Scan on tenk2 (actual rows=10 loops=1) Filter: (thousand = 0) Rows Removed by Filter: 9990 ! -> Gather (actual rows=9798 loops=10) Workers Planned: 4 Workers Launched: 4 -> Parallel Seq Scan on tenk1 (actual rows=1960 loops=50) Two tuples apparently went missing. Similar failures on the build farm: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11 Could this be related to commit 34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit 497171d3e2aaeea3b30d710b4e368645ad07ae43? -- Thomas Munro http://www.enterprisedb.com
Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
От
Tomas Vondra
Дата:
On 03/04/2018 03:20 AM, Thomas Munro wrote: > Hi, > > I saw a one-off failure like this: > > QUERY PLAN > -------------------------------------------------------------------------- > Aggregate (actual rows=1 loops=1) > ! -> Nested Loop (actual rows=98000 loops=1) > -> Seq Scan on tenk2 (actual rows=10 loops=1) > Filter: (thousand = 0) > Rows Removed by Filter: 9990 > ! -> Gather (actual rows=9800 loops=10) > Workers Planned: 4 > Workers Launched: 4 > -> Parallel Seq Scan on tenk1 (actual rows=1960 loops=50) > --- 485,495 ---- > QUERY PLAN > -------------------------------------------------------------------------- > Aggregate (actual rows=1 loops=1) > ! -> Nested Loop (actual rows=97984 loops=1) > -> Seq Scan on tenk2 (actual rows=10 loops=1) > Filter: (thousand = 0) > Rows Removed by Filter: 9990 > ! -> Gather (actual rows=9798 loops=10) > Workers Planned: 4 > Workers Launched: 4 > -> Parallel Seq Scan on tenk1 (actual rows=1960 loops=50) > > > Two tuples apparently went missing. > > Similar failures on the build farm: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11 > > Could this be related to commit > 34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit > 497171d3e2aaeea3b30d710b4e368645ad07ae43? > I think the same failure (or at least very similar plan diff) was already mentioned here: https://www.postgresql.org/message-id/17385.1520018934@sss.pgh.pa.us So I guess someone else already noticed, but I don't see the cause identified in that thread. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?
От
Andres Freund
Дата:
On March 3, 2018 6:36:51 PM PST, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >On 03/04/2018 03:20 AM, Thomas Munro wrote: >> Hi, >> >> I saw a one-off failure like this: >> >> QUERY PLAN >> >-------------------------------------------------------------------------- >> Aggregate (actual rows=1 loops=1) >> ! -> Nested Loop (actual rows=98000 loops=1) >> -> Seq Scan on tenk2 (actual rows=10 loops=1) >> Filter: (thousand = 0) >> Rows Removed by Filter: 9990 >> ! -> Gather (actual rows=9800 loops=10) >> Workers Planned: 4 >> Workers Launched: 4 >> -> Parallel Seq Scan on tenk1 (actual rows=1960 >loops=50) >> --- 485,495 ---- >> QUERY PLAN >> >-------------------------------------------------------------------------- >> Aggregate (actual rows=1 loops=1) >> ! -> Nested Loop (actual rows=97984 loops=1) >> -> Seq Scan on tenk2 (actual rows=10 loops=1) >> Filter: (thousand = 0) >> Rows Removed by Filter: 9990 >> ! -> Gather (actual rows=9798 loops=10) >> Workers Planned: 4 >> Workers Launched: 4 >> -> Parallel Seq Scan on tenk1 (actual rows=1960 >loops=50) >> >> >> Two tuples apparently went missing. >> >> Similar failures on the build farm: >> >> >https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01 >> >https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32 >> >https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11 >> >> Could this be related to commit >> 34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit >> 497171d3e2aaeea3b30d710b4e368645ad07ae43? >> > >I think the same failure (or at least very similar plan diff) was >already mentioned here: > >https://www.postgresql.org/message-id/17385.1520018934@sss.pgh.pa.us > >So I guess someone else already noticed, but I don't see the cause >identified in that thread. Robert and I started discussing it a bit over IM. No conclusion. Robert tried to reproduce locally, including disabling atomics,without luck. Can anybody reproduce locally? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
От
Thomas Munro
Дата:
On Sun, Mar 4, 2018 at 3:40 PM, Andres Freund <andres@anarazel.de> wrote: > On March 3, 2018 6:36:51 PM PST, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >>On 03/04/2018 03:20 AM, Thomas Munro wrote: >>> Hi, >>> >>> I saw a one-off failure like this: >>> >>> QUERY PLAN >>> >>-------------------------------------------------------------------------- >>> Aggregate (actual rows=1 loops=1) >>> ! -> Nested Loop (actual rows=98000 loops=1) >>> -> Seq Scan on tenk2 (actual rows=10 loops=1) >>> Filter: (thousand = 0) >>> Rows Removed by Filter: 9990 >>> ! -> Gather (actual rows=9800 loops=10) >>> Workers Planned: 4 >>> Workers Launched: 4 >>> -> Parallel Seq Scan on tenk1 (actual rows=1960 >>loops=50) >>> --- 485,495 ---- >>> QUERY PLAN >>> >>-------------------------------------------------------------------------- >>> Aggregate (actual rows=1 loops=1) >>> ! -> Nested Loop (actual rows=97984 loops=1) >>> -> Seq Scan on tenk2 (actual rows=10 loops=1) >>> Filter: (thousand = 0) >>> Rows Removed by Filter: 9990 >>> ! -> Gather (actual rows=9798 loops=10) >>> Workers Planned: 4 >>> Workers Launched: 4 >>> -> Parallel Seq Scan on tenk1 (actual rows=1960 >>loops=50) >>> >>> >>> Two tuples apparently went missing. >>> >>> Similar failures on the build farm: >>> >>> >>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01 >>> >>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32 >>> >>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11 >>> >>> Could this be related to commit >>> 34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit >>> 497171d3e2aaeea3b30d710b4e368645ad07ae43? >>> >> >>I think the same failure (or at least very similar plan diff) was >>already mentioned here: >> >>https://www.postgresql.org/message-id/17385.1520018934@sss.pgh.pa.us >> >>So I guess someone else already noticed, but I don't see the cause >>identified in that thread. Oh. Sorry, I didn't recognise that as the same thing, from the title. Doesn't seem to be related to number of workers launched at all... it looks more like the tuple queue is misbehaving. Though I haven't got any proof of anything yet. > Robert and I started discussing it a bit over IM. No conclusion. Robert tried to reproduce locally, including disablingatomics, without luck. > > Can anybody reproduce locally? I've seen it several times on Travis CI. (So I would normally have been able to tell you about this problem before the was committed, except that the email thread was too long and the mail archive app cuts long threads off!) Will try on some different kind of computers that I have local control off... I suspect (knowing how we run it on Travis CI) that being way overloaded might be helpful... -- Thomas Munro http://www.enterprisedb.com
Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
От
Thomas Munro
Дата:
On Sun, Mar 4, 2018 at 3:48 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I've seen it several times on Travis CI. (So I would normally have > been able to tell you about this problem before the was committed, > except that the email thread was too long and the mail archive app > cuts long threads off!) (Correction. It wasn't too long. That was something else. In this case the Commitfest entry had two threads registered and my bot is too stupid to find the interesting one.) -- Thomas Munro http://www.enterprisedb.com
Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
От
Tomas Vondra
Дата:
On 03/04/2018 03:40 AM, Andres Freund wrote: > > > On March 3, 2018 6:36:51 PM PST, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> On 03/04/2018 03:20 AM, Thomas Munro wrote: >>> Hi, >>> >>> I saw a one-off failure like this: >>> >>> QUERY PLAN >>> >> -------------------------------------------------------------------------- >>> Aggregate (actual rows=1 loops=1) >>> ! -> Nested Loop (actual rows=98000 loops=1) >>> -> Seq Scan on tenk2 (actual rows=10 loops=1) >>> Filter: (thousand = 0) >>> Rows Removed by Filter: 9990 >>> ! -> Gather (actual rows=9800 loops=10) >>> Workers Planned: 4 >>> Workers Launched: 4 >>> -> Parallel Seq Scan on tenk1 (actual rows=1960 >> loops=50) >>> --- 485,495 ---- >>> QUERY PLAN >>> >> -------------------------------------------------------------------------- >>> Aggregate (actual rows=1 loops=1) >>> ! -> Nested Loop (actual rows=97984 loops=1) >>> -> Seq Scan on tenk2 (actual rows=10 loops=1) >>> Filter: (thousand = 0) >>> Rows Removed by Filter: 9990 >>> ! -> Gather (actual rows=9798 loops=10) >>> Workers Planned: 4 >>> Workers Launched: 4 >>> -> Parallel Seq Scan on tenk1 (actual rows=1960 >> loops=50) >>> >>> >>> Two tuples apparently went missing. >>> >>> Similar failures on the build farm: >>> >>> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01 >>> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32 >>> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11 >>> >>> Could this be related to commit >>> 34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit >>> 497171d3e2aaeea3b30d710b4e368645ad07ae43? >>> >> >> I think the same failure (or at least very similar plan diff) was >> already mentioned here: >> >> https://www.postgresql.org/message-id/17385.1520018934@sss.pgh.pa.us >> >> So I guess someone else already noticed, but I don't see the cause >> identified in that thread. > > Robert and I started discussing it a bit over IM. No conclusion. Robert tried to reproduce locally, including disablingatomics, without luck. > > Can anybody reproduce locally? > I've started "make check" with parallel_schedule tweaked to contain many select_parallel runs, and so far I've seen a couple of failures like this (about 10 failures out of 1500 runs): select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0; ! ERROR: lost connection to parallel worker I have no idea why the worker fails (no segfaults in dmesg, nothing in posgres log), or if it's related to the issue discussed here at all. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
От
Thomas Munro
Дата:
On Sun, Mar 4, 2018 at 4:07 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I've started "make check" with parallel_schedule tweaked to contain many > select_parallel runs, and so far I've seen a couple of failures like > this (about 10 failures out of 1500 runs): > > select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and > tenk2.thousand=0; > ! ERROR: lost connection to parallel worker > > I have no idea why the worker fails (no segfaults in dmesg, nothing in > posgres log), or if it's related to the issue discussed here at all. That sounds like the new defences from 2badb5afb89cd569500ef7c3b23c7a9d11718f2f. -- Thomas Munro http://www.enterprisedb.com
Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
От
Tomas Vondra
Дата:
On 03/04/2018 04:11 AM, Thomas Munro wrote: > On Sun, Mar 4, 2018 at 4:07 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> I've started "make check" with parallel_schedule tweaked to contain many >> select_parallel runs, and so far I've seen a couple of failures like >> this (about 10 failures out of 1500 runs): >> >> select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and >> tenk2.thousand=0; >> ! ERROR: lost connection to parallel worker >> >> I have no idea why the worker fails (no segfaults in dmesg, nothing in >> posgres log), or if it's related to the issue discussed here at all. > > That sounds like the new defences from 2badb5afb89cd569500ef7c3b23c7a9d11718f2f. > Yeah. But I wonder why the worker fails at all, or how to find that. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
От
Thomas Munro
Дата:
On Sun, Mar 4, 2018 at 4:17 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 03/04/2018 04:11 AM, Thomas Munro wrote: >> On Sun, Mar 4, 2018 at 4:07 PM, Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: >>> ! ERROR: lost connection to parallel worker >> >> That sounds like the new defences from 2badb5afb89cd569500ef7c3b23c7a9d11718f2f. > > Yeah. But I wonder why the worker fails at all, or how to find that. Could it be that a concurrency bug causes tuples to be lost on the tuple queue, and also sometimes causes X (terminate) messages to be lost from the error queue, so that the worker appears to go away unexpectedly? -- Thomas Munro http://www.enterprisedb.com
Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
От
Thomas Munro
Дата:
On Sun, Mar 4, 2018 at 4:37 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Could it be that a concurrency bug causes tuples to be lost on the > tuple queue, and also sometimes causes X (terminate) messages to be > lost from the error queue, so that the worker appears to go away > unexpectedly? Could shm_mq_detach_internal() need a pg_write_barrier() before it writes mq_detached = true, to make sure that anyone who observes that can also see the most recent increase of mq_bytes_written? -- Thomas Munro http://www.enterprisedb.com
Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
От
Thomas Munro
Дата:
On Sun, Mar 4, 2018 at 5:40 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Could shm_mq_detach_internal() need a pg_write_barrier() before it > writes mq_detached = true, to make sure that anyone who observes that > can also see the most recent increase of mq_bytes_written? I can reproduce both failure modes (missing tuples and "lost contact") in the regression database with the attached Python script on my Mac. It takes a few minutes and seems to be happen sooner when my machine is also doing other stuff (playing debugging music...). I can reproduce it at 34db06ef9a1d7f36391c64293bf1e0ce44a33915 "shm_mq: Reduce spinlock usage." but (at least so far) not at the preceding commit. I can fix it with the following patch, which writes XXX out to the log where it would otherwise miss a final message sent just before detaching with sufficiently bad timing/memory ordering. This patch isn't my proposed fix, it's just a demonstration of what's busted. There could be a better way to structure things than this. -- Thomas Munro http://www.enterprisedb.com
Вложения
Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
От
Magnus Hagander
Дата:
On Sun, Mar 4, 2018 at 3:51 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Sun, Mar 4, 2018 at 3:48 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I've seen it several times on Travis CI. (So I would normally have
> been able to tell you about this problem before the was committed,
> except that the email thread was too long and the mail archive app
> cuts long threads off!)
(Correction. It wasn't too long. That was something else. In this
case the Commitfest entry had two threads registered and my bot is too
stupid to find the interesting one.)
Um. Have you actually seen the "mail archive app" cut long threads off in other cases? Because it's certainly not supposed to do that...
Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
От
Thomas Munro
Дата:
On Sun, Mar 4, 2018 at 10:46 PM, Magnus Hagander <magnus@hagander.net> wrote: > Um. Have you actually seen the "mail archive app" cut long threads off in > other cases? Because it's certainly not supposed to do that... Hi Magnus, I mean the "flat" thread view: https://www.postgresql.org/message-id/flat/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com The final message on that page is not the final message that appears in my mail client for the thread. I guessed that might have been cut off due to some hard-coded limit, but perhaps there is some other reason (different heuristics for thread following?) -- Thomas Munro http://www.enterprisedb.com
Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
От
Tomas Vondra
Дата:
On 03/04/2018 10:27 AM, Thomas Munro wrote: > On Sun, Mar 4, 2018 at 5:40 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> Could shm_mq_detach_internal() need a pg_write_barrier() before it >> writes mq_detached = true, to make sure that anyone who observes that >> can also see the most recent increase of mq_bytes_written? > > I can reproduce both failure modes (missing tuples and "lost contact") > in the regression database with the attached Python script on my Mac. > It takes a few minutes and seems to be happen sooner when my machine > is also doing other stuff (playing debugging music...). > > I can reproduce it at 34db06ef9a1d7f36391c64293bf1e0ce44a33915 > "shm_mq: Reduce spinlock usage." but (at least so far) not at the > preceding commit. > > I can fix it with the following patch, which writes XXX out to the log > where it would otherwise miss a final message sent just before > detaching with sufficiently bad timing/memory ordering. This patch > isn't my proposed fix, it's just a demonstration of what's busted. > There could be a better way to structure things than this. > I can confirm this resolves the issue for me. Before the patch, I've seen 112 failures in ~11500 runs. With the patch I saw 0 failures, but about 100 messages XXX in the log. So my conclusion is that your analysis is likely correct. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
От
Thomas Munro
Дата:
On Mon, Mar 5, 2018 at 4:05 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 03/04/2018 10:27 AM, Thomas Munro wrote: >> I can fix it with the following patch, which writes XXX out to the log >> where it would otherwise miss a final message sent just before >> detaching with sufficiently bad timing/memory ordering. This patch >> isn't my proposed fix, it's just a demonstration of what's busted. >> There could be a better way to structure things than this. > > I can confirm this resolves the issue for me. Before the patch, I've > seen 112 failures in ~11500 runs. With the patch I saw 0 failures, but > about 100 messages XXX in the log. > > So my conclusion is that your analysis is likely correct. Thanks! Here are a couple of patches. I'm not sure which I prefer. The "pessimistic" one looks simpler and is probably the way to go, but the "optimistic" one avoids doing an extra read until it has actually run out of data and seen mq_detached == true. I realised that the pg_write_barrier() added to shm_mq_detach_internal() from the earlier demonstration/hack patch was not needed... I had a notion that SpinLockAcquire() might not include a strong enough barrier (unlike SpinLockRelease()), but after reading s_lock.h I think it's not needed (since you get either TAS() or a syscall-based slow path, both expected to include a full fence). I haven't personally tested this on a weak memory order system. Thoughts? -- Thomas Munro http://www.enterprisedb.com
Вложения
Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
От
Alvaro Herrera
Дата:
Thomas Munro wrote: > On Sun, Mar 4, 2018 at 10:46 PM, Magnus Hagander <magnus@hagander.net> wrote: > > Um. Have you actually seen the "mail archive app" cut long threads off in > > other cases? Because it's certainly not supposed to do that... > > Hi Magnus, > > I mean the "flat" thread view: > > https://www.postgresql.org/message-id/flat/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com > > The final message on that page is not the final message that appears > in my mail client for the thread. I guessed that might have been cut > off due to some hard-coded limit, but perhaps there is some other > reason (different heuristics for thread following?) You're thinking of message https://www.postgresql.org/message-id/CAFjFpRfa6_n10cn3vXjN9hdTqneH6A1rfnLXy0PnCP63T2putw@mail.gmail.com but that is not the same thread -- it doesn't have the References or In-Reply-To headers (see "raw"; user/pwd is archives/antispam). Don't know why though -- maybe Gmail trimmed References because it no longer fit in the DKIM signature? Yours had a long one: https://www.postgresql.org/message-id/raw/CAEepm%3D0VCrC-WfzZkq3YSvJXf225rDnp1ypjv%2BrjKO5d0%3DXqFg%40mail.gmail.com -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
От
Thomas Munro
Дата:
On Tue, Mar 6, 2018 at 5:04 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Thomas Munro wrote: >> On Sun, Mar 4, 2018 at 10:46 PM, Magnus Hagander <magnus@hagander.net> wrote: >> > Um. Have you actually seen the "mail archive app" cut long threads off in >> > other cases? Because it's certainly not supposed to do that... >> >> Hi Magnus, >> >> I mean the "flat" thread view: >> >> https://www.postgresql.org/message-id/flat/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com >> >> The final message on that page is not the final message that appears >> in my mail client for the thread. I guessed that might have been cut >> off due to some hard-coded limit, but perhaps there is some other >> reason (different heuristics for thread following?) > > You're thinking of message > https://www.postgresql.org/message-id/CAFjFpRfa6_n10cn3vXjN9hdTqneH6A1rfnLXy0PnCP63T2putw@mail.gmail.com > but that is not the same thread -- it doesn't have the References or > In-Reply-To headers (see "raw"; user/pwd is archives/antispam). Don't > know why though -- maybe Gmail trimmed References because it no longer > fit in the DKIM signature? Yours had a long one: > https://www.postgresql.org/message-id/raw/CAEepm%3D0VCrC-WfzZkq3YSvJXf225rDnp1ypjv%2BrjKO5d0%3DXqFg%40mail.gmail.com Huh. Interesting. It seems that Gmail uses a fuzzier heuristics, not just "In-Reply-To", explaining why I considered that to be the same thread but our archive didn't: http://www.sensefulsolutions.com/2010/08/how-does-email-threading-work-in-gmail.html I wonder why it dropped the In-Reply-To header when Ashutosh replied... -- Thomas Munro http://www.enterprisedb.com
On Sun, Mar 4, 2018 at 4:46 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Thanks! Here are a couple of patches. I'm not sure which I prefer. > The "pessimistic" one looks simpler and is probably the way to go, but > the "optimistic" one avoids doing an extra read until it has actually > run out of data and seen mq_detached == true. > > I realised that the pg_write_barrier() added to > shm_mq_detach_internal() from the earlier demonstration/hack patch was > not needed... I had a notion that SpinLockAcquire() might not include > a strong enough barrier (unlike SpinLockRelease()), but after reading > s_lock.h I think it's not needed (since you get either TAS() or a > syscall-based slow path, both expected to include a full fence). I > haven't personally tested this on a weak memory order system. The optimistic approach seems a little bit less likely to slow this down on systems where barriers are expensive, so I committed that one. Thanks for debugging this; I hope this fixes it, but I guess we'll see. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
От
Thomas Munro
Дата:
On Tue, Mar 6, 2018 at 9:17 AM, Robert Haas <robertmhaas@gmail.com> wrote: > The optimistic approach seems a little bit less likely to slow this > down on systems where barriers are expensive, so I committed that one. > Thanks for debugging this; I hope this fixes it, but I guess we'll > see. Thanks. For the record, the commit message (written by me) should have acknowledged Tomas's help as reviewer and tester. Sorry about that. -- Thomas Munro http://www.enterprisedb.com
Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?
От
Tomas Vondra
Дата:
On 03/05/2018 09:37 PM, Thomas Munro wrote: > On Tue, Mar 6, 2018 at 9:17 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> The optimistic approach seems a little bit less likely to slow this >> down on systems where barriers are expensive, so I committed that one. >> Thanks for debugging this; I hope this fixes it, but I guess we'll >> see. > > Thanks. > > For the record, the commit message (written by me) should have > acknowledged Tomas's help as reviewer and tester. Sorry about that. > Meh. You've done the hard work of figuring out what's wrong. The commit message is perfectly fine. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services