Обсуждение: select_parallel test failure: gather sometimes losing tuples (maybeduring rescans)?

Поиск

Список

Период

Сортировка

select_parallel test failure: gather sometimes losing tuples (maybeduring rescans)?

От

Thomas Munro

Дата:

04 марта 2018 г., 08:20:57

Hi,

I saw a one-off failure like this:

                                  QUERY PLAN
  --------------------------------------------------------------------------
   Aggregate (actual rows=1 loops=1)
!    ->  Nested Loop (actual rows=98000 loops=1)
           ->  Seq Scan on tenk2 (actual rows=10 loops=1)
                 Filter: (thousand = 0)
                 Rows Removed by Filter: 9990
!          ->  Gather (actual rows=9800 loops=10)
                 Workers Planned: 4
                 Workers Launched: 4
                 ->  Parallel Seq Scan on tenk1 (actual rows=1960 loops=50)
--- 485,495 ----
                                  QUERY PLAN
  --------------------------------------------------------------------------
   Aggregate (actual rows=1 loops=1)
!    ->  Nested Loop (actual rows=97984 loops=1)
           ->  Seq Scan on tenk2 (actual rows=10 loops=1)
                 Filter: (thousand = 0)
                 Rows Removed by Filter: 9990
!          ->  Gather (actual rows=9798 loops=10)
                 Workers Planned: 4
                 Workers Launched: 4
                 ->  Parallel Seq Scan on tenk1 (actual rows=1960 loops=50)


Two tuples apparently went missing.

Similar failures on the build farm:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11

Could this be related to commit
34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit
497171d3e2aaeea3b30d710b4e368645ad07ae43?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Tomas Vondra

Дата:

04 марта 2018 г., 08:36:51

On 03/04/2018 03:20 AM, Thomas Munro wrote:
> Hi,
> 
> I saw a one-off failure like this:
> 
>                                   QUERY PLAN
>   --------------------------------------------------------------------------
>    Aggregate (actual rows=1 loops=1)
> !    ->  Nested Loop (actual rows=98000 loops=1)
>            ->  Seq Scan on tenk2 (actual rows=10 loops=1)
>                  Filter: (thousand = 0)
>                  Rows Removed by Filter: 9990
> !          ->  Gather (actual rows=9800 loops=10)
>                  Workers Planned: 4
>                  Workers Launched: 4
>                  ->  Parallel Seq Scan on tenk1 (actual rows=1960 loops=50)
> --- 485,495 ----
>                                   QUERY PLAN
>   --------------------------------------------------------------------------
>    Aggregate (actual rows=1 loops=1)
> !    ->  Nested Loop (actual rows=97984 loops=1)
>            ->  Seq Scan on tenk2 (actual rows=10 loops=1)
>                  Filter: (thousand = 0)
>                  Rows Removed by Filter: 9990
> !          ->  Gather (actual rows=9798 loops=10)
>                  Workers Planned: 4
>                  Workers Launched: 4
>                  ->  Parallel Seq Scan on tenk1 (actual rows=1960 loops=50)
> 
> 
> Two tuples apparently went missing.
> 
> Similar failures on the build farm:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11
> 
> Could this be related to commit
> 34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit
> 497171d3e2aaeea3b30d710b4e368645ad07ae43?
> 

I think the same failure (or at least very similar plan diff) was
already mentioned here:

https://www.postgresql.org/message-id/17385.1520018934@sss.pgh.pa.us

So I guess someone else already noticed, but I don't see the cause
identified in that thread.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: select_parallel test failure: gather sometimes losing tuples (maybe during rescans)?

От

Andres Freund

Дата:

04 марта 2018 г., 08:40:08


On March 3, 2018 6:36:51 PM PST, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>On 03/04/2018 03:20 AM, Thomas Munro wrote:
>> Hi,
>>
>> I saw a one-off failure like this:
>>
>>                                   QUERY PLAN
>>
>--------------------------------------------------------------------------
>>    Aggregate (actual rows=1 loops=1)
>> !    ->  Nested Loop (actual rows=98000 loops=1)
>>            ->  Seq Scan on tenk2 (actual rows=10 loops=1)
>>                  Filter: (thousand = 0)
>>                  Rows Removed by Filter: 9990
>> !          ->  Gather (actual rows=9800 loops=10)
>>                  Workers Planned: 4
>>                  Workers Launched: 4
>>                  ->  Parallel Seq Scan on tenk1 (actual rows=1960
>loops=50)
>> --- 485,495 ----
>>                                   QUERY PLAN
>>
>--------------------------------------------------------------------------
>>    Aggregate (actual rows=1 loops=1)
>> !    ->  Nested Loop (actual rows=97984 loops=1)
>>            ->  Seq Scan on tenk2 (actual rows=10 loops=1)
>>                  Filter: (thousand = 0)
>>                  Rows Removed by Filter: 9990
>> !          ->  Gather (actual rows=9798 loops=10)
>>                  Workers Planned: 4
>>                  Workers Launched: 4
>>                  ->  Parallel Seq Scan on tenk1 (actual rows=1960
>loops=50)
>>
>>
>> Two tuples apparently went missing.
>>
>> Similar failures on the build farm:
>>
>>
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01
>>
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32
>>
>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11
>>
>> Could this be related to commit
>> 34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit
>> 497171d3e2aaeea3b30d710b4e368645ad07ae43?
>>
>
>I think the same failure (or at least very similar plan diff) was
>already mentioned here:
>
>https://www.postgresql.org/message-id/17385.1520018934@sss.pgh.pa.us
>
>So I guess someone else already noticed, but I don't see the cause
>identified in that thread.

Robert and I started discussing it a bit over IM. No conclusion. Robert tried to reproduce locally, including disabling
atomics,without luck. 

Can anybody reproduce locally?


Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Thomas Munro

Дата:

04 марта 2018 г., 08:48:35

On Sun, Mar 4, 2018 at 3:40 PM, Andres Freund <andres@anarazel.de> wrote:
> On March 3, 2018 6:36:51 PM PST, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>On 03/04/2018 03:20 AM, Thomas Munro wrote:
>>> Hi,
>>>
>>> I saw a one-off failure like this:
>>>
>>>                                   QUERY PLAN
>>>
>>--------------------------------------------------------------------------
>>>    Aggregate (actual rows=1 loops=1)
>>> !    ->  Nested Loop (actual rows=98000 loops=1)
>>>            ->  Seq Scan on tenk2 (actual rows=10 loops=1)
>>>                  Filter: (thousand = 0)
>>>                  Rows Removed by Filter: 9990
>>> !          ->  Gather (actual rows=9800 loops=10)
>>>                  Workers Planned: 4
>>>                  Workers Launched: 4
>>>                  ->  Parallel Seq Scan on tenk1 (actual rows=1960
>>loops=50)
>>> --- 485,495 ----
>>>                                   QUERY PLAN
>>>
>>--------------------------------------------------------------------------
>>>    Aggregate (actual rows=1 loops=1)
>>> !    ->  Nested Loop (actual rows=97984 loops=1)
>>>            ->  Seq Scan on tenk2 (actual rows=10 loops=1)
>>>                  Filter: (thousand = 0)
>>>                  Rows Removed by Filter: 9990
>>> !          ->  Gather (actual rows=9798 loops=10)
>>>                  Workers Planned: 4
>>>                  Workers Launched: 4
>>>                  ->  Parallel Seq Scan on tenk1 (actual rows=1960
>>loops=50)
>>>
>>>
>>> Two tuples apparently went missing.
>>>
>>> Similar failures on the build farm:
>>>
>>>
>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01
>>>
>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32
>>>
>>https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11
>>>
>>> Could this be related to commit
>>> 34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit
>>> 497171d3e2aaeea3b30d710b4e368645ad07ae43?
>>>
>>
>>I think the same failure (or at least very similar plan diff) was
>>already mentioned here:
>>
>>https://www.postgresql.org/message-id/17385.1520018934@sss.pgh.pa.us
>>
>>So I guess someone else already noticed, but I don't see the cause
>>identified in that thread.

Oh.  Sorry, I didn't recognise that as the same thing, from the title.
Doesn't seem to be related to number of workers launched at all... it
looks more like the tuple queue is misbehaving.  Though I haven't got
any proof of anything yet.

> Robert and I started discussing it a bit over IM. No conclusion. Robert tried to reproduce locally, including
disablingatomics, without luck.
 
>
> Can anybody reproduce locally?

I've seen it several times on Travis CI.  (So I would normally have
been able to tell you about this problem before the was committed,
except that the email thread was too long and the mail archive app
cuts long threads off!)  Will try on some different kind of computers
that I have local control off...  I suspect (knowing how we run it on
Travis CI) that being way overloaded might be helpful...

-- 
Thomas Munro
http://www.enterprisedb.com

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Thomas Munro

Дата:

04 марта 2018 г., 08:51:07

On Sun, Mar 4, 2018 at 3:48 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I've seen it several times on Travis CI.  (So I would normally have
> been able to tell you about this problem before the was committed,
> except that the email thread was too long and the mail archive app
> cuts long threads off!)

(Correction.  It wasn't too long.  That was something else.  In this
case the Commitfest entry had two threads registered and my bot is too
stupid to find the interesting one.)

-- 
Thomas Munro
http://www.enterprisedb.com

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Tomas Vondra

Дата:

04 марта 2018 г., 09:07:19


On 03/04/2018 03:40 AM, Andres Freund wrote:
> 
> 
> On March 3, 2018 6:36:51 PM PST, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> On 03/04/2018 03:20 AM, Thomas Munro wrote:
>>> Hi,
>>>
>>> I saw a one-off failure like this:
>>>
>>>                                   QUERY PLAN
>>>  
>> --------------------------------------------------------------------------
>>>    Aggregate (actual rows=1 loops=1)
>>> !    ->  Nested Loop (actual rows=98000 loops=1)
>>>            ->  Seq Scan on tenk2 (actual rows=10 loops=1)
>>>                  Filter: (thousand = 0)
>>>                  Rows Removed by Filter: 9990
>>> !          ->  Gather (actual rows=9800 loops=10)
>>>                  Workers Planned: 4
>>>                  Workers Launched: 4
>>>                  ->  Parallel Seq Scan on tenk1 (actual rows=1960
>> loops=50)
>>> --- 485,495 ----
>>>                                   QUERY PLAN
>>>  
>> --------------------------------------------------------------------------
>>>    Aggregate (actual rows=1 loops=1)
>>> !    ->  Nested Loop (actual rows=97984 loops=1)
>>>            ->  Seq Scan on tenk2 (actual rows=10 loops=1)
>>>                  Filter: (thousand = 0)
>>>                  Rows Removed by Filter: 9990
>>> !          ->  Gather (actual rows=9798 loops=10)
>>>                  Workers Planned: 4
>>>                  Workers Launched: 4
>>>                  ->  Parallel Seq Scan on tenk1 (actual rows=1960
>> loops=50)
>>>
>>>
>>> Two tuples apparently went missing.
>>>
>>> Similar failures on the build farm:
>>>
>>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=okapi&dt=2018-03-03%2020%3A15%3A01
>>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2018-03-03%2018%3A13%3A32
>>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2018-03-03%2017%3A55%3A11
>>>
>>> Could this be related to commit
>>> 34db06ef9a1d7f36391c64293bf1e0ce44a33915 or commit
>>> 497171d3e2aaeea3b30d710b4e368645ad07ae43?
>>>
>>
>> I think the same failure (or at least very similar plan diff) was
>> already mentioned here:
>>
>> https://www.postgresql.org/message-id/17385.1520018934@sss.pgh.pa.us
>>
>> So I guess someone else already noticed, but I don't see the cause
>> identified in that thread.
> 
> Robert and I started discussing it a bit over IM. No conclusion. Robert tried to reproduce locally, including
disablingatomics, without luck.
 
> 
> Can anybody reproduce locally?
> 

I've started "make check" with parallel_schedule tweaked to contain many
select_parallel runs, and so far I've seen a couple of failures like
this (about 10 failures out of 1500 runs):

  select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and
tenk2.thousand=0;
! ERROR:  lost connection to parallel worker

I have no idea why the worker fails (no segfaults in dmesg, nothing in
posgres log), or if it's related to the issue discussed here at all.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Thomas Munro

Дата:

04 марта 2018 г., 09:11:52

On Sun, Mar 4, 2018 at 4:07 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I've started "make check" with parallel_schedule tweaked to contain many
> select_parallel runs, and so far I've seen a couple of failures like
> this (about 10 failures out of 1500 runs):
>
>   select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and
> tenk2.thousand=0;
> ! ERROR:  lost connection to parallel worker
>
> I have no idea why the worker fails (no segfaults in dmesg, nothing in
> posgres log), or if it's related to the issue discussed here at all.

That sounds like the new defences from 2badb5afb89cd569500ef7c3b23c7a9d11718f2f.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Tomas Vondra

Дата:

04 марта 2018 г., 09:17:38


On 03/04/2018 04:11 AM, Thomas Munro wrote:
> On Sun, Mar 4, 2018 at 4:07 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> I've started "make check" with parallel_schedule tweaked to contain many
>> select_parallel runs, and so far I've seen a couple of failures like
>> this (about 10 failures out of 1500 runs):
>>
>>   select count(*) from tenk1, tenk2 where tenk1.hundred > 1 and
>> tenk2.thousand=0;
>> ! ERROR:  lost connection to parallel worker
>>
>> I have no idea why the worker fails (no segfaults in dmesg, nothing in
>> posgres log), or if it's related to the issue discussed here at all.
> 
> That sounds like the new defences from 2badb5afb89cd569500ef7c3b23c7a9d11718f2f.
> 

Yeah. But I wonder why the worker fails at all, or how to find that.


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Thomas Munro

Дата:

04 марта 2018 г., 09:37:01

On Sun, Mar 4, 2018 at 4:17 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 03/04/2018 04:11 AM, Thomas Munro wrote:
>> On Sun, Mar 4, 2018 at 4:07 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> ! ERROR:  lost connection to parallel worker
>>
>> That sounds like the new defences from 2badb5afb89cd569500ef7c3b23c7a9d11718f2f.
>
> Yeah. But I wonder why the worker fails at all, or how to find that.

Could it be that a concurrency bug causes tuples to be lost on the
tuple queue, and also sometimes causes X (terminate) messages to be
lost from the error queue, so that the worker appears to go away
unexpectedly?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Thomas Munro

Дата:

04 марта 2018 г., 10:40:38

On Sun, Mar 4, 2018 at 4:37 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Could it be that a concurrency bug causes tuples to be lost on the
> tuple queue, and also sometimes causes X (terminate) messages to be
> lost from the error queue, so that the worker appears to go away
> unexpectedly?

Could shm_mq_detach_internal() need a pg_write_barrier() before it
writes mq_detached = true, to make sure that anyone who observes that
can also see the most recent increase of mq_bytes_written?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Thomas Munro

Дата:

04 марта 2018 г., 15:27:01

On Sun, Mar 4, 2018 at 5:40 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Could shm_mq_detach_internal() need a pg_write_barrier() before it
> writes mq_detached = true, to make sure that anyone who observes that
> can also see the most recent increase of mq_bytes_written?

I can reproduce both failure modes (missing tuples and "lost contact")
in the regression database with the attached Python script on my Mac.
It takes a few minutes and seems to be happen sooner when my machine
is also doing other stuff (playing debugging music...).

I can reproduce it at 34db06ef9a1d7f36391c64293bf1e0ce44a33915
"shm_mq: Reduce spinlock usage." but (at least so far) not at the
preceding commit.

I can fix it with the following patch, which writes XXX out to the log
where it would otherwise miss a final message sent just before
detaching with sufficiently bad timing/memory ordering.  This patch
isn't my proposed fix, it's just a demonstration of what's busted.
There could be a better way to structure things than this.

-- 
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Magnus Hagander

Дата:

04 марта 2018 г., 15:46:50

On Sun, Mar 4, 2018 at 3:51 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Sun, Mar 4, 2018 at 3:48 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I've seen it several times on Travis CI. (So I would normally have
> been able to tell you about this problem before the was committed,
> except that the email thread was too long and the mail archive app
> cuts long threads off!)

(Correction. It wasn't too long. That was something else. In this
case the Commitfest entry had two threads registered and my bot is too
stupid to find the interesting one.)

Um. Have you actually seen the "mail archive app" cut long threads off in other cases? Because it's certainly not supposed to do that...

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Thomas Munro

Дата:

04 марта 2018 г., 16:03:49

On Sun, Mar 4, 2018 at 10:46 PM, Magnus Hagander <magnus@hagander.net> wrote:
> Um. Have you actually seen the "mail archive app" cut long threads off in
> other cases? Because it's certainly not supposed to do that...

Hi Magnus,

I mean the "flat" thread view:

https://www.postgresql.org/message-id/flat/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com

The final message on that page is not the final message that appears
in my mail client for the thread.  I guessed that might have been cut
off due to some hard-coded limit, but perhaps there is some other
reason (different heuristics for thread following?)

-- 
Thomas Munro
http://www.enterprisedb.com

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Tomas Vondra

Дата:

04 марта 2018 г., 21:05:59


On 03/04/2018 10:27 AM, Thomas Munro wrote:
> On Sun, Mar 4, 2018 at 5:40 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Could shm_mq_detach_internal() need a pg_write_barrier() before it
>> writes mq_detached = true, to make sure that anyone who observes that
>> can also see the most recent increase of mq_bytes_written?
> 
> I can reproduce both failure modes (missing tuples and "lost contact")
> in the regression database with the attached Python script on my Mac.
> It takes a few minutes and seems to be happen sooner when my machine
> is also doing other stuff (playing debugging music...).
> 
> I can reproduce it at 34db06ef9a1d7f36391c64293bf1e0ce44a33915
> "shm_mq: Reduce spinlock usage." but (at least so far) not at the
> preceding commit.
> 
> I can fix it with the following patch, which writes XXX out to the log
> where it would otherwise miss a final message sent just before
> detaching with sufficiently bad timing/memory ordering.  This patch
> isn't my proposed fix, it's just a demonstration of what's busted.
> There could be a better way to structure things than this.
> 

I can confirm this resolves the issue for me. Before the patch, I've
seen 112 failures in ~11500 runs. With the patch I saw 0 failures, but
about 100 messages XXX in the log.

So my conclusion is that your analysis is likely correct.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Thomas Munro

Дата:

05 марта 2018 г., 03:46:08

On Mon, Mar 5, 2018 at 4:05 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 03/04/2018 10:27 AM, Thomas Munro wrote:
>> I can fix it with the following patch, which writes XXX out to the log
>> where it would otherwise miss a final message sent just before
>> detaching with sufficiently bad timing/memory ordering.  This patch
>> isn't my proposed fix, it's just a demonstration of what's busted.
>> There could be a better way to structure things than this.
>
> I can confirm this resolves the issue for me. Before the patch, I've
> seen 112 failures in ~11500 runs. With the patch I saw 0 failures, but
> about 100 messages XXX in the log.
>
> So my conclusion is that your analysis is likely correct.

Thanks!  Here are a couple of patches.  I'm not sure which I prefer.
The "pessimistic" one looks simpler and is probably the way to go, but
the "optimistic" one avoids doing an extra read until it has actually
run out of data and seen mq_detached == true.

I realised that the pg_write_barrier() added to
shm_mq_detach_internal() from the earlier demonstration/hack patch was
not needed... I had a notion that SpinLockAcquire() might not include
a strong enough barrier (unlike SpinLockRelease()), but after reading
s_lock.h I think it's not needed (since you get either TAS() or a
syscall-based slow path, both expected to include a full fence).  I
haven't personally tested this on a weak memory order system.

Thoughts?

-- 
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Alvaro Herrera

Дата:

05 марта 2018 г., 22:04:35

Thomas Munro wrote:
> On Sun, Mar 4, 2018 at 10:46 PM, Magnus Hagander <magnus@hagander.net> wrote:
> > Um. Have you actually seen the "mail archive app" cut long threads off in
> > other cases? Because it's certainly not supposed to do that...
> 
> Hi Magnus,
> 
> I mean the "flat" thread view:
> 
> https://www.postgresql.org/message-id/flat/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com
> 
> The final message on that page is not the final message that appears
> in my mail client for the thread.  I guessed that might have been cut
> off due to some hard-coded limit, but perhaps there is some other
> reason (different heuristics for thread following?)

You're thinking of message
https://www.postgresql.org/message-id/CAFjFpRfa6_n10cn3vXjN9hdTqneH6A1rfnLXy0PnCP63T2putw@mail.gmail.com
but that is not the same thread -- it doesn't have the References or
In-Reply-To headers (see "raw"; user/pwd is archives/antispam).  Don't
know why though -- maybe Gmail trimmed References because it no longer
fit in the DKIM signature?  Yours had a long one:
https://www.postgresql.org/message-id/raw/CAEepm%3D0VCrC-WfzZkq3YSvJXf225rDnp1ypjv%2BrjKO5d0%3DXqFg%40mail.gmail.com

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Thomas Munro

Дата:

06 марта 2018 г., 01:57:26

On Tue, Mar 6, 2018 at 5:04 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Thomas Munro wrote:
>> On Sun, Mar 4, 2018 at 10:46 PM, Magnus Hagander <magnus@hagander.net> wrote:
>> > Um. Have you actually seen the "mail archive app" cut long threads off in
>> > other cases? Because it's certainly not supposed to do that...
>>
>> Hi Magnus,
>>
>> I mean the "flat" thread view:
>>
>> https://www.postgresql.org/message-id/flat/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com
>>
>> The final message on that page is not the final message that appears
>> in my mail client for the thread.  I guessed that might have been cut
>> off due to some hard-coded limit, but perhaps there is some other
>> reason (different heuristics for thread following?)
>
> You're thinking of message
> https://www.postgresql.org/message-id/CAFjFpRfa6_n10cn3vXjN9hdTqneH6A1rfnLXy0PnCP63T2putw@mail.gmail.com
> but that is not the same thread -- it doesn't have the References or
> In-Reply-To headers (see "raw"; user/pwd is archives/antispam).  Don't
> know why though -- maybe Gmail trimmed References because it no longer
> fit in the DKIM signature?  Yours had a long one:
> https://www.postgresql.org/message-id/raw/CAEepm%3D0VCrC-WfzZkq3YSvJXf225rDnp1ypjv%2BrjKO5d0%3DXqFg%40mail.gmail.com

Huh.  Interesting.  It seems that Gmail uses a fuzzier heuristics, not
just "In-Reply-To", explaining why I considered that to be the same
thread but our archive didn't:

http://www.sensefulsolutions.com/2010/08/how-does-email-threading-work-in-gmail.html

I wonder why it dropped the In-Reply-To header when Ashutosh replied...

-- 
Thomas Munro
http://www.enterprisedb.com

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Robert Haas

Дата:

06 марта 2018 г., 02:17:16

On Sun, Mar 4, 2018 at 4:46 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Thanks!  Here are a couple of patches.  I'm not sure which I prefer.
> The "pessimistic" one looks simpler and is probably the way to go, but
> the "optimistic" one avoids doing an extra read until it has actually
> run out of data and seen mq_detached == true.
>
> I realised that the pg_write_barrier() added to
> shm_mq_detach_internal() from the earlier demonstration/hack patch was
> not needed... I had a notion that SpinLockAcquire() might not include
> a strong enough barrier (unlike SpinLockRelease()), but after reading
> s_lock.h I think it's not needed (since you get either TAS() or a
> syscall-based slow path, both expected to include a full fence).  I
> haven't personally tested this on a weak memory order system.

The optimistic approach seems a little bit less likely to slow this
down on systems where barriers are expensive, so I committed that one.
Thanks for debugging this; I hope this fixes it, but I guess we'll
see.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Thomas Munro

Дата:

06 марта 2018 г., 02:37:01

On Tue, Mar 6, 2018 at 9:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> The optimistic approach seems a little bit less likely to slow this
> down on systems where barriers are expensive, so I committed that one.
> Thanks for debugging this; I hope this fixes it, but I guess we'll
> see.

Thanks.

For the record, the commit message (written by me) should have
acknowledged Tomas's help as reviewer and tester.  Sorry about that.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: select_parallel test failure: gather sometimes losing tuples(maybe during rescans)?

От

Tomas Vondra

Дата:

06 марта 2018 г., 03:32:12


On 03/05/2018 09:37 PM, Thomas Munro wrote:
> On Tue, Mar 6, 2018 at 9:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> The optimistic approach seems a little bit less likely to slow this
>> down on systems where barriers are expensive, so I committed that one.
>> Thanks for debugging this; I hope this fixes it, but I guess we'll
>> see.
> 
> Thanks.
> 
> For the record, the commit message (written by me) should have
> acknowledged Tomas's help as reviewer and tester.  Sorry about that.
> 

Meh. You've done the hard work of figuring out what's wrong. The commit
message is perfectly fine.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: select_parallel test failure: gather sometimes losing tuples (maybeduring rescans)?

Вложения

Вложения