RE: BUG #16031: Group by returns duplicate groups

Поиск

Список

Период

Сортировка

От	David Raymond
Тема	RE: BUG #16031: Group by returns duplicate groups
Дата	2 октября 2019 г. 18:36:33
Msg-id	AM6PR07MB6022C91A384768FAB540B4C0879C0@AM6PR07MB6022.eurprd07.prod.outlook.com обсуждение исходный текст
Ответ на	Re: BUG #16031: Group by returns duplicate groups (Andrew Gierth <andrew@tao11.riddles.org.uk>)
Ответы	Re: BUG #16031: Group by returns duplicate groups Re: BUG #16031: Group by returns duplicate groups
Список	pgsql-bugs

Дерево обсуждения

As an update, I've currently got a dump that consistently shows weirdness when loaded. It's just the "name" field, has
1.3million records, is 15 MB zipped, and has things garbled enough that I don't mind sending it.
 

How small does it need to be before it's good to send to someone?


Output after re-loading the table:

testing=> \d+ weird_grouping
                                    Table "name_stuff.weird_grouping"
 Column |          Type          | Collation | Nullable | Default | Storage  | Stats target | Description
--------+------------------------+-----------+----------+---------+----------+--------------+-------------
 name   | character varying(254) |           | not null |         | extended |              |


testing=> select count(*), count(distinct name) from weird_grouping;
   count   |   count
-----------+-----------
 1,297,265 | 1,176,103
(1 row)

Time: 7616.186 ms (00:07.616)

testing=> explain (analyze, verbose, costs, buffers, timing, summary) select count(*), count(distinct name) from
weird_grouping;
                                                                QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=10187.10..10187.11 rows=1 width=16) (actual time=7805.463..7805.463 rows=1 loops=1)
   Output: count(*), count(DISTINCT name)
   Buffers: shared hit=2080 read=6236, temp read=5484 written=5506
   ->  Seq Scan on name_stuff.weird_grouping  (cost=0.00..9563.40 rows=124740 width=516) (actual time=0.031..128.203
rows=1297265loops=1)
 
         Output: name
         Buffers: shared hit=2080 read=6236
 Planning Time: 0.027 ms
 Execution Time: 7805.483 ms
(8 rows)

Time: 7805.822 ms (00:07.806)

testing=> with foo as (select name from weird_grouping group by name) select name from foo group by name having
count(*)> 1;
 
 name
------
(0 rows)

Time: 1950.401 ms (00:01.950)

testing=> explain (analyze, verbose, costs, buffers, timing, summary) with foo as (select name from weird_grouping
groupby name) select name from foo group by name having count(*) > 1;
 
                                                                    QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=9882.25..9884.75 rows=67 width=516) (actual time=1466.738..1466.738 rows=0 loops=1)
   Output: foo.name
   Group Key: foo.name
   Filter: (count(*) > 1)
   Rows Removed by Filter: 1176101
   Buffers: shared hit=2144 read=6172, temp written=4533
   CTE foo
     ->  HashAggregate  (cost=9875.25..9877.25 rows=200 width=516) (actual time=494.343..734.005 rows=1176101 loops=1)
           Output: weird_grouping.name
           Group Key: weird_grouping.name
           Buffers: shared hit=2144 read=6172
           ->  Seq Scan on name_stuff.weird_grouping  (cost=0.00..9563.40 rows=124740 width=516) (actual
time=0.283..120.898rows=1297265 loops=1)
 
                 Output: weird_grouping.name
                 Buffers: shared hit=2144 read=6172
   ->  CTE Scan on foo  (cost=0.00..4.00 rows=200 width=516) (actual time=494.346..1035.185 rows=1176101 loops=1)
         Output: foo.name
         Buffers: shared hit=2144 read=6172, temp written=4533
 Planning Time: 0.464 ms
 Execution Time: 2230.238 ms
(19 rows)

Time: 2231.291 ms (00:02.231)

testing=> analyze verbose weird_grouping;
INFO:  analyzing "name_stuff.weird_grouping"
INFO:  "weird_grouping": scanned 8316 of 8316 pages, containing 1297265 live rows and 0 dead rows; 30000 rows in
sample,1297265 estimated total rows
 
ANALYZE
Time: 474.963 ms

testing=> select count(*), count(distinct name) from weird_grouping;
   count   |   count
-----------+-----------
 1,297,265 | 1,176,103
(1 row)

Time: 7449.983 ms (00:07.450)

testing=> explain (analyze, verbose, costs, buffers, timing, summary) select count(*), count(distinct name) from
weird_grouping;
                                                                 QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=27774.98..27774.99 rows=1 width=16) (actual time=7754.578..7754.578 rows=1 loops=1)
   Output: count(*), count(DISTINCT name)
   Buffers: shared hit=2240 read=6076, temp read=5484 written=5506
   ->  Seq Scan on name_stuff.weird_grouping  (cost=0.00..21288.65 rows=1297265 width=20) (actual time=0.036..102.144
rows=1297265loops=1)
 
         Output: name
         Buffers: shared hit=2240 read=6076
 Planning Time: 0.031 ms
 Execution Time: 7754.598 ms
(8 rows)

Time: 7754.902 ms (00:07.755)

testing=> with foo as (select name from weird_grouping group by name) select name from foo group by name having
count(*)> 1;
 
 name
-------
 DCT
 DELTA
(2 rows)

Time: 9561.382 ms (00:09.561)

testing=> explain (analyze, verbose, costs, buffers, timing, summary) with foo as (select name from weird_grouping
groupby name) select name from foo group by name having count(*) > 1;
 
                                                                        QUERY PLAN
                             
 

----------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=209783.95..209786.45 rows=67 width=516) (actual time=8591.210..8686.132 rows=2 loops=1)
   Output: foo.name
   Group Key: foo.name
   Filter: (count(*) > 1)
   Rows Removed by Filter: 1176099
   Buffers: shared hit=2304 read=6012, temp read=7800 written=12363
   CTE foo
     ->  Group  (cost=179613.72..186100.05 rows=947356 width=20) (actual time=6416.900..7842.634 rows=1176103 loops=1)
           Output: weird_grouping.name
           Group Key: weird_grouping.name
           Buffers: shared hit=2304 read=6012, temp read=7800 written=7830
           ->  Sort  (cost=179613.72..182856.89 rows=1297265 width=20) (actual time=6416.899..7665.158 rows=1297265
loops=1)
                 Output: weird_grouping.name
                 Sort Key: weird_grouping.name
                 Sort Method: external merge  Disk: 39048kB
                 Buffers: shared hit=2304 read=6012, temp read=7800 written=7830
                 ->  Seq Scan on name_stuff.weird_grouping  (cost=0.00..21288.65 rows=1297265 width=20) (actual
time=0.294..137.486rows=1297265 loops=1)
 
                       Output: weird_grouping.name
                       Buffers: shared hit=2304 read=6012
   ->  CTE Scan on foo  (cost=0.00..18947.12 rows=947356 width=516) (actual time=6416.902..8105.771 rows=1176103
loops=1)
         Output: foo.name
         Buffers: shared hit=2304 read=6012, temp read=7800 written=12363
 Planning Time: 0.258 ms
 Execution Time: 10305.891 ms
(24 rows)

Time: 10306.990 ms (00:10.307)

testing=>



-----Original Message-----
From: Andrew Gierth <andrew@tao11.riddles.org.uk> 
Sent: Tuesday, October 1, 2019 3:27 PM
To: David Raymond <David.Raymond@tomtom.com>
Cc: pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #16031: Group by returns duplicate groups

>>>>> "David" == David Raymond <David.Raymond@tomtom.com> writes:

 David> I checked for other places where there was weirdness going on
 David> and there are more.

You could try something like,

select name, lname
  from (select name, lag(name) over (order by name) as lname
          from big_table) s
 where name < lname;

That should show all cases where the sort order was inconsistent with
the < operator (which obviously should never happen).

However, there's not much more I can do to help with this, since I don't
use Windows myself and have no useful access to any Windows system. You
might try and cut down the data to the smallest set that shows
inconsistencies using the above; particularly relevant is whether the
problem only shows up for external merge sorts or whether it happens for
in-memory sorts too.

-- 
Andrew (irc:RhodiumToad)

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

RE: BUG #16031: Group by returns duplicate groups