Обсуждение: joining two tables slow due to sequential scan
= clinicaldocuments.dssdocumentidentifier where documentversions.documentstatus = 'AC';
= documentversions .documentidentifier where clinicaldocuments.patientidentifier= 123;
What version of postgres are you using? Can you post the output from EXPLAIN ANALYZE?
-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Tim Jones
Sent: Friday, February 10, 2006 4:07 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] joining two tables slow due to sequential scan
I am trying to join two tables and keep getting a sequential scan in the plan even though there is an index on the columns I am joining on. Basically this the deal ... I have two tables with docid in them which is what I am using for the join.
ClinicalDocs ... (no primary key) though it does not help if I make docid primary key
docid integer (index)
patientid integer (index)
visitid integer (index)
...
Documentversions
docid integer (index)
docversionnumber (index)
docversionidentifier (primary key)
It seems to do an index scan if I put the primary key as docid. This is what occurs when I link on the patid from ClinicalDocs to patient table. However I can not make the docid primary key because it gets repeated depending on how may versions of a document I have. I have tried using a foreign key on documentversions with no sucess.
In addition this query
select * from documentversions join clinicaldocuments on documentversions.documentidentifier
= clinicaldocuments.dssdocumentidentifier where documentversions.documentstatus = 'AC';
does index scan
but if I change the order e.g
select * from clinicaldocuments join documentversions on clinicaldocuments.dssdocumentidentifier
= documentversions .documentidentifier where clinicaldocuments.patientidentifier= 123;
does sequential scan what I need is bottom query
it is extremely slow ... Any ideas ?
Tim Jones
Healthcare Project Manager
Optio Software, Inc.
(770) 576-3555
On Fri, 2006-02-10 at 16:06, Tim Jones wrote: > > I am trying to join two tables and keep getting a sequential scan in > the plan even though there is an index on the columns I am joining > on. Basically this the deal ... I have two tables with docid in them > which is what I am using for the join. > SNIP > select * from documentversions join clinicaldocuments on > documentversions.documentidentifier > = clinicaldocuments.dssdocumentidentifier where > documentversions.documentstatus = 'AC'; > > does index scan > but if I change the order e.g > > select * from clinicaldocuments join documentversions on > clinicaldocuments.dssdocumentidentifier > = documentversions .documentidentifier where > clinicaldocuments.patientidentifier= 123; OK. I'm gonna make a couple of guesses here: 1: clinicaldocuments.patientidentifier is an int8 and you're running 7.4 or before. 2: There are more rows with clinicaldocuments.patientidentifier= 123 than with documentversions.documentstatus = 'AC'. 3: documentversions.documentidentifier and clinicaldocuments.dssdocumentidentifier are not the same type. Any of those things true?
OK. I'm gonna make a couple of guesses here: 1: clinicaldocuments.patientidentifier is an int8 and you're running 7.4 or before. -- nope int4 and 8.1 2: There are more rows with clinicaldocuments.patientidentifier= 123 than with documentversions.documentstatus = 'AC'. -- nope generally speaking all statuses are 'AC' 3: documentversions.documentidentifier and clinicaldocuments.dssdocumentidentifier are not the same type. -- nope both int4 Any of those things true?
On Fri, 2006-02-10 at 16:35, Tim Jones wrote: > OK. I'm gonna make a couple of guesses here: > > 1: clinicaldocuments.patientidentifier is an int8 and you're running > 7.4 or before. > > -- nope int4 and 8.1 > > 2: There are more rows with clinicaldocuments.patientidentifier= 123 > than with documentversions.documentstatus = 'AC'. > > -- nope generally speaking all statuses are 'AC' > > 3: documentversions.documentidentifier and > clinicaldocuments.dssdocumentidentifier are not the same type. > > -- nope both int4 OK then, I guess we'll need to see the explain analyze output of both of those queries.
for first query
QUERY PLAN
'Limit (cost=4.69..88.47 rows=10 width=1350) (actual
time=32.195..32.338 rows=10 loops=1)'
' -> Nested Loop (cost=4.69..4043.09 rows=482 width=1350) (actual
time=32.190..32.316 rows=10 loops=1)'
' -> Bitmap Heap Scan on documentversions (cost=4.69..1139.40
rows=482 width=996) (actual time=32.161..32.171 rows=10 loops=1)'
' Recheck Cond: (documentstatus = ''AC''::bpchar)'
' -> Bitmap Index Scan on ix_docstatus (cost=0.00..4.69
rows=482 width=0) (actual time=31.467..31.467 rows=96368 loops=1)'
' Index Cond: (documentstatus = ''AC''::bpchar)'
' -> Index Scan using ix_cdocdid on clinicaldocuments
(cost=0.00..6.01 rows=1 width=354) (actual time=0.006..0.007 rows=1
loops=10)'
' Index Cond: ("outer".documentidentifier =
clinicaldocuments.dssdocumentidentifier)'
for second query
QUERY PLAN
'Hash Join (cost=899.83..4384.17 rows=482 width=1350)'
' Hash Cond: ("outer".documentidentifier =
"inner".dssdocumentidentifier)'
' -> Seq Scan on documentversions (cost=0.00..2997.68 rows=96368
width=996)'
' -> Hash (cost=898.62..898.62 rows=482 width=354)'
' -> Bitmap Heap Scan on clinicaldocuments (cost=4.69..898.62
rows=482 width=354)'
' Recheck Cond: (patientidentifier = 123)'
' -> Bitmap Index Scan on ix_cdocpid (cost=0.00..4.69
rows=482 width=0)'
' Index Cond: (patientidentifier = 123)'
thnx
Tim Jones
Healthcare Project Manager
Optio Software, Inc.
(770) 576-3555
________________________________
From: Dave Dutcher [mailto:dave@tridecap.com]
Sent: Friday, February 10, 2006 5:15 PM
To: Tim Jones; pgsql-performance@postgresql.org
Subject: RE: [PERFORM] joining two tables slow due to sequential scan
What version of postgres are you using? Can you post the output from
EXPLAIN ANALYZE?
-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Tim Jones
Sent: Friday, February 10, 2006 4:07 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] joining two tables slow due to sequential scan
I am trying to join two tables and keep getting a sequential scan in the
plan even though there is an index on the columns I am joining on.
Basically this the deal ... I have two tables with docid in them which
is what I am using for the join.
ClinicalDocs ... (no primary key) though it does not help if I make
docid primary key
docid integer (index)
patientid integer (index)
visitid integer (index)
...
Documentversions
docid integer (index)
docversionnumber (index)
docversionidentifier (primary key)
It seems to do an index scan if I put the primary key as docid. This is
what occurs when I link on the patid from ClinicalDocs to patient table.
However I can not make the docid primary key because it gets repeated
depending on how may versions of a document I have. I have tried using
a foreign key on documentversions with no sucess.
In addition this query
select * from documentversions join clinicaldocuments on
documentversions.documentidentifier
= clinicaldocuments.dssdocumentidentifier where
documentversions.documentstatus = 'AC';
does index scan
but if I change the order e.g
select * from clinicaldocuments join documentversions on
clinicaldocuments.dssdocumentidentifier
= documentversions .documentidentifier where
clinicaldocuments.patientidentifier= 123;
does sequential scan what I need is bottom query
it is extremely slow ... Any ideas ?
Tim Jones
Healthcare Project Manager
Optio Software, Inc.
(770) 576-3555
On Fri, 2006-02-10 at 16:37, Tim Jones wrote:
> for first query
>
> QUERY PLAN
> 'Limit (cost=4.69..88.47 rows=10 width=1350) (actual
> time=32.195..32.338 rows=10 loops=1)'
> ' -> Nested Loop (cost=4.69..4043.09 rows=482 width=1350) (actual
> time=32.190..32.316 rows=10 loops=1)'
> ' -> Bitmap Heap Scan on documentversions (cost=4.69..1139.40
> rows=482 width=996) (actual time=32.161..32.171 rows=10 loops=1)'
> ' Recheck Cond: (documentstatus = ''AC''::bpchar)'
> ' -> Bitmap Index Scan on ix_docstatus (cost=0.00..4.69
> rows=482 width=0) (actual time=31.467..31.467 rows=96368 loops=1)'
> ' Index Cond: (documentstatus = ''AC''::bpchar)'
> ' -> Index Scan using ix_cdocdid on clinicaldocuments
> (cost=0.00..6.01 rows=1 width=354) (actual time=0.006..0.007 rows=1
> loops=10)'
> ' Index Cond: ("outer".documentidentifier =
> clinicaldocuments.dssdocumentidentifier)'
>
>
> for second query
>
> QUERY PLAN
> 'Hash Join (cost=899.83..4384.17 rows=482 width=1350)'
> ' Hash Cond: ("outer".documentidentifier =
> "inner".dssdocumentidentifier)'
> ' -> Seq Scan on documentversions (cost=0.00..2997.68 rows=96368
> width=996)'
> ' -> Hash (cost=898.62..898.62 rows=482 width=354)'
> ' -> Bitmap Heap Scan on clinicaldocuments (cost=4.69..898.62
> rows=482 width=354)'
> ' Recheck Cond: (patientidentifier = 123)'
> ' -> Bitmap Index Scan on ix_cdocpid (cost=0.00..4.69
> rows=482 width=0)'
> ' Index Cond: (patientidentifier = 123)'
OK, the first one is explain analyze, but the second one is just plain
explain...
oops
QUERY PLAN
'Hash Join (cost=899.83..4384.17 rows=482 width=1350) (actual
time=0.203..0.203 rows=0 loops=1)'
' Hash Cond: ("outer".documentidentifier =
"inner".dssdocumentidentifier)'
' -> Seq Scan on documentversions (cost=0.00..2997.68 rows=96368
width=996) (actual time=0.007..0.007 rows=1 loops=1)'
' -> Hash (cost=898.62..898.62 rows=482 width=354) (actual
time=0.161..0.161 rows=0 loops=1)'
' -> Bitmap Heap Scan on clinicaldocuments (cost=4.69..898.62
rows=482 width=354) (actual time=0.159..0.159 rows=0 loops=1)'
' Recheck Cond: (patientidentifier = 123)'
' -> Bitmap Index Scan on ix_cdocpid (cost=0.00..4.69
rows=482 width=0) (actual time=0.153..0.153 rows=0 loops=1)'
' Index Cond: (patientidentifier = 123)'
'Total runtime: 0.392 ms'
note I have done these on a smaller db than what I am using but the
plans are the same
Tim Jones
Healthcare Project Manager
Optio Software, Inc.
(770) 576-3555
-----Original Message-----
From: Scott Marlowe [mailto:smarlowe@g2switchworks.com]
Sent: Friday, February 10, 2006 5:39 PM
To: Tim Jones
Cc: Dave Dutcher; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] joining two tables slow due to sequential scan
On Fri, 2006-02-10 at 16:37, Tim Jones wrote:
> for first query
>
> QUERY PLAN
> 'Limit (cost=4.69..88.47 rows=10 width=1350) (actual
> time=32.195..32.338 rows=10 loops=1)'
> ' -> Nested Loop (cost=4.69..4043.09 rows=482 width=1350) (actual
> time=32.190..32.316 rows=10 loops=1)'
> ' -> Bitmap Heap Scan on documentversions (cost=4.69..1139.40
> rows=482 width=996) (actual time=32.161..32.171 rows=10 loops=1)'
> ' Recheck Cond: (documentstatus = ''AC''::bpchar)'
> ' -> Bitmap Index Scan on ix_docstatus (cost=0.00..4.69
> rows=482 width=0) (actual time=31.467..31.467 rows=96368 loops=1)'
> ' Index Cond: (documentstatus = ''AC''::bpchar)'
> ' -> Index Scan using ix_cdocdid on clinicaldocuments
> (cost=0.00..6.01 rows=1 width=354) (actual time=0.006..0.007 rows=1
> loops=10)'
> ' Index Cond: ("outer".documentidentifier =
> clinicaldocuments.dssdocumentidentifier)'
>
>
> for second query
>
> QUERY PLAN
> 'Hash Join (cost=899.83..4384.17 rows=482 width=1350)'
> ' Hash Cond: ("outer".documentidentifier =
> "inner".dssdocumentidentifier)'
> ' -> Seq Scan on documentversions (cost=0.00..2997.68 rows=96368
> width=996)'
> ' -> Hash (cost=898.62..898.62 rows=482 width=354)'
> ' -> Bitmap Heap Scan on clinicaldocuments (cost=4.69..898.62
> rows=482 width=354)'
> ' Recheck Cond: (patientidentifier = 123)'
> ' -> Bitmap Index Scan on ix_cdocpid (cost=0.00..4.69
> rows=482 width=0)'
> ' Index Cond: (patientidentifier = 123)'
OK, the first one is explain analyze, but the second one is just plain
explain...
"Tim Jones" <TJones@optio.com> writes:
> QUERY PLAN
> 'Hash Join (cost=899.83..4384.17 rows=482 width=1350)'
> ' Hash Cond: ("outer".documentidentifier =
> "inner".dssdocumentidentifier)'
This is not EXPLAIN ANALYZE output. Also, the rowcount estimates
seem far enough off in the other query to make me wonder how long
it's been since you ANALYZEd the tables...
More generally, though, I don't see anything particularly wrong
with this query plan. You're selecting enough of the table that
an indexscan isn't necessarily a good plan.
regards, tom lane
On Fri, 2006-02-10 at 16:43, Tim Jones wrote:
> oops
>
> QUERY PLAN
> 'Hash Join (cost=899.83..4384.17 rows=482 width=1350) (actual
> time=0.203..0.203 rows=0 loops=1)'
> ' Hash Cond: ("outer".documentidentifier =
> "inner".dssdocumentidentifier)'
> ' -> Seq Scan on documentversions (cost=0.00..2997.68 rows=96368
> width=996) (actual time=0.007..0.007 rows=1 loops=1)'
> ' -> Hash (cost=898.62..898.62 rows=482 width=354) (actual
> time=0.161..0.161 rows=0 loops=1)'
> ' -> Bitmap Heap Scan on clinicaldocuments (cost=4.69..898.62
> rows=482 width=354) (actual time=0.159..0.159 rows=0 loops=1)'
> ' Recheck Cond: (patientidentifier = 123)'
> ' -> Bitmap Index Scan on ix_cdocpid (cost=0.00..4.69
> rows=482 width=0) (actual time=0.153..0.153 rows=0 loops=1)'
> ' Index Cond: (patientidentifier = 123)'
> 'Total runtime: 0.392 ms'
>
> note I have done these on a smaller db than what I am using but the
> plans are the same
Hmmmm. We really need to see what's happening on the real database to
see what's going wrong. i.e. if the real database thinks it'll get 30
rows and it gets back 5,000,000 that's a problem.
The query planner in pgsql is cost based, so until you have real data
underneath it, and analyze it, you can't really say how it will behave
for you. I.e. small test sets don't work.
"Tim Jones" <TJones@optio.com> writes:
> QUERY PLAN
> 'Hash Join (cost=899.83..4384.17 rows=482 width=1350) (actual
> time=0.203..0.203 rows=0 loops=1)'
> ...
> 'Total runtime: 0.392 ms'
Hardly seems like evidence of a performance problem ...
regards, tom lane
ok here is real db
the first query I had seems to make no sense because it is only fast if
I limit the rows since almost all rows have status = 'AC'
second query
tables both have about 10 million rows and it takes a long time as you
can see but this person only has approx 160 total documents
QUERY PLAN
------------------------------------------------------------------------
-------------------------------------------------------------------
Hash Join (cost=84813.14..1510711.97 rows=48387 width=555) (actual
time=83266.854..91166.315 rows=3 loops=1)
Hash Cond: ("outer".documentidentifier =
"inner".dssdocumentidentifier)
-> Seq Scan on documentversions (cost=0.00..269141.98 rows=9677398
width=415) (actual time=0.056..49812.459 rows=9677398 loops=1)
-> Hash (cost=83660.05..83660.05 rows=48036 width=140) (actual
time=10.833..10.833 rows=3 loops=1)
-> Bitmap Heap Scan on clinicaldocuments
(cost=301.13..83660.05 rows=48036 width=140) (actual time=0.243..0.258
rows=3 loops=1)
Recheck Cond: (patientidentifier = 690193)
-> Bitmap Index Scan on ix_cdocpid (cost=0.00..301.13
rows=48036 width=0) (actual time=0.201..0.201 rows=3 loops=1)
Index Cond: (patientidentifier = 690193)
Total runtime: 91166.540 ms
Tim Jones
Healthcare Project Manager
Optio Software, Inc.
(770) 576-3555
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, February 10, 2006 5:52 PM
To: Tim Jones
Cc: Scott Marlowe; Dave Dutcher; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] joining two tables slow due to sequential scan
"Tim Jones" <TJones@optio.com> writes:
> QUERY PLAN
> 'Hash Join (cost=899.83..4384.17 rows=482 width=1350) (actual
> time=0.203..0.203 rows=0 loops=1)'
> ...
> 'Total runtime: 0.392 ms'
Hardly seems like evidence of a performance problem ...
regards, tom lane
OK, if I'm reading this correctly, it looks like the planner is choosing
a sequential scan because it expects 48,000 rows for that
patientidentifier, but its actually only getting 3. The planner has the
number of rows right for the sequential scan, so it seems like the stats
are up to date. I would try increasing the stats for the
patientindentifier column with 'alter table set statistics...' or
increasing the default_statistics_target for the whole DB. Once you
have changed the stats I believe you need to run analyze again.
-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Tim Jones
Sent: Friday, February 10, 2006 4:59 PM
To: Tom Lane
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] joining two tables slow due to sequential scan
ok here is real db
the first query I had seems to make no sense because it is only fast if
I limit the rows since almost all rows have status = 'AC'
second query
tables both have about 10 million rows and it takes a long time as you
can see but this person only has approx 160 total documents
QUERY PLAN
------------------------------------------------------------------------
-------------------------------------------------------------------
Hash Join (cost=84813.14..1510711.97 rows=48387 width=555) (actual
time=83266.854..91166.315 rows=3 loops=1)
Hash Cond: ("outer".documentidentifier =
"inner".dssdocumentidentifier)
-> Seq Scan on documentversions (cost=0.00..269141.98 rows=9677398
width=415) (actual time=0.056..49812.459 rows=9677398 loops=1)
-> Hash (cost=83660.05..83660.05 rows=48036 width=140) (actual
time=10.833..10.833 rows=3 loops=1)
-> Bitmap Heap Scan on clinicaldocuments
(cost=301.13..83660.05 rows=48036 width=140) (actual time=0.243..0.258
rows=3 loops=1)
Recheck Cond: (patientidentifier = 690193)
-> Bitmap Index Scan on ix_cdocpid (cost=0.00..301.13
rows=48036 width=0) (actual time=0.201..0.201 rows=3 loops=1)
Index Cond: (patientidentifier = 690193)
Total runtime: 91166.540 ms
Tim Jones
Healthcare Project Manager
Optio Software, Inc.
(770) 576-3555
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, February 10, 2006 5:52 PM
To: Tim Jones
Cc: Scott Marlowe; Dave Dutcher; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] joining two tables slow due to sequential scan
"Tim Jones" <TJones@optio.com> writes:
> QUERY PLAN
> 'Hash Join (cost=899.83..4384.17 rows=482 width=1350) (actual
> time=0.203..0.203 rows=0 loops=1)'
> ...
> 'Total runtime: 0.392 ms'
Hardly seems like evidence of a performance problem ...
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
ok I am retarded :) Apparently I thought I had done analyze on these
tables but I actually had not and that was all that was needed. but
thanks for the help.
Tim Jones
Healthcare Project Manager
Optio Software, Inc.
(770) 576-3555
-----Original Message-----
From: Dave Dutcher [mailto:dave@tridecap.com]
Sent: Friday, February 10, 2006 6:25 PM
To: Tim Jones
Cc: pgsql-performance@postgresql.org
Subject: RE: [PERFORM] joining two tables slow due to sequential scan
OK, if I'm reading this correctly, it looks like the planner is choosing
a sequential scan because it expects 48,000 rows for that
patientidentifier, but its actually only getting 3. The planner has the
number of rows right for the sequential scan, so it seems like the stats
are up to date. I would try increasing the stats for the
patientindentifier column with 'alter table set statistics...' or
increasing the default_statistics_target for the whole DB. Once you
have changed the stats I believe you need to run analyze again.
-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Tim Jones
Sent: Friday, February 10, 2006 4:59 PM
To: Tom Lane
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] joining two tables slow due to sequential scan
ok here is real db
the first query I had seems to make no sense because it is only fast if
I limit the rows since almost all rows have status = 'AC'
second query
tables both have about 10 million rows and it takes a long time as you
can see but this person only has approx 160 total documents
QUERY PLAN
------------------------------------------------------------------------
-------------------------------------------------------------------
Hash Join (cost=84813.14..1510711.97 rows=48387 width=555) (actual
time=83266.854..91166.315 rows=3 loops=1)
Hash Cond: ("outer".documentidentifier =
"inner".dssdocumentidentifier)
-> Seq Scan on documentversions (cost=0.00..269141.98 rows=9677398
width=415) (actual time=0.056..49812.459 rows=9677398 loops=1)
-> Hash (cost=83660.05..83660.05 rows=48036 width=140) (actual
time=10.833..10.833 rows=3 loops=1)
-> Bitmap Heap Scan on clinicaldocuments
(cost=301.13..83660.05 rows=48036 width=140) (actual time=0.243..0.258
rows=3 loops=1)
Recheck Cond: (patientidentifier = 690193)
-> Bitmap Index Scan on ix_cdocpid (cost=0.00..301.13
rows=48036 width=0) (actual time=0.201..0.201 rows=3 loops=1)
Index Cond: (patientidentifier = 690193) Total
runtime: 91166.540 ms
Tim Jones
Healthcare Project Manager
Optio Software, Inc.
(770) 576-3555
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, February 10, 2006 5:52 PM
To: Tim Jones
Cc: Scott Marlowe; Dave Dutcher; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] joining two tables slow due to sequential scan
"Tim Jones" <TJones@optio.com> writes:
> QUERY PLAN
> 'Hash Join (cost=899.83..4384.17 rows=482 width=1350) (actual
> time=0.203..0.203 rows=0 loops=1)'
> ...
> 'Total runtime: 0.392 ms'
Hardly seems like evidence of a performance problem ...
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings