Обсуждение: COUNT and Performance ...

Поиск
Список
Период
Сортировка

COUNT and Performance ...

От
Hans-Jürgen Schönig
Дата:
This patch adds a note to the documentation describing why the
performance of min() and max() is slow when applied to the entire table,
and suggesting the simple workaround most experienced Pg users
eventually learn about (SELECT xyz ... ORDER BY xyz LIMIT 1).

Any suggestions on improving the wording of this section would be
welcome.

Cheers,


------

ORDER and LIMIT work pretty fast (no seq scan).
In special cases there can be another way to avoid seq scans:


action=# select tuple_count from pgstattuple('t_text');tuple_count
-------------      14203
(1 row)

action=# BEGIN;
BEGIN
action=# insert into t_text (suchid) VALUES ('100000');
INSERT 578606 1
action=# select tuple_count from pgstattuple('t_text');tuple_count
-------------      14204
(1 row)

action=# ROLLBACK;
ROLLBACK
action=# select tuple_count from pgstattuple('t_text');tuple_count
-------------      14203
(1 row)


If people want to count ALL rows of a table. The contrib stuff is pretty 
useful. It seems to be transaction safe.

The performance boost is great (PostgreSQL 7.3, RedHat, 166Mhz):


root@actionscouts:~# time psql action -c "select tuple_count from 
pgstattuple('t_text');"tuple_count
-------------      14203
(1 row)


real    0m0.266s
user    0m0.030s
sys     0m0.020s
root@actionscouts:~# time psql action -c "select count(*) from t_text"count
-------14203
(1 row)


real    0m0.701s
user    0m0.040s
sys     0m0.010s


I think that this could be a good workaround for huge counts (maybe 
millions of records) with no where clause and no joins.
   Hans

<http://kernel.cybertec.at>



Re: COUNT and Performance ...

От
Neil Conway
Дата:
On Sun, 2003-02-02 at 03:55, Hans-Jürgen Schönig wrote:
> If people want to count ALL rows of a table. The contrib stuff is pretty
> useful. It seems to be transaction safe.

Interesting -- I didn't know about the contrib stuff. I'll update the
docs patch.

Cheers,

Neil

--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC





Re: COUNT and Performance ...

От
Hans-Jürgen Schönig
Дата:
Neil Conway wrote:

>On Sun, 2003-02-02 at 03:55, Hans-Jürgen Schönig wrote:
>
>
>>If people want to count ALL rows of a table. The contrib stuff is pretty
>>useful. It seems to be transaction safe.
>>
>>
>
>Interesting -- I didn't know about the contrib stuff. I'll update the
>docs patch.
>
>Cheers,
>
>Neil
>
>

Personall I think a system table containing information about all tables
in the database would be best. The contrib stuff provides a lot of
useful information which would be perfect for a system view (dead
tuples, number of records, ...). I guess many people would consider that
to be useful. This would not fix the COUNT problem but it would provide
the most essential information people need: The number of records in the
table.

Maybe this is worth discussing.
   Regards,
       Hans

--
*Cybertec Geschwinde u Schoenig*
Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria
Tel: +43/1/913 68 09; +43/664/233 90 75
www.postgresql.at <http://www.postgresql.at>, cluster.postgresql.at
<http://cluster.postgresql.at>, www.cybertec.at
<http://www.cybertec.at>, kernel.cybertec.at <http://kernel.cybertec.at>




Re: COUNT and Performance ...

От
Tom Lane
Дата:
Hans-Jürgen Schönig <postgres@cybertec.at> writes:
> In special cases there can be another way to avoid seq scans:
> [ use pgstattuple() ]

But pgstattuple does do a sequential scan of the table.  You avoid a lot
of the executor's tuple-pushing and plan-node-traversing machinery that
way, but the I/O requirement is going to be exactly the same.

> If people want to count ALL rows of a table. The contrib stuff is pretty 
> useful. It seems to be transaction safe.

Not entirely.  pgstattuple uses HeapTupleSatisfiesNow(), which means you
get a count of tuples that are committed good in terms of the effects of
transactions committed up to the instant each tuple is examined.  This
is in general different from what count(*) would tell you, because it
ignores snapshotting.  It'd be quite unrepeatable too, in the face of
active concurrent changes --- it's very possible for pgstattuple to
count a single row twice or not at all, if it's being concurrently
updated and the other transaction commits between the times pgstattuple
sees the old and new versions of the row.

> The performance boost is great (PostgreSQL 7.3, RedHat, 166Mhz):

I think your test case is small enough that the whole table is resident
in memory, so this measurement only accounts for CPU time per tuple and
not any I/O.  Given the small size of pgstattuple's per-tuple loop, the
speed differential is not too surprising --- but it won't scale up to
larger tables.

Sometime it would be interesting to profile count(*) on large tables
and see exactly where the CPU time goes.  It might be possible to shave
off some of the executor overhead ...
        regards, tom lane


Re: COUNT and Performance ...

От
Neil Conway
Дата:
On Sun, 2003-02-02 at 13:04, Tom Lane wrote:
> I think your test case is small enough that the whole table is resident
> in memory, so this measurement only accounts for CPU time per tuple and
> not any I/O.  Given the small size of pgstattuple's per-tuple loop, the
> speed differential is not too surprising --- but it won't scale up to
> larger tables.

Good observation.

When the entire table is in cache, pgstattuple about 4 times faster than
count(*) on my machine. When the table is too large to fit into cache,
the performance difference drops to 8% in favour of pgstattuple:

nconway=# select count(*) from big_table; count  
---------8388612
(1 row)

Time: 26769.99 ms
nconway=# SELECT tuple_count FROM pgstattuple('big_table');tuple_count 
-------------    8388612
(1 row)

Time: 24658.87 ms

Cheers,

Neil
-- 
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC





Re: COUNT and Performance ...

От
Hans-Jürgen Schönig
Дата:
>
>
>But pgstattuple does do a sequential scan of the table.  You avoid a lot
>of the executor's tuple-pushing and plan-node-traversing machinery that
>way, but the I/O requirement is going to be exactly the same.
>  
>

I have tried it more often so that I can be sure that everything is in 
the cache.
I thought it did some sort of "stat" on tables. Too bad :(.


>>If people want to count ALL rows of a table. The contrib stuff is pretty 
>>useful. It seems to be transaction safe.
>>    
>>
>
>Not entirely.  pgstattuple uses HeapTupleSatisfiesNow(), which means you
>get a count of tuples that are committed good in terms of the effects of
>transactions committed up to the instant each tuple is examined.  This
>is in general different from what count(*) would tell you, because it
>ignores snapshotting.  It'd be quite unrepeatable too, in the face of
>active concurrent changes --- it's very possible for pgstattuple to
>count a single row twice or not at all, if it's being concurrently
>updated and the other transaction commits between the times pgstattuple
>sees the old and new versions of the row.
>  
>
Interesting. I have tried it with concurrent sessions and transactions - 
the results seemed to be right (I could not see the records inserted by 
open transactions). Too bad :(. It would have been a nice work around.


>>The performance boost is great (PostgreSQL 7.3, RedHat, 166Mhz
>>
>
>I think your test case is small enough that the whole table is resident
>in memory, so this measurement only accounts for CPU time per tuple and
>not any I/O.  Given the small size of pgstattuple's per-tuple loop, the
>speed differential is not too surprising --- but it won't scale up to
>larger tables.
>
>Sometime it would be interesting to profile count(*) on large tables
>and see exactly where the CPU time goes.  It might be possible to shave
>off some of the executor overhead ...
>
>            regards, tom lane
>  
>

I have tried it with the largest table on my testing system.
Reducing the overhead is great :).
   Thanks a lot,
       Hans

-- 
*Cybertec Geschwinde u Schoenig*
Ludo-Hartmannplatz 1/14, A-1160 Vienna, Austria
Tel: +43/1/913 68 09; +43/664/233 90 75
www.postgresql.at <http://www.postgresql.at>, cluster.postgresql.at 
<http://cluster.postgresql.at>, www.cybertec.at 
<http://www.cybertec.at>, kernel.cybertec.at <http://kernel.cybertec.at>




Re: COUNT and Performance ...

От
Arjen van der Meijden
Дата:
For a more accurate view of the time used, use the \timing switch in psql.
That leaves out the overhead for forking and loading psql, connecting to
the database and such things.

I think, that it would be even nicer if postgresql automatically choose
to replace the count(*)-with-no-where with something similar.

Regards,

Arjen

Hans-Jürgen Schönig wrote:
> This patch adds a note to the documentation describing why the
> performance of min() and max() is slow when applied to the entire table,
> and suggesting the simple workaround most experienced Pg users
> eventually learn about (SELECT xyz ... ORDER BY xyz LIMIT 1).
>
> Any suggestions on improving the wording of this section would be
> welcome.
>
> Cheers,
>
>
> ------
>
> ORDER and LIMIT work pretty fast (no seq scan).
> In special cases there can be another way to avoid seq scans:
>
>
> action=# select tuple_count from pgstattuple('t_text');
>  tuple_count
> -------------
>        14203
> (1 row)
>
> action=# BEGIN;
> BEGIN
> action=# insert into t_text (suchid) VALUES ('100000');
> INSERT 578606 1
> action=# select tuple_count from pgstattuple('t_text');
>  tuple_count
> -------------
>        14204
> (1 row)
>
> action=# ROLLBACK;
> ROLLBACK
> action=# select tuple_count from pgstattuple('t_text');
>  tuple_count
> -------------
>        14203
> (1 row)
>
>
> If people want to count ALL rows of a table. The contrib stuff is pretty
> useful. It seems to be transaction safe.
>
> The performance boost is great (PostgreSQL 7.3, RedHat, 166Mhz):
>
>
> root@actionscouts:~# time psql action -c "select tuple_count from
> pgstattuple('t_text');"
>  tuple_count
> -------------
>        14203
> (1 row)
>
>
> real    0m0.266s
> user    0m0.030s
> sys     0m0.020s
> root@actionscouts:~# time psql action -c "select count(*) from t_text"
>  count
> -------
>  14203
> (1 row)
>
>
> real    0m0.701s
> user    0m0.040s
> sys     0m0.010s
>
>
> I think that this could be a good workaround for huge counts (maybe
> millions of records) with no where clause and no joins.
>
>     Hans
>
> <http://kernel.cybertec.at>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org



Re: COUNT and Performance ...

От
Bruce Momjian
Дата:
I didn't think pgstattuple had proper visibility checks.


---------------------------------------------------------------------------

Hans-J�rgen Sch�nig wrote:
> This patch adds a note to the documentation describing why the
> performance of min() and max() is slow when applied to the entire table,
> and suggesting the simple workaround most experienced Pg users
> eventually learn about (SELECT xyz ... ORDER BY xyz LIMIT 1).
>
> Any suggestions on improving the wording of this section would be
> welcome.
>
> Cheers,
>
>
> ------
>
> ORDER and LIMIT work pretty fast (no seq scan).
> In special cases there can be another way to avoid seq scans:
>
>
> action=# select tuple_count from pgstattuple('t_text');
>  tuple_count
> -------------
>        14203
> (1 row)
>
> action=# BEGIN;
> BEGIN
> action=# insert into t_text (suchid) VALUES ('100000');
> INSERT 578606 1
> action=# select tuple_count from pgstattuple('t_text');
>  tuple_count
> -------------
>        14204
> (1 row)
>
> action=# ROLLBACK;
> ROLLBACK
> action=# select tuple_count from pgstattuple('t_text');
>  tuple_count
> -------------
>        14203
> (1 row)
>
>
> If people want to count ALL rows of a table. The contrib stuff is pretty
> useful. It seems to be transaction safe.
>
> The performance boost is great (PostgreSQL 7.3, RedHat, 166Mhz):
>
>
> root@actionscouts:~# time psql action -c "select tuple_count from
> pgstattuple('t_text');"
>  tuple_count
> -------------
>        14203
> (1 row)
>
>
> real    0m0.266s
> user    0m0.030s
> sys     0m0.020s
> root@actionscouts:~# time psql action -c "select count(*) from t_text"
>  count
> -------
>  14203
> (1 row)
>
>
> real    0m0.701s
> user    0m0.040s
> sys     0m0.010s
>
>
> I think that this could be a good workaround for huge counts (maybe
> millions of records) with no where clause and no joins.
>
>     Hans
>
> <http://kernel.cybertec.at>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073 


Re: COUNT and Performance ...

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I didn't think pgstattuple had proper visibility checks.

It doesn't, see followup discussion.
        regards, tom lane