Re: Querying distinct values from a large table
От | Ron |
---|---|
Тема | Re: Querying distinct values from a large table |
Дата | |
Msg-id | E1HCGjN-0003Gp-GQ@elasmtp-kukur.atl.sa.earthlink.net обсуждение исходный текст |
Ответ на | Re: Querying distinct values from a large table (Gregory Stark <gsstark@mit.edu>) |
Список | pgsql-performance |
I strongly encourage anyone who is interested in the general external sorting problem peruse Jim Gray's site: http://research.microsoft.com/barc/SortBenchmark/ Ron Peacetree At 08:24 AM 1/31/2007, Gregory Stark wrote: >Tom Lane <tgl@sss.pgh.pa.us> writes: > > > Alvaro Herrera <alvherre@commandprompt.com> writes: > > > Gregory Stark wrote: > > >> (Incidentally I'm not sure where 2-5x comes from. It's > entirely dependant on > > >> your data distribution. It's not hard to come up with > distributions where it's > > >> 1000x as fast and others where there's no speed difference.) > > > > > So the figure is really "1-1000x"? I bet this one is more impressive in > > > PHB terms. > > > > Luke has a bad habit of quoting numbers that are obviously derived from > > narrow benchmarking scenarios as Universal Truths, rather than providing > > the context they were derived in. I wish he'd stop doing that... > >In fairness I may have exaggerated a bit there. There is a limit to how much >of a speedup you can get in valid benchmarking situations. A single sequential >scan is always going to be necessary so you're only saving the cost of writing >out the temporary file and subsequent merge passes. > >It's hard to generate lots of intermediate merge passes since there are only >O(log(n)) of them. So to get 1000x speedup on a large I/O bound sort you would >have to be sorting something on order of 2^1000 records which is ridiculous. >Realistically you should only be able to save 2-5 intermediate merge passes. > >On the other there are some common situations where you could see atypical >increases. Consider joining a bunch of small tables to generate a large result >set. The small tables are probably all in memory and the result set may only >have a small number of distinct values. If you throw out the duplicates early >you save *all* the I/O. If you have to do a disk sort it could be many orders >slower. > >This is actually not an uncommon coding idiom for MySQL programmers accustomed >to fast DISTINCT working around the lack of subqueries and poor performance of >IN and EXISTS. They often just join together all the tables in a big cross >join and then toss in a DISTINCT at the top to get rid of the duplicates. > >-- > Gregory Stark > EnterpriseDB http://www.enterprisedb.com > > >---------------------------(end of broadcast)--------------------------- >TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq
В списке pgsql-performance по дате отправления: