Re: Using quicksort for every external sort run
От | Peter Geoghegan |
---|---|
Тема | Re: Using quicksort for every external sort run |
Дата | |
Msg-id | CAM3SWZR-0uURKesmr8+mWR=vieqaRKsM6_tbRFzn9rw-GO6pCw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Using quicksort for every external sort run (Greg Stark <stark@mit.edu>) |
Список | pgsql-hackers |
On Mon, Feb 15, 2016 at 3:45 PM, Greg Stark <stark@mit.edu> wrote: > I was thinking about this over the past couple weeks. I'm starting to > think the quicksort runs gives at least the beginnings of a way > forward on this front. As I've already pointed out several times, I wrote a tool that makes it easy to load sortbenchmark.org data into a PostgreSQL table: https://github.com/petergeoghegan/gensort (You should use the Python script that invokes the "gensort" utility -- see its "--help" display for details). This seems useful as a standard benchmark, since it's perfectly deterministic, allowing the user to create arbitrarily large tables to use for sort benchmarks. Still, it doesn't produce data that is any way organic; sort data is uniformly distributed. Also, it produces a table that really only has one attribute to sort on, a text attribute. I suggest looking at real world data, too. I have downloaded UK land registry data, which is a freely available dataset about property sales in the UK since the 1990s, of which there have apparently been about 20 million (I started with a 20 million line CSV file). I've used COPY to load the data into one PostgreSQL table. I attach instructions on how to recreate this, and some suggested CREATE INDEX statements that seemed representative to me. There are a variety of Postgres data types in use, including UUID, numeric, and text. The final Postgres table is just under 3GB. I will privately make available a URL that those CC'd here can use to download a custom format dump of the table, which comes in at 1.1GB (ask me off-list if you'd like to get that URL, but weren't CC'd here). This URL is provided as a convenience for reviewers, who can skip my detailed instructions. An expensive rollup() query on the "land_registry_price_paid_uk" table is interesting. Example: select date_trunc('year', transfer_date), county, district, city, sum(price) from land_registry_price_paid_uk group by rollup (1, county, district, city); Performance is within ~5% of an *internal* sort with the patch series applied, even though ~80% of time is spent copying and sorting SortTuples overall in the internal sort case (the internal case cannot overlap sorting and aggregate processing, since it has no final merge step). This is a nice demonstration of how this work has significantly blurred the line between internal and external sorts. -- Peter Geoghegan
Вложения
В списке pgsql-hackers по дате отправления: