Re: Selecting K random rows - efficiently!
От | Pavel Stehule |
---|---|
Тема | Re: Selecting K random rows - efficiently! |
Дата | |
Msg-id | 162867790710300333m687f4615r15a786a607c5a863@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Selecting K random rows - efficiently! ("vincent" <vinny@xs4all.nl>) |
Список | pgsql-general |
2007/10/30, vincent <vinny@xs4all.nl>: > > 2007/10/26, Patrick TJ McPhee <ptjm@interlog.com>: > >> In article <ffnid8$1q2t$1@news.hub.org>, cluster <skrald@amossen.dk> > >> wrote: > >> % > How important is true randomness? > >> % > >> % The goal is an even distribution but currently I have not seen any way > >> % to produce any kind of random sampling efficiently. Notice the word > >> > >> How about generating the ctid randomly? You can get the number of pages > >> from pg_class and estimate the number of rows either using the number > >> of tuples in pg_class or just based on what you know about the data. > >> Then just generate two series of random numbers, one from 0 to the > >> number > >> of pages and the other from 1 to the number of rows per page, and keep > >> picking rows until you have enough numbers. Assuming there aren't too > >> many dead tuples and your estimates are good, this should retrieve n > >> rows > >> with roughly n look-ups. > >> > >> If your estimates are low, there will be tuples which can never be > >> selected, > >> and so far as I know, there's no way to construct a random ctid in a > >> stock > >> postgres database, but apart from that it seems like a good plan. If > >> efficiency is important, you could create a C function which returns a > >> series of random tids and join on that. > >> -- > >> > > > > SELECT id, ... > > FROM data > > WHERE id = ANY(ARRAY( > > SELECT (random()*max_id)::int > > FROM generate_series(1,20))) > > LIMIT 1; > > > > -- max_id is external constant > > > > Pavel Stehule > > That selects records where the id is one of twenty random numbers between > zero and the maximum id. That's not truely random, nor is it completely > safe if there are lots of gaps in the values of id. For example, if the > lowest id is 50000 and the highest is 50040, this will be very likely to > generate lots of numbers below 50000 and find no records at all. > > there is only one safe way ORDER BY random() LIMIT 1; if you know so your id has not uniform distribution, you have to mo: SELECT random()*(max_id - min_id) + min_id This solution is far to ideal, but it is fast and for some purposes enough (web shops, etc) Pavel
В списке pgsql-general по дате отправления: