Selecting K random rows - efficiently!
От | cluster |
---|---|
Тема | Selecting K random rows - efficiently! |
Дата | |
Msg-id | ffn044$29uq$1@news.hub.org обсуждение исходный текст |
Ответы |
Re: Selecting K random rows - efficiently!
Re: Selecting K random rows - efficiently! Re: Selecting K random rows - efficiently! |
Список | pgsql-general |
It has been suggested [1] that a good way to select K random rows efficiently from a table is to 1) add a new column, random_number, to the table and initialize it with random() 2) perform the following query: SELECT * FROM mydata WHERE random_number >= (SELECT RANDOM() OFFSET 0) ORDER BY random_number ASC LIMIT <K> Here <K> is the number of random rows to pick. E.g. 100. The benefit in this solution is that the "random_number" column can be indexed, allowing the query to be performed using a simple and fast index scan. However, there is a couple of major drawbacks in this method: 1) The smaller "random_number" is, the less likely is it that it will be picked when using this method. Example: A random_number close to zero will only have a very small probability to be selected. The solution is to reassign "random_number" every now and then in order to "even out" the selection probabilities over time. PROBLEM: If the number of rows are large (e.g. 200.000 or even a million or more), the update query: UPDATE mydata SET random_number = random(); might be very resource demanding, take a lot of time and pretty much slow down all other transactions because it eats up all resources. 2) The query to select K random rows orders the rows by "random_number" and selects the first K rows that have "random_number" larger than some other random number R. But what happens if there is less than K rows with random_value >= R? How can the rest of the random rows be picked efficiently? Any ideas on how to deal with these drawbacks? References: [1] http://tinyurl.com/tyg4m
В списке pgsql-general по дате отправления: