Re: add modulo (%) operator to pgbench
От | Heikki Linnakangas |
---|---|
Тема | Re: add modulo (%) operator to pgbench |
Дата | |
Msg-id | 54229CB6.5010608@vmware.com обсуждение исходный текст |
Ответ на | Re: add modulo (%) operator to pgbench (Fabien COELHO <coelho@cri.ensmp.fr>) |
Ответы |
Re: add modulo (%) operator to pgbench
|
Список | pgsql-hackers |
On 09/24/2014 10:45 AM, Fabien COELHO wrote: > Currently these distributions are achieved by mapping a continuous > function onto integers, so that neighboring integers get neighboring > number of draws, say with size=7: > > #draws 10 6 3 1 0 0 0 // some exponential distribution > int drawn 0 1 2 3 4 5 6 > > Although having an exponential distribution of accesses on tuples is quite > reasonable, the likelyhood there would be so much correlation between > neighboring values is not realistic at all. You need some additional > shuffling to get there. > >> I don't understand what that pseudo-random stage you're talking about is. Can >> you elaborate? > > The pseudo random stage is just a way to scatter the values. A basic > approach to achieve this is "i' = (i * large-prime) % size", if you have a > modulo. For instance with prime=5 you may get something like: > > #draws 10 6 3 1 0 0 0 > int drawn 0 1 2 3 4 5 6 (i) > scattered 0 5 3 1 6 4 2 (i' = 5 i % 7) > > So the distribution becomes: > > #draws 10 1 0 3 0 6 0 > scattered 0 1 2 3 4 5 6 > > Which is more interesting from a testing perspective because it removes > the neighboring value correlation. Depends on what you're testing. Yeah, shuffling like that makes sense for a primary key. Or not: very often, recently inserted rows are also queried more often, so that there is indeed a strong correlation between the integer key and the access frequency. Or imagine that you have a table that stores the height of people in centimeters. To populate that, you would want to use a gaussian distributed variable, without shuffling. For shuffling, perhaps we should provide a pgbench function or operator that does that directly, instead of having to implement it using * and %. Something like hash(x, min, max), where x is the input variable (gaussian distributed, or whatever you want), and min and max are the range to map it to. > I must say that I'm appaled by a decision process which leads to such > results, with significant patches passed, and the tiny complement to make > it really useful (I mean not on the paper or on the feature list, but in > real life) is rejected... The idea of a modulo operator was not rejected, we'd just like to have the infrastructure in place first. - Heikki
В списке pgsql-hackers по дате отправления: