Обсуждение: Functions to return random numbers in a given range
Attached is a patch that adds 3 SQL-callable functions to return random integer/numeric values chosen uniformly from a given range: random(min int, max int) returns int random(min bigint, max bigint) returns bigint random(min numeric, max numeric) returns numeric The return value is in the range [min, max], and in the numeric case, the result scale equals Max(scale(min), scale(max)), so it can be used to generate large random integers, as well as decimals. The goal is to provide simple, easy-to-use functions that operate correctly over arbitrary ranges, which is trickier than it might seem using the existing random() function. The main advantages are: 1. Support for arbitrary bounds (provided that max >= min). A SQL or PL/pgSQL implementation based on the existing random() function can suffer from integer overflow if the difference max-min is too large. 2. Uniform results over the full range. It's easy to overlook the fact that in a naive implementation doing something like "((max-min)*random()+min)::int", the endpoint values will be half as likely as any other value, since casting to integer rounds to nearest. 3. Makes better use of the underlying PRNG, not limited to the 52-bits of double precision values. 4. Simpler and more efficient generation of random numeric values. This is something I have commonly wanted in the past, and have usually resorted to hacks involving multiple calls to random() to build strings of digits, which is horribly slow, and messy. The implementation moves the existing random functions to a new source file, so the new functions all share a common PRNG state with the existing random functions, and that state is kept private to that file. Regards, Dean
Вложения
Hi
čt 21. 12. 2023 v 18:06 odesílatel Dean Rasheed <dean.a.rasheed@gmail.com> napsal:
Attached is a patch that adds 3 SQL-callable functions to return
random integer/numeric values chosen uniformly from a given range:
random(min int, max int) returns int
random(min bigint, max bigint) returns bigint
random(min numeric, max numeric) returns numeric
The return value is in the range [min, max], and in the numeric case,
the result scale equals Max(scale(min), scale(max)), so it can be used
to generate large random integers, as well as decimals.
The goal is to provide simple, easy-to-use functions that operate
correctly over arbitrary ranges, which is trickier than it might seem
using the existing random() function. The main advantages are:
1. Support for arbitrary bounds (provided that max >= min). A SQL or
PL/pgSQL implementation based on the existing random() function can
suffer from integer overflow if the difference max-min is too large.
2. Uniform results over the full range. It's easy to overlook the fact
that in a naive implementation doing something like
"((max-min)*random()+min)::int", the endpoint values will be half as
likely as any other value, since casting to integer rounds to nearest.
3. Makes better use of the underlying PRNG, not limited to the 52-bits
of double precision values.
4. Simpler and more efficient generation of random numeric values.
This is something I have commonly wanted in the past, and have usually
resorted to hacks involving multiple calls to random() to build
strings of digits, which is horribly slow, and messy.
The implementation moves the existing random functions to a new source
file, so the new functions all share a common PRNG state with the
existing random functions, and that state is kept private to that
file.
+1
Regards
Pavel
Regards,
Dean
On Fri, Dec 22, 2023 at 1:07 AM Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > Attached is a patch that adds 3 SQL-callable functions to return > random integer/numeric values chosen uniformly from a given range: > > random(min int, max int) returns int > random(min bigint, max bigint) returns bigint > random(min numeric, max numeric) returns numeric > > The return value is in the range [min, max], and in the numeric case, > the result scale equals Max(scale(min), scale(max)), so it can be used > to generate large random integers, as well as decimals. > > The goal is to provide simple, easy-to-use functions that operate > correctly over arbitrary ranges, which is trickier than it might seem > using the existing random() function. The main advantages are: > > 1. Support for arbitrary bounds (provided that max >= min). A SQL or > PL/pgSQL implementation based on the existing random() function can > suffer from integer overflow if the difference max-min is too large. > Your patch works. performance is the best amount for other options in [0]. I don't have deep knowledge about which one is more random. Currently we have to explicitly mention the lower and upper bound. but can we do this: just give me an int, int means the int data type can be represented. or just give me a random bigint. but for numeric, the full numeric values that can be represented are very big. Maybe we can use the special value null to achieve this like use select random(NULL::int,null) to represent a random int in the full range of integers values can be represented. Do you think it makes sense? [0] https://www.postgresql.org/message-id/CAEG8a3LcYXjNU1f2bxMm9c6ThQsPoTcvYO_kOnifx3aGXkbgPw%40mail.gmail.com
Thank you for the patch. I applied this patch manually to the master branch, resolving a conflict in `numeric.h`. It successfully passed both `make check` and `make check-world`. Best regards, David
On Thu, 28 Dec 2023 at 07:34, jian he <jian.universality@gmail.com> wrote: > > Your patch works. > performance is the best amount for other options in [0]. > I don't have deep knowledge about which one is more random. > Thanks for testing. > Currently we have to explicitly mention the lower and upper bound. > but can we do this: > just give me an int, int means the int data type can be represented. > or just give me a random bigint. > but for numeric, the full numeric values that can be represented are very big. > > Maybe we can use the special value null to achieve this > like use > select random(NULL::int,null) > to represent a random int in the full range of integers values can be > represented. > Hmm, I don't particularly like that idea. It seems pretty ugly. Now that we support literal integers in hex, with underscores, it's relatively easy to pass INT_MIN/MAX as arguments to these functions, if that's what you need. I think if we were going to have a shorthand for getting full-range random integers, it would probably be better to introduce separate no-arg functions for that. I'm not really sure if that's a sufficiently common use case to justify the effort though. Regards, Dean
On Fri, 26 Jan 2024 at 20:44, David Zhang <david.zhang@highgo.ca> wrote: > > Thank you for the patch. > > I applied this patch manually to the master branch, resolving a conflict > in `numeric.h`. It successfully passed both `make check` and `make > check-world`. > Thanks for testing. Interestingly, the cfbot didn't pick up on the fact that it needed rebasing. Anyway, the copyright years in the new file's header comment needed updating, so here is a rebase doing that. Regards, Dean
Вложения
Hi, > Interestingly, the cfbot didn't pick up on the fact that it needed > rebasing. Anyway, the copyright years in the new file's header comment > needed updating, so here is a rebase doing that. Maybe I'm missing something but I'm not sure if I understand what this test tests particularly: ``` -- There should be no triple duplicates in 1000 full-range 32-bit random() -- values. (Each of the C(1000, 3) choices of triplets from the 1000 values -- has a probability of 1/(2^32)^2 of being a triple duplicate, so the -- average number of triple duplicates is 1000 * 999 * 998 / 6 / 2^64, which -- is roughly 9e-12.) SELECT r, count(*) FROM (SELECT random(-2147483648, 2147483647) r FROM generate_series(1, 1000)) ss GROUP BY r HAVING count(*) > 2; ``` The intent seems to be to check the fact that random numbers are distributed evenly. If this is the case I think the test is wrong. The sequence of numbers 100, 100, 100, 100, 100 is as random as 99, 8, 4, 12, 45 and every particular sequence has low probability. All in all personally I would argue that this is a meaningless test that just fails with a low probability. Same for the tests that follow below. The proper way of testing PRNG would be to call setseed() and compare return values with expected ones. I don't mind testing the proposed invariants but they should do this after calling setseed(). Currently the patch places the tests right before the call. -- Best regards, Aleksander Alekseev
On Tue, 30 Jan 2024 at 12:47, Aleksander Alekseev <aleksander@timescale.com> wrote: > > Maybe I'm missing something but I'm not sure if I understand what this > test tests particularly: > > ``` > -- There should be no triple duplicates in 1000 full-range 32-bit random() > -- values. (Each of the C(1000, 3) choices of triplets from the 1000 values > -- has a probability of 1/(2^32)^2 of being a triple duplicate, so the > -- average number of triple duplicates is 1000 * 999 * 998 / 6 / 2^64, which > -- is roughly 9e-12.) > SELECT r, count(*) > FROM (SELECT random(-2147483648, 2147483647) r > FROM generate_series(1, 1000)) ss > GROUP BY r HAVING count(*) > 2; > ``` > > The intent seems to be to check the fact that random numbers are > distributed evenly. If this is the case I think the test is wrong. The > sequence of numbers 100, 100, 100, 100, 100 is as random as 99, 8, 4, > 12, 45 and every particular sequence has low probability. All in all > personally I would argue that this is a meaningless test that just > fails with a low probability. Same for the tests that follow below. > I'm following the same approach used to test the existing random functions, and the idea is the same. For example, this existing test: -- There should be no duplicates in 1000 random() values. -- (Assuming 52 random bits in the float8 results, we could -- take as many as 3000 values and still have less than 1e-9 chance -- of failure, per https://en.wikipedia.org/wiki/Birthday_problem) SELECT r, count(*) FROM (SELECT random() r FROM generate_series(1, 1000)) ss GROUP BY r HAVING count(*) > 1; If the underlying PRNG were non-uniform, or the method of reduction to the required range was flawed in some way that reduced the number of actual possible return values, then the probability of duplicates would be increased. A non-uniform distribution would probably be caught by the KS tests, but uniform gaps in the possible outputs might not be, so I think this test still has value. > The proper way of testing PRNG would be to call setseed() and compare > return values with expected ones. I don't mind testing the proposed > invariants but they should do this after calling setseed(). Currently > the patch places the tests right before the call. > There are also new tests of that nature, following the call to setseed(0.5). They're useful for a quick visual check of the results, and confirming the expected number of digits after the decimal point in the numeric case. However, I think those tests are insufficient on their own. Regards, Dean
Hi Dean, I did a quick review and a little bit of testing on the patch today. I think it's a good/useful idea, and I think the code is ready to go (the code is certainly much cleaner than anything I'd written ...). I do have one minor comments regarding the docs - it refers to "random functions" in a couple places, which sounds to me as if it was talking about some functions arbitrarily taken from some list, although it clearly means "functions generating random numbers". (I realize this might be just due to me not being native speaker.) Did you think about adding more functions generating either other types of data distributions (now we have uniform and normal), or random data for other data types (I often need random strings, for example)? Of course, I'm not saying this patch needs to do that. But perhaps it might affect how we name stuff to make it "extensible". regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, 24 Feb 2024 at 17:10, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hi Dean, > > I did a quick review and a little bit of testing on the patch today. I > think it's a good/useful idea, and I think the code is ready to go (the > code is certainly much cleaner than anything I'd written ...). > Thanks for reviewing! > I do have one minor comments regarding the docs - it refers to "random > functions" in a couple places, which sounds to me as if it was talking > about some functions arbitrarily taken from some list, although it > clearly means "functions generating random numbers". (I realize this > might be just due to me not being native speaker.) > Yes, I think you're right, that wording was a bit clumsy. Attached is an update that's hopefully a bit better. > Did you think about adding more functions generating either other types > of data distributions (now we have uniform and normal), or random data > for other data types (I often need random strings, for example)? > > Of course, I'm not saying this patch needs to do that. But perhaps it > might affect how we name stuff to make it "extensible". > I don't have any plans to add more random functions, but I did think about it from that perspective. Currently we have "random" and "random_normal", so the natural extension would be "random_${distribution}" for other data distributions, with "uniform" as the default distribution, if omitted. For different result datatypes, it ought to be mostly possible to determine the result type from the arguments. There might be some exceptions, like maybe "random_bytes(length)" to generate a byte array, but I think that would be OK. Regards, Dean
Вложения
On Tue, 27 Feb 2024 at 17:33, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > On Sat, 24 Feb 2024 at 17:10, Tomas Vondra > > > > I did a quick review and a little bit of testing on the patch today. I > > think it's a good/useful idea, and I think the code is ready to go (the > > code is certainly much cleaner than anything I'd written ...). > Based on the reviews so far, I think this is ready for commit, so unless anyone objects, I will do so in a day or so. As a quick summary, this adds a new file: src/backend/utils/adt/pseudorandomfuncs.c which contains SQL-callable functions that access a single shared pseudorandom number generator, whose state is private to that file. Currently the functions are: random() returns double precision [moved from float.c] random(min integer, max integer) returns integer [new] random(min bigint, max bigint) returns bigint [new] random(min numeric, max numeric) returns numeric [new] random_normal() returns double precision [moved from float.c] setseed(seed double precision) returns void [moved from float.c] It's possible that functions to return other random distributions or other datatypes might get added in the future, but I have no plans to do so at the moment. Regards, Dean
On Tue, 26 Mar 2024 at 06:57, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > Based on the reviews so far, I think this is ready for commit, so > unless anyone objects, I will do so in a day or so. > Committed. Thanks for the reviews. Regards, Dean