Обсуждение: Functions to return random numbers in a given range

Поиск
Список
Период
Сортировка

Functions to return random numbers in a given range

От
Dean Rasheed
Дата:
Attached is a patch that adds 3 SQL-callable functions to return
random integer/numeric values chosen uniformly from a given range:

  random(min int, max int) returns int
  random(min bigint, max bigint) returns bigint
  random(min numeric, max numeric) returns numeric

The return value is in the range [min, max], and in the numeric case,
the result scale equals Max(scale(min), scale(max)), so it can be used
to generate large random integers, as well as decimals.

The goal is to provide simple, easy-to-use functions that operate
correctly over arbitrary ranges, which is trickier than it might seem
using the existing random() function. The main advantages are:

1. Support for arbitrary bounds (provided that max >= min). A SQL or
PL/pgSQL implementation based on the existing random() function can
suffer from integer overflow if the difference max-min is too large.

2. Uniform results over the full range. It's easy to overlook the fact
that in a naive implementation doing something like
"((max-min)*random()+min)::int", the endpoint values will be half as
likely as any other value, since casting to integer rounds to nearest.

3. Makes better use of the underlying PRNG, not limited to the 52-bits
of double precision values.

4. Simpler and more efficient generation of random numeric values.
This is something I have commonly wanted in the past, and have usually
resorted to hacks involving multiple calls to random() to build
strings of digits, which is horribly slow, and messy.

The implementation moves the existing random functions to a new source
file, so the new functions all share a common PRNG state with the
existing random functions, and that state is kept private to that
file.

Regards,
Dean

Вложения

Re: Functions to return random numbers in a given range

От
Pavel Stehule
Дата:
Hi

čt 21. 12. 2023 v 18:06 odesílatel Dean Rasheed <dean.a.rasheed@gmail.com> napsal:
Attached is a patch that adds 3 SQL-callable functions to return
random integer/numeric values chosen uniformly from a given range:

  random(min int, max int) returns int
  random(min bigint, max bigint) returns bigint
  random(min numeric, max numeric) returns numeric
The return value is in the range [min, max], and in the numeric case,
the result scale equals Max(scale(min), scale(max)), so it can be used
to generate large random integers, as well as decimals.

The goal is to provide simple, easy-to-use functions that operate
correctly over arbitrary ranges, which is trickier than it might seem
using the existing random() function. The main advantages are:

1. Support for arbitrary bounds (provided that max >= min). A SQL or
PL/pgSQL implementation based on the existing random() function can
suffer from integer overflow if the difference max-min is too large.

2. Uniform results over the full range. It's easy to overlook the fact
that in a naive implementation doing something like
"((max-min)*random()+min)::int", the endpoint values will be half as
likely as any other value, since casting to integer rounds to nearest.

3. Makes better use of the underlying PRNG, not limited to the 52-bits
of double precision values.

4. Simpler and more efficient generation of random numeric values.
This is something I have commonly wanted in the past, and have usually
resorted to hacks involving multiple calls to random() to build
strings of digits, which is horribly slow, and messy.

The implementation moves the existing random functions to a new source
file, so the new functions all share a common PRNG state with the
existing random functions, and that state is kept private to that
file.

+1

Regards

Pavel


Regards,
Dean

Re: Functions to return random numbers in a given range

От
jian he
Дата:
On Fri, Dec 22, 2023 at 1:07 AM Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>
> Attached is a patch that adds 3 SQL-callable functions to return
> random integer/numeric values chosen uniformly from a given range:
>
>   random(min int, max int) returns int
>   random(min bigint, max bigint) returns bigint
>   random(min numeric, max numeric) returns numeric
>
> The return value is in the range [min, max], and in the numeric case,
> the result scale equals Max(scale(min), scale(max)), so it can be used
> to generate large random integers, as well as decimals.
>
> The goal is to provide simple, easy-to-use functions that operate
> correctly over arbitrary ranges, which is trickier than it might seem
> using the existing random() function. The main advantages are:
>
> 1. Support for arbitrary bounds (provided that max >= min). A SQL or
> PL/pgSQL implementation based on the existing random() function can
> suffer from integer overflow if the difference max-min is too large.
>

Your patch works.
performance is the best amount for other options in [0].
I don't have deep knowledge about which one is more random.

Currently we have to explicitly mention the lower and upper bound.
but can we do this:
just give me an int, int means the int data type can be represented.
or just give me a random bigint.
but for numeric, the full numeric values that can be represented are very big.

Maybe we can use the special value null to achieve this
like use
select random(NULL::int,null)
to represent a random int in the full range of integers values can be
represented.

Do you think it makes sense?

[0] https://www.postgresql.org/message-id/CAEG8a3LcYXjNU1f2bxMm9c6ThQsPoTcvYO_kOnifx3aGXkbgPw%40mail.gmail.com



Re: Functions to return random numbers in a given range

От
David Zhang
Дата:
Thank you for the patch.

I applied this patch manually to the master branch, resolving a conflict 
in `numeric.h`. It successfully passed both `make check` and `make 
check-world`.


Best regards,

David




Re: Functions to return random numbers in a given range

От
Dean Rasheed
Дата:
On Thu, 28 Dec 2023 at 07:34, jian he <jian.universality@gmail.com> wrote:
>
> Your patch works.
> performance is the best amount for other options in [0].
> I don't have deep knowledge about which one is more random.
>

Thanks for testing.

> Currently we have to explicitly mention the lower and upper bound.
> but can we do this:
> just give me an int, int means the int data type can be represented.
> or just give me a random bigint.
> but for numeric, the full numeric values that can be represented are very big.
>
> Maybe we can use the special value null to achieve this
> like use
> select random(NULL::int,null)
> to represent a random int in the full range of integers values can be
> represented.
>

Hmm, I don't particularly like that idea. It seems pretty ugly. Now
that we support literal integers in hex, with underscores, it's
relatively easy to pass INT_MIN/MAX as arguments to these functions,
if that's what you need. I think if we were going to have a shorthand
for getting full-range random integers, it would probably be better to
introduce separate no-arg functions for that. I'm not really sure if
that's a sufficiently common use case to justify the effort though.

Regards,
Dean



Re: Functions to return random numbers in a given range

От
Dean Rasheed
Дата:
On Fri, 26 Jan 2024 at 20:44, David Zhang <david.zhang@highgo.ca> wrote:
>
> Thank you for the patch.
>
> I applied this patch manually to the master branch, resolving a conflict
> in `numeric.h`. It successfully passed both `make check` and `make
> check-world`.
>

Thanks for testing.

Interestingly, the cfbot didn't pick up on the fact that it needed
rebasing. Anyway, the copyright years in the new file's header comment
needed updating, so here is a rebase doing that.

Regards,
Dean

Вложения

Re: Functions to return random numbers in a given range

От
Aleksander Alekseev
Дата:
Hi,

> Interestingly, the cfbot didn't pick up on the fact that it needed
> rebasing. Anyway, the copyright years in the new file's header comment
> needed updating, so here is a rebase doing that.

Maybe I'm missing something but I'm not sure if I understand what this
test tests particularly:

```
-- There should be no triple duplicates in 1000 full-range 32-bit random()
-- values.  (Each of the C(1000, 3) choices of triplets from the 1000 values
-- has a probability of 1/(2^32)^2 of being a triple duplicate, so the
-- average number of triple duplicates is 1000 * 999 * 998 / 6 / 2^64, which
-- is roughly 9e-12.)
SELECT r, count(*)
FROM (SELECT random(-2147483648, 2147483647) r
      FROM generate_series(1, 1000)) ss
GROUP BY r HAVING count(*) > 2;
```

The intent seems to be to check the fact that random numbers are
distributed evenly. If this is the case I think the test is wrong. The
sequence of numbers 100, 100, 100, 100, 100 is as random as 99, 8, 4,
12, 45 and every particular sequence has low probability. All in all
personally I would argue that this is a meaningless test that just
fails with a low probability. Same for the tests that follow below.

The proper way of testing PRNG would be to call setseed() and compare
return values with expected ones. I don't mind testing the proposed
invariants but they should do this after calling setseed(). Currently
the patch places the tests right before the call.

-- 
Best regards,
Aleksander Alekseev



Re: Functions to return random numbers in a given range

От
Dean Rasheed
Дата:
On Tue, 30 Jan 2024 at 12:47, Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Maybe I'm missing something but I'm not sure if I understand what this
> test tests particularly:
>
> ```
> -- There should be no triple duplicates in 1000 full-range 32-bit random()
> -- values.  (Each of the C(1000, 3) choices of triplets from the 1000 values
> -- has a probability of 1/(2^32)^2 of being a triple duplicate, so the
> -- average number of triple duplicates is 1000 * 999 * 998 / 6 / 2^64, which
> -- is roughly 9e-12.)
> SELECT r, count(*)
> FROM (SELECT random(-2147483648, 2147483647) r
>       FROM generate_series(1, 1000)) ss
> GROUP BY r HAVING count(*) > 2;
> ```
>
> The intent seems to be to check the fact that random numbers are
> distributed evenly. If this is the case I think the test is wrong. The
> sequence of numbers 100, 100, 100, 100, 100 is as random as 99, 8, 4,
> 12, 45 and every particular sequence has low probability. All in all
> personally I would argue that this is a meaningless test that just
> fails with a low probability. Same for the tests that follow below.
>

I'm following the same approach used to test the existing random
functions, and the idea is the same. For example, this existing test:

-- There should be no duplicates in 1000 random() values.
-- (Assuming 52 random bits in the float8 results, we could
-- take as many as 3000 values and still have less than 1e-9 chance
-- of failure, per https://en.wikipedia.org/wiki/Birthday_problem)
SELECT r, count(*)
FROM (SELECT random() r FROM generate_series(1, 1000)) ss
GROUP BY r HAVING count(*) > 1;

If the underlying PRNG were non-uniform, or the method of reduction to
the required range was flawed in some way that reduced the number of
actual possible return values, then the probability of duplicates
would be increased. A non-uniform distribution would probably be
caught by the KS tests, but uniform gaps in the possible outputs might
not be, so I think this test still has value.

> The proper way of testing PRNG would be to call setseed() and compare
> return values with expected ones. I don't mind testing the proposed
> invariants but they should do this after calling setseed(). Currently
> the patch places the tests right before the call.
>

There are also new tests of that nature, following the call to
setseed(0.5). They're useful for a quick visual check of the results,
and confirming the expected number of digits after the decimal point
in the numeric case. However, I think those tests are insufficient on
their own.

Regards,
Dean



Re: Functions to return random numbers in a given range

От
Tomas Vondra
Дата:
Hi Dean,

I did a quick review and a little bit of testing on the patch today. I
think it's a good/useful idea, and I think the code is ready to go (the
code is certainly much cleaner than anything I'd written ...).

I do have one minor comments regarding the docs - it refers to "random
functions" in a couple places, which sounds to me as if it was talking
about some functions arbitrarily taken from some list, although it
clearly means "functions generating random numbers". (I realize this
might be just due to me not being native speaker.)


Did you think about adding more functions generating either other types
of data distributions (now we have uniform and normal), or random data
for other data types (I often need random strings, for example)?

Of course, I'm not saying this patch needs to do that. But perhaps it
might affect how we name stuff to make it "extensible".


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Functions to return random numbers in a given range

От
Dean Rasheed
Дата:
On Sat, 24 Feb 2024 at 17:10, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hi Dean,
>
> I did a quick review and a little bit of testing on the patch today. I
> think it's a good/useful idea, and I think the code is ready to go (the
> code is certainly much cleaner than anything I'd written ...).
>

Thanks for reviewing!

> I do have one minor comments regarding the docs - it refers to "random
> functions" in a couple places, which sounds to me as if it was talking
> about some functions arbitrarily taken from some list, although it
> clearly means "functions generating random numbers". (I realize this
> might be just due to me not being native speaker.)
>

Yes, I think you're right, that wording was a bit clumsy. Attached is
an update that's hopefully a bit better.

> Did you think about adding more functions generating either other types
> of data distributions (now we have uniform and normal), or random data
> for other data types (I often need random strings, for example)?
>
> Of course, I'm not saying this patch needs to do that. But perhaps it
> might affect how we name stuff to make it "extensible".
>

I don't have any plans to add more random functions, but I did think
about it from that perspective. Currently we have "random" and
"random_normal", so the natural extension would be
"random_${distribution}" for other data distributions, with "uniform"
as the default distribution, if omitted.

For different result datatypes, it ought to be mostly possible to
determine the result type from the arguments. There might be some
exceptions, like maybe "random_bytes(length)" to generate a byte
array, but I think that would be OK.

Regards,
Dean

Вложения

Re: Functions to return random numbers in a given range

От
Dean Rasheed
Дата:
On Tue, 27 Feb 2024 at 17:33, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>
> On Sat, 24 Feb 2024 at 17:10, Tomas Vondra
> >
> > I did a quick review and a little bit of testing on the patch today. I
> > think it's a good/useful idea, and I think the code is ready to go (the
> > code is certainly much cleaner than anything I'd written ...).
>

Based on the reviews so far, I think this is ready for commit, so
unless anyone objects, I will do so in a day or so.

As a quick summary, this adds a new file:

src/backend/utils/adt/pseudorandomfuncs.c

which contains SQL-callable functions that access a single shared
pseudorandom number generator, whose state is private to that file.
Currently the functions are:

  random() returns double precision                 [moved from float.c]
  random(min integer, max integer) returns integer  [new]
  random(min bigint, max bigint) returns bigint     [new]
  random(min numeric, max numeric) returns numeric  [new]
  random_normal() returns double precision          [moved from float.c]
  setseed(seed double precision) returns void       [moved from float.c]

It's possible that functions to return other random distributions or
other datatypes might get added in the future, but I have no plans to
do so at the moment.

Regards,
Dean



Re: Functions to return random numbers in a given range

От
Dean Rasheed
Дата:
On Tue, 26 Mar 2024 at 06:57, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
>
> Based on the reviews so far, I think this is ready for commit, so
> unless anyone objects, I will do so in a day or so.
>

Committed. Thanks for the reviews.

Regards,
Dean