Обсуждение: Thoughts on NBASE=100000000

Поиск
Список
Период
Сортировка

Thoughts on NBASE=100000000

От
"Joel Jacobson"
Дата:
Hello hackers,

I'm not hopeful this idea will be fruitful, but maybe we can find solutions
to the problems together.

The idea is to increase the numeric NBASE from 1e4 to 1e8, which could possibly
give a significant performance boost of all operations across the board,
on 64-bit architectures, for many inputs.

Last time numeric's base was changed was back in 2003, when d72f6c75038 changed
it from 10 to 10000. Back then, 32-bit architectures were still dominant,
so base-10000 was clearly the best choice at this time.

Today, since 64-bit architectures are dominant, NBASE=1e8 seems like it would
have been the best choice, since the square of that still fits in
a 64-bit signed int.

Changing NBASE might seem impossible at first, due to the existing numeric data
on disk, and incompatibility issues when numeric data is transferred on the
wire.

Here are some ideas on how to work around some of these:

- Incrementally changing the data on disk, e.g. upon UPDATE/INSERT
and supporting both NBASE=1e4 (16-bit) and NBASE=1e8 (32-bit)
when reading data.

- Due to the lack of a version field in the NumericVar struct,
we need a way to detect if a Numeric value on disk uses
the existing NBASE=1e4, or NBASE=1e8.
One hack I've thought about is to exploit the fact that NUMERIC_NBYTES,
defined as:
    #define NUMERIC_NBYTES(num) (VARSIZE(num) - NUMERIC_HEADER_SIZE(num))
will always be divisible by two, since a NumericDigit is an int16 (2 bytes).
The idea is then to let "NUMERIC_NBYTES divisible by three"
indicate NBASE=1e8, at the cost of one to three extra padding bytes.

Another important aspect is disk space utilization, which is of course better
for NBASE=1e4, since it packs the data more tightly.
I think this is the main disadvantage of NBASE=1e8, but perhaps users would be
willing to sacrifice some disk, if they would get better run-time performance.

As said initially, this might be completely unrealistic,
but interested to hear if anyone else have had similar dreams.

Regards,
Joel



Re: Thoughts on NBASE=100000000

От
Matthias van de Meent
Дата:
On Sun, 7 Jul 2024, 22:40 Joel Jacobson, <joel@compiler.org> wrote:
>
> Hello hackers,
>
> I'm not hopeful this idea will be fruitful, but maybe we can find solutions
> to the problems together.
>
> The idea is to increase the numeric NBASE from 1e4 to 1e8, which could possibly
> give a significant performance boost of all operations across the board,
> on 64-bit architectures, for many inputs.
>
> Last time numeric's base was changed was back in 2003, when d72f6c75038 changed
> it from 10 to 10000. Back then, 32-bit architectures were still dominant,
> so base-10000 was clearly the best choice at this time.
>
> Today, since 64-bit architectures are dominant, NBASE=1e8 seems like it would
> have been the best choice, since the square of that still fits in
> a 64-bit signed int.

Back then 64-bit was by far not as dominant (server and consumer chips
with AMD64 ISA only got released that year after the commit), so I
don't think 1e8 would have been the best choice at that point in time.
Would be better now, yes, but not back then.

> Changing NBASE might seem impossible at first, due to the existing numeric data
> on disk, and incompatibility issues when numeric data is transferred on the
> wire.
>
> Here are some ideas on how to work around some of these:
>
> - Incrementally changing the data on disk, e.g. upon UPDATE/INSERT
> and supporting both NBASE=1e4 (16-bit) and NBASE=1e8 (32-bit)
> when reading data.

I think that a dynamic decision would make more sense here. At low
precision, the overhead of 4+1 bytes vs 2 bytes is quite significant.
This sounds important for overall storage concerns, especially if the
padding bytes (mentioned below) are added to indicate types.

> - Due to the lack of a version field in the NumericVar struct,
> we need a way to detect if a Numeric value on disk uses
> the existing NBASE=1e4, or NBASE=1e8.
> One hack I've thought about is to exploit the fact that NUMERIC_NBYTES,
> defined as:
>     #define NUMERIC_NBYTES(num) (VARSIZE(num) - NUMERIC_HEADER_SIZE(num))
> will always be divisible by two, since a NumericDigit is an int16 (2 bytes).
> The idea is then to let "NUMERIC_NBYTES divisible by three"
> indicate NBASE=1e8, at the cost of one to three extra padding bytes.

Do you perhaps mean NUMERIC_NBYTES *not divisible by 2*, i.e. an
uneven NUMERIC_NBYTES as indicator for NBASE=1e8, rather than only
multiples of 3?  I'm asking because there are many integers divisible
by both 2 and 3 (all integer multiples of 6; that's 50% of the
multiples of 3), so with the multiple-of-3 scheme  we might need up to
5 pad bytes to get to the next multiple of 3 that isn't also a
multiple of 2. Additionally, if the last digit woud've fit in
NBASE_1e4, then the 1e8-based numeric value could even be 7 bytes
larger than the equivalent 1e4-based numeric.

While I don't think this is worth implementing for general usage, it
could be worth exploring for the larger numeric values, where the
relative overhead of the larger representation is lower.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Thoughts on NBASE=100000000

От
"Joel Jacobson"
Дата:
On Mon, Jul 8, 2024, at 12:45, Matthias van de Meent wrote:
> On Sun, 7 Jul 2024, 22:40 Joel Jacobson, <joel@compiler.org> wrote:
>> Today, since 64-bit architectures are dominant, NBASE=1e8 seems like it would
>> have been the best choice, since the square of that still fits in
>> a 64-bit signed int.
>
> Back then 64-bit was by far not as dominant (server and consumer chips
> with AMD64 ISA only got released that year after the commit), so I
> don't think 1e8 would have been the best choice at that point in time.
> Would be better now, yes, but not back then.

Oh, grammar mistake by me!
I meant to say it "would be the best choice", in line with what I wrote above:

>> Last time numeric's base was changed was back in 2003, when d72f6c75038 changed
>> it from 10 to 10000. Back then, 32-bit architectures were still dominant,
>> so base-10000 was clearly the best choice at this time.

>> Changing NBASE might seem impossible at first, due to the existing numeric data
>> on disk, and incompatibility issues when numeric data is transferred on the
>> wire.
>>
>> Here are some ideas on how to work around some of these:
>>
>> - Incrementally changing the data on disk, e.g. upon UPDATE/INSERT
>> and supporting both NBASE=1e4 (16-bit) and NBASE=1e8 (32-bit)
>> when reading data.
>
> I think that a dynamic decision would make more sense here. At low
> precision, the overhead of 4+1 bytes vs 2 bytes is quite significant.
> This sounds important for overall storage concerns, especially if the
> padding bytes (mentioned below) are added to indicate types.

Right, I agree.

Another idea: It seems possible to reduce the disk space for numerics
that fit into one byte, i.e. 0 <= val <= 255, which could be communicated
via NUMERIC_NBYTES=1.
At least the value 0 should be quite common.

>> - Due to the lack of a version field in the NumericVar struct,
>> we need a way to detect if a Numeric value on disk uses
>> the existing NBASE=1e4, or NBASE=1e8.
>> One hack I've thought about is to exploit the fact that NUMERIC_NBYTES,
>> defined as:
>>     #define NUMERIC_NBYTES(num) (VARSIZE(num) - NUMERIC_HEADER_SIZE(num))
>> will always be divisible by two, since a NumericDigit is an int16 (2 bytes).
>> The idea is then to let "NUMERIC_NBYTES divisible by three"
>> indicate NBASE=1e8, at the cost of one to three extra padding bytes.
>
> Do you perhaps mean NUMERIC_NBYTES *not divisible by 2*, i.e. an
> uneven NUMERIC_NBYTES as indicator for NBASE=1e8, rather than only
> multiples of 3?

Oh, yes of course! Thinko.

> While I don't think this is worth implementing for general usage, it
> could be worth exploring for the larger numeric values, where the
> relative overhead of the larger representation is lower.

Yes, I agree it's definitively seems like a win for larger numeric values.
Not sure about smaller numeric values, maybe it's possible
to improve upon.

Regards,
Joel



Re: Thoughts on NBASE=100000000

От
"Joel Jacobson"
Дата:
On Mon, Jul 8, 2024, at 13:42, Joel Jacobson wrote:
> On Mon, Jul 8, 2024, at 12:45, Matthias van de Meent wrote:
>> On Sun, 7 Jul 2024, 22:40 Joel Jacobson, <joel@compiler.org> wrote:
>>> Today, since 64-bit architectures are dominant, NBASE=1e8 seems like it would
>>> have been the best choice, since the square of that still fits in
>>> a 64-bit signed int.
>>
>> Back then 64-bit was by far not as dominant (server and consumer chips
>> with AMD64 ISA only got released that year after the commit), so I
>> don't think 1e8 would have been the best choice at that point in time.
>> Would be better now, yes, but not back then.
>
> Oh, grammar mistake by me!
> I meant to say it "would be the best choice", in line with what I wrote above:
>
>>> Last time numeric's base was changed was back in 2003, when d72f6c75038 changed
>>> it from 10 to 10000. Back then, 32-bit architectures were still dominant,
>>> so base-10000 was clearly the best choice at this time.
>
>>> Changing NBASE might seem impossible at first, due to the existing numeric data
>>> on disk, and incompatibility issues when numeric data is transferred on the
>>> wire.
>>>
>>> Here are some ideas on how to work around some of these:
>>>
>>> - Incrementally changing the data on disk, e.g. upon UPDATE/INSERT
>>> and supporting both NBASE=1e4 (16-bit) and NBASE=1e8 (32-bit)
>>> when reading data.
>>
>> I think that a dynamic decision would make more sense here. At low
>> precision, the overhead of 4+1 bytes vs 2 bytes is quite significant.
>> This sounds important for overall storage concerns, especially if the
>> padding bytes (mentioned below) are added to indicate types.
>
> Right, I agree.
>
> Another idea: It seems possible to reduce the disk space for numerics
> that fit into one byte, i.e. 0 <= val <= 255, which could be communicated
> via NUMERIC_NBYTES=1.
> At least the value 0 should be quite common.
>
>>> - Due to the lack of a version field in the NumericVar struct,
>>> we need a way to detect if a Numeric value on disk uses
>>> the existing NBASE=1e4, or NBASE=1e8.
>>> One hack I've thought about is to exploit the fact that NUMERIC_NBYTES,
>>> defined as:
>>>     #define NUMERIC_NBYTES(num) (VARSIZE(num) - NUMERIC_HEADER_SIZE(num))
>>> will always be divisible by two, since a NumericDigit is an int16 (2 bytes).
>>> The idea is then to let "NUMERIC_NBYTES divisible by three"
>>> indicate NBASE=1e8, at the cost of one to three extra padding bytes.
>>
>> Do you perhaps mean NUMERIC_NBYTES *not divisible by 2*, i.e. an
>> uneven NUMERIC_NBYTES as indicator for NBASE=1e8, rather than only
>> multiples of 3?
>
> Oh, yes of course! Thinko.
>
>> While I don't think this is worth implementing for general usage, it
>> could be worth exploring for the larger numeric values, where the
>> relative overhead of the larger representation is lower.
>
> Yes, I agree it's definitively seems like a win for larger numeric values.
> Not sure about smaller numeric values, maybe it's possible
> to improve upon.

When reading the thread "access numeric data in module" [1], I came to
think about $subject again.

In [1], the main argument for keeping numeric's internals private seems
to be the possibility that the representation may change in the future,
as it has in the past, the last such change being in 2003 (commit
d72f6c7503).

This reasoning suggests there is a greater-than-zero chance of
such a change ever happening, which, combined with the apparent
desire to somehow expose more of numeric's internals, led to renewed
personal motivation to ignite a discussion on $subject.

If numeric internal's ever need to change, it seems better to do so
within the current time window when they have still been kept private,
hence this email.

Question:

Is there a greater-than-zero chance we could want to modernizing numeric
for 64-bit hardware, even if it would require a change to its internals?

/Joel

[1]
https://www.postgresql.org/message-id/flat/CAJBL5DP2emt%2BWPeCo%2BYY_ogsGt90-_kRU3weS5YJLQTfNZr72Q%40mail.gmail.com



Re: Thoughts on NBASE=100000000

От
David Rowley
Дата:
On Sun, 21 Sept 2025 at 17:09, Joel Jacobson <joel@compiler.org> wrote:
> Is there a greater-than-zero chance we could want to modernizing numeric
> for 64-bit hardware, even if it would require a change to its internals?

I can't quite speak to the proposal for differentiating the two
different on-disk formats, but assuming that's all above board, I
think the answer has to depend on how much performance there is to be
had from this change vs the complexity of the change. We don't really
have information about either of those things here, so I think it's
going to be hard to answer this question until more information is
available.

If you put the on-disk format issue to the side for now, could this
change be mocked up quickly enough to demonstrate what sort of
performance gains are available from this? For me, I do suspect the
numbers will look nice for larger numerics, but since I've not studied
that code in detail, I can't really predict what scale of numbers the
performance will start looking very good and where it won't make any
difference, or if there will be regression for smaller numbers.

Maybe some nice graphs showing some performance numbers (and perhaps
table size) of various scales of numerics would help drive some
discussion.

Also, FWIW, I suspect this is a good time to consider this proposal
because, as far as I see it, since 2a600a93c, we've basically made a
statement that we don't really care that much about 32-bit performance
anymore. So it would seem a bit off if someone objected to your
proposal on the grounds of not wanting to slow down 32-bit platforms.

I'm also keen to understand how you'd propose to handle functions
receiving 1 NumericVar in 1e8 base format and the other in 1e4 format?
Do you propose to normalise those to 1e8 so the functions don't need
to have any additional code to handle calculating between the two
different formats? If so, how much overhead is there to the
normalisation? I assume that there'd be plenty of read-only workloads
that might never rewrite the on-disk data which would result in that
conversion having to take place all the time. I assume that could make
things slower rather than faster, especially so for smaller numeric
types that wouldn't gain from having the 1e8 base.

David