Обсуждение: A varint implementation for PG?

Поиск
Список
Период
Сортировка

A varint implementation for PG?

От
Andres Freund
Дата:
Hi,

I several times, most recently for the record format in the undo
patchset, wished for a fast variable width integer implementation for
postgres. Using very narrow integers, for space efficiency, solves the
space usage problem, but leads to extensibility / generality problems.

Other cases where I wanted a variable width integer encodings are, non
exhaustively: variable-width TIDs, on-disk representation for a smaller
SQL level integer, WAL format, wire protocol, ...

I spent a few days experimenting with different variable length
encodings for unsigned and signed integers. My goal is/was to find an
encoding and a prototype that has reasonably low space overhead and that
is very quick to encode/decode.


The probably most common encoding is often called LEB128. It basically
uses 7 bits in each byte for data, and one bit as a "continuation"
bit. While not bad space-wise, it's not great for decoding (but also
encoding) speed. The problem is a) that there's a branch for each byte,
which will result in poorly predictable code, b) the 7 byte base needs
to be changed to an 8 byte base (not as expensive as the branching
issue, but quite noticably).

I think a better encoding is something similar to UTF-8. That is,
encoding the byte length in unary in the leading byte, with the data
following that prefix (plus the separator ending the prefix). That
allows to avoid per-byte branching during decoding.

In contrast to UTF-8, in my prototype I decided to encode the prefix in
unary as the number of leading 0 bits - the main advantage is that that
makes it possible to use instructions like "count leading zeroes" to
determine the length (otherwise one needs to invert the data first).


To encode negative numbers, I chose to use what's often called "zigzag"
encoding. One cannot just naively use the above encoding for signed
integers, as otherwise two's complement negative numbers would always be
of the maximum length (due to the leading 1 bit(s)). Instead the
to-be-encoded number is multiplied by two (i.e. <<1), and the sign bit
is stored at the lowest bit; negative numbers are additionally stored
with bit bits inverted.

The only disadvantage of that encoding that I am aware of is that
encoded signed varints cannot just be memcmp'd (which they could with
e.g. a single sign bit).

Alternatively one could store the sign directly after the "separator
bit" (the one ending the unary length indicator). But then there's the
problem that the number has leading data after the indicator. It's also
a bit harder to write the code for that encoding. I wrote the code for
that first, and I think it's very likely worth going for the simpler
approach.


In my benchmarks encoding + decoding a varint in this way costs
approximately 23 cycles with my intel skylake cpu. I think pipelining is
hiding a bit of the latency, but not too much. As the benchmark
currently is stupid and just encodes all 32bit integers, it's overly
friendly to the branch predictor however.

Even with those caveats, I think that's a pretty good result. Other
encodings were more expensive. And I think there's definitely some room
for optimization left.



I've pushed a repo containing my initial implementation of the above to
https://github.com/anarazel/varint_experiment and attached a tarball
containing it. The header file contains a more detailed description of
the format.

So far I've only implemented uint64_t and int64_t encoding/decoding
functions. It'd definitely make sense to implement a 4 byte variant, and
potentially also one supporting larger lengths. I think it'd make sense
to have a variant support length up to 64bit long unary length indicator
(i.e. 64bytes of data (potentially minus one bit)).

If data lengths longer than that are required for a use case, it
probably is better to either a) use the max-representable 8 byte integer
as an indicator that the length is stored or b) sacrifice another bit to
represent whether the integer is the data itself or the length. Not sure
if this is worthwhile.


Do others see use in this? If so, does the encoding outlined above sound
reasonable?  I'm using a variant of this for a proposal for a different
undo record format, but I really hope this would be interesting for
others too.



Greetings,

Andres Freund

Вложения

Re: A varint implementation for PG?

От
Craig Ringer
Дата:
On Tue, 10 Dec 2019 at 09:51, Andres Freund <andres@anarazel.de> wrote:
Hi,

I several times, most recently for the record format in the undo
patchset, wished for a fast variable width integer implementation for
postgres. Using very narrow integers, for space efficiency, solves the
space usage problem, but leads to extensibility / generality problems.

Yes. I've wanted flexible but efficiently packed integers quite a bit too, especially when working with wire protocols.

Am I stabbing completely in the dark when wondering if this might be a step towards a way to lift the size limit on VARLENA Datums like bytea ? 

There are obvious practical concerns with doing so, given that our protocol offers no handle based lazy fetching for big VARLENA values, but that too needs a way to represent sizes sensibly and flexibly.
 
Even with those caveats, I think that's a pretty good result. Other
encodings were more expensive. And I think there's definitely some room
for optimization left.

I don't feel at all qualified to question your analysis of the appropriate representation. But your explanation certainly makes a lot of sense as someone approaching the topic mostly fresh - I've done a bit with BCD but not much else.
 
I assume we'd be paying a price in padding and alignment in most cases, and probably more memory copying, but these representations would likely be appearing mostly in places where other costs are overwhelmingly greater like network or disk I/O.

If data lengths longer than that are required for a use case

If baking a new variant integer format now, I think limiting it to 64 bits is probably a mistake given how long-lived PostgreSQL is, and how hard it can be to change things in the protocol, on disk, etc.
 
it
probably is better to either a) use the max-representable 8 byte integer
as an indicator that the length is stored or b) sacrifice another bit to
represent whether the integer is the data itself or the length.

I'd be inclined to suspect that (b) is likely worth doing. If nothing else because not being able to represent the full range of a 64-bit integer in the variant type is potentially going to be a seriously annoying hassle at points where we're interacting with places that could use the full width. We'd then have the potential for variant integers of > 2^64 but at least that's wholly under our control.

I also routinely underestimate how truly huge a 64-bit integer really is. But even now 8 petabytes isn't as inconceivable as it used to be....

It mostly depends on how often you expect you'd be coming up on the boundaries where the extra bit would push you up a variant size.

Do others see use in this?

Yes. Very, very much yes.

I'd be quick to want to expose it to SQL too.
 
--
 Craig Ringer                   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: A varint implementation for PG?

От
Andres Freund
Дата:
Hi,

On 2019-12-13 13:31:55 +0800, Craig Ringer wrote:
> Am I stabbing completely in the dark when wondering if this might be a step
> towards a way to lift the size limit on VARLENA Datums like bytea ?

It could be - but I think it'd be a pretty small piece of it. But yes, I
have mused about that.



> > Even with those caveats, I think that's a pretty good result. Other
> > encodings were more expensive. And I think there's definitely some room
> > for optimization left.
> 
> 
> I don't feel at all qualified to question your analysis of the appropriate
> representation. But your explanation certainly makes a lot of sense as
> someone approaching the topic mostly fresh - I've done a bit with BCD but
> not much else.
> 
> I assume we'd be paying a price in padding and alignment in most cases, and
> probably more memory copying, but these representations would likely be
> appearing mostly in places where other costs are overwhelmingly greater
> like network or disk I/O.

I don't really see where padding/alignment costs come into play here?



> If data lengths longer than that are required for a use case
> 
> 
> If baking a new variant integer format now, I think limiting it to 64 bits
> is probably a mistake given how long-lived PostgreSQL is, and how hard it
> can be to change things in the protocol, on disk, etc.

I don't think it's ever going to be sensible to transport 64bit quanta
of data. Also, uh, it'd be larger than the data a postgres instance
could really contain, given LSNs are 64 bit.



> > it
> > probably is better to either a) use the max-representable 8 byte integer
> > as an indicator that the length is stored or b) sacrifice another bit to
> > represent whether the integer is the data itself or the length.

> I'd be inclined to suspect that (b) is likely worth doing. If nothing else
> because not being able to represent the full range of a 64-bit integer in
> the variant type is potentially going to be a seriously annoying hassle at
> points where we're interacting with places that could use the full width.
> We'd then have the potential for variant integers of > 2^64 but at least
> that's wholly under our control.

I'm very very staunchly against doing either of these for the varints
used widely. Throwing away even a bit is quite painful, as it
e.g. reduces the range representable in a single byte from 0 - 127/-64 -
63 to 0 - 63/-32 - 31.  Without ever being useful, given what kind of
things varints are commonly going to describe. There's e.g. simply no
practical use of describing a single WAL record length that's bigger
than 63 bit can represent.

I *can* see a separate varint type, probably sharing some code, that
supports storing arbitrarily large numbers. But using that everywhere
would be pointless.


> I'd be quick to want to expose it to SQL too.

It'll be a bit problmeatic to deal with all the casting necessary, and
with the likely resulting overload resolution issues.  I'm wondering
whether it'd be worthwhile to have a ALTER TABLE ... STORAGE ... option
that encodes int2/4/8 as varints when inside a tuple, but otherwise just
let it be a normal integer.

Greetings,

Andres Freund



Re: A varint implementation for PG?

От
Robert Haas
Дата:
[ resurrecting this 2-year-old thread ]

On Fri, Dec 13, 2019 at 12:45 AM Andres Freund <andres@anarazel.de> wrote:
> > If baking a new variant integer format now, I think limiting it to 64 bits
> > is probably a mistake given how long-lived PostgreSQL is, and how hard it
> > can be to change things in the protocol, on disk, etc.
>
> I don't think it's ever going to be sensible to transport 64bit quanta
> of data. Also, uh, it'd be larger than the data a postgres instance
> could really contain, given LSNs are 64 bit.

We already use 128-bit integers in some places in the source code, and
it seems more likely than not that use of 128-bit integer data types
will continue to expand over time. If we want to represent 64 bit
integers in a variable number of bits today, it does seem pretty
likely that some day we will want to do the same thing with 128 bit
integers, and maybe eventually larger. And I think that all of that is
true even though exhausting a 64-bit LSN space looks difficult today
and will likely remain difficult for the foreseeable future.

So to me, the right thing to do here probably depends a bit on the use
case. If we are talking about using this for strictly internal
purposes - e.g. cramming stuff into an internal data structure into
which users have no direct visibility - then I think it's probably
best to pick a representation that corresponds exactly to the C data
type that we're trying to store. If we're starting out with a  int64
and we want to byte-squeeze it, the representation you've chosen here
seems like just the thing. And it's easy to imagine why we might also
want to have similar transformations for uint64, int32, and uint32,
and maybe eventually int128 and uint128. And if the details of the
format are a little different in each case that's fine; for these
sorts of use cases we're always going to unpack into the same data
type that we packed originally, so it's not important to have exact
compatibility across differences in signedness or bit count.

However, I suspect that the whole approach should be completely
revised for a user-visible data type. On the one hand, there's no
telling how large a value some user will want to represent, so
limiting ourselves to 64 bits does seem shortsighted. And on the othe
hand, if we've got a varlena, we already know the length, so it seems
like we shouldn't also encode the length in the value. Maybe there's a
more efficient way, but the first thing that occurs to me is to just
discard high order bytes that are all zeroes or all ones until the
high order bit of the next byte doesn't match and plonk the remaining
bytes into the varlena. To decompress, just sign-extend out to the
target length. Really, this kind of representation can be extended to
represent arbitrarily large integers, even bigger than what we can
currently do with numeric, which is already crazy huge, and it seems
to have some advantage in that every payload byte contains exactly 8
data bits, so we don't need to shift or mask while encoding and
decoding.

Now, we could think of introducing a new format for variable-length
datums, essentially making this a new typlen rather than a new kind of
varlena. That might be worth it, because if you are storing a lot of
values that are small enough that this format could represent them in
3 bytes or less, which I think would be everything up to +/- 2^20, you
could save a pretty significant amount of space even if your table was
laid out to avoid alignment padding. However, there would be some
distributed overhead to this, because all the places that have special
handling for typlen = -1 and typlen = -2 would need to grow new cases
for this. I'm not sure how much of a problem that is, really, but I
don't think we can go nuts with adding new typlen values.

> It'll be a bit problmeatic to deal with all the casting necessary, and
> with the likely resulting overload resolution issues.  I'm wondering
> whether it'd be worthwhile to have a ALTER TABLE ... STORAGE ... option
> that encodes int2/4/8 as varints when inside a tuple, but otherwise just
> let it be a normal integer.

I don't see how a STORAGE option could work. We tend to treat those as
hints, rather than critical data, which wouldn't work here. I think
there are a number of problems that crop up, but one of them is the
same thing we hit with the custom TOAST compression stuff. If you need
to make a value of some row type out of a HeapTuple, you're not going
to know which settings were used to create that heap tuple, and you're
certainly not going to know that when you go to deform that tuple. The
only thing you're going to have is the tuple descriptor, which AFAICS
means that the representation needs to be a property of the type, not
where the value is stored. Maybe you have a clever idea I'm not
seeing.

As for the casting and overloading issues, that's tripped up quite a
few people now when trying to add new numeric data types - unsigned
integers being a case that has come up a few times now, I think. I
don't think it has to be a blocker. I think the solution is probably
to accept that using unsigned or variable-width data types will
require inserting more casts than would be required for integer
datatypes and numeric. That may not thrill everybody, but it may still
be better than deciding we're never ever ever adding any more data
types for storing numbers than we have today.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: A varint implementation for PG?

От
Andres Freund
Дата:
Hi,

On 2021-08-03 14:26:16 -0400, Robert Haas wrote:
> [ resurrecting this 2-year-old thread ]
> On Fri, Dec 13, 2019 at 12:45 AM Andres Freund <andres@anarazel.de> wrote:
> > I don't think it's ever going to be sensible to transport 64bit quanta
> > of data. Also, uh, it'd be larger than the data a postgres instance
> > could really contain, given LSNs are 64 bit.
> 
> We already use 128-bit integers in some places in the source code, and
> it seems more likely than not that use of 128-bit integer data types
> will continue to expand over time. If we want to represent 64 bit
> integers in a variable number of bits today, it does seem pretty
> likely that some day we will want to do the same thing with 128 bit
> integers, and maybe eventually larger. And I think that all of that is
> true even though exhausting a 64-bit LSN space looks difficult today
> and will likely remain difficult for the foreseeable future.

Yea, my answer here was intended to be about areas where we "bake in", to
quote Craig, the varint format, like varlena tags or protocol messages. For
those a 64bit length tag seems entirely sufficient.

I think e.g. adding an SQL level variable width integer type would potentially
come to a different outcome. But even for things like that I suspect that more
often than not you'd be better off with a variable length length value, rather
than encoding the whole "huge" value as a single variable width integer.


> So to me, the right thing to do here probably depends a bit on the use
> case. If we are talking about using this for strictly internal
> purposes - e.g. cramming stuff into an internal data structure into
> which users have no direct visibility - then I think it's probably
> best to pick a representation that corresponds exactly to the C data
> type that we're trying to store. If we're starting out with a  int64
> and we want to byte-squeeze it, the representation you've chosen here
> seems like just the thing. And it's easy to imagine why we might also
> want to have similar transformations for uint64, int32, and uint32,
> and maybe eventually int128 and uint128. And if the details of the
> format are a little different in each case that's fine; for these
> sorts of use cases we're always going to unpack into the same data
> type that we packed originally, so it's not important to have exact
> compatibility across differences in signedness or bit count.

Agreed.


> However, I suspect that the whole approach should be completely
> revised for a user-visible data type. On the one hand, there's no
> telling how large a value some user will want to represent, so
> limiting ourselves to 64 bits does seem shortsighted. And on the othe
> hand, if we've got a varlena, we already know the length, so it seems
> like we shouldn't also encode the length in the value.

If we're talking varlenas, then I think using embedded varints only really
makes sense if they're "sub-components" of that varlena, not the entire
value...


> Now, we could think of introducing a new format for variable-length
> datums, essentially making this a new typlen rather than a new kind of
> varlena. That might be worth it, because if you are storing a lot of
> values that are small enough that this format could represent them in
> 3 bytes or less, which I think would be everything up to +/- 2^20, you
> could save a pretty significant amount of space even if your table was
> laid out to avoid alignment padding. However, there would be some
> distributed overhead to this, because all the places that have special
> handling for typlen = -1 and typlen = -2 would need to grow new cases
> for this. I'm not sure how much of a problem that is, really, but I
> don't think we can go nuts with adding new typlen values.

Yes - the branches for the different typelens already are quite visible in
profiles... It might be that we could compensate for that without too much
difficulty (e.g. by having a field in a tupdesc indicating what kind of types
are in use in a tuple to be formed/deformed, and dispatching to different
form/deform routines based on that), but it's not obviously a win.



> > It'll be a bit problmeatic to deal with all the casting necessary, and
> > with the likely resulting overload resolution issues.  I'm wondering
> > whether it'd be worthwhile to have a ALTER TABLE ... STORAGE ... option
> > that encodes int2/4/8 as varints when inside a tuple, but otherwise just
> > let it be a normal integer.
> 
> I don't see how a STORAGE option could work. We tend to treat those as
> hints, rather than critical data, which wouldn't work here.

I wasn't thinking that this would be something that could be changed without a
table rewrite - so maybe STORAGE would be the wrong place to put it, given how
it's currently used. OTOH, I don't think it'd be that big a problem to have a
rewrite for some, but not all STORAGE options...


> I think there are a number of problems that crop up, but one of them is the
> same thing we hit with the custom TOAST compression stuff. If you need to
> make a value of some row type out of a HeapTuple, you're not going to know
> which settings were used to create that heap tuple, and you're certainly not
> going to know that when you go to deform that tuple. The only thing you're
> going to have is the tuple descriptor, which AFAICS means that the
> representation needs to be a property of the type, not where the value is
> stored. Maybe you have a clever idea I'm not seeing.

All the information needed for deforming fields is a property of pg_attribute,
not pg_type (because the type could have been dropped after the column was
dropped, but we still need to skip over the column). So if one accepts needing
rewrites for changing the data encoding, I don't think there'd be a huge
issue here.

I think the compression stuff is a bit different because you want to prevent
compressed values of one type wandering into another table with a different
storage type. That problem wouldn't really exist for an SQL level varint as I
was imagining them - they'd always be deformed into the "normal" int2/4/8
Datum representation outside of a HeapTuple. We obviously don't want to do
that for varlenas, because they need to be fetched from out-of-line, which is
several orders of magnitude more expensive than decoding a varint.


I am now wondering if what we're talking about here would best be thought of
not as a variable width integer type, but a variable-width encoding for all
pass-by-value types.

Leaving on-disk compatibility aside (:)), ISTM that we by default could use
the following heuristic to decide how to encode pass-by-value types: If it's a
leading fixed-width NOT NULL type, store it in fixed-length
encoding. Otherwise use a variable-length encoding.

Greetings,

Andres Freund



Re: A varint implementation for PG?

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> However, I suspect that the whole approach should be completely
> revised for a user-visible data type. On the one hand, there's no
> telling how large a value some user will want to represent, so
> limiting ourselves to 64 bits does seem shortsighted. And on the othe
> hand, if we've got a varlena, we already know the length, so it seems
> like we shouldn't also encode the length in the value. Maybe there's a
> more efficient way, but the first thing that occurs to me is to just
> discard high order bytes that are all zeroes or all ones until the
> high order bit of the next byte doesn't match and plonk the remaining
> bytes into the varlena. To decompress, just sign-extend out to the
> target length. Really, this kind of representation can be extended to
> represent arbitrarily large integers, even bigger than what we can
> currently do with numeric, which is already crazy huge, and it seems
> to have some advantage in that every payload byte contains exactly 8
> data bits, so we don't need to shift or mask while encoding and
> decoding.

+1.  I think this, together with our existing rules for varlena headers,
would address the issue quite nicely.  Any sanely-sized integer would
require only a one-byte header, so the minimum on-disk size is 2 bytes
(with no alignment padding required).  I don't buy that there's enough
need to justify inventing a new typlen code, since even if you did it
wouldn't improve things all that much compared to this design.

(Oh ... actually the minimum on-disk size is one byte, since value zero
would require no payload bytes.)

            regards, tom lane



Re: A varint implementation for PG?

От
Robert Haas
Дата:
On Tue, Aug 3, 2021 at 3:32 PM Andres Freund <andres@anarazel.de> wrote:
> I am now wondering if what we're talking about here would best be thought of
> not as a variable width integer type, but a variable-width encoding for all
> pass-by-value types.
>
> Leaving on-disk compatibility aside (:)), ISTM that we by default could use
> the following heuristic to decide how to encode pass-by-value types: If it's a
> leading fixed-width NOT NULL type, store it in fixed-length
> encoding. Otherwise use a variable-length encoding.

This is pretty integer-centric, though. If your pass-by-value type is
storing timestamps, for example, they're not likely to be especially
close to zero. Since a 64-bit address is pretty big, perhaps they're
still close enough to zero that this will work out to a win, but I
don't know, that seems a bit cheesy. I grant that it could work out to
a win -- pass-by-value data types whose distribution is very different
from what's typical for integers, or for that matter columns full of
integers that all happen to be toward the extreme values the data type
can store, are probably not that common. I just don't really like
making such assumptions on a system-wide basis (as opposed to a
per-datatype basis where it's easier to reason about the
consequences).

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: A varint implementation for PG?

От
Andres Freund
Дата:
Hi,

On 2021-08-04 09:31:25 -0400, Robert Haas wrote:
> This is pretty integer-centric, though. If your pass-by-value type is
> storing timestamps, for example, they're not likely to be especially
> close to zero. Since a 64-bit address is pretty big, perhaps they're
> still close enough to zero that this will work out to a win, but I
> don't know, that seems a bit cheesy.

Yea, that's fair. The really bad™ example probably is negative numbers - which
wouldn't be easy to do something about in a datatype agnostic way.


> I grant that it could work out to a win -- pass-by-value data types whose
> distribution is very different from what's typical for integers, or for that
> matter columns full of integers that all happen to be toward the extreme
> values the data type can store, are probably not that common.

It'd work out as a wash for common timestamps:

./varint_test -u 681413261095983
processing unsigned
unsigned:    681413261095983
  input bytes:     00 02  6b bd  e3 5f  74 2f
8 output bytes:     01 02  6b bd  e3 5f  74 2f
decoded:    681413261095983

I don't think there's many workloads where plain integers would skew extreme
enough for it to work out to a loss often enough to matter. But:

> I just don't really like making such assumptions on a system-wide basis (as
> opposed to a per-datatype basis where it's easier to reason about the
> consequences).

I'd not at all be opposed to datatypes having influence over the on-disk
encoding. I was just musing about a default heuristic that could make sense. I
do think you'd want something that chooses the encoding for one pg_attribute
values based on preceding columns.

Greetings,

Andres Freund



Re: A varint implementation for PG?

От
Andres Freund
Дата:
Hi,

On 2021-08-03 14:26:16 -0400, Robert Haas wrote:
> [ resurrecting this 2-year-old thread ]
>
> On Fri, Dec 13, 2019 at 12:45 AM Andres Freund <andres@anarazel.de> wrote:
> > > If baking a new variant integer format now, I think limiting it to 64 bits
> > > is probably a mistake given how long-lived PostgreSQL is, and how hard it
> > > can be to change things in the protocol, on disk, etc.
> >
> > I don't think it's ever going to be sensible to transport 64bit quanta
> > of data. Also, uh, it'd be larger than the data a postgres instance
> > could really contain, given LSNs are 64 bit.
>
> We already use 128-bit integers in some places in the source code, and
> it seems more likely than not that use of 128-bit integer data types
> will continue to expand over time. If we want to represent 64 bit
> integers in a variable number of bits today, it does seem pretty
> likely that some day we will want to do the same thing with 128 bit
> integers, and maybe eventually larger. And I think that all of that is
> true even though exhausting a 64-bit LSN space looks difficult today
> and will likely remain difficult for the foreseeable future.

I was thinking a bit about how to encode arbitrary length values for the cases
where that's interesting.

Currently what I proposed for 8 byte unsigned integers is that we encode the
length in unary unset bits, followed by a set bit. As a special case, a prefix
of 8 bits indicates a length of 9, without needing the separator bit - we
don't need a separator bit at that point.

Extending that to arbitrary lengths obviously at some point makes the encoding
in unary wasteful, and the benefit of few branches vanishes. So what I was
thinking is that for variable length pieces of data that are not limited to 8
bytes, we could replace the '8 0 bits' special case with a new special case:
The length in bytes follows as a max-8-byte varint.

That'd leave us with the following overheads:
- 0 - 127: 0 bytes (i.e. 0 to 7 bits)
- 128 - 2^56 - 1: 1 byte (i.e. 7 bits - 7 bytes)
- 7 bytes - 127 bytes: 2 bytes
- 128 bytes - 16383 bytes: 3 bytes
- 16384 bytes - 2097151 bytes: 4 bytes
- 2097152 bytes - 268435455 bytes: 5 bytes
- 268435456 bytes - 34359738367 bytes: 6 bytes
- ...

The obvious alternative would of course be to just always store the length
prefix separately:
- 0 - 127 bytes: 1 byte
- 128 - 16383 bytes: 2 bytes
- 16384 - 2097151 bytes: 3 bytes
- ...

I do suspect that for a fair number of cases the "0 byte overhead" for very
small values would be worth the increase in overhead later. Particularly
because the decoding for values up to 7 bytes would be cheaper cpu-wise as
well.

Greetings,

Andres Freund



Re: A varint implementation for PG?

От
Robert Haas
Дата:
On Wed, Aug 4, 2021 at 3:01 PM Andres Freund <andres@anarazel.de> wrote:
> Extending that to arbitrary lengths obviously at some point makes the encoding
> in unary wasteful, and the benefit of few branches vanishes. So what I was
> thinking is that for variable length pieces of data that are not limited to 8
> bytes, we could replace the '8 0 bits' special case with a new special case:
> The length in bytes follows as a max-8-byte varint.

But what if I have a machine with more than 16 exabytes of RAM and I
want to use all of its memory to store one really big integer?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: A varint implementation for PG?

От
Andres Freund
Дата:
On 2021-08-04 15:37:36 -0400, Robert Haas wrote:
> On Wed, Aug 4, 2021 at 3:01 PM Andres Freund <andres@anarazel.de> wrote:
> > Extending that to arbitrary lengths obviously at some point makes the encoding
> > in unary wasteful, and the benefit of few branches vanishes. So what I was
> > thinking is that for variable length pieces of data that are not limited to 8
> > bytes, we could replace the '8 0 bits' special case with a new special case:
> > The length in bytes follows as a max-8-byte varint.
> 
> But what if I have a machine with more than 16 exabytes of RAM and I
> want to use all of its memory to store one really big integer?

Then the embedded 8 byte length value would just have to do the same thing
recursively to store that huge length header :)



Re: A varint implementation for PG?

От
Robert Haas
Дата:
On Wed, Aug 4, 2021 at 3:46 PM Andres Freund <andres@anarazel.de> wrote:
> > But what if I have a machine with more than 16 exabytes of RAM and I
> > want to use all of its memory to store one really big integer?
>
> Then the embedded 8 byte length value would just have to do the same thing
> recursively to store that huge length header :)

Well, yes. But more seriously, my point is that I can't imagine why we
would need an object with a length bounded by 2^64. I mean I suppose
there's no harm in looking to the future, but that's *really big*.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: A varint implementation for PG?

От
Tomas Vondra
Дата:

On 8/4/21 9:01 PM, Andres Freund wrote:
> Hi,
> 
> On 2021-08-03 14:26:16 -0400, Robert Haas wrote:
>> [ resurrecting this 2-year-old thread ]
>>
>> On Fri, Dec 13, 2019 at 12:45 AM Andres Freund <andres@anarazel.de> wrote:
>>>> If baking a new variant integer format now, I think limiting it to 64 bits
>>>> is probably a mistake given how long-lived PostgreSQL is, and how hard it
>>>> can be to change things in the protocol, on disk, etc.
>>>
>>> I don't think it's ever going to be sensible to transport 64bit quanta
>>> of data. Also, uh, it'd be larger than the data a postgres instance
>>> could really contain, given LSNs are 64 bit.
>>
>> We already use 128-bit integers in some places in the source code, and
>> it seems more likely than not that use of 128-bit integer data types
>> will continue to expand over time. If we want to represent 64 bit
>> integers in a variable number of bits today, it does seem pretty
>> likely that some day we will want to do the same thing with 128 bit
>> integers, and maybe eventually larger. And I think that all of that is
>> true even though exhausting a 64-bit LSN space looks difficult today
>> and will likely remain difficult for the foreseeable future.
> 
> I was thinking a bit about how to encode arbitrary length values for the cases
> where that's interesting.
> 
> Currently what I proposed for 8 byte unsigned integers is that we encode the
> length in unary unset bits, followed by a set bit. As a special case, a prefix
> of 8 bits indicates a length of 9, without needing the separator bit - we
> don't need a separator bit at that point.
> 

How is that better than the two varint flavors that are already out 
there, i.e. the bitcoin [1] and protocol buffers [2]?

The first one seems quite efficient in how it encodes the length into 
very few bits (which matters especially for small values). It's designed 
for integers with 1B, 2B, 4B or 8B, but it can be extended to arbitrary 
lengths fairly easily, I think:

Look at the first byte, and

0 - 243 - encoded as is
244 - 1 byte
245 - 2 bytes
246 - 3 bytes
247 - 4 bytes
248 - 5 bytes
249 - 6 bytes
250 - 7 bytes
251 - 8 bytes
252 - next 1 byte is length
253 - next 2 bytes are length
254 - next 3 bytes are length
255 - next 4 bytes are length

If we want to support longer lengths, we'd have to reserve an extra 
value (which reduces the number of values that require a single byte).

The [2] is a bit more problematic, as it's tailored for very short 
values (essentially wasting 1/8 of bits) and you have to parse the whole 
value to determine the length.

[1] https://wiki.bitcoinsv.io/index.php/VarInt
[2] https://developers.google.com/protocol-buffers/docs/encoding

> Extending that to arbitrary lengths obviously at some point makes the encoding
> in unary wasteful, and the benefit of few branches vanishes. So what I was
> thinking is that for variable length pieces of data that are not limited to 8
> bytes, we could replace the '8 0 bits' special case with a new special case:
> The length in bytes follows as a max-8-byte varint.
> 
> That'd leave us with the following overheads:
> - 0 - 127: 0 bytes (i.e. 0 to 7 bits)
> - 128 - 2^56 - 1: 1 byte (i.e. 7 bits - 7 bytes)
> - 7 bytes - 127 bytes: 2 bytes
> - 128 bytes - 16383 bytes: 3 bytes
> - 16384 bytes - 2097151 bytes: 4 bytes
> - 2097152 bytes - 268435455 bytes: 5 bytes
> - 268435456 bytes - 34359738367 bytes: 6 bytes
> - ...
> 
> The obvious alternative would of course be to just always store the length
> prefix separately:
> - 0 - 127 bytes: 1 byte
> - 128 - 16383 bytes: 2 bytes
> - 16384 - 2097151 bytes: 3 bytes
> - ...
> 
> I do suspect that for a fair number of cases the "0 byte overhead" for very
> small values would be worth the increase in overhead later. Particularly
> because the decoding for values up to 7 bytes would be cheaper cpu-wise as
> well.
> 

Yeah. Especially if the long values can be compressed, which probably 
applies to most real-world data sets. IMHO the efficiency for short 
values is the more important thing.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: A varint implementation for PG?

От
Andres Freund
Дата:
Hi,

On 2021-08-04 23:44:10 +0200, Tomas Vondra wrote:
> How is that better than the two varint flavors that are already out there,
> i.e. the bitcoin [1] and protocol buffers [2]?

The protobuf one is *terrible* for CPU efficiency. You need to go through each
byte, do masking and shifting for each byte and then have a conditional
branch. That's bad from the the amount of instructions you need to execute,
and *really* bad for the branch predictor.


> The first one seems quite efficient in how it encodes the length into very
> few bits (which matters especially for small values). It's designed for
> integers with 1B, 2B, 4B or 8B, but it can be extended to arbitrary lengths
> fairly easily, I think:

> Look at the first byte, and
> 
> 0 - 243 - encoded as is
> 244 - 1 byte
> 245 - 2 bytes
> 246 - 3 bytes
> 247 - 4 bytes
> 248 - 5 bytes
> 249 - 6 bytes
> 250 - 7 bytes
> 251 - 8 bytes
> 252 - next 1 byte is length
> 253 - next 2 bytes are length
> 254 - next 3 bytes are length
> 255 - next 4 bytes are length

> If we want to support longer lengths, we'd have to reserve an extra value
> (which reduces the number of values that require a single byte).

I think that's not a bad scheme. I think it may end up being a bit more
expensive to decode because you need more branches instead of using
find-first-set type instructions.

Greetings,

Andres Freund



Re: A varint implementation for PG?

От
Tomas Vondra
Дата:
On 8/5/21 1:05 AM, Andres Freund wrote:
> Hi,
> 
> On 2021-08-04 23:44:10 +0200, Tomas Vondra wrote:
>> How is that better than the two varint flavors that are already out there,
>> i.e. the bitcoin [1] and protocol buffers [2]?
> 
> The protobuf one is *terrible* for CPU efficiency. You need to go through each
> byte, do masking and shifting for each byte and then have a conditional
> branch. That's bad from the the amount of instructions you need to execute,
> and *really* bad for the branch predictor.
> 

Yeah, probably true - particularly for longer values. No argument here.

> 
>> The first one seems quite efficient in how it encodes the length into very
>> few bits (which matters especially for small values). It's designed for
>> integers with 1B, 2B, 4B or 8B, but it can be extended to arbitrary lengths
>> fairly easily, I think:
> 
>> Look at the first byte, and
>>
>> 0 - 243 - encoded as is
>> 244 - 1 byte
>> 245 - 2 bytes
>> 246 - 3 bytes
>> 247 - 4 bytes
>> 248 - 5 bytes
>> 249 - 6 bytes
>> 250 - 7 bytes
>> 251 - 8 bytes
>> 252 - next 1 byte is length
>> 253 - next 2 bytes are length
>> 254 - next 3 bytes are length
>> 255 - next 4 bytes are length
> 
>> If we want to support longer lengths, we'd have to reserve an extra value
>> (which reduces the number of values that require a single byte).
> 
> I think that's not a bad scheme. I think it may end up being a bit more
> expensive to decode because you need more branches instead of using
> find-first-set type instructions.
> 

I don't think it requires many branches, because you can essentially do

   if (byte[0] <= 243)
     length = 0
   else if (byte[0] <= 251)
     length = byte[0] - 243
   else
   {
     length_bytes = byte[0] - 251;
     ... read length_bytes, decode length
   }

but I haven't tried implementing it and maybe my intuition is wrong.

Or maybe it'd be a good scheme for on-disk format, but poor for memory.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: A varint implementation for PG?

От
Andrew Dunstan
Дата:
On 8/4/21 7:21 PM, Tomas Vondra wrote:
> On 8/5/21 1:05 AM, Andres Freund wrote:
>
>>
>>> The first one seems quite efficient in how it encodes the length
>>> into very
>>> few bits (which matters especially for small values). It's designed for
>>> integers with 1B, 2B, 4B or 8B, but it can be extended to arbitrary
>>> lengths
>>> fairly easily, I think:
>>
>>> Look at the first byte, and
>>>
>>> 0 - 243 - encoded as is
>>> 244 - 1 byte
>>> 245 - 2 bytes
>>> 246 - 3 bytes
>>> 247 - 4 bytes
>>> 248 - 5 bytes
>>> 249 - 6 bytes
>>> 250 - 7 bytes
>>> 251 - 8 bytes
>>> 252 - next 1 byte is length
>>> 253 - next 2 bytes are length
>>> 254 - next 3 bytes are length
>>> 255 - next 4 bytes are length
>>
>>> If we want to support longer lengths, we'd have to reserve an extra
>>> value
>>> (which reduces the number of values that require a single byte).
>>
>> I think that's not a bad scheme. I think it may end up being a bit more
>> expensive to decode because you need more branches instead of using
>> find-first-set type instructions.
>>
>
> I don't think it requires many branches, because you can essentially do
>
>   if (byte[0] <= 243)
>     length = 0
>   else if (byte[0] <= 251)
>     length = byte[0] - 243
>   else
>   {
>     length_bytes = byte[0] - 251;
>     ... read length_bytes, decode length
>   }
>
> but I haven't tried implementing it and maybe my intuition is wrong.
>
> Or maybe it'd be a good scheme for on-disk format, but poor for memory.
>
>
>

This seems like quite an elegant scheme. Certainly worth trying out. I
find it hard to believe that more than 4 length bytes would be needed
(although that reminds me of the famous and possibly apocryphal quote
"640K ought to be enough for anybody.")


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: A varint implementation for PG?

От
Robert Haas
Дата:
Andres asked me off-list if I could take another look at this.

So here's a bit of review:

- The header comment at the top of the file gives some examples of how
the encoding works, and then basically says, oh wait, there's also a
sign bit at the end, so all those examples are actually wrong. It's
really this other thing. Perhaps it's possible to reword things to
avoid that.

- The XXX comment in the file header says "variants" where it probably
means "varints". A later comment mentions "lenght" instead of
"length".

- In pg_varint_encode_uint64, where it says "XXX: I'm sure there's a
neater way to do this," I'm not sure exactly what your parameters for
neater are, but you could avoid the conditional if you just did:

bits_of_output_data = bits_of_input_data + bytes_of_input_data;
bytes_of_output_data = (bits_of_output_data + BITS_PER_BYTE - 1) /
BITS_PER_BYTE;

I think the comment implies that you thought of that and discarded it
on performance grounds, but I'm not quite sure that I'm right about
that, so I mention it here.

I also thought of another approach that doesn't require computing
bytes_of_input_data:

bytes_of_output_data = bits_of_input_data / 7 + 1;

The intuition here is that every byte you add to the output gives you
room for 7 additional data bits, because you gain 8 for the new byte
but also have to consume 1 bit from the first byte, for a net gain of
7. This formula gives the wrong answer when bits_of_input_data is 63
or 64, but you don't need to use it for values greater than
PG_VARINT_UINT64_MAX_8BYTE_VAL, so that doesn't matter.

- It's a bit surprising that the length argument to memcpy() is a
constant in the 8-bytes-or-less case. It should be fine, because the
output buffer must be big enough to hold at least 9 more bytes, so all
that can happen is we write unnecessary bytes that the caller can
later overwrite, or not. But it would be worth a comment, I think. In
particular, we should note that you should always have enough
allocated space in the output buffer for the maximum width of 9 bytes
even if you know the number you're actually encoding is small. You do
mention this in the decoding function, but not on the encoding side.

- The FIXME comment for pg_varint_decode_uint64 no verb.

- ret <<= BITS_PER_BYTE * (BITS_PER_BYTE - bytes) is awfully hard to
reason about. My first thought was that it had to be wrong: if we are
shifting left during encoding, how can we also be shifting left during
decoding? Actually I think I see, now, how it works on a little-Endian
machine. Logically, the encoded bytes are the least-significant bytes.
We copy them to the least-significant bytes of "ret," but they're in
the wrong order, with the most significant byte of the encoded
representation in the last significant byte of "ret". By shifting
left, we move all the bytes to the other "end" of "ret", and then the
byte swap puts them back where they were, but now in the right order.
Along the way, the garbage bits we're supposed to be ignoring get
thrown away and replaced with zeroes. But I still don't see how it
works on a big-Endian machine. On such a machine, we copy the encoded
bytes, which are still the least significant bytes of the original
value, to the most significant bytes of "ret". The byte swap isn't
going to do anything here, so in this case it feels like the shift
needs to be in the other direction -- a shift right. But maybe not,
because originally I thought it should be a shift right in both cases,
and now I don't think that's right.

- Re "mask out length indicator bits & separator bit", only the
separator bit actually is being or needs to be masked out.

- I think the overall interface is fine. I think it might be useful to
add a function that returns the length to which something would encode
without actually encoding it, for cases where you want to estimate how
much space you're going to need for something so that you can allocate
space for it, and then only afterwards do the encoding for real.

That's all I've got.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: A varint implementation for PG?

От
Nathan Bossart
Дата:
On Thu, Jan 05, 2023 at 06:36:15PM -0500, Robert Haas wrote:
> Andres asked me off-list if I could take another look at this.

I'm curious whether there are plans to pick this up again.  IMHO it seems
like a generally good idea.  AFAICT the newest version of the patch is in a
separate thread [0], which I just wanted to link here for posterity.  If
there are no plans, I might give it a try, but otherwise I'm happy to help
review.

[0] https://postgr.es/m/20221004234952.anrguppx5owewb6n%40awork3.anarazel.de

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com