Обсуждение: SQL-level pg_datum_image_equal

Поиск
Список
Период
Сортировка

SQL-level pg_datum_image_equal

От
Matthias van de Meent
Дата:
Hi,

One of our customers has this workload where every so often they
update the whole table to make sure it's up-to-date. In general, you'd
probably want to use MERGE for such a workload and ignore all rows
that already have only matching data, but there's a catch: PostgreSQL
doesn't have an efficient way to check if the provided data is
actually equal in all senses of the word, so we can't easily and
cheaply determine whether an update is needed; which is one reason why
the full table was updated every time.

A naive approach to determining whether each value needs to be updated
would use `old IS NOT DISTINCT FROM new`, but a.) this relies on `=`
operators to exist for that type, and b.) the = operator of some types
don't always distinguish between values that are different for human
readers; with as famous example '1.0' and '1.00' in numeric; they have
an equal value but are clearly distinct to readers (and certain
functions).

One could get around this in this case by 'simply' casting to text and
comparing the outputs (using the C collation for performance and
determinism), or by wrapping it in a row (which then uses
record_image_eq, which does use binary compare functions internally),
but both imply additional parsing, wrapping, and overhead compared to
a direct datum_image_eq call.

So, attached is a simple and to-the-point patch that adds the function
mentioned in $subject, which will tell the user whether two values of
the same type have an exactly equal binary representation, using
datum_image_eq.


Kind regards,

Matthias van de Meent

Вложения

Re: SQL-level pg_datum_image_equal

От
jian he
Дата:
On Thu, Dec 11, 2025 at 1:46 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> Hi,
>

>

> So, attached is a simple and to-the-point patch that adds the function
> mentioned in $subject, which will tell the user whether two values of
> the same type have an exactly equal binary representation, using
> datum_image_eq.
>

hi.

maybe Table 9.76
(https://www.postgresql.org/docs/current/functions-info.html#FUNCTIONS-INFO-CATALOG)
is the right place for this function.

corner case confused me, I think this is related to null handling,
maybe not related to this.
create type t1 as (a int, b text);
select pg_datum_image_equal('(,)'::t1, $$(,)$$::t1);
select pg_datum_image_equal('(,)'::t1, NULL::t1);
select '(,)'::t1 is null, NULL::t1 is null;

enforce_generic_type_consistency already resolved generic type.
see
select pg_datum_image_equal('1','1');
ERROR:  could not determine polymorphic type because input has type unknown

so
+ if (!OidIsValid(typ))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("could not determine type")));
+ }
this part should be elog(ERROR.....) ?


--
jian
https://www.enterprisedb.com/



Re: SQL-level pg_datum_image_equal

От
Corey Huinker
Дата:
On Wed, Dec 10, 2025 at 12:46 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
Hi,

One of our customers has this workload where every so often they
update the whole table to make sure it's up-to-date. In general, you'd
probably want to use MERGE for such a workload and ignore all rows
that already have only matching data, but there's a catch: PostgreSQL
doesn't have an efficient way to check if the provided data is
actually equal in all senses of the word, so we can't easily and
cheaply determine whether an update is needed; which is one reason why
the full table was updated every time.

Have you ruled out the suppress_redundant_updates_trigger?

Re: SQL-level pg_datum_image_equal

От
Matthias van de Meent
Дата:
On Sat, 20 Dec 2025, 17:07 Corey Huinker, <corey.huinker@gmail.com> wrote:
>
> On Wed, Dec 10, 2025 at 12:46 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
>>
>> Hi,
>>
>> One of our customers has this workload where every so often they
>> update the whole table to make sure it's up-to-date. In general, you'd
>> probably want to use MERGE for such a workload and ignore all rows
>> that already have only matching data, but there's a catch: PostgreSQL
>> doesn't have an efficient way to check if the provided data is
>> actually equal in all senses of the word, so we can't easily and
>> cheaply determine whether an update is needed; which is one reason why
>> the full table was updated every time.
>
> Have you ruled out the suppress_redundant_updates_trigger?

Thank you for the reference, I wasn't aware of this trigger.
Sadly, it does not work for our use case, as that only suppresses an
update if the heap-formatted rows are binary identical, which is not
guaranteed even if when all values are equivalent; as it doesn't take
detoasting into account. It also doesn't minimize the pressure on the
TOAST table, which is something else we're trying to do with the new
function.

The issue is that when you SET a column with a user-provided value,
during trigger handling, HOT checking, and TOASTing, the binary
representation of that user-provided value is the untoasted version
(as it has not yet been inserted into any toast table and isn't
represented as varatt_external), while the original row's value may be
a toast pointer (represented as varatt_external). The checks in
trigger handling, TOASTing, and HOT checking, the old tuple's value
for that column (in its varatt_external representation) is compared
against the new value (as normal varattrib_4b.va_4byte or
varattrib_1b), and those will never be binary equal - their first byte
is guaranteed to be different. Only if the value is pulled directly
from the original column will the original column's TOAST pointer be
used, and can a new toast table insertion be skipped (after which
suppress_redundant_updates_trigger with its in-heap-row compare option
might become useful).

But, lacking a system that checks checks whether toasted values
actually changed (and thus whether HOT applies, and whether an update
has to happen), that trigger isn't up to the task at hand.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)



Re: SQL-level pg_datum_image_equal

От
Matthias van de Meent
Дата:
On Sat, 20 Dec 2025 at 14:15, jian he <jian.universality@gmail.com> wrote:
>
> On Thu, Dec 11, 2025 at 1:46 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> >
> > Hi,
> >
>
> >
>
> > So, attached is a simple and to-the-point patch that adds the function
> > mentioned in $subject, which will tell the user whether two values of
> > the same type have an exactly equal binary representation, using
> > datum_image_eq.
> >
>
> hi.
>
> maybe Table 9.76
> (https://www.postgresql.org/docs/current/functions-info.html#FUNCTIONS-INFO-CATALOG)
> is the right place for this function.

I think table 9.3
(https://www.postgresql.org/docs/18/functions-comparison.html#FUNCTIONS-COMPARISON-FUNC-TABLE)
makes more sense, as this is more a compare function than one that
exposes catalog information about the input.

> corner case confused me, I think this is related to null handling,
> maybe not related to this.
> create type t1 as (a int, b text);
> select pg_datum_image_equal('(,)'::t1, $$(,)$$::t1);
> select pg_datum_image_equal('(,)'::t1, NULL::t1);
> select '(,)'::t1 is null, NULL::t1 is null;

Yes, that's row-type NULL handling for you. '(,)' is a composite value
with only NULL values in the attributes, and SQL defines that rows
with only NULL columns must return True when `IS NULL` evaluates their
NULL-ness. On disk, however, it is still stored as a "composite type;
attributes 'a' and 'b' are NULL"; so that a user that casts the value
to text will get a different result between (NULL::t1::text) and
('(,)'::t1::text), allowing safe round-trip conversions. Also note
that `('(,)'::t1 IS DISTINCT FROM NULL::t1) = TRUE, another curious
consequence of this SQL rule.

So, that output is expected; some methods already expose these
differences between the values, so pg_datum_image_equal() *must* also
indicate they are different. And now we also have one more reason to
have a function that can notice distinctions that go deeper than
surface-level SQL.

Aside: This new function doesn't actually fully cover the spectrum of
possible inequalities detectable through SQL, as there are some very
low level datum introspection tools like pg_column_size() whose output
depends on the type of toasting applied. My function cover that,
because that data should be completely irrelevant to normal data
usage, and the user can combine this manually if they really need it.

> enforce_generic_type_consistency already resolved generic type.

While you are correct to point out that the type system would prevent
this from getting called from SQL without a proper type, I'd like to
keep the check to make sure that callers from outside the type system
don't accidentally fail to provide the function with a correct type.

> so
> + if (!OidIsValid(typ))
> + {
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("could not determine type")));
> + }
> this part should be elog(ERROR.....) ?

Is there a policy on what should _not_ use ereport? I know we don't
require ereport for internal errors, but is considered forbidden?


Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)



Re: SQL-level pg_datum_image_equal

От
Matthias van de Meent
Дата:
On Mon, 22 Dec 2025 at 16:25, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> On Sat, 20 Dec 2025 at 14:15, jian he <jian.universality@gmail.com> wrote:
> > maybe Table 9.76
> > (https://www.postgresql.org/docs/current/functions-info.html#FUNCTIONS-INFO-CATALOG)
> > is the right place for this function.
>
> I think table 9.3
> (https://www.postgresql.org/docs/18/functions-comparison.html#FUNCTIONS-COMPARISON-FUNC-TABLE)
> makes more sense, as this is more a compare function than one that
> exposes catalog information about the input.

Attached is v2, which adds the new function to the docs, in addition
to rebasing the patch onto master.


Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

Вложения

Re: SQL-level pg_datum_image_equal

От
David Rowley
Дата:
On Thu, 26 Mar 2026 at 03:46, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Attached is v2, which adds the new function to the docs, in addition
> to rebasing the patch onto master.

In [1] I was looking into reported bugs with the datum_image code not
behaving correctly. It turns out this is due to hash_numeric() using
the wrong PG_RETURN_* macro, resulting in it not correctly
sign-extending negative 32-bit ints. I did some analysis to see if
there were any user-visible changes from fixing that in the back
branches. I didn't find any.  What you're adding here would add one. I
think we might need to become more strict about using the correct
return macro before we expose this stuff, else it's going to be a
rather scary process to fix bugs when it could affect the
datum_image_equal result.

The thing about Memoize slowly finding all the bugs in the
datum_image_* code is that we're free to fix these bugs, as the use
case (the Memoize hash table) only lasts as long as the query.  If we
expose all of this to users, then they could persist things in tables.
6911f8037 fixed another problem that resulted in the datum_image code
not doing the right thing.

I also proposed a hack in [2] to fix the sign-extension problem. I had
hoped that might be temporary, but if we were to expose this function
at the SQL level, then I doubt we'd be brave enough to remove it.

Consider the results of the following 2 queries, which only differ in
that one uses a materialized CTE and the other does not.

with cte(hash) as materialized (select hash_numeric(n) from
(values('1234.124'::numeric),('1234.124'::numeric)) n(n))
select * from cte where pg_datum_image_equal(hash,
hash_numeric('1234.124'::numeric)); -- wrong results

with cte(hash) as (select hash_numeric(n) from
(values('1234.124'::numeric),('1234.124'::numeric)) n(n))
select * from cte where pg_datum_image_equal(hash,
hash_numeric('1234.124'::numeric));

 hash
------
(0 rows)

    hash
------------
 -612512148
 -612512148
(2 rows)

This would work correctly if I pushed the patch in [2]. But as of yet,
I'm not sure about it. I at least think we should delay doing this
until we've come up with a fix that we're confident in.

David

[1] https://postgr.es/m/CAApHDvrRR1VOk4i4CpNWeL48veFshfRAvDTuWxsiUhUqo0akwQ%40mail.gmail.com
[2] https://postgr.es/m/CAApHDvreF-UiqBaHtRTQWQ6z1X9snstJW%2Bdfb2DU5GOb-uPEBg%40mail.gmail.com



Re: SQL-level pg_datum_image_equal

От
Matthias van de Meent
Дата:
On Wed, 25 Mar 2026 at 21:40, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 26 Mar 2026 at 03:46, Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > Attached is v2, which adds the new function to the docs, in addition
> > to rebasing the patch onto master.
>
> In [1] I was looking into reported bugs with the datum_image code not
> behaving correctly. It turns out this is due to hash_numeric() using
> the wrong PG_RETURN_* macro, resulting in it not correctly
> sign-extending negative 32-bit ints.

Thank you for refering me to this.

> I did some analysis to see if
> there were any user-visible changes from fixing that in the back
> branches. I didn't find any.  What you're adding here would add one. I
> think we might need to become more strict about using the correct
> return macro before we expose this stuff, else it's going to be a
> rather scary process to fix bugs when it could affect the
> datum_image_equal result.

I'm not sure we can reliably enforce that sign extension is always
going to be consistent for the same type; specifically looking at
unsigned integers as example. PG's tuple deserialization would see 4
byte alignment and sign-extend it as if it were a signed integer, but
returned values would prefer non- sign-extended values for more
efficient casts from larger unsigned types.

> The thing about Memoize slowly finding all the bugs in the
> datum_image_* code is that we're free to fix these bugs, as the use
> case (the Memoize hash table) only lasts as long as the query.  If we
> expose all of this to users, then they could persist things in tables.
> 6911f8037 fixed another problem that resulted in the datum_image code
> not doing the right thing.

I'm happy to mark this function as STABLE for now (to prevent its
inclusion in permanent storage), and/or to adjust the code to adjust
for subnormal inputs (values with incorrect/inconsistent/unexpected
sign extensions).

> I also proposed a hack in [2] to fix the sign-extension problem. I had
> hoped that might be temporary, but if we were to expose this function
> at the SQL level, then I doubt we'd be brave enough to remove it.

pg_datum_image_equal is supposed to only care about the bytes of the
datum that contain the actual value, so I can surely update it to do
that. I'll be happy to only apply that part to the code inside
pg_datum_image_equal; that should remove the part that exposes the
issue directly to users.

> This would work correctly if I pushed the patch in [2]. But as of yet,
> I'm not sure about it. I at least think we should delay doing this
> until we've come up with a fix that we're confident in.

I'm happy with any of the options: Having your patch at [2] get
committed, me applying [2]'s changes in pg_datum_image_equal, or
marking the function STABLE (and so, disallowing using its output in
permanent storage).

Note that false negatives are unlikely to be a problem, whilst false
positives might have issues. Sign extension issues here would create
false negatives, so likely wouldn't be a huge problem.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

PS. isn't this an issue with OIDs getting read from disk vs copied
around in memory, too? Oids are unsigned, whilst the deserialization
path assumes signed, so I'd expect that to have different outputs when
the from-disk OID you're checking has its not-sign bit 31 set?

> [2] https://postgr.es/m/CAApHDvreF-UiqBaHtRTQWQ6z1X9snstJW%2Bdfb2DU5GOb-uPEBg%40mail.gmail.com



Re: SQL-level pg_datum_image_equal

От
David Rowley
Дата:
On Thu, 26 Mar 2026 at 10:25, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> I'm happy to mark this function as STABLE for now (to prevent its
> inclusion in permanent storage), and/or to adjust the code to adjust
> for subnormal inputs (values with incorrect/inconsistent/unexpected
> sign extensions).

You lost me at this part. How does marking the function as STABLE
prevent users from persisting things on disk based on the return value
of the function? I expected the primary use case for this would be in
trigger functions that make decisions about data that goes into
tables.

David



Re: SQL-level pg_datum_image_equal

От
Matthias van de Meent
Дата:
On Wed, 25 Mar 2026 at 22:51, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 26 Mar 2026 at 10:25, Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > I'm happy to mark this function as STABLE for now (to prevent its
> > inclusion in permanent storage), and/or to adjust the code to adjust
> > for subnormal inputs (values with incorrect/inconsistent/unexpected
> > sign extensions).
>
> You lost me at this part. How does marking the function as STABLE
> prevent users from persisting things on disk based on the return value
> of the function? I expected the primary use case for this would be in
> trigger functions that make decisions about data that goes into
> tables.

Indexes and stored generated columns' expression may only contain
IMMUTABLE functions, so that they don't change output when the inputs
values are unchanged. As the current datum_image_equal depends on the
volatile contents/definition of sign-extended bytes (which we clearly
don't have a defined/expected value for) that makes the output of this
function not immutable for the "same" input values.  Marking it as
STABLE would therefore prevent its values from being stored inside
PostgreSQL's definitions and storing the wrong value (or making a
decision based on the wrong value, in case of partial indexes).

That wouldn't solve the issue when used in a trigger function, indeed;
which is why the second part shows that pg_datum_image_equal could
itself check and adjust byval types to have consistent sign-extended
bytes before passing them on to datum_image_equal, so that it won't be
sensitive to the issue discussed in the memoization thread.


Kind regards,

Matthias van de Meent



Re: SQL-level pg_datum_image_equal

От
Tom Lane
Дата:
Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
> On Wed, 25 Mar 2026 at 22:51, David Rowley <dgrowleyml@gmail.com> wrote:
>> You lost me at this part. How does marking the function as STABLE
>> prevent users from persisting things on disk based on the return value
>> of the function? I expected the primary use case for this would be in
>> trigger functions that make decisions about data that goes into
>> tables.

> Indexes and stored generated columns' expression may only contain
> IMMUTABLE functions, so that they don't change output when the inputs
> values are unchanged. As the current datum_image_equal depends on the
> volatile contents/definition of sign-extended bytes (which we clearly
> don't have a defined/expected value for) that makes the output of this
> function not immutable for the "same" input values.

This seems to me to be a rather creative misinterpretation of what
STABLE and IMMUTABLE mean.  If you want to claim that IMMUTABLE means
that, then the function isn't STABLE either, since it could give
different results for the "same" input values within one query.
Moreover, switching from IMMUTABLE to STABLE wouldn't fix the
problem of users assuming more than they should.

The actual problem here is that datum_image_eq is assuming more
than it should about the contents of a pass-by-value Datum.
That was okay for its original use-cases because a false not-equal
report would just end in not applying some optimization.  But
Memoize thinks that the answers are exact, and users would too
if we expose the function at SQL level.

I think what David proposed at
<CAApHDvreF-UiqBaHtRTQWQ6z1X9snstJW+dfb2DU5GOb-uPEBg@mail.gmail.com>
is not a hack, but in fact correcting datum_image_eq/datum_image_hash
to not assume that unspecified bits are reliably the same.

            regards, tom lane



Re: SQL-level pg_datum_image_equal

От
Matthias van de Meent
Дата:
On Thu, 26 Mar 2026 at 18:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
> > On Wed, 25 Mar 2026 at 22:51, David Rowley <dgrowleyml@gmail.com> wrote:
> >> You lost me at this part. How does marking the function as STABLE
> >> prevent users from persisting things on disk based on the return value
> >> of the function? I expected the primary use case for this would be in
> >> trigger functions that make decisions about data that goes into
> >> tables.
>
> > Indexes and stored generated columns' expression may only contain
> > IMMUTABLE functions, so that they don't change output when the inputs
> > values are unchanged. As the current datum_image_equal depends on the
> > volatile contents/definition of sign-extended bytes (which we clearly
> > don't have a defined/expected value for) that makes the output of this
> > function not immutable for the "same" input values.
>
> This seems to me to be a rather creative misinterpretation of what
> STABLE and IMMUTABLE mean.  If you want to claim that IMMUTABLE means
> that, then the function isn't STABLE either, since it could give
> different results for the "same" input values within one query.
> Moreover, switching from IMMUTABLE to STABLE wouldn't fix the
> problem of users assuming more than they should.

Yeah, that's fair.

> The actual problem here is that datum_image_eq is assuming more
> than it should about the contents of a pass-by-value Datum.
> That was okay for its original use-cases because a false not-equal
> report would just end in not applying some optimization.  But
> Memoize thinks that the answers are exact, and users would too
> if we expose the function at SQL level.
>
> I think what David proposed at
> <CAApHDvreF-UiqBaHtRTQWQ6z1X9snstJW+dfb2DU5GOb-uPEBg@mail.gmail.com>
> is not a hack, but in fact correcting datum_image_eq/datum_image_hash
> to not assume that unspecified bits are reliably the same.

Agreed.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)