Обсуждение: [PATCH] Refactor bytea_sortsupport(), take two
Hi, This is a follow-up to b45242fd30ff [1]. Previously we separated varlena.c into varlena.c and bytea.c. This patch makes bytea_sortsupport() independent from varlena.c code as it was proposed before [2][3]. The benefits of this change are summarized in the commit message that I included to the patch. As always, your feedback is most appreciated. [1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b45242fd30ff [2]: https://postgr.es/m/1502394.1725398354@sss.pgh.pa.us [3]: https://postgr.es/m/CAJ7c6TO3X88dGd8C4Tb-Eq2ZDPz%2B9mP%2BKOwdzK_82BEz_cMPZg%40mail.gmail.com -- Best regards, Aleksander Alekseev
Вложения
Hi, > This is a follow-up to b45242fd30ff [1]. Previously we separated > varlena.c into varlena.c and bytea.c. This patch makes > bytea_sortsupport() independent from varlena.c code as it was proposed > before [2][3]. The benefits of this change are summarized in the > commit message that I included to the patch. > > As always, your feedback is most appreciated. cfbot indicates that v1 needs a rebase. Here is v2.
Вложения
On Thu, Aug 7, 2025 at 6:44 PM Aleksander Alekseev <aleksander@tigerdata.com> wrote: > > Hi, > > > This is a follow-up to b45242fd30ff [1]. Previously we separated > > varlena.c into varlena.c and bytea.c. This patch makes > > bytea_sortsupport() independent from varlena.c code as it was proposed > > before [2][3]. The benefits of this change are summarized in the > > commit message that I included to the patch. > > > > As always, your feedback is most appreciated. > > cfbot indicates that v1 needs a rebase. Here is v2. - * Relies on the assumption that text, VarChar, BpChar, and bytea all have the - * same representation. Callers that always use the C collation (e.g. - * non-collatable type callers like bytea) may have NUL bytes in their strings; - * this will not work with any other collation, though. + * Relies on the assumption that text, VarChar, and BpChar all have the + * same representation. Callers that use the C collation may have NUL bytes + * in their strings; this will not work with any other collation, though. - * More generally, it's okay that bytea callers can have NUL bytes in - * strings because abbreviated cmp need not make a distinction between + * Generally speaking, it's okay that C locale callers can have NUL bytes + * in strings because abbreviated cmp need not make a distinction between Don't these types disallow NUL bytes regardless of locale / character set? -- John Naylor Amazon Web Services
Hi John, Thanks for the review. > Don't these types disallow NUL bytes regardless of locale / character set? You are right, they do. Here is the patch v3 with corrected comments. -- Best regards, Aleksander Alekseev
Вложения
On Mon, Sep 15, 2025 at 3:40 PM Aleksander Alekseev <aleksander@tigerdata.com> wrote: > > > Don't these types disallow NUL bytes regardless of locale / character set? > > You are right, they do. Here is the patch v3 with corrected comments. Hi Aleksander, - * Relies on the assumption that text, VarChar, BpChar, and bytea all have the - * same representation. Callers that always use the C collation (e.g. - * non-collatable type callers like bytea) may have NUL bytes in their strings; - * this will not work with any other collation, though. + * Relies on the assumption that text, VarChar, and BpChar all have the + * same representation. These text types cannot contain NUL bytes. AFAICS, the only reaon to mention NUL bytes here before was because of bytea -- sirnce bytea is being removed from this path, I don't see a need to mention mention NUL bytes here. It'd be relevant if any simplification is possible in functions that previously may have had to worry about NUL bytes. I don't immediately see any such opportunities, though. Are there? + * Note: text types (text, varchar, bpchar) cannot contain NUL bytes, + * so we don't need to worry about NUL byte handling here. Same here. Also, "Note" is stronger than a normal comment, maybe to call attention to complex edge cases or weird behaviors. +#if SIZEOF_DATUM == 8 We recently made all datums 8 bytes. Some comments are randomly different than the equivalents in varlena.c. It's probably better if same things remain the same, but there's nothing wrong either. "Lastly, the performance and memory consumption could be optimized a little for the bytea case." Is this a side effect of the patch, or a possibility for future work? It's not clear. The rest seems fine at a glance. -- John Naylor Amazon Web Services
Hi Aleksander,
Overall LGTM. Just a few small comments:
On Sep 15, 2025, at 16:40, Aleksander Alekseev <aleksander@tigerdata.com> wrote:
--
Best regards,
Aleksander Alekseev
<v3-0001-Refactor-bytea_sortsupport.patch>
1.
····
+ /* We can't afford to leak memory here. */
+ if (PointerGetDatum(arg1) != x)
+ pfree(arg1);
+ if (PointerGetDatum(arg2) != y)
+ pfree(arg2);
···
Should we do:
```
PG_FREE_IF_COPY(arg1, 0);
PG_FREE_IF_COPY(arg2, 1)
```
Similar to other places.
2.
···
+/*
+ * Conversion routine for sortsupport.
+ */
+static Datum
+bytea_abbrev_convert(Datum original, SortSupport ssup)
···
The function comment is less descriptive. I would suggest something like:
```
/*
* Abbreviated key conversion for bytea sortsupport.
*
* Returns the abbreviated key as a Datum. If a detoasted copy was made,
* it is freed before returning.
*/
```
3.
```
+ if (abbrev_distinct <= 1.0)
+ abbrev_distinct = 1.0;
+
+ if (key_distinct <= 1.0)
+ key_distinct = 1.0;
```
Why <= 1.0 then set to 1.0? Shouldn't “if" takes just <1.0?
4.
```
Datum
bytea_sortsupport(PG_FUNCTION_ARGS)
{
SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
MemoryContext oldcontext;
+ ByteaSortSupport *bss;
```
“Bss” can be defined in the “if” clause where it is used.
5.
```
Datum
bytea_sortsupport(PG_FUNCTION_ARGS)
{
SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0);
MemoryContext oldcontext;
+ ByteaSortSupport *bss;
+ bool abbreviate = ssup->abbreviate;
```
The local variable “abbreviate” is only used once, do we really need to cache ssup->abbreviate into a local variable?
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
Hi Chao, > 1. > ···· > + /* We can't afford to leak memory here. */ > + if (PointerGetDatum(arg1) != x) > + pfree(arg1); > + if (PointerGetDatum(arg2) != y) > + pfree(arg2); > ··· > > Should we do: > ``` > PG_FREE_IF_COPY(arg1, 0); > PG_FREE_IF_COPY(arg2, 1) > ``` > > Similar to other places. Hmmm... to me it doesn't look more readable to be honest. I choose the same pattern used in varlena.c and suggest keeping it for consistency. IMO this refactoring can be discussed later in a separate thread. Good observation though. > 2. > ··· > +/* > + * Conversion routine for sortsupport. > + */ > +static Datum > +bytea_abbrev_convert(Datum original, SortSupport ssup) > ··· > > The function comment is less descriptive. I would suggest something like: > ``` > /* > * Abbreviated key conversion for bytea sortsupport. > * > * Returns the abbreviated key as a Datum. If a detoasted copy was made, > * it is freed before returning. > */ > ``` Sorry, but I don't think I agree with this either. The comment you are proposing exposes the implementation details. I don't think that the caller needs to know them. This is basically a simplified varstr_abbrev_convert(). I was not certain if I should mention this in a comment and/or how much text I should copy from the comment for varstr_abbrev_convert(). I'm open to suggestions. > 3. > ``` > + if (abbrev_distinct <= 1.0) > + abbrev_distinct = 1.0; > + > + if (key_distinct <= 1.0) > + key_distinct = 1.0; > ``` > > Why <= 1.0 then set to 1.0? Shouldn't “if" takes just <1.0? Good point. I'll fix this in v4. Also I'll modify varstr_abbrev_abort() for consistency. One could argue that the change of varstr_abbrev_abort() is irrelevant in the context of this patch. If there are objections we can easily change '<' back to '<='. It's not that important but '<' looks cleaner IMO. > 4. > ``` > Datum > bytea_sortsupport(PG_FUNCTION_ARGS) > { > SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0); > MemoryContext oldcontext; > + ByteaSortSupport *bss; > ``` > > “Bss” can be defined in the “if” clause where it is used. Agree. I'll fix this in v4. > 5. > ``` > Datum > bytea_sortsupport(PG_FUNCTION_ARGS) > { > SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0); > MemoryContext oldcontext; > + ByteaSortSupport *bss; > + bool abbreviate = ssup->abbreviate; > ``` > > The local variable “abbreviate” is only used once, do we really need to cache ssup->abbreviate into a local variable? Agree, thanks. -- Best regards, Aleksander Alekseev
Hi John, > - * Relies on the assumption that text, VarChar, BpChar, and bytea all have the > - * same representation. Callers that always use the C collation (e.g. > - * non-collatable type callers like bytea) may have NUL bytes in their strings; > - * this will not work with any other collation, though. > + * Relies on the assumption that text, VarChar, and BpChar all have the > + * same representation. These text types cannot contain NUL bytes. > > AFAICS, the only reaon to mention NUL bytes here before was because of > bytea -- sirnce bytea is being removed from this path, I don't see a > need to mention mention NUL bytes here. It'd be relevant if any > simplification is possible in functions that previously may have had > to worry about NUL bytes. I don't immediately see any such > opportunities, though. Are there? Doesn't seem so. I removed the mention of NUL bytes. > + * Note: text types (text, varchar, bpchar) cannot contain NUL bytes, > + * so we don't need to worry about NUL byte handling here. > > Same here. Also, "Note" is stronger than a normal comment, maybe to > call attention to complex edge cases or weird behaviors. Agree. Fixed. > +#if SIZEOF_DATUM == 8 > > We recently made all datums 8 bytes. Right, I forgot to change the patch accordingly. Fixed. > Some comments are randomly different than the equivalents in > varlena.c. It's probably better if same things remain the same, but > there's nothing wrong either. I did my best to keep the comments consistent between varlena.c and bytea.c. I don't think they are going to be that consistent in the long term anyway, so not sure how much effort we should invest into this. > "Lastly, the performance and memory consumption could be optimized a little for > the bytea case." > > Is this a side effect of the patch, or a possibility for future work? > It's not clear. The idea was to reflect the fact that bytea-specific code becomes simpler (less if's, etc) and uses structures of smaller size. This is probably not that important for the commit message. I removed it in order to avoid confusion. PFA patch v4. -- Best regards, Aleksander Alekseev