Обсуждение: [PATCH] avoid double scanning in function byteain

Поиск
Список
Период
Сортировка

[PATCH] avoid double scanning in function byteain

От
Steven Niu
Дата:
Hi,

The byteain function converts a string input into a bytea type.
The original implementation processes two input formats:
a hex format (starting with \x) and a traditional escaped format.
For the escaped format, the function scans the input string twice
— once to calculate the exact size of the output and allocate memory,
and again to fill the allocated memory with the parsed data.

This double scanning can be inefficient, especially for large inputs.
So I optimized the function to eliminate the need for two scans,
while preserving correctness and efficiency.

Please help review it and share your valuable comments.

Thanks,
Steven Niu
https://www.highgo.com/
Вложения

Re: [PATCH] avoid double scanning in function byteain

От
Kirill Reshke
Дата:
On Wed, 26 Mar 2025 at 12:17, Steven Niu <niushiji@gmail.com> wrote:
>
> Hi,

Hi!

> This double scanning can be inefficient, especially for large inputs.
> So I optimized the function to eliminate the need for two scans,
> while preserving correctness and efficiency.

While the argument that processing input once not twice is fast is
generally true, may we have some simple bench here just to have an
idea how valuable this patch is?


Patch:


>+ /* Handle traditional escaped style in a single pass */
>+ input_len = strlen(inputText);
>+ result = palloc(input_len + VARHDRSZ);  /* Allocate max possible size */
>  rp = VARDATA(result);
>+ tp = inputText;
>+
>  while (*tp != '\0')


Isn't this `strlen` O(n) + `while` O(n)? Where is the speed up?



[0] https://github.com/bminor/glibc/blob/master/string/strlen.c#L43-L45

-- 
Best regards,
Kirill Reshke



Re: [PATCH] avoid double scanning in function byteain

От
Steven Niu
Дата:
在 2025/3/26 16:37, Kirill Reshke 写道:
> On Wed, 26 Mar 2025 at 12:17, Steven Niu <niushiji@gmail.com> wrote:
>>
>> Hi,
> 
> Hi!
> 
>> This double scanning can be inefficient, especially for large inputs.
>> So I optimized the function to eliminate the need for two scans,
>> while preserving correctness and efficiency.
> 
> While the argument that processing input once not twice is fast is
> generally true, may we have some simple bench here just to have an
> idea how valuable this patch is?
> 
> 
> Patch:
> 
> 
>> + /* Handle traditional escaped style in a single pass */
>> + input_len = strlen(inputText);
>> + result = palloc(input_len + VARHDRSZ);  /* Allocate max possible size */
>>   rp = VARDATA(result);
>> + tp = inputText;
>> +
>>   while (*tp != '\0')
> 
> 
> Isn't this `strlen` O(n) + `while` O(n)? Where is the speed up?
> 
> 
> 
> [0] https://github.com/bminor/glibc/blob/master/string/strlen.c#L43-L45
> 



Hi, Kirill,

Your deep insight suprised me!

Yes, you are correct that strlen() actually performed a loop operation.
So maybe the performance difference is not so obvious.

However, there are some other reasons that drive me to make this change.

1. The author of original code left comment: "BUGS: The input is scanned 
twice." .
You can find this line of code in my patch. This indicates a left work 
to be done.

2. If I were the author of this function, I would not be satisfied with 
myself that I used two loops to do something which actually can be done 
with one loop.
I prefer to choose a way that would not add more burden to readers.

3. The while (*tp != '\0') loop has some unnecessary codes and I made 
some change.

Thanks,
Steven





Re: [PATCH] avoid double scanning in function byteain

От
Stepan Neretin
Дата:


On Wed, Mar 26, 2025 at 9:39 PM Steven Niu <niushiji@gmail.com> wrote:

在 2025/3/26 16:37, Kirill Reshke 写道:
> On Wed, 26 Mar 2025 at 12:17, Steven Niu <niushiji@gmail.com> wrote:
>>
>> Hi,
>
> Hi!
>
>> This double scanning can be inefficient, especially for large inputs.
>> So I optimized the function to eliminate the need for two scans,
>> while preserving correctness and efficiency.
>
> While the argument that processing input once not twice is fast is
> generally true, may we have some simple bench here just to have an
> idea how valuable this patch is?
>
>
> Patch:
>
>
>> + /* Handle traditional escaped style in a single pass */
>> + input_len = strlen(inputText);
>> + result = palloc(input_len + VARHDRSZ);  /* Allocate max possible size */
>>   rp = VARDATA(result);
>> + tp = inputText;
>> +
>>   while (*tp != '\0')
>
>
> Isn't this `strlen` O(n) + `while` O(n)? Where is the speed up?
>
>
>
> [0] https://github.com/bminor/glibc/blob/master/string/strlen.c#L43-L45
>



Hi, Kirill,

Your deep insight suprised me!

Yes, you are correct that strlen() actually performed a loop operation.
So maybe the performance difference is not so obvious.

However, there are some other reasons that drive me to make this change.

1. The author of original code left comment: "BUGS: The input is scanned
twice." .
You can find this line of code in my patch. This indicates a left work
to be done.

2. If I were the author of this function, I would not be satisfied with
myself that I used two loops to do something which actually can be done
with one loop.
I prefer to choose a way that would not add more burden to readers.

3. The while (*tp != '\0') loop has some unnecessary codes and I made
some change.

Thanks,
Steven





Hi hackers,

This is a revised version (v2) of the patch that optimizes the `byteain()` function.

The original implementation handled escaped input by scanning the string twice — first to determine the output size and again to fill in the bytea. This patch eliminates the double scan by using a single-pass approach with `StringInfo`, simplifying the logic and improving maintainability.

Changes since v1 (originally by Steven Niu):
- Use `StringInfo` instead of manual memory allocation.
- Remove redundant code and improve readability.
- Add regression tests for both hex and escaped formats.

This version addresses performance and clarity while ensuring compatibility with existing behavior. The patch also reflects discussion on the original version, including feedback from Kirill Reshke.

Looking forward to your review and comments.

Best regards,  
Stepan Neretin
Вложения

Re: [PATCH] avoid double scanning in function byteain

От
Stepan Neretin
Дата:


On Fri, May 9, 2025 at 5:37 PM Stepan Neretin <slpmcf@gmail.com> wrote:


On Fri, May 9, 2025 at 5:24 PM Stepan Neretin <slpmcf@gmail.com> wrote:


On Wed, Mar 26, 2025 at 9:39 PM Steven Niu <niushiji@gmail.com> wrote:

在 2025/3/26 16:37, Kirill Reshke 写道:
> On Wed, 26 Mar 2025 at 12:17, Steven Niu <niushiji@gmail.com> wrote:
>>
>> Hi,
>
> Hi!
>
>> This double scanning can be inefficient, especially for large inputs.
>> So I optimized the function to eliminate the need for two scans,
>> while preserving correctness and efficiency.
>
> While the argument that processing input once not twice is fast is
> generally true, may we have some simple bench here just to have an
> idea how valuable this patch is?
>
>
> Patch:
>
>
>> + /* Handle traditional escaped style in a single pass */
>> + input_len = strlen(inputText);
>> + result = palloc(input_len + VARHDRSZ);  /* Allocate max possible size */
>>   rp = VARDATA(result);
>> + tp = inputText;
>> +
>>   while (*tp != '\0')
>
>
> Isn't this `strlen` O(n) + `while` O(n)? Where is the speed up?
>
>
>
> [0] https://github.com/bminor/glibc/blob/master/string/strlen.c#L43-L45
>



Hi, Kirill,

Your deep insight suprised me!

Yes, you are correct that strlen() actually performed a loop operation.
So maybe the performance difference is not so obvious.

However, there are some other reasons that drive me to make this change.

1. The author of original code left comment: "BUGS: The input is scanned
twice." .
You can find this line of code in my patch. This indicates a left work
to be done.

2. If I were the author of this function, I would not be satisfied with
myself that I used two loops to do something which actually can be done
with one loop.
I prefer to choose a way that would not add more burden to readers.

3. The while (*tp != '\0') loop has some unnecessary codes and I made
some change.

Thanks,
Steven





Hi hackers,

This is a revised version (v2) of the patch that optimizes the `byteain()` function.

The original implementation handled escaped input by scanning the string twice — first to determine the output size and again to fill in the bytea. This patch eliminates the double scan by using a single-pass approach with `StringInfo`, simplifying the logic and improving maintainability.

Changes since v1 (originally by Steven Niu):
- Use `StringInfo` instead of manual memory allocation.
- Remove redundant code and improve readability.
- Add regression tests for both hex and escaped formats.

This version addresses performance and clarity while ensuring compatibility with existing behavior. The patch also reflects discussion on the original version, including feedback from Kirill Reshke.

Looking forward to your review and comments.

Best regards,  
Stepan Neretin


Hi,

I noticed that the previous version of the patch was authored with an incorrect email address due to a misconfigured git config.

I've corrected the author information in this v2 and made sure it's consistent with my usual contributor identity. No other changes were introduced apart from that and the updates discussed earlier.

Sorry for the confusion, and thanks for your understanding.

Best regards,

Stepan Neretin



Hi,

Sorry for the noise — I'm resending the patch because I noticed a compiler warning related to mixed declarations and code, which I’ve now fixed.

Apologies for the oversight in the previous submission.

Regards,

Stepan Neretin

Вложения

Re: [PATCH] avoid double scanning in function byteain

От
Aleksander Alekseev
Дата:
Hi Stepan,

> Sorry for the noise — I'm resending the patch because I noticed a compiler warning related to mixed declarations and
code,which I’ve now fixed. 
>
> Apologies for the oversight in the previous submission.

Thanks for the patch.

As Kirill pointed out above, it would be nice if you could prove that
your implementation is actually faster. I think something like a
simple micro-benchmark will do.

--
Best regards,
Aleksander Alekseev



Re: [PATCH] avoid double scanning in function byteain

От
Stepan Neretin
Дата:


On Fri, May 9, 2025 at 7:43 PM Aleksander Alekseev <aleksander@timescale.com> wrote:
Hi Stepan,

> Sorry for the noise — I'm resending the patch because I noticed a compiler warning related to mixed declarations and code, which I’ve now fixed.
>
> Apologies for the oversight in the previous submission.

Thanks for the patch.

As Kirill pointed out above, it would be nice if you could prove that
your implementation is actually faster. I think something like a
simple micro-benchmark will do.

--
Best regards,
Aleksander Alekseev


Hi,

Thanks for the feedback.

I’ve done a simple micro-benchmark using PL/pgSQL with a large escaped input string (\\123 repeated 100,000 times), converted to bytea in a loop:

DO $$
DECLARE    start_time TIMESTAMP;    end_time TIMESTAMP;    i INTEGER;    dummy BYTEA;    input TEXT := repeat(E'\\123', 100000);    elapsed_ms DOUBLE PRECISION;
BEGIN    start_time := clock_timestamp();
    FOR i IN 1..1000 LOOP        dummy := input::bytea;    END LOOP;
    end_time := clock_timestamp();    elapsed_ms := EXTRACT(EPOCH FROM end_time - start_time) * 1000;    RAISE NOTICE 'Average time per call: % ms', elapsed_ms / 1000;
END
$$;

Here are the results from NOTICE output:

Without patch:

NOTICE:  Average time per call: 0.49176600000000004 ms
NOTICE:  Average time per call: 0.47658999999999996 ms

With patch:

NOTICE:  Average time per call: 0.468231 ms
NOTICE:  Average time per call: 0.463909 ms

The gain is small but consistent. Let me know if a more rigorous benchmark would be useful.

Best regards,
Stepan Neretin

Re: [PATCH] avoid double scanning in function byteain

От
Peter Eisentraut
Дата:
The relationship between patch 0001 and 0002 is unclear to me.  Are 
these incremental or alternatives?  The description doesn't make this clear.

Some of the changes in patch 0002 just appear to move code and comments 
around without changing anything substantial.  It's not clear why that 
is done, as it's not related to what the patch claims it does.

The main tests for the bytea type input formats are in 
src/test/regress/sql/strings.sql, so you should add any new tests there. 
  Maybe there are already enough tests there that you don't need any new 
ones.

Overall, I would consider the bytea "escaped" format kind of 
obsolescent.  But if you want to make it a bit faster with little other 
impact, why not.



Re: [PATCH] avoid double scanning in function byteain

От
Tom Lane
Дата:
Peter Eisentraut <peter@eisentraut.org> writes:
> The relationship between patch 0001 and 0002 is unclear to me.  Are 
> these incremental or alternatives?  The description doesn't make this clear.

It appears to me that 0002 is actually counterproductive.  I cannot
see a reason to get a StringInfo involved here: it adds overhead
and removes no complexity worth noticing.  If it were hard to get
a close-enough upper bound for the output length, then yeah a
StringInfo could be a good solution.  But the "strlen(inputText)"
proposed in 0001 seems plenty good enough, especially since as you
say this is a somewhat obsolescent format.  The fact that it would
often overallocate somewhat doesn't bother me --- and a StringInfo
would in most cases overallocate by a lot more.

I'm inclined to accept 0001, reject 0002, and move on.  This doesn't
seem like an area that's worth a huge amount of discussion.

> The main tests for the bytea type input formats are in 
> src/test/regress/sql/strings.sql, so you should add any new tests there. 
>   Maybe there are already enough tests there that you don't need any new 
> ones.

The code coverage report shows that byteain is covered except for the
path handling "\\".  I'd be content to add one test query, or extend
some existing query, to make that branch get hit.

BTW, the patch needs rebasing because this code just got moved
to bytea.c.

            regards, tom lane



Re: [PATCH] avoid double scanning in function byteain

От
Tom Lane
Дата:
I wrote:
> I'm inclined to accept 0001, reject 0002, and move on.  This doesn't
> seem like an area that's worth a huge amount of discussion.

Done that way.  I made a couple more cosmetic changes and added
test cases for the double-backslash code path (which hadn't been
covered in byteaout either, I see now).

BTW, in my hands the microbenchmark that Stepan suggested shows the
committed version to be about 40% faster than before for long input.
So apparently the StringInfo-ification suggested in 0002 gave back
just about all the performance gain from 0001.

            regards, tom lane



Re: [PATCH] avoid double scanning in function byteain

От
Stepan Neretin
Дата:


On Sat, Jul 19, 2025 at 3:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
> I'm inclined to accept 0001, reject 0002, and move on.  This doesn't
> seem like an area that's worth a huge amount of discussion.

Done that way.  I made a couple more cosmetic changes and added
test cases for the double-backslash code path (which hadn't been
covered in byteaout either, I see now).

BTW, in my hands the microbenchmark that Stepan suggested shows the
committed version to be about 40% faster than before for long input.
So apparently the StringInfo-ification suggested in 0002 gave back
just about all the performance gain from 0001.

                        regards, tom lane

Hi Tom,

Thanks a lot for reviewing and committing the change — much appreciated!

I agree with your rationale regarding patch 0001 vs 0002. It makes sense to avoid the overhead of StringInfo in this context, especially given the measurable performance benefit from the simpler approach.

One small thing: it seems the commit or diff with the final adjustments and test additions wasn't attached or linked in the thread. Could you please point me to the commit hash or reference? I’d love to take a look at the final version.

Best regards,
Stepan Neretin 

Re: [PATCH] avoid double scanning in function byteain

От
"David G. Johnston"
Дата:
On Sunday, July 27, 2025, Stepan Neretin <slpmcf@gmail.com> wrote:

One small thing: it seems the commit or diff with the final adjustments and test additions wasn't attached or linked in the thread. Could you please point me to the commit hash or reference? I’d love to take a look at the final version.



 The pgsql-committers list is searchable.  All commits get sent there.


David J.

Re: [PATCH] avoid double scanning in function byteain

От
Stepan Neretin
Дата:


On Mon, Jul 28, 2025 at 1:41 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Sunday, July 27, 2025, Stepan Neretin <slpmcf@gmail.com> wrote:

One small thing: it seems the commit or diff with the final adjustments and test additions wasn't attached or linked in the thread. Could you please point me to the commit hash or reference? I’d love to take a look at the final version.



 The pgsql-committers list is searchable.  All commits get sent there.


David J.


Hi David,

Yes, I'm aware of the pgsql-committers archive and I did check there — thank you for the reminder!

However, I couldn’t find the patch or final diff in either the pgsql-committers message you linked or as an attachment in the original thread.

Best regards,
Stepan 

[PATCH] avoid double scanning in function byteain

От
"David G. Johnston"
Дата:
On Sunday, July 27, 2025, Stepan Neretin <slpmcf@gmail.com> wrote:
However, I couldn’t find the patch or final diff in either the pgsql-committers message you linked or as an attachment in the original thread.

There is a gitweb link included in the Details section.  Click that.  Or just read off the first 8 characters of the commit hash in that link and plug it into your git client.

David J.

Re: [PATCH] avoid double scanning in function byteain

От
Tom Lane
Дата:
Stepan Neretin <slpmcf@gmail.com> writes:
> However, I couldn’t find the patch or final diff in either the
> pgsql-committers message you linked or as an attachment in the original
> thread.

Commit is here:

https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=3683af617

The "patch" link on such pages is good if you want a locally-applyable
patch rather than a colorized version.

The "details" link in the email that David pointed you to would also
have gotten you there.

            regards, tom lane