Обсуждение: Small patch to improve safety of utf8_to_unicode().

Поиск
Список
Период
Сортировка

Small patch to improve safety of utf8_to_unicode().

От
Jeff Davis
Дата:
Attached.



Вложения

Re: Small patch to improve safety of utf8_to_unicode().

От
Chao Li
Дата:

> On Dec 13, 2025, at 07:24, Jeff Davis <pgsql@j-davis.com> wrote:
>
> Attached.
>
>
> <v1-0001-Make-utf8_to_unicode-safer.patch>


This patch adds a length check to utf8_to_unicode(), I think which is where “safety” comes from. Can you please add an
alittle bit more to the commit message instead of only saying “improve safety”. It also deleted two redundant function
declarationsfrom pg_wchar.h, which may also worth a quick note in the commit message. 

The code changes all look good to me. Only nitpicks are:

1
```
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
index 07f895ae2bf..47bd2814460 100644
--- a/contrib/fuzzystrmatch/daitch_mokotoff.c
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -401,7 +401,8 @@ read_char(const unsigned char *str, int *ix)

     /* Decode UTF-8 character to ISO 10646 code point. */
     str += *ix;
-    c = utf8_to_unicode(str);
+    /* Assume byte sequence has not been broken. */
+    c = utf8_to_unicode(str, MAX_MULTIBYTE_CHAR_LEN);
```

Here we need an empty line above the new comment.

2
```
diff --git a/src/common/wchar.c b/src/common/wchar.c
index a4bc29921de..c113cadf815 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -661,7 +661,8 @@ ucs_wcwidth(pg_wchar ucs)
 static int
 pg_utf_dsplen(const unsigned char *s)
 {
-    return ucs_wcwidth(utf8_to_unicode(s));
+    /* trust that input is not a truncated byte sequence */
+    return ucs_wcwidth(utf8_to_unicode(s, MAX_MULTIBYTE_CHAR_LEN));
 }
```

For the new comment, as a code reader, I wonder why we “trust” that? To me, it more feels like because of lacking
lengthinformation, we have to trust. I would like this comment to be enhanced a little bit with more information. 

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







Re: Small patch to improve safety of utf8_to_unicode().

От
Jeff Davis
Дата:
On Sun, 2025-12-14 at 07:22 +0800, Chao Li wrote:
> This patch adds a length check to utf8_to_unicode(), I think which is
> where “safety” comes from. Can you please add an a little bit more to
> the commit message instead of only saying “improve safety”.

Right: it does not read past pg_mblen(), nor past the supplied length.

We could go further and check for valid continuation bytes, but the
other routines don't do that.

> It also deleted two redundant function declarations from pg_wchar.h,
> which may also worth a quick note in the commit message.

I committed that as an independent change and removed it from this
patch.

> +    /* Assume byte sequence has not been broken. */
> +    c = utf8_to_unicode(str, MAX_MULTIBYTE_CHAR_LEN);
> ```
>
> Here we need an empty line above the new comment.

Done, and I expanded the comment to explain why it's safe.

>  pg_utf_dsplen(const unsigned char *s)
>  {
> -    return ucs_wcwidth(utf8_to_unicode(s));
> +    /* trust that input is not a truncated byte sequence */
> +    return ucs_wcwidth(utf8_to_unicode(s,
> MAX_MULTIBYTE_CHAR_LEN));
>  }
> ```
>
> For the new comment, as a code reader, I wonder why we “trust” that?

We could use strlen(), but I was concerned that it might be used for
string fragments that aren't NUL-terminated, because it's intended for
a single character. A caller might reasonably assume that it wouldn't
read past pg_mblen().

So I changed the comment slightly to just say that it requires the
input is a valid UTF-8 sequence. Let me know if you have another
suggestion.

Regards,
    Jeff Davis



Вложения

Re: Small patch to improve safety of utf8_to_unicode().

От
Chao Li
Дата:

> On Dec 16, 2025, at 04:23, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Sun, 2025-12-14 at 07:22 +0800, Chao Li wrote:
>> This patch adds a length check to utf8_to_unicode(), I think which is
>> where “safety” comes from. Can you please add an a little bit more to
>> the commit message instead of only saying “improve safety”.
>
> Right: it does not read past pg_mblen(), nor past the supplied length.
>
> We could go further and check for valid continuation bytes, but the
> other routines don't do that.
>
>> It also deleted two redundant function declarations from pg_wchar.h,
>> which may also worth a quick note in the commit message.
>
> I committed that as an independent change and removed it from this
> patch.
>
>> + /* Assume byte sequence has not been broken. */
>> + c = utf8_to_unicode(str, MAX_MULTIBYTE_CHAR_LEN);
>> ```
>>
>> Here we need an empty line above the new comment.
>
> Done, and I expanded the comment to explain why it's safe.
>
>>  pg_utf_dsplen(const unsigned char *s)
>>  {
>> - return ucs_wcwidth(utf8_to_unicode(s));
>> + /* trust that input is not a truncated byte sequence */
>> + return ucs_wcwidth(utf8_to_unicode(s,
>> MAX_MULTIBYTE_CHAR_LEN));
>>  }
>> ```
>>
>> For the new comment, as a code reader, I wonder why we “trust” that?
>
> We could use strlen(), but I was concerned that it might be used for
> string fragments that aren't NUL-terminated, because it's intended for
> a single character. A caller might reasonably assume that it wouldn't
> read past pg_mblen().
>
> So I changed the comment slightly to just say that it requires the
> input is a valid UTF-8 sequence. Let me know if you have another
> suggestion.
>
> Regards,
> Jeff Davis
>
>
> <v2-0001-Make-utf8_to_unicode-safer.patch>

V2 LGTM.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/