Обсуждение: Small patch to improve safety of utf8_to_unicode().

Поиск

Список

Период

Сортировка

Small patch to improve safety of utf8_to_unicode().

От

Jeff Davis

Дата:

13 декабря 2025 г., 02:24:02

Attached.

Вложения

v1-0001-Make-utf8_to_unicode-safer.patch

Re: Small patch to improve safety of utf8_to_unicode().

От

Chao Li

Дата:

14 декабря 2025 г., 02:22:28

> On Dec 13, 2025, at 07:24, Jeff Davis <pgsql@j-davis.com> wrote:
>
> Attached.
>
>
> <v1-0001-Make-utf8_to_unicode-safer.patch>

This patch adds a length check to utf8_to_unicode(), I think which is where “safety” comes from. Can you please add an
alittle bit more to the commit message instead of only saying “improve safety”. It also deleted two redundant function
declarationsfrom pg_wchar.h, which may also worth a quick note in the commit message. 

The code changes all look good to me. Only nitpicks are:

1
```
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
index 07f895ae2bf..47bd2814460 100644
--- a/contrib/fuzzystrmatch/daitch_mokotoff.c
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -401,7 +401,8 @@ read_char(const unsigned char *str, int *ix)

     /* Decode UTF-8 character to ISO 10646 code point. */
     str += *ix;
-    c = utf8_to_unicode(str);
+    /* Assume byte sequence has not been broken. */
+    c = utf8_to_unicode(str, MAX_MULTIBYTE_CHAR_LEN);
```

Here we need an empty line above the new comment.

2
```
diff --git a/src/common/wchar.c b/src/common/wchar.c
index a4bc29921de..c113cadf815 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -661,7 +661,8 @@ ucs_wcwidth(pg_wchar ucs)
 static int
 pg_utf_dsplen(const unsigned char *s)
 {
-    return ucs_wcwidth(utf8_to_unicode(s));
+    /* trust that input is not a truncated byte sequence */
+    return ucs_wcwidth(utf8_to_unicode(s, MAX_MULTIBYTE_CHAR_LEN));
 }
```

For the new comment, as a code reader, I wonder why we “trust” that? To me, it more feels like because of lacking
lengthinformation, we have to trust. I would like this comment to be enhanced a little bit with more information. 

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Small patch to improve safety of utf8_to_unicode().

От

Jeff Davis

Дата:

15 декабря 2025 г., 23:23:44

On Sun, 2025-12-14 at 07:22 +0800, Chao Li wrote:
> This patch adds a length check to utf8_to_unicode(), I think which is
> where “safety” comes from. Can you please add an a little bit more to
> the commit message instead of only saying “improve safety”.

Right: it does not read past pg_mblen(), nor past the supplied length.

We could go further and check for valid continuation bytes, but the
other routines don't do that.

> It also deleted two redundant function declarations from pg_wchar.h,
> which may also worth a quick note in the commit message.

I committed that as an independent change and removed it from this
patch.

> +    /* Assume byte sequence has not been broken. */
> +    c = utf8_to_unicode(str, MAX_MULTIBYTE_CHAR_LEN);
> ```
>
> Here we need an empty line above the new comment.

Done, and I expanded the comment to explain why it's safe.

>  pg_utf_dsplen(const unsigned char *s)
>  {
> -    return ucs_wcwidth(utf8_to_unicode(s));
> +    /* trust that input is not a truncated byte sequence */
> +    return ucs_wcwidth(utf8_to_unicode(s,
> MAX_MULTIBYTE_CHAR_LEN));
>  }
> ```
>
> For the new comment, as a code reader, I wonder why we “trust” that?

We could use strlen(), but I was concerned that it might be used for
string fragments that aren't NUL-terminated, because it's intended for
a single character. A caller might reasonably assume that it wouldn't
read past pg_mblen().

So I changed the comment slightly to just say that it requires the
input is a valid UTF-8 sequence. Let me know if you have another
suggestion.

Regards,
    Jeff Davis

Вложения

v2-0001-Make-utf8_to_unicode-safer.patch

Re: Small patch to improve safety of utf8_to_unicode().

От

Chao Li

Дата:

16 декабря 2025 г., 02:34:41


> On Dec 16, 2025, at 04:23, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Sun, 2025-12-14 at 07:22 +0800, Chao Li wrote:
>> This patch adds a length check to utf8_to_unicode(), I think which is
>> where “safety” comes from. Can you please add an a little bit more to
>> the commit message instead of only saying “improve safety”.
>
> Right: it does not read past pg_mblen(), nor past the supplied length.
>
> We could go further and check for valid continuation bytes, but the
> other routines don't do that.
>
>> It also deleted two redundant function declarations from pg_wchar.h,
>> which may also worth a quick note in the commit message.
>
> I committed that as an independent change and removed it from this
> patch.
>
>> + /* Assume byte sequence has not been broken. */
>> + c = utf8_to_unicode(str, MAX_MULTIBYTE_CHAR_LEN);
>> ```
>>
>> Here we need an empty line above the new comment.
>
> Done, and I expanded the comment to explain why it's safe.
>
>>  pg_utf_dsplen(const unsigned char *s)
>>  {
>> - return ucs_wcwidth(utf8_to_unicode(s));
>> + /* trust that input is not a truncated byte sequence */
>> + return ucs_wcwidth(utf8_to_unicode(s,
>> MAX_MULTIBYTE_CHAR_LEN));
>>  }
>> ```
>>
>> For the new comment, as a code reader, I wonder why we “trust” that?
>
> We could use strlen(), but I was concerned that it might be used for
> string fragments that aren't NUL-terminated, because it's intended for
> a single character. A caller might reasonably assume that it wouldn't
> read past pg_mblen().
>
> So I changed the comment slightly to just say that it requires the
> input is a valid UTF-8 sequence. Let me know if you have another
> suggestion.
>
> Regards,
> Jeff Davis
>
>
> <v2-0001-Make-utf8_to_unicode-safer.patch>

V2 LGTM.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Small patch to improve safety of utf8_to_unicode().

От

Jeff Davis

Дата:

17 декабря 2025 г., 22:37:59

On Tue, 2025-12-16 at 07:34 +0800, Chao Li wrote:
> > <v2-0001-Make-utf8_to_unicode-safer.patch>
>
> V2 LGTM.

On second thought, if we're going to change something here, we should
probably have a more flexible API for both utf8_to_unicode() and
unicode_to_utf8().

Looking at the callers, I think we want to have signatures something
like:

/* returns number of bytes consumed, or -1 */
static inline ssize_t
utf8_to_unicode(char32_t *cp, const unsigned char *src, size_t srclen)
{
    ...
}

/* returns number of bytes written, or -1 */
static inline ssize_t
unicode_to_utf8(unsigned char *dst, size_t dstsize, char32_t cp)
{
    ...
}

That would make both APIs safer, and the caller wouldn't need to call
unicode_utf8len() or pg_utf8_mblen() separately.

We could also do more validation, but of course then the callers would
need to do something if they encounter a failure. We could also try to
catch NUL terminators in the middle of a sequence, which might be
useful.

Regards,
    Jeff Davis

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Small patch to improve safety of utf8_to_unicode().

Small patch to improve safety of utf8_to_unicode().

Вложения

Re: Small patch to improve safety of utf8_to_unicode().

Re: Small patch to improve safety of utf8_to_unicode().

Вложения

Re: Small patch to improve safety of utf8_to_unicode().

Re: Small patch to improve safety of utf8_to_unicode().