Re: What is the maximum encoding-conversion growth rate, anyway?
От | Tatsuo Ishii |
---|---|
Тема | Re: What is the maximum encoding-conversion growth rate, anyway? |
Дата | |
Msg-id | 20070529.091918.59670400.t-ishii@sraoss.co.jp обсуждение исходный текст |
Ответ на | What is the maximum encoding-conversion growth rate, anyway? (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: What is the maximum encoding-conversion growth rate, anyway?
|
Список | pgsql-hackers |
> I just rearranged the code in mbutils.c a little bit to make it more > robust if conversion of an over-length string is attempted, and noted > this comment: > > /* > * When converting strings between different encodings, we assume that space > * for converted result is 4-to-1 growth in the worst case. The rate for > * currently supported encoding pairs are within 3 (SJIS JIS X0201 half width > * kanna -> UTF8 is the worst case). So "4" should be enough for the moment. > * > * Note that this is not the same as the maximum character width in any > * particular encoding. > */ > #define MAX_CONVERSION_GROWTH 4 > > It strikes me that this is overly pessimistic, since we do not support > 5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters > in any supported encoding that require 4 bytes in another. Could we > reduce the multiplier to 3? Or even 2? This has a direct impact on the > longest COPY lines we can support, so I'd like it not to be larger than > necessary. I'm afraid we have to mke it larger, rather than smaller for 8.3. For example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3 bytes UTF_8 (0x00e3818b and 0x00e3829a). See util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details. So the worst case is now 6, rather than 3. Can we add a column to pg_conversion which represents the "growth rate"? This would reduce the rate for most encodings much smaller than 6. -- Tatsuo Ishii SRA OSS, Inc. Japan
В списке pgsql-hackers по дате отправления: