Обсуждение: EOL characters and multibyte encodings

Поиск
Список
Период
Сортировка

EOL characters and multibyte encodings

От
Joe Conway
Дата:
I finally was able PL/R to compile and run on Windows recently. This has 
lead to people using a Windows based client (typically PgAdmin III) to 
create PL/R functions. Immediately I started to receive reports of 
failures that turned out to be due to the carriage return (\r) used in 
standard Win32 EOLs (\r\n). It seems that the R parser only accepts 
newlines (\n), even on Win32 (confirmed on r-devel list with a core 
developer).

My first thought on fixing this issue was to simply replace all 
instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the 
R parser. As far as I know, any instances of '\r' embedded in a 
syntactically valid R statement must be escaped (i.e. literally the 
characters "\" and "r"), so that should not be a problem. But I am 
concerned about how this potentially plays against multibyte characters. 
Is it safe to do this, or do I need to use a mb-aware replace algorithm?

Thanks,

Joe




Re: EOL characters and multibyte encodings

От
Tom Lane
Дата:
Joe Conway <mail@joeconway.com> writes:
> I finally was able PL/R to compile and run on Windows recently. This has 
> lead to people using a Windows based client (typically PgAdmin III) to 
> create PL/R functions. Immediately I started to receive reports of 
> failures that turned out to be due to the carriage return (\r) used in 
> standard Win32 EOLs (\r\n). It seems that the R parser only accepts 
> newlines (\n), even on Win32 (confirmed on r-devel list with a core 
> developer).

> My first thought on fixing this issue was to simply replace all 
> instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the 
> R parser. As far as I know, any instances of '\r' embedded in a 
> syntactically valid R statement must be escaped (i.e. literally the 
> characters "\" and "r"), so that should not be a problem. But I am 
> concerned about how this potentially plays against multibyte characters. 
> Is it safe to do this, or do I need to use a mb-aware replace algorithm?

It's safe, because you'll be dealing with prosrc inside the backend,
therefore using a backend-legal encoding, and those don't have any ASCII
aliasing problems (all bytes of an MB character must have high bit set).

However I dislike doing it exactly that way because line numbers in the
R script will all get doubled.  Unless R never reports errors in terms
of line numbers, you'd be better off to either delete the \r characters
or replace them with spaces.
        regards, tom lane


Re: EOL characters and multibyte encodings

От
Andrew Dunstan
Дата:

Joe Conway wrote:
> I finally was able PL/R to compile and run on Windows recently. This 
> has lead to people using a Windows based client (typically PgAdmin 
> III) to create PL/R functions. Immediately I started to receive 
> reports of failures that turned out to be due to the carriage return 
> (\r) used in standard Win32 EOLs (\r\n). It seems that the R parser 
> only accepts newlines (\n), even on Win32 (confirmed on r-devel list 
> with a core developer).
>
> My first thought on fixing this issue was to simply replace all 
> instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to 
> the R parser. As far as I know, any instances of '\r' embedded in a 
> syntactically valid R statement must be escaped (i.e. literally the 
> characters "\" and "r"), so that should not be a problem. But I am 
> concerned about how this potentially plays against multibyte 
> characters. Is it safe to do this, or do I need to use a mb-aware 
> replace algorithm?
>
>

Didn't we just settle that all the server-side encodings have to be 
ASCII supersets? In which case, just removing the CRs should be quite safe.

cheers

andrew


Re: EOL characters and multibyte encodings

От
Joe Conway
Дата:
Tom Lane wrote:
> Joe Conway <mail@joeconway.com> writes:
>> My first thought on fixing this issue was to simply replace all 
>> instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the 
>> R parser. As far as I know, any instances of '\r' embedded in a 
>> syntactically valid R statement must be escaped (i.e. literally the 
>> characters "\" and "r"), so that should not be a problem. But I am 
>> concerned about how this potentially plays against multibyte characters. 
>> Is it safe to do this, or do I need to use a mb-aware replace algorithm?
> 
> It's safe, because you'll be dealing with prosrc inside the backend,
> therefore using a backend-legal encoding, and those don't have any ASCII
> aliasing problems (all bytes of an MB character must have high bit set).

Great -- I wasn't sure about that.

> However I dislike doing it exactly that way because line numbers in the
> R script will all get doubled.  Unless R never reports errors in terms
> of line numbers, you'd be better off to either delete the \r characters
> or replace them with spaces.

Good point. But I need to be able to deal with Apple EOLs too -- IIRC 
those can be *only* '\r'. So I guess I need to do a look-ahead whenever 
I run into '\r', see if it is followed by '\n', and then munge the 
string accordingly.

Joe


Re: EOL characters and multibyte encodings

От
"William ZHANG"
Дата:
"Joe Conway" <mail@joeconway.com>
> Tom Lane wrote:
>> Joe Conway <mail@joeconway.com> writes:
>>> My first thought on fixing this issue was to simply replace all 
>>> instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the 
>>> R parser. As far as I know, any instances of '\r' embedded in a 
>>> syntactically valid R statement must be escaped (i.e. literally the 
>>> characters "\" and "r"), so that should not be a problem. But I am 
>>> concerned about how this potentially plays against multibyte characters. 
>>> Is it safe to do this, or do I need to use a mb-aware replace algorithm?
>>
>> It's safe, because you'll be dealing with prosrc inside the backend,
>> therefore using a backend-legal encoding, and those don't have any ASCII
>> aliasing problems (all bytes of an MB character must have high bit set).

The lower byte of some characters in BIG5, GBK, GB18030 may be less than
0x7F and don't have the high bit set. Fortunately, they don't use 0x0D and
0x0A (CR and LF).

Regards,
William ZHANG

> Great -- I wasn't sure about that.
>




Re: EOL characters and multibyte encodings

От
Andrew Dunstan
Дата:

William ZHANG wrote:
>>>
>>> It's safe, because you'll be dealing with prosrc inside the backend,
>>> therefore using a backend-legal encoding, and those don't have any ASCII
>>> aliasing problems (all bytes of an MB character must have high bit set).
>>>       
>
> The lower byte of some characters in BIG5, GBK, GB18030 may be less than
> 0x7F and don't have the high bit set. Fortunately, they don't use 0x0D and
> 0x0A (CR and LF).
>
>   
>   

Those are client-only encodings, precisely for this sort of reason, and 
thus not relevant to the present discussion. As Tom points out above, 
when the language handler gets the code it will be encoded in the 
relevant backend encoding which can't be any of these.

(Side note: the restriction by the R parser to unix-only line endings is 
a dreadful piece of design. As Jon Postel rightly said, the best rule is 
"Be liberal in what you accept and conservative in what you send." Just 
about every parser for every language has been able to handle this, so 
why must R be different?)

cheers

andrew