Обсуждение: EOL characters and multibyte encodings
I finally was able PL/R to compile and run on Windows recently. This has lead to people using a Windows based client (typically PgAdmin III) to create PL/R functions. Immediately I started to receive reports of failures that turned out to be due to the carriage return (\r) used in standard Win32 EOLs (\r\n). It seems that the R parser only accepts newlines (\n), even on Win32 (confirmed on r-devel list with a core developer). My first thought on fixing this issue was to simply replace all instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the R parser. As far as I know, any instances of '\r' embedded in a syntactically valid R statement must be escaped (i.e. literally the characters "\" and "r"), so that should not be a problem. But I am concerned about how this potentially plays against multibyte characters. Is it safe to do this, or do I need to use a mb-aware replace algorithm? Thanks, Joe
Joe Conway <mail@joeconway.com> writes: > I finally was able PL/R to compile and run on Windows recently. This has > lead to people using a Windows based client (typically PgAdmin III) to > create PL/R functions. Immediately I started to receive reports of > failures that turned out to be due to the carriage return (\r) used in > standard Win32 EOLs (\r\n). It seems that the R parser only accepts > newlines (\n), even on Win32 (confirmed on r-devel list with a core > developer). > My first thought on fixing this issue was to simply replace all > instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the > R parser. As far as I know, any instances of '\r' embedded in a > syntactically valid R statement must be escaped (i.e. literally the > characters "\" and "r"), so that should not be a problem. But I am > concerned about how this potentially plays against multibyte characters. > Is it safe to do this, or do I need to use a mb-aware replace algorithm? It's safe, because you'll be dealing with prosrc inside the backend, therefore using a backend-legal encoding, and those don't have any ASCII aliasing problems (all bytes of an MB character must have high bit set). However I dislike doing it exactly that way because line numbers in the R script will all get doubled. Unless R never reports errors in terms of line numbers, you'd be better off to either delete the \r characters or replace them with spaces. regards, tom lane
Joe Conway wrote: > I finally was able PL/R to compile and run on Windows recently. This > has lead to people using a Windows based client (typically PgAdmin > III) to create PL/R functions. Immediately I started to receive > reports of failures that turned out to be due to the carriage return > (\r) used in standard Win32 EOLs (\r\n). It seems that the R parser > only accepts newlines (\n), even on Win32 (confirmed on r-devel list > with a core developer). > > My first thought on fixing this issue was to simply replace all > instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to > the R parser. As far as I know, any instances of '\r' embedded in a > syntactically valid R statement must be escaped (i.e. literally the > characters "\" and "r"), so that should not be a problem. But I am > concerned about how this potentially plays against multibyte > characters. Is it safe to do this, or do I need to use a mb-aware > replace algorithm? > > Didn't we just settle that all the server-side encodings have to be ASCII supersets? In which case, just removing the CRs should be quite safe. cheers andrew
Tom Lane wrote: > Joe Conway <mail@joeconway.com> writes: >> My first thought on fixing this issue was to simply replace all >> instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the >> R parser. As far as I know, any instances of '\r' embedded in a >> syntactically valid R statement must be escaped (i.e. literally the >> characters "\" and "r"), so that should not be a problem. But I am >> concerned about how this potentially plays against multibyte characters. >> Is it safe to do this, or do I need to use a mb-aware replace algorithm? > > It's safe, because you'll be dealing with prosrc inside the backend, > therefore using a backend-legal encoding, and those don't have any ASCII > aliasing problems (all bytes of an MB character must have high bit set). Great -- I wasn't sure about that. > However I dislike doing it exactly that way because line numbers in the > R script will all get doubled. Unless R never reports errors in terms > of line numbers, you'd be better off to either delete the \r characters > or replace them with spaces. Good point. But I need to be able to deal with Apple EOLs too -- IIRC those can be *only* '\r'. So I guess I need to do a look-ahead whenever I run into '\r', see if it is followed by '\n', and then munge the string accordingly. Joe
"Joe Conway" <mail@joeconway.com> > Tom Lane wrote: >> Joe Conway <mail@joeconway.com> writes: >>> My first thought on fixing this issue was to simply replace all >>> instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the >>> R parser. As far as I know, any instances of '\r' embedded in a >>> syntactically valid R statement must be escaped (i.e. literally the >>> characters "\" and "r"), so that should not be a problem. But I am >>> concerned about how this potentially plays against multibyte characters. >>> Is it safe to do this, or do I need to use a mb-aware replace algorithm? >> >> It's safe, because you'll be dealing with prosrc inside the backend, >> therefore using a backend-legal encoding, and those don't have any ASCII >> aliasing problems (all bytes of an MB character must have high bit set). The lower byte of some characters in BIG5, GBK, GB18030 may be less than 0x7F and don't have the high bit set. Fortunately, they don't use 0x0D and 0x0A (CR and LF). Regards, William ZHANG > Great -- I wasn't sure about that. >
William ZHANG wrote: >>> >>> It's safe, because you'll be dealing with prosrc inside the backend, >>> therefore using a backend-legal encoding, and those don't have any ASCII >>> aliasing problems (all bytes of an MB character must have high bit set). >>> > > The lower byte of some characters in BIG5, GBK, GB18030 may be less than > 0x7F and don't have the high bit set. Fortunately, they don't use 0x0D and > 0x0A (CR and LF). > > > Those are client-only encodings, precisely for this sort of reason, and thus not relevant to the present discussion. As Tom points out above, when the language handler gets the code it will be encoded in the relevant backend encoding which can't be any of these. (Side note: the restriction by the R parser to unix-only line endings is a dreadful piece of design. As Jon Postel rightly said, the best rule is "Be liberal in what you accept and conservative in what you send." Just about every parser for every language has been able to handle this, so why must R be different?) cheers andrew