Обсуждение: substring start position behavior

Поиск
Список
Период
Сортировка

substring start position behavior

От
PG Doc comments form
Дата:
The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/16/functions-string.html
Description:

Hey,

I was confused by substring behavior today, when giving 0 as a start
position. I understand now that string indices are 1-based, have a certain
flexibility about where to start (allowing negative start positions), and
that this is defined in the standard SQL spec.

I'm comfy with all this, but I think it'd be nice to have a hint in the pg
substring docs for nonpositive start positions, so that users don't have to
have paid for the standard SQL spec to get past this. To me, substring seems
like a relatively common function with relatively surprising behavior.

I'd be happy to write up a patch as a next step -- I'm not sure this would
count as a "trivial patch that serves an obvious need"
(https://wiki.postgresql.org/wiki/Submitting_a_Patch). Only thing is, I
can't seem to find the source repo for the docs.
https://www.postgresql.org/docs/current/docguide-toolsets.html mentions a
configure script, but I don't know which one.

Best,
Dan

Re: substring start position behavior

От
Jelte Fennema-Nio
Дата:
On Tue, 27 Feb 2024 at 19:01, PG Doc comments form
<noreply@postgresql.org> wrote:
> I'd be happy to write up a patch as a next step -- I'm not sure this would
> count as a "trivial patch that serves an obvious need"
> (https://wiki.postgresql.org/wiki/Submitting_a_Patch).

A patch that adds an example of a negative substring usage, seems useful to me.

> Only thing is, I
> can't seem to find the source repo for the docs.

The sources of the docs are simply in the regular postgres docs:
https://github.com/postgres/postgres/blob/e1724af42c7b0bddad2785f8f7a15c4c123862b2/doc/src/sgml/func.sgml#L2786-L2796

> https://www.postgresql.org/docs/current/docguide-toolsets.html mentions a
> configure script, but I don't know which one.

Also meson+ninja is these days faster and easier to use than the
configure script:

https://www.postgresql.org/docs/current/install-meson.html

After setting up the build directory with meson you can build the docs using
ninja -C build docs



Re: substring start position behavior

От
Bruce Momjian
Дата:
On Tue, Feb 27, 2024 at 05:20:23PM +0000, PG Doc comments form wrote:
> The following documentation comment has been logged on the website:
> 
> Page: https://www.postgresql.org/docs/16/functions-string.html
> Description:
> 
> Hey,
> 
> I was confused by substring behavior today, when giving 0 as a start
> position. I understand now that string indices are 1-based, have a certain
> flexibility about where to start (allowing negative start positions), and
> that this is defined in the standard SQL spec.
> 
> I'm comfy with all this, but I think it'd be nice to have a hint in the pg
> substring docs for nonpositive start positions, so that users don't have to
> have paid for the standard SQL spec to get past this. To me, substring seems
> like a relatively common function with relatively surprising behavior.

I dug into this and quickly became as confused as you were.  The best
explanation I found of the current behavior is here (with diagram):

    https://www.mssqltips.com/sqlservertutorial/9374/sql-substring-function/
    SELECT SUBSTRING('Hello world',-2,5) as msg

The last Postgres community discussion of this behavior I could find was
from 2007:

    https://www.postgresql.org/message-id/flat/12803.1168804636%40sss.pgh.pa.us#8316fb2298c9e49f77867a1ae2ead447

This web page explains the feature:

    https://stackoverflow.com/questions/33462061/sql-server-substring-position-negative-value

but also asks:

    now the only question that remains is, "why would anyone need it
    to behave this way?"

and the answer given is:

    @mao47 Well, it depends. I am not an author of implementation of
    SUBSTR but I guess with negative index it behaves like LEFT(string,
    LEN(string) - 1 - index). It works the same way in PostgreSQL so maybe
    it is SQL standard.

Informix has substring() which matches the SQL standard, and substr()
which uses negative start from the end of the string:

    https://www.ibm.com/docs/en/informix-servers/14.10?topic=smf-substr-function

Oracle doesn't have substring(), just substr(), and matches Informix
behavior, I think.

I have developed the attached doc patch to document this.  The only
question is whether this substring behavior is so odd that we should not
document it.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Вложения

Re: substring start position behavior

От
dansonlinepresence@gmail.com
Дата:
Thanks for all that context and diff! The odd behavior is indeed in
the SQL spec. At least I was convinced of that on postgres IRC by
someone who seemed to have quoted from the spec. I don't think the
feature ought to be hidden because it's odd.

If I may suggest some specificity to the wording here, I think "start
values less than 1" would avoid confusion about whether 0 is
nonpositive or not, and bring attention to the function being
1-indexed rather than 0-indexed.

Sorry I haven't had time to grab the pg docs repo and make a diff
myself. I appreciate the slack being picked up (:

On Tue, Mar 5, 2024 at 7:17 PM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Tue, Feb 27, 2024 at 05:20:23PM +0000, PG Doc comments form wrote:
> > The following documentation comment has been logged on the website:
> >
> > Page: https://www.postgresql.org/docs/16/functions-string.html
> > Description:
> >
> > Hey,
> >
> > I was confused by substring behavior today, when giving 0 as a start
> > position. I understand now that string indices are 1-based, have a certain
> > flexibility about where to start (allowing negative start positions), and
> > that this is defined in the standard SQL spec.
> >
> > I'm comfy with all this, but I think it'd be nice to have a hint in the pg
> > substring docs for nonpositive start positions, so that users don't have to
> > have paid for the standard SQL spec to get past this. To me, substring seems
> > like a relatively common function with relatively surprising behavior.
>
> I dug into this and quickly became as confused as you were.  The best
> explanation I found of the current behavior is here (with diagram):
>
>         https://www.mssqltips.com/sqlservertutorial/9374/sql-substring-function/
>         SELECT SUBSTRING('Hello world',-2,5) as msg
>
> The last Postgres community discussion of this behavior I could find was
> from 2007:
>
>         https://www.postgresql.org/message-id/flat/12803.1168804636%40sss.pgh.pa.us#8316fb2298c9e49f77867a1ae2ead447
>
> This web page explains the feature:
>
>         https://stackoverflow.com/questions/33462061/sql-server-substring-position-negative-value
>
> but also asks:
>
>         now the only question that remains is, "why would anyone need it
>         to behave this way?"
>
> and the answer given is:
>
>         @mao47 Well, it depends. I am not an author of implementation of
>         SUBSTR but I guess with negative index it behaves like LEFT(string,
>         LEN(string) - 1 - index). It works the same way in PostgreSQL so maybe
>         it is SQL standard.
>
> Informix has substring() which matches the SQL standard, and substr()
> which uses negative start from the end of the string:
>
>         https://www.ibm.com/docs/en/informix-servers/14.10?topic=smf-substr-function
>
> Oracle doesn't have substring(), just substr(), and matches Informix
> behavior, I think.
>
> I have developed the attached doc patch to document this.  The only
> question is whether this substring behavior is so odd that we should not
> document it.
>
> --
>   Bruce Momjian  <bruce@momjian.us>        https://momjian.us
>   EDB                                      https://enterprisedb.com
>
>   Only you can decide what is important to you.



Re: substring start position behavior

От
Bruce Momjian
Дата:
On Tue, Mar  5, 2024 at 09:47:54PM -0500, dansonlinepresence@gmail.com wrote:
> Thanks for all that context and diff! The odd behavior is indeed in
> the SQL spec. At least I was convinced of that on postgres IRC by
> someone who seemed to have quoted from the spec. I don't think the
> feature ought to be hidden because it's odd.
> 
> If I may suggest some specificity to the wording here, I think "start
> values less than 1" would avoid confusion about whether 0 is
> nonpositive or not, and bring attention to the function being
> 1-indexed rather than 0-indexed.

Sure, updated patch attached.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Вложения

Re: substring start position behavior

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> This web page explains the feature:
>     https://stackoverflow.com/questions/33462061/sql-server-substring-position-negative-value
> but also asks:
>     now the only question that remains is, "why would anyone need it
>     to behave this way?"

Yeah.  I believe our implementation adheres to the SQL spec, which
says this for <character substring function> (in SQL:2021 6.3.2):

    a) If the character encoding form of <character value expression>
    is UTF8, UTF16, or UTF32, then, in the remainder of this General
    Rule, the term “character” shall be taken to mean “unit specified
    by <char length units>”.

    b) Let C be the value of the <character value expression>, let LC
    be the length in characters of C, and let S be the value of the
    <start position>.

    c) If <string length> is specified, then let L be the value of
    <string length> and let E be S+L. Otherwise, let E be the larger
    of LC+1 and S.

    d) If at least one of C, S, and L is the null value, then the
    result of the <character substring function> is the null value.

    e) If E is less than S, then an exception condition is raised:
    data exception — substring error (22011).  [tgl note: given c),
    this happens if and only if a negative <string length> is provided.]

    f) Case:

      i) If S is greater than LC or if E is less than 1 (one), then
      the result of the <character substring function> is the
      zero-length character string.

      ii) Otherwise,

        1) Let S1 be the larger of S and 1 (one). Let E1 be the
        smaller of E and LC+1. Let L1 be E1–S1.

        2) The result of the <character substring function> is a
        character string containing the L1 characters of C starting at
        character number S1 in the same order that the characters
        appear in C.

That's a pretty sterling example of standards-ese that is both
unreadable and devoid of any justification.  But if you trace through
the possible effects of a negative S value, it looks like

  (1) if L >= 0 is specified and S+L (E) is less than one, the result
  is an empty string per rule f)i).

  (2) if L >= 0 is specified and S+L (E) is at least one but less than
  LC+1, then E is the substring end+1 position.

  (3) otherwise, a negative S is disregarded and replaced by 1 so
  far as the substring end calculation is concerned.

  (4) in any case, a negative S is disregarded and replaced by 1 so
  far as the substring start calculation is concerned.

I'm kind of inclined to not document this weirdness.  I especially
don't think it's worth giving an example that neither explains the
"disregarded" bit nor highlights the dependency on L being given.

            regards, tom lane



Re: substring start position behavior

От
Bruce Momjian
Дата:
On Wed, Mar  6, 2024 at 12:42:09AM -0500, Tom Lane wrote:
> I'm kind of inclined to not document this weirdness.  I especially
> don't think it's worth giving an example that neither explains the
> "disregarded" bit nor highlights the dependency on L being given.

I retract my doc patch for this.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.