Обсуждение: PATCH: CITEXT 2.0 v4

Поиск
Список
Период
Сортировка

PATCH: CITEXT 2.0 v4

От
"David E. Wheeler"
Дата:
Howdy,

I've attached a new patch with the latest revisions of for the citext
contrib module patch. The changes include:

* Using strlen() to pass string lengths to the comparison function,
   since lowercasing the value can change the length. Per Tom Lane.
* Made citextcmp consistently return int32, per Tom Lane.
* Made the hash index function return the proper value, per Tom Lane.
* Removed the COMMENTs and GRANTs from citext.sql.in.
* Added a cast function from bpchar to citext, as suggested by Tom Lane.
* Set the storage type for CITEXT to "extended", to ensure that it will
   be toastable. Per Tom Lane.
* Fixed the COMMUTATOR of <=.
* Changed the cast from citext to bpchar from implicit to assignment.
   This eliminates ambiguous function resolutions.
* Eliminated superflous functions, per Tom Lane.
* Removed unnecessary `OPERATOR()` calls in NEGATORs and the like.
* Added binary in/out functions. Per Tom Lane
* Added an explicit shell type to make the output a bit quieter.
* Converted tests to pure SQL and omitted multibyte tests (though a
   few remain commented-out).
* Reorganized and expanded the documentation a bit.

This version is far better than I started with, and I'm very grateful
for the feedback.

Now, I have a few remaining questions to ask, mostly just to get your
opinions:

* The README for citext 1.0 on pgFoundry says:

> I had to make a decision on casting between types for regular
> expressions and
> decided that if any parameter is of citext type then case
> insensitive applies.
> For example applying regular expressions with a varchar and a citext
> will
> produce a case-insensitive result.
>
> Having thought about this afterwards I realised that since we have
> the option
> to use case-insensitive results with regular expressions I should
> have left the
> behaviour exactly as text and then you have the best of both
> worlds... oh well
> not hard to change for any of you perfectionists!

I followed the original and made all the regex and LIKE comparisons
case-insensitive. But maybe I should not have? Especially since the
regular expression functions (e.g., regexp_replace()) and a few non-
regex functions (e.g., replace()) still don't behave case-insensitively?

* If the answer is "no", how can I make those functions behave case-
insensitively? (See the "TODO" tests.)

* Should there be any other casts? To and from name, perhaps?

Thanks!

David


Вложения

Re: PATCH: CITEXT 2.0 v4

От
"David E. Wheeler"
Дата:
On Jul 15, 2008, at 22:23, David E. Wheeler wrote:

> * The README for citext 1.0 on pgFoundry says:
>
>> I had to make a decision on casting between types for regular  
>> expressions and
>> decided that if any parameter is of citext type then case  
>> insensitive applies.
>> For example applying regular expressions with a varchar and a  
>> citext will
>> produce a case-insensitive result.
>>
>> Having thought about this afterwards I realised that since we have  
>> the option
>> to use case-insensitive results with regular expressions I should  
>> have left the
>> behaviour exactly as text and then you have the best of both  
>> worlds... oh well
>> not hard to change for any of you perfectionists!
>
> I followed the original and made all the regex and LIKE comparisons  
> case-insensitive. But maybe I should not have? Especially since the  
> regular expression functions (e.g., regexp_replace()) and a few non- 
> regex functions (e.g., replace()) still don't behave case- 
> insensitively?

I was thinking about this a bit last night and wanted to fill things  
out a bit.

As a programmer, I find Donald Fraser's hindsight to be more  
appealing, because at least that way I have the option to do matching  
against CITEXT strings case-sensitively when I want to.

OTOH, if what we want is to have CITEXT work more like a case- 
insensitive collation, then the expectation from the matching  
operators and functions might be different. Does anyone have any idea  
whether regex and LIKE matching against a string in a case-insensitive  
collation would match case-insensitively or not? If so, then maybe the  
regex and LIKE ops and funcs *should* match case-insensitively? If  
not, or if only for some collations, then I would think not.

Either way, I know of no way, currently, to allow functions like  
replace(), split_part(), strpos(), and translate() to match case- 
insensitiely, even if we wanted to. Anyone have any ideas?

> * If the answer is "no", how can I make those functions behave case- 
> insensitively? (See the "TODO" tests.)
>
> * Should there be any other casts? To and from name, perhaps?

Thanks,

David



Re: PATCH: CITEXT 2.0 v4

От
Robert Treat
Дата:
On Wednesday 16 July 2008 13:54:25 David E. Wheeler wrote:
> On Jul 15, 2008, at 22:23, David E. Wheeler wrote:
> > * The README for citext 1.0 on pgFoundry says:
> >> I had to make a decision on casting between types for regular
> >> expressions and
> >> decided that if any parameter is of citext type then case
> >> insensitive applies.
> >> For example applying regular expressions with a varchar and a
> >> citext will
> >> produce a case-insensitive result.
> >>
> >> Having thought about this afterwards I realised that since we have
> >> the option
> >> to use case-insensitive results with regular expressions I should
> >> have left the
> >> behaviour exactly as text and then you have the best of both
> >> worlds... oh well
> >> not hard to change for any of you perfectionists!
> >
> > I followed the original and made all the regex and LIKE comparisons
> > case-insensitive. But maybe I should not have? Especially since the
> > regular expression functions (e.g., regexp_replace()) and a few non-
> > regex functions (e.g., replace()) still don't behave case-
> > insensitively?
>
> I was thinking about this a bit last night and wanted to fill things
> out a bit.
>
> As a programmer, I find Donald Fraser's hindsight to be more
> appealing, because at least that way I have the option to do matching
> against CITEXT strings case-sensitively when I want to.
>
> OTOH, if what we want is to have CITEXT work more like a case-
> insensitive collation, then the expectation from the matching
> operators and functions might be different. Does anyone have any idea
> whether regex and LIKE matching against a string in a case-insensitive
> collation would match case-insensitively or not? If so, then maybe the
> regex and LIKE ops and funcs *should* match case-insensitively? If
> not, or if only for some collations, then I would think not.
>
> Either way, I know of no way, currently, to allow functions like
> replace(), split_part(), strpos(), and translate() to match case-
> insensitiely, even if we wanted to. Anyone have any ideas?
>
> > * If the answer is "no", how can I make those functions behave case-
> > insensitively? (See the "TODO" tests.)
> >
> > * Should there be any other casts? To and from name, perhaps?
>

AIUI, your propsing the following:

select 'x'::citext = 'X'::citext;?column?
----------t
(1 row)

select 'x'::citext ~ 'X'::citext;?column?
----------f
(1 row)

I understand the desire for flexibility, but the above seems wierd to me. 

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


Re: PATCH: CITEXT 2.0 v4

От
"David E. Wheeler"
Дата:
On Jul 16, 2008, at 11:20, Robert Treat wrote:

>> I was thinking about this a bit last night and wanted to fill things
>> out a bit.
>>
>> As a programmer, I find Donald Fraser's hindsight to be more
>> appealing, because at least that way I have the option to do matching
>> against CITEXT strings case-sensitively when I want to.
>>
>> OTOH, if what we want is to have CITEXT work more like a case-
>> insensitive collation, then the expectation from the matching
>> operators and functions might be different. Does anyone have any idea
>> whether regex and LIKE matching against a string in a case- 
>> insensitive
>> collation would match case-insensitively or not? If so, then maybe  
>> the
>> regex and LIKE ops and funcs *should* match case-insensitively? If
>> not, or if only for some collations, then I would think not.
>>
>> Either way, I know of no way, currently, to allow functions like
>> replace(), split_part(), strpos(), and translate() to match case-
>> insensitiely, even if we wanted to. Anyone have any ideas?
>>
>>> * If the answer is "no", how can I make those functions behave case-
>>> insensitively? (See the "TODO" tests.)
>>>
>>> * Should there be any other casts? To and from name, perhaps?
>>
>
> AIUI, your propsing the following:
>
> select 'x'::citext = 'X'::citext;
> ?column?
> ----------
> t
> (1 row)
>
> select 'x'::citext ~ 'X'::citext;
> ?column?
> ----------
> f
> (1 row)
>
> I understand the desire for flexibility, but the above seems wierd  
> to me.

That's what Donald Fraser suggested, and I see some value in that, but  
wanted to get some other opinions. And you're right, that does seem a  
bit weird.

The trouble is that, right now:

template1=# select regexp_replace( 'fxx'::citext, 'X'::citext, 'o'); regexp_replace
---------------- fxx
(1 row)

So there's an inconsistency there. I don't know how to make that work  
case-insensitively.

Best,

David


Re: PATCH: CITEXT 2.0 v4

От
Michael Paesold
Дата:
Am 16.07.2008 um 20:38 schrieb David E. Wheeler:
>
> The trouble is that, right now:
>
> template1=# select regexp_replace( 'fxx'::citext, 'X'::citext, 'o');
> regexp_replace
> ----------------
> fxx
> (1 row)
>
> So there's an inconsistency there. I don't know how to make that  
> work case-insensitively.

Wouldn't it be possible to create a variant of regexp_replace, i.e.  
regexp_replace(citext,citext,text), which would again lower-case the  
first two arguments before passing the input to  
regexp_replace(text,text,text)?

Best Regards
Michael Paesold


Re: PATCH: CITEXT 2.0 v4

От
"David E. Wheeler"
Дата:
On Jul 17, 2008, at 03:45, Michael Paesold wrote:

> Wouldn't it be possible to create a variant of regexp_replace, i.e.  
> regexp_replace(citext,citext,text), which would again lower-case the  
> first two arguments before passing the input to  
> regexp_replace(text,text,text)?

Sure, but then you end up with this:

template1=# select regexp_replace( 'Fxx'::citext, 'X'::citext, 'o');
regexp_replace
----------------
foo
(1 row)

Which is just wrong. I'm going to look at the regex C functions today  
and see if there's an easy way to just always pass them the 'i' flag,  
which would do the trick. That still won't help replace(),  
split_part(), or translate(), however.

Best,

David


Re: PATCH: CITEXT 2.0 v4

От
Michael Paesold
Дата:
David E. Wheeler writes:

> On Jul 17, 2008, at 03:45, Michael Paesold wrote:
>
>> Wouldn't it be possible to create a variant of regexp_replace, i.e.  
>> regexp_replace(citext,citext,text), which would again lower-case  
>> the first two arguments before passing the input to  
>> regexp_replace(text,text,text)?
>
> Sure, but then you end up with this:
>
> template1=# select regexp_replace( 'Fxx'::citext, 'X'::citext, 'o');
> regexp_replace
> ----------------
> foo
> (1 row)

Yeah, you are right, I see. :-)


> Which is just wrong. I'm going to look at the regex C functions  
> today and see if there's an easy way to just always pass them the  
> 'i' flag, which would do the trick. That still won't help replace(),  
> split_part(), or translate(), however.

Calling regex functions with the case-insensitivity option would be  
great. It should also be possible to rewrite replace() into  
regexp_replace() by first escaping the regex meta characters.

Actually re-implementing those functions in a case insensitive way  
would still be an option, but of course some amount of work. The  
question is, how much use case there is.

Best Regards
Michael Paesold


Re: PATCH: CITEXT 2.0 v4

От
"David E. Wheeler"
Дата:
On Jul 18, 2008, at 01:39, Michael Paesold wrote:

> Calling regex functions with the case-insensitivity option would be  
> great. It should also be possible to rewrite replace() into  
> regexp_replace() by first escaping the regex meta characters.
>
> Actually re-implementing those functions in a case insensitive way  
> would still be an option, but of course some amount of work. The  
> question is, how much use case there is.

Not much for me. I might use the regex functions, but would be happy  
to manually pass the "i" flag.

However, if someone with a lot more C and Pg core knowledge wanted to  
sit down with me for a couple hours next week and help me bang out  
these functions, that would be great. I'd love to have the  
implementation be that much more complete.

I do believe that, as it stands now in the v4 patch, citext is pretty  
close to ready, and certainly commit-able.

Thanks,

David


Re: PATCH: CITEXT 2.0 v4

От
"David E. Wheeler"
Дата:
On Jul 18, 2008, at 09:53, David E. Wheeler wrote:

> However, if someone with a lot more C and Pg core knowledge wanted  
> to sit down with me for a couple hours next week and help me bang  
> out these functions, that would be great. I'd love to have the  
> implementation be that much more complete.

I've implemented fixes for the regexp_* functions and strpos() in pure  
SQL, like so:

CREATE OR REPLACE FUNCTION regexp_matches( citext, citext ) RETURNS  
TEXT[] AS '    SELECT regexp_matches( $1::text, $2::text, ''i'' );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_matches( citext, citext, text )  
RETURNS TEXT[] AS '    SELECT regexp_matches( $1::text, $2::text, CASE WHEN strpos($3,  
''c'') = 0 THEN  $3 || ''i'' ELSE $3 END );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_replace( citext, citext, text )  
returns TEXT AS '    SELECT regexp_replace( $1::text, $2::text, $3, ''i'');
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_replace( citext, citext, text,  
text ) returns TEXT AS '    SELECT regexp_replace( $1::text, $2::text, $3, CASE WHEN  
strpos($4, ''c'') = 0 THEN  $4 || ''i'' ELSE $4 END);
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_split_to_array( citext, citext )  
RETURNS TEXT[] AS '    SELECT regexp_split_to_array( $1::text, $2::text, ''i'' );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_split_to_array( citext, citext,  
text ) RETURNS TEXT[] AS '    SELECT regexp_split_to_array( $1::text, $2::text, CASE WHEN  
strpos($3, ''c'') = 0 THEN  $3 || ''i'' ELSE $3 END );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_split_to_table( citext, citext )  
RETURNS SETOF TEXT AS '    SELECT regexp_split_to_table( $1::text, $2::text, ''i'' );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_split_to_table( citext, citext,  
text ) RETURNS SETOF TEXT AS '    SELECT regexp_split_to_table( $1::text, $2::text, CASE WHEN  
strpos($3, ''c'') = 0 THEN  $3 || ''i'' ELSE $3 END );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION strpos( citext, citext ) RETURNS INT AS '    SELECT strpos( LOWER( $1::text ), LOWER(
$2::text) );
 
' LANGUAGE SQL IMMUTABLE STRICT;

Not so bad, though it'd be nice to have C functions that just did  
these things. Still not case-insensitive are:

-- replace()
-- split_part()
-- translate()

So, anyone at OSCON this week want to help me with these? Or to  
convert the above functions to C? Greg? Bruce?

Thanks,

David


Re: PATCH: CITEXT 2.0 v4

От
"David E. Wheeler"
Дата:
On Jul 18, 2008, at 01:39, Michael Paesold wrote:

> Calling regex functions with the case-insensitivity option would be  
> great. It should also be possible to rewrite replace() into  
> regexp_replace() by first escaping the regex meta characters.
>
> Actually re-implementing those functions in a case insensitive way  
> would still be an option, but of course some amount of work. The  
> question is, how much use case there is.

I've figured out how to make all the functions work using SQL function  
workarounds, converting things and re-dispatching to the text versions  
as appropriate. They work quite well, and can be converted to C later  
if that becomes a requirement.

Meanwhile, on the question of whether or not regular expression and  
LIKE comparisons *should* match case-insensitively, I have a couple  
more observations:

* Thinking about how a true case-insensitive collation would work, I'm  
quite certain that it would match case-insensitively. Anything else  
would just be unexpected, because in a case-insensitive collation,  
lowercase characters are, in practice, identical to uppercase  
characters. As far as matching is concerned, there is no difference  
between them. So the matching operators and functions against CITEXT  
should follow that assumption.

* I tried a few matches on MySQL, where the collation is case- 
insensitive by default, and it confirms my impression:

mysql> select 'Foo' regexp 'o$';
+-------------------+
| 'Foo' regexp 'o$' |
+-------------------+
|                 1 |
+-------------------+
1 row in set (0.00 sec)

mysql> select 'Foo' regexp 'O$';
+-------------------+
| 'Foo' regexp 'O$' |
+-------------------+
|                 1 |
+-------------------+
1 row in set (0.00 sec)

mysql> select 'Foo' like '%o';
+-----------------+
| 'Foo' like '%o' |
+-----------------+
|               1 |
+-----------------+
1 row in set (0.00 sec)

mysql> select 'Foo' like '%O';
+-----------------+
| 'Foo' like '%O' |
+-----------------+
|               1 |
+-----------------+
1 row in set (0.00 sec)

I'll grant that MySQL may not be the best model for how things should  
work, but it's something, at least. Anyone else got access to another  
database with case-insensitive collations to see how LIKE and regular  
expressions work?

Thanks,

David