Обсуждение: BUG #8821: pg_trgm segfault with Turkish locale database

Поиск
Список
Период
Сортировка

BUG #8821: pg_trgm segfault with Turkish locale database

От
ufuk@paralaus.com
Дата:
The following bug has been logged on the website:

Bug reference:      8821
Logged by:          Ufuk Kayserilioglu
Email address:      ufuk@paralaus.com
PostgreSQL version: 9.3.2
Operating system:   Linux - Ubuntu x86_64
Description:

PLATFORM:
--------

PostgreSQL 9.3.2 installed from precise-pgdg apt repository on Ubuntu
12.04.3 LTS and 13.10 x86_64

SETUP:
-----

Given a database with encoding UTF-8, locale tr_TR.UTF-8, 'pg_trgm'
extension enabled, and the following setup:

CREATE TABLE car_makers (maker TEXT);
INSERT INTO car_makers VALUES ('AUDI'), ('MINI');

Run any one of the following commands:

- SELECT maker <-> 'MAZDA' FROM car_makers;
- SELECT similarity(maker, 'MAZDA') FROM car_makers;
- SELECT show_trgm('III');

ACTUAL OUTCOME:
--------------

The database will crash and the log will output something along the lines
of:

LOG:  server process (PID 24941) was terminated by signal 11: Segmentation
fault
DETAIL:  Failed process was running: SELECT maker <-> 'MAZDA' FROM
car_makers;
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
FATAL:  the database system is in recovery mode
LOG:  all server processes terminated; reinitializing

EXPECTED OUTCOME:
----------------

The commands return a result and the database process stays up.

ANALYSIS:
--------

I initially ran into this issue on a table with more records and have
narrowed the case down to rows that have 4 letter words that contain the
uppercase I character (that's why only 'AUDI' and 'MINI' in the example). I
am not sure why comparison with 'MAZDA' triggers the crash (for example, a
comparison with 'MAZDA ' does not), but it seems other 5 char strings also
trigger it (eg. 'ABCDE' or '12345'). Also, there are other, longer
comparison strings like 'RENAULT-OYAK ' that also trigger it.

As I was trying to understand what exactly causes the problem, I tried to
understand the trigrams generated by the values in the table and various
test strings. I happened on the other crash during that experimentation.

As a result, it seems there is a problem with trigrams generation and/or
comparison when the Turkish locale is being used.

The issue ultimately seems to be linked to the fact that the lowercase
equivalent of "I" is "ı" (i without the dot), and the trigram codes does the
proper conversion to lowercase which turns the string into a multibyte
string.

The same problem does NOT exist on Mac OSX 10.9.1 with PostgreSQL installed
via Postgres.app. However, OSX has broken tr_TR locale support to begin with
(for example, sorting does not work properly.)

WORKAROUND:
----------

I was not able to find a reliable workaround since the code that does the
lowercase conversion seems to only take into consideration the locale of the
working database (which in my case needs to be tr_TR). Specifying an 'en_US'
collation on the 'maker' column or an explicit collation specifier in the
query do not work.

Re: BUG #8821: pg_trgm segfault with Turkish locale database

От
Tom Lane
Дата:
ufuk@paralaus.com writes:
> Given a database with encoding UTF-8, locale tr_TR.UTF-8, 'pg_trgm'
> extension enabled, and the following setup:

> CREATE TABLE car_makers (maker TEXT);
> INSERT INTO car_makers VALUES ('AUDI'), ('MINI');

> Run any one of the following commands:

> - SELECT maker <-> 'MAZDA' FROM car_makers;
> - SELECT similarity(maker, 'MAZDA') FROM car_makers;
> - SELECT show_trgm('III');

> As a result, it seems there is a problem with trigrams generation and/or
> comparison when the Turkish locale is being used.

It looks like generate_trgm() is not considering the possibility that
case-folding will make the string physically longer, so you get a buffer
overrun when any of these I-containing strings are converted to trigrams.
Will fix, thanks for the report!

            regards, tom lane

Re: BUG #8821: pg_trgm segfault with Turkish locale database

От
Ufuk Kayserilioglu
Дата:
Hi Tom,

Thanks for the quick response. Just a quick followup note though; running=
:

SELECT show=5Ftrgm(maker) =46ROM car=5Fmakers;

behaves properly, but trying to make a similarity comparison triggers the=
 segfault (at least in my case). That's why I suspect there may be 2 rela=
ted bugs, one related trigram generation and, maybe, one related to trigr=
am comparison. Hope this info will help you as well.

Regards,

--
Ufuk Kayserilioglu

=46rom:=C2=A0Tom Lane Tom Lane
Reply:=C2=A0Tom Lane tgl=40sss.pgh.pa.us
Date:=C2=A013 January 2014 at 18:09:46
To:=C2=A0ufuk=40paralaus.com ufuk=40paralaus.com
Subject:=C2=A0 Re: =5BBUGS=5D BUG =238821: pg=5Ftrgm segfault with Turkis=
h locale database =20
ufuk=40paralaus.com writes: =20
> Given a database with encoding UT=46-8, locale tr=5FTR.UT=46-8, 'pg=5Ft=
rgm' =20
> extension enabled, and the following setup: =20

> CREATE TABLE car=5Fmakers (maker TEXT); =20
> INSERT INTO car=5Fmakers VALUES ('AUDI'), ('MINI'); =20

> Run any one of the following commands: =20

> - SELECT maker <-> 'MAZDA' =46ROM car=5Fmakers; =20
> - SELECT similarity(maker, 'MAZDA') =46ROM car=5Fmakers; =20
> - SELECT show=5Ftrgm('III'); =20

> As a result, it seems there is a problem with trigrams generation and/o=
r =20
> comparison when the Turkish locale is being used. =20

It looks like generate=5Ftrgm() is not considering the possibility that =20
case-folding will make the string physically longer, so you get a buffer =
=20
overrun when any of these I-containing strings are converted to trigrams.=
 =20
Will fix, thanks for the report=21 =20

regards, tom lane =20

Re: BUG #8821: pg_trgm segfault with Turkish locale database

От
Tom Lane
Дата:
Ufuk Kayserilioglu <ufuk@paralaus.com> writes:
> Thanks for the quick response. Just a quick followup note though; running:

> SELECT show_trgm(maker) FROM car_makers;

> behaves properly, but trying to make a similarity comparison triggers the segfault (at least in my case). That's why
Isuspect there may be 2 related bugs, one related trigram generation and, maybe, one related to trigram comparison.
Hopethis info will help you as well. 

The given cases all work for me with the committed patch to generate_trgm(),
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c3ccc9ee584b9b015dd9c1931e261e21f3961e5f

I'm not going to sit here and claim that there are now zero bugs in
pg_trgm, but I'm not seeing evidence of a second issue.  Since the bug is
a buffer overrun of just a few bytes, whether it results in a crash in any
particular trigram operation is hard to predict; it might just stomp on
noncritical memory.

            regards, tom lane

Re: BUG #8821: pg_trgm segfault with Turkish locale database

От
Ufuk Kayserilioglu
Дата:
Hi Tom,

Thank you so much for the quick turnaround. Indeed, I just built the late=
st version from Git and tested it on one of the original test machines an=
d all cases are working perfectly.

Regards,

--
Ufuk Kayserilioglu

=46rom:=C2=A0Tom Lane Tom Lane
Reply:=C2=A0Tom Lane tgl=40sss.pgh.pa.us
Date:=C2=A013 January 2014 at 21:24:31
To:=C2=A0Ufuk Kayserilioglu ufuk=40paralaus.com
Subject:=C2=A0 Re: =5BBUGS=5D BUG =238821: pg=5Ftrgm segfault with Turkis=
h locale database =20
Ufuk Kayserilioglu <ufuk=40paralaus.com> writes: =20
> Thanks for the quick response. Just a quick followup note though; runni=
ng: =20

> SELECT show=5Ftrgm(maker) =46ROM car=5Fmakers; =20

> behaves properly, but trying to make a similarity comparison triggers t=
he segfault (at least in my case). That's why I suspect there may be 2 re=
lated bugs, one related trigram generation and, maybe, one related to tri=
gram comparison. Hope this info will help you as well. =20

The given cases all work for me with the committed patch to generate=5Ftr=
gm(), =20
http://git.postgresql.org/gitweb/=3Fp=3Dpostgresql.git;a=3Dcommitdiff;h=3D=
c3ccc9ee584b9b015dd9c1931e261e21f3961e5f =20

I'm not going to sit here and claim that there are now zero bugs in =20
pg=5Ftrgm, but I'm not seeing evidence of a second issue. Since the bug i=
s =20
a buffer overrun of just a few bytes, whether it results in a crash in an=
y =20
particular trigram operation is hard to predict; it might just stomp on =20
noncritical memory. =20

regards, tom lane =20