On Thu, 28 Dec 2023 at 17:46, Bruce Momjian <bruce@momjian.us> wrote:
>
> On Thu, Dec 28, 2023 at 10:15:07AM -0500, aa wrote:
> > Hello Postgres Team!
> >
> > First of all, a big THANK YOU for the great work you folks are doing!
> >
> > The reason I am writing to you is to suggest a feature in future Postgres
> > versions, a feature that is partially there but is not quite where it should be
> > in my opinion: the full text search functionality. This functionality in my
> > opinion, should be available out of the box, for any possible language
> > available, including east Asia character based languages. You would probably
> > say that this will require a huge amount of work, and I would say, a postgres
> > extension which does exactly this, already exists, and it is called : pgroonga
> > (https://pgroonga.github.io/)
>
> Please explain how this is different from what we already have:
>
> https://www.postgresql.org/docs/current/textsearch.html
I'm not familiar with pgroonga, but the main issue with built-in text
search is that it cannot tokenize asian and many other languages
properly.
Here default parser cannot tokenize Japanese text:
=# select * from ts_parse('default', 'これはペンです');
tokid | token
-------+----------------
2 | これはペンです
Unlike Latin:
=# select * from ts_parse('default', 'this is a pen');
tokid | token
-------+-------
1 | this
12 |
1 | is
12 |
1 | a
12 |
1 | pen
To add support for Japanese (and other languages) it is necessary to
write a new parser or fix the existing default parser.
On the other hand pgroonga's source code looks complex, and I doubt
that there are pgsql-hackers who know it and target languages well and
who will be able to port it to Postgres core.
--
Artur