Обсуждение: Text search language field

Поиск
Список
Период
Сортировка

Text search language field

От
Daniel Staal
Дата:
I'm working on project with some large text areas that will need to be
searched, and I'm trying to set up an indexed text search field to make
things a bit smoother.  Only one of the top requirements is that this has
to be multi-lingual, so I should be using the correct language's text
search.  The table stores the language the text is in, so in theory this
should be easy...  But it seems to be a bit more subtle than it first
appears.  Any ideas on the best way to set this up?

Tables:

Resource:
    "text_element"    text,
    "text_search"    tsvector,
    "language"        char varying(3) references "languages"

languages  (This is 'static' list of ISO 639-3 codes, to make sure
everything uses the same codes):
    "code" character varying(3) primary key,
    "description" character varying(100)

Trigger (This is my first draft):
create trigger "textsearch_trig" before update or insert
   on "resource" for each row execute procedure
   tsvector_update_trigger_column("text_search", "language",
"text_element");

The trigger currently throws an error 'column "language" is not of
regconfig type' whenever I try to use it, and I'm looking for the best way
to solve that.  The 'simple' solution of 'create text search configuration
eng ( copy = pg_catalog.english );' didn't work.  Before I start into a
long exploration into why and what should be done about it, I thought I'd
seek wisdom.  ;)

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------

Re: Text search language field

От
Daniel Staal
Дата:
--As of May 13, 2012 2:15:36 PM +1200, Gavin Flower is alleged to have said:

> I can't comment on how to implement the functionality you want, but I
> have a few comments you may (or may not!) find useful.

This of course was not the *complete* specs of the tables.  I didn't feel
the need to post pages of SQL for a simple question.  ;)  I just put in the
relevant fields.

>   • Field names need not be in double quotes.

I know.  But I prefer to quote everything always, both to help them stand
out in the text, and to make sure I don't have any case issues.

>   • if a character field is Always 3 characters, then say so

It currently is always 3 characters, but I don't want to be to dogmatic
about it, in case that needs to change in the future.  (ISO codes have done
that in the past, and I may decide a different/additional set of codes is
needed at some point in the future.)  Besides, there is no performance
benefit in Postgres.  (The opposite, actually...)

>   • add NOT NULL where appropriate (you may well decide more fields
> aught to be NOT NULL)

Generally I have, though not always I'll admit.  'Primary key' implies it,
and that was the only one that actually has that restriction in the
selection of fields I showed.

>   • keeping primary and foreign key fields separate from user visible
> data.

I disagree strongly here.  ;)  Primary keys should be whatever is suitable
for the data; creating an artificial primary key (when not needed) is just
storing more data and doing more work.  It also gives a feeling of 'safety'
which doesn't actually exist, as you can then create data that is invalid,
but that fits because your table design allows it.  In particular, the
language table keeping the the 'code' unique and distinct is the *entire
point* of the table, so there is no good reason to use anything else as the
primary key.  (I do have a generated ID in the resource table, though it's
a much more complex generation than a simple serial.  Again, I didn't feel
the need to show it.)

This does mean thinking through your future use-cases a bit more at the
initial design stage, but if I wasn't willing to do that I wouldn't be
asking this question in the first place.

(Actually, the languages table is nearly exclusively used inside the
database, so you could claim it *is* separate from user visible data.  I
could almost use an enum there instead, but having a linked description
available might be useful on occasion.  I also thought that knowing it
exists might be useful in solving my problem: One possible issue is that
text is not the correct data type, and having a mapping table available
might be useful.)

>   • identifying primary and foreign key fields clearly

Agreed, although I only showed one of each.  ;)

>   • suggest table names be singular (my convention, not universally
> adopted!)

I tend to use plural or singular depending on how they will be used:
'resource' will tend to be used one at a time, while 'languages' will tend
to be used as a reference list, and therefore as an aggregate.  A more
defined naming scheme might be useful I'll admit.

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------

Re: Text search language field

От
Gavin Flower
Дата:
Hi Daniel,

I can't comment on how to implement the functionality you want, but I have a few comments you may (or may not!) find useful.

  1. Field names need not be in double quotes.
  2. if a character field is Always 3 characters, then say so
  3. add NOT NULL where appropriate (you may well decide more fields aught to be NOT NULL)
  4. keeping primary and foreign key fields separate from user visible data.
  5. identifying primary and foreign key fields clearly
  6. suggest table names be singular (my convention, not universally adaopted!)
The following SQL is syntactically correct (at least psql did not object to it...)

CREATE TABLE language
(
    id              SERIAL PRIMARY KEY,
    code            char(3) NOT NULL,
    description     text
);
  
CREATE TABLE resource
(
    id              SERIAL PRIMARY KEY,
    text_element    text
NOT NULL,
    text_search     tsvector,
    language_id     int REFERENCES language(id)
NOT NULL
); 



On 13/05/12 12:08, Daniel Staal wrote:

I'm working on project with some large text areas that will need to be searched, and I'm trying to set up an indexed text search field to make things a bit smoother.  Only one of the top requirements is that this has to be multi-lingual, so I should be using the correct language's text search.  The table stores the language the text is in, so in theory this should be easy...  But it seems to be a bit more subtle than it first appears.  Any ideas on the best way to set this up?

Tables:

Resource:
   "text_element"    text,
   "text_search"    tsvector,
   "language"        char varying(3) references "languages"

languages  (This is 'static' list of ISO 639-3 codes, to make sure everything uses the same codes):
   "code" character varying(3) primary key,
   "description" character varying(100)

Trigger (This is my first draft):
create trigger "textsearch_trig" before update or insert
  on "resource" for each row execute procedure
  tsvector_update_trigger_column("text_search", "language", "text_element");

The trigger currently throws an error 'column "language" is not of regconfig type' whenever I try to use it, and I'm looking for the best way to solve that.  The 'simple' solution of 'create text search configuration eng ( copy = pg_catalog.english );' didn't work.  Before I start into a long exploration into why and what should be done about it, I thought I'd seek wisdom.  ;)

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------


Re: Text search language field

От
Gavin Flower
Дата:
On 13/05/12 14:56, Daniel Staal wrote:
--As of May 13, 2012 2:15:36 PM +1200, Gavin Flower is alleged to have said:

I can't comment on how to implement the functionality you want, but I
have a few comments you may (or may not!) find useful.

This of course was not the *complete* specs of the tables.  I didn't feel the need to post pages of SQL for a simple question.  ;)  I just put in the relevant fields.

  • Field names need not be in double quotes.

I know.  But I prefer to quote everything always, both to help them stand out in the text, and to make sure I don't have any case issues.

  • if a character field is Always 3 characters, then say so

It currently is always 3 characters, but I don't want to be to dogmatic about it, in case that needs to change in the future.  (ISO codes have done that in the past, and I may decide a different/additional set of codes is needed at some point in the future.)  Besides, there is no performance benefit in Postgres.  (The opposite, actually...)

  • add NOT NULL where appropriate (you may well decide more fields
aught to be NOT NULL)

Generally I have, though not always I'll admit.  'Primary key' implies it, and that was the only one that actually has that restriction in the selection of fields I showed.
I am amazed at the number of times I see people specifying 'NOT NULL' and PRIMARY KEY' for the same field!  Mind you, these same people could probably justifiable laugh at the daft things I do!  :-)


  • keeping primary and foreign key fields separate from user visible
data.

I disagree strongly here.  ;)  Primary keys should be whatever is suitable for the data; creating an artificial primary key (when not needed) is just storing more data and doing more work.  It also gives a feeling of 'safety' which doesn't actually exist, as you can then create data that is invalid, but that fits because your table design allows it.  In particular, the language table keeping the the 'code' unique and distinct is the *entire point* of the table, so there is no good reason to use anything else as the primary key.  (I do have a generated ID in the resource table, though it's a much more complex generation than a simple serial.  Again, I didn't feel the need to show it.)

Actually, If I had thought about it a bit more, and something I would (or should!) do if I was designing the table for real, would be to use an UNIQUE qualifier for the code field.

Unless there are performance and/or data storage, or some such constraints - I prefer to linking tables with user visible things  One production database I worked on had 5 tables in a chain of parent/child relationships, and each child primary key was a concatenation of a character field with the fields comprising the primary key of its parent – could be over 45 bytes in characters in length!

The current database I am designing is very complicated, but likely never to have more than a few thousand records in any table, and is likely to have many more reads (with some quite complicated queries) than writes. So I focus on trying to work in a very standardizing way, without having to worry over much about performance. Knowing my luck, my next project will be the exact opposite!


This does mean thinking through your future use-cases a bit more at the initial design stage, but if I wasn't willing to do that I wouldn't be asking this question in the first place.

(Actually, the languages table is nearly exclusively used inside the database, so you could claim it *is* separate from user visible data.  I could almost use an enum there instead, but having a linked description available might be useful on occasion.  I also thought that knowing it exists might be useful in solving my problem: One possible issue is that text is not the correct data type, and having a mapping table available might be useful.)
I would like to use ENUMs in postgres, but there appear to be problems f they need to be updated.


  • identifying primary and foreign key fields clearly

Agreed, although I only showed one of each.  ;)

  • suggest table names be singular (my convention, not universally
adopted!)

I tend to use plural or singular depending on how they will be used: 'resource' will tend to be used one at a time, while 'languages' will tend to be used as a reference list, and therefore as an aggregate.  A more defined naming scheme might be useful I'll admit.
Horses for courses.

I remember many years ago, that there was a big argument about systematic as distinct from meaningful names. I was programming in COBOL (names could be up to 30 characters long), I thought the argument was silly, as it depends... In fact in one COBOL program I adopted both approaches, as short systematic names are better for use in a set complicated numerical expressions and only in a self contained stretch of code, and meaningful names for variables used throughout a program.

For one stored procedure (TransactSQL) of over 3000 lines, accessing 15 tables using 7 temporary tables – I was very carefully in the systematic use of suffixes and consistent use of names.


Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------

Never too sure what other people know, being helpful can run the risk of seemingly be patronising!  I remember in one job I was given a task whee for part of it I was effectivly at the level of a trainess, and for other parts i had greater experience - a little unsettling!


Cheers,

P.S. about top posting before, I got a bit distracted by a work related call.

Re: Text search language field

От
Gavin Flower
Дата:
On 13/05/12 15:37, Gavin Flower wrote:
On 13/05/12 14:56, Daniel Staal wrote:
--As of May 13, 2012 2:15:36 PM +1200, Gavin Flower is alleged to have said:

I can't comment on how to implement the functionality you want, but I
have a few comments you may (or may not!) find useful.

This of course was not the *complete* specs of the tables.  I didn't feel the need to post pages of SQL for a simple question.  ;)  I just put in the relevant fields.

  • Field names need not be in double quotes.

I know.  But I prefer to quote everything always, both to help them stand out in the text, and to make sure I don't have any case issues.

  • if a character field is Always 3 characters, then say so

It currently is always 3 characters, but I don't want to be to dogmatic about it, in case that needs to change in the future.  (ISO codes have done that in the past, and I may decide a different/additional set of codes is needed at some point in the future.)  Besides, there is no performance benefit in Postgres.  (The opposite, actually...)

  • add NOT NULL where appropriate (you may well decide more fields
aught to be NOT NULL)

Generally I have, though not always I'll admit.  'Primary key' implies it, and that was the only one that actually has that restriction in the selection of fields I showed.
I am amazed at the number of times I see people specifying 'NOT NULL' and PRIMARY KEY' for the same field!  Mind you, these same people could probably justifiable laugh at the daft things I do!  :-)


  • keeping primary and foreign key fields separate from user visible
data.

I disagree strongly here.  ;)  Primary keys should be whatever is suitable for the data; creating an artificial primary key (when not needed) is just storing more data and doing more work.  It also gives a feeling of 'safety' which doesn't actually exist, as you can then create data that is invalid, but that fits because your table design allows it.  In particular, the language table keeping the the 'code' unique and distinct is the *entire point* of the table, so there is no good reason to use anything else as the primary key.  (I do have a generated ID in the resource table, though it's a much more complex generation than a simple serial.  Again, I didn't feel the need to show it.)

Actually, If I had thought about it a bit more, and something I would (or should!) do if I was designing the table for real, would be to use an UNIQUE qualifier for the code field.

Unless there are performance and/or data storage, or some such constraints - I prefer to linking tables with user visible things  One production database I worked on had 5 tables in a chain of parent/child relationships, and each child primary key was a concatenation of a character field with the fields comprising the primary key of its parent – could be over 45 bytes in characters in length!

hmm... I meant >>> linking tables with _NON_ user visible things <<<
(probably obvious, but just in case...)
)

The current database I am designing is very complicated, but likely never to have more than a few thousand records in any table, and is likely to have many more reads (with some quite complicated queries) than writes. So I focus on trying to work in a very standardizing way, without having to worry over much about performance. Knowing my luck, my next project will be the exact opposite!


This does mean thinking through your future use-cases a bit more at the initial design stage, but if I wasn't willing to do that I wouldn't be asking this question in the first place.

(Actually, the languages table is nearly exclusively used inside the database, so you could claim it *is* separate from user visible data.  I could almost use an enum there instead, but having a linked description available might be useful on occasion.  I also thought that knowing it exists might be useful in solving my problem: One possible issue is that text is not the correct data type, and having a mapping table available might be useful.)
I would like to use ENUMs in postgres, but there appear to be problems f they need to be updated.


  • identifying primary and foreign key fields clearly

Agreed, although I only showed one of each.  ;)

  • suggest table names be singular (my convention, not universally
adopted!)

I tend to use plural or singular depending on how they will be used: 'resource' will tend to be used one at a time, while 'languages' will tend to be used as a reference list, and therefore as an aggregate.  A more defined naming scheme might be useful I'll admit.
Horses for courses.

I remember many years ago, that there was a big argument about systematic as distinct from meaningful names. I was programming in COBOL (names could be up to 30 characters long), I thought the argument was silly, as it depends... In fact in one COBOL program I adopted both approaches, as short systematic names are better for use in a set complicated numerical expressions and only in a self contained stretch of code, and meaningful names for variables used throughout a program.

For one stored procedure (TransactSQL) of over 3000 lines, accessing 15 tables using 7 temporary tables – I was very carefully in the systematic use of suffixes and consistent use of names.


Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------

Never too sure what other people know, being helpful can run the risk of seemingly be patronising!  I remember in one job I was given a task whee for part of it I was effectivly at the level of a trainess, and for other parts i had greater experience - a little unsettling!


Cheers,

P.S. about top posting before, I got a bit distracted by a work related call.


Re: Text search language field

От
Daniel Staal
Дата:
--As of May 13, 2012 3:37:22 PM +1200, Gavin Flower is alleged to have said:

>> Generally I have, though not always I'll admit.  'Primary key' implies
>> it, and that was the only one that actually has that restriction in the
>> selection of fields I showed.
>
> I am amazed at the number of times I see people specifying 'NOT NULL' and
> PRIMARY KEY' for the same field!  Mind you, these same people could
> probably justifiable laugh at the daft things I do!  :-)

I don't mind the belt-and-suspenders approach on this, if you want to use
it.  Sure, it's redundant, but it's also clearer and a decent reminder to
the human reading the SQL.

>>>   • keeping primary and foreign key fields separate from user
>>> visible  data.
>>
>> I disagree strongly here.  ;)  Primary keys should be whatever is
>> suitable for the data; creating an artificial primary key (when not
>> needed) is just storing more data and doing more work.  It also gives a
>> feeling of 'safety' which doesn't actually exist, as you can then create
>> data that is invalid, but that fits because your table design allows
>> it.  In particular, the language table keeping the the 'code' unique and
>> distinct is the *entire point* of the table, so there is no good reason
>> to use anything else as the primary key.  (I do have a generated ID in
>> the resource table, though it's a much more complex generation than a
>> simple serial.  Again, I didn't feel the need to show it.)
>
> Actually, If I had thought about it a bit more, and something I would (or
> should!) do if I was designing the table for real, would be to use an
> UNIQUE qualifier for the code field.
>
> Unless there are performance and/or data storage, or some such
> constraints - I prefer to linking tables with (non)user visible things 
One
> production database I worked on had 5 tables in a chain of parent/child
> relationships, and each child primary key was a concatenation of a
> character field with the fields comprising the primary key of its parent
> – could be over 45 bytes in characters in length!
>
> The current database I am designing is very complicated, but likely never
> to have more than a few thousand records in any table, and is likely to
> have many more reads (with some quite complicated queries) than writes.
> So I focus on trying to work in a very standardizing way, without having
> to worry over much about performance. Knowing my luck, my next project
> will be the exact opposite!

This project has the potential to be very performance-critical, so I'm
trying to operate on that assumption.  ;)

And again, I don't see the point of creating extra fields and data just to
enforce some artificial separation between 'user-visible' and 'database'
fields.  There's no performance benefit, there's a maintenance *penalty*
(in that your data is more complicated), and a programming penalty.
(Again: your data is more complicated.)

In this case, for example, doing it with a separate 'id' field would either
require a hash lookup in the application this database is being created to
support, or a separate lookup on nearly every write to the resources table
(AKA: The second-most common operation I'm expecting), just to get the
language code id.  The first has obvious maintenance problems, and they
both have a performance penalty.  And none of this has any benefit to
anyone, that I can see.  So why?

> I remember many years ago, that there was a big argument about systematic
> as distinct from meaningful names. I was programming in COBOL (names
> could be up to 30 characters long), I thought the argument was silly, as
> it depends... In fact in one COBOL program I adopted both approaches, as
> short systematic names are better for use in a set complicated numerical
> expressions and only in a self contained stretch of code, and meaningful
> names for variables used throughout a program.

The only time it's not silly is when it's being decided upon as a
'company-wide standard'.  Then it's tragic.  ;)  (Good guidelines are
invaluable.  Enforced standards, especially given by those who aren't doing
any of the actual work, are not.)

> Never too sure what other people know, being helpful can run the risk of
> seemingly be patronising!  I remember in one job I was given a task whee
> for part of it I was effectivly at the level of a trainess, and for other
> parts i had greater experience - a little unsettling!

That's been most of my jobs.  ;)

> P.S. about top posting before, I got a bit distracted by a work related
> call.

No problem.  Now, if I could only get you to respect the 'Reply-To:'
header...  (I'm also not a fan of HTML email, but I can live with that.)

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------

Re: Text search language field

От
Daniel Staal
Дата:
--As of May 12, 2012 8:08:42 PM -0400, PGSQL-Novice is alleged to have said:
> Tables:
>
> Resource:
>     "text_element"    text,
>     "text_search"    tsvector,
>     "language"        char varying(3) references "languages"
>
> languages  (This is 'static' list of ISO 639-3 codes, to make sure
> everything uses the same codes):
>     "code" character varying(3) primary key,
>     "description" character varying(100)

> The trigger currently throws an error 'column "language" is not of
> regconfig type' whenever I try to use it, and I'm looking for the best
> way to solve that.  The 'simple' solution of 'create text search
> configuration eng ( copy = pg_catalog.english );' didn't work.  Before I
> start into a long exploration into why and what should be done about it,
> I thought I'd seek wisdom.  ;)

--As for the rest, it is mine.

Ok, after a bit of exploration, there are two problems with my current
approach:

The 'config_column_name' passed to 'tsvector_update_trigger_column' cannot
be a text column.

The config in the config column must be fully qualified.  (So I can't use
'english', I'd have to use 'pg_catalog.english'.)

Combined, these mean I it's probably not worth trying to use the column to
tell both my app and Postgresql what language is being used.  So, anyone
have any ideas on how to write a trigger to pull the field from the
language table based on the code?

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------