Re: extracting location info from string
От | Craig Ringer |
---|---|
Тема | Re: extracting location info from string |
Дата | |
Msg-id | 4DDB02E9.9040509@postnewspapers.com.au обсуждение исходный текст |
Ответ на | Re: extracting location info from string (Tarlika Elisabeth Schmitz <postgresql3@numerixtechnology.de>) |
Список | pgsql-sql |
On 24/05/2011 6:39 AM, Tarlika Elisabeth Schmitz wrote: > Indeed. However, the situation is not quite as bleak as it appears: > - I am only dealing with 50 countries (Luxemburg and Vatican are not > amongst them) > - Only for two countries will city/region be displayed instead of > country. > - Ultimately, where the only important bit of imformation is the > country. > - The only really important persons are those from the two countries. Ah, see that's critical. You've just been able to restrict the problem domain to a much simpler task with a smaller and well-defined range of possibilities. Most of the complexity is in the nasty corner cases and weirdness, and you've (probably) just cut most of that away. > Of 17000 historical records, 4400 don't match this simple pattern. > Of the 4400, 1300 are "USA" or "Usa" instead of "United States", 900 > "North America" whatever that is! There are plenty of common + > valid region abbreviations. > > I get about 1000 new records of this type per year. I'd do this kind of analysis in a PL/Perl or PL/python function myself. It's easier to write "If <x> then <y> else <x>" logic in a readable form, and such chained tests are usually better for this sort of work. That also makes it easier to do a cleanup pass first, where you substitute common spelling errors and canonicalize country names. > However, the import process has to be as automatic as possible in such > a way that inconsistencies are flagged up for later manual > intervention. I say later because, for instance, a person's data will > have to be imported with or without location info because other new > data will link to it. That's another good reason to use a PL function for this cleanup work. It's easy to INSERT a record into a side table that flags it for later examination if necessary, and to RAISE NOTICE or to issue a NOTIFY if you need to do closer-to-realtime checking. -- Craig Ringer Tech-related writing at http://soapyfrogs.blogspot.com/
В списке pgsql-sql по дате отправления: