Re: Removing duplicates
От | Josh Berkus |
---|---|
Тема | Re: Removing duplicates |
Дата | |
Msg-id | web-810507@davinci.ethosmedia.com обсуждение исходный текст |
Ответ на | Removing duplicates (Matthew Hagerty <matthew@brwholesale.com>) |
Ответы |
Re: Removing duplicates
|
Список | pgsql-sql |
Matt, > I have a customer database (name, address1, address2, city, state, > zip) and I need a query (or two) that will give me a mailing list > with the least amount of duplicates possible. I know that precise > matching is not possible, i.e. "P.O. Box 123" will never match "PO > Box 123" without some data massaging, but if I can isolate even 50% > of any duplicates, that would help greatly. From the sound of things, you are trying to get out a mailing with theleast number of duplicates you can in a limited time,rather thantrying to clean up the list for permanent storage. Chances are, youbought or traded this list from an outsidesource, yes? In that case, here's some quick de-duplication tricks from myfundraising days: 1. Compare text columns with the punctuation stripped out. It'samazing how many typographical differences come down to punctuation.FYI, in Roberto Mello's function catalog (accessable fromhttp://techdocs.postgresql.org/ ) I believe that I postedsomepunctuation-stripping PL/pgSQL procedures. For furtherde-duplication, compare only the left 15 characters of atext field(e.g. SUBSTR(strip_string(address1), 1, 15)), but beware ... this cancause you to weed out some non-duplicates,such as multiple residentsof large apartment buildings. 2. For box office lists, you can use phonetic formulas to comparepersonal names (NOT addresses). I believe that Soundexand Metaphoneare included in /contrib these days. This does not work well onChinese or Southeast Asian names. 3. If you got phone numbers along with the addresses, these are anexcellend guage of uniqueness. > Also, any suggestions on which parameters to check the duplicates > for? My first thoughts were to make sure there were no two > addresses the same in the same zip code. Any insight (or examples) > would be greatly appreciated. Just don't forget that some of the zip codes will probably beerroneous. -Josh
В списке pgsql-sql по дате отправления: