Re: identifying duplicates in table with redundancies
От | Andreas Schmitz |
---|---|
Тема | Re: identifying duplicates in table with redundancies |
Дата | |
Msg-id | 4CA2FB63.2030102@longimanus.net обсуждение исходный текст |
Ответ на | Re: identifying duplicates in table with redundancies ("Tarlika Elisabeth Schmitz" <postgresql2@numerixtechnology.de>) |
Ответы |
Re: identifying duplicates in table with redundancies
|
Список | pgsql-sql |
On 09/28/2010 10:36 PM, Tarlika Elisabeth Schmitz wrote: > On Tue, 28 Sep 2010 11:34:31 +0100 > "Oliveiros d'Azevedo Cristina"<oliveiros.cristina@marktest.pt> wrote: > >> ----- Original Message ----- >> From: "Tarlika Elisabeth Schmitz"<postgresql@numerixtechnology.de> >> To:<pgsql-sql@postgresql.org> >> Sent: Monday, September 27, 2010 5:54 PM >> Subject: Re: [SQL] identifying duplicates in table with redundancies >> >> >>> On Fri, 24 Sep 2010 18:12:18 +0100 >>> Oliver d'Azevedo Christina<oliveiros.cristina@gmail.com> wrote: >>> >>>>>> SELECT DISTINCT trainer_id,trainer_name >>>>>> FROM ( >>>>>> SELECT trainer_name -- The field you want to test for duplicates >>>>>> FROM ( >>>>>> SELECT DISTINCT "trainer_id","trainer_name" >>>>>> FROM student >>>>>> ) x >>>>>> GROUP BY "trainer_name" -- the field you want to test for >>>>>> duplicates >>>>>> HAVING (COUNT(*)> 1) >>>>>> ) z >>>>>> NATURAL JOIN student y >>> >>> >>> What indices would you recommend for this operation? >> But, on this query in particular I would recomend an indice on >> trainer_name, as this field will be used on the join and on the group >> by. For the other query, the one you get by substituting trainer_name >> by trainer_id, place an index on trainer_id. >> Also, these indexes may help speed up the order by clause, if you use >> one. >> >> If you have a table with lots of data you can try them around and see >> how performance varies (and don't forget there's also EXPLAIN ANALYZE) > > Strangely, these indices did not do anything. > Without, the query took about 8500ms. Same with index. > > The table has 250000 records. 11000 have trainer_name = null. Only > 13000 unique trainer_names. > > It is not hugely important as these queries are not time-critical. > This is only a helper table, which I use to analyze the date prior to > populating the destination tables with the data. > > Regards, > Tarlika > I guess explain analyze shows up a seq scan. try avoiding to use distinct. use group by instead. regards Andreas
В списке pgsql-sql по дате отправления: