DISTINCT ON, when you want it, you're glad it's there
От | Robert Forsman |
---|---|
Тема | DISTINCT ON, when you want it, you're glad it's there |
Дата | |
Msg-id | 199910310138.VAA01343@nile.purplefrog.com обсуждение исходный текст |
Список | pgsql-sql |
I was recently reading the DISTINCT ON conversation logged at http://www.postgresql.org/docs/pgsql/doc/TODO.detail/distinct and felt I had to kick in. Here at Incanta I'm writing a query to choose a list of video clips for a user. One of the features of this problem is that we may have multiple encodings for a single clip. If the user can decode two formats, I'd like to offer them the shortest version of the movie (let's assume the quality is identical). I'm using the following for a jdbc PreparedStatement (using a PreparedStatement doesn't buy me anything in this case, but ignore that): "SELECT DISTINCT ON clipid c.clipid, f.fileid, c.producerid, c.creationdate, c.duration, rankVideoClip(c.clipid, c.producerid) as rank, f.length FROM videoclipregistry c, videofileregistry f WHERE c.clipid = f.clipid AND f.encrate < ? AND c.producerid = ? AND f.format IN ("+formatStr+") AND c.creationdate > ? ORDER BY rank DESC, clipid, length ASC, creationdate" I discovered rather quickly that DISTINCT ON seems to eliminate SUCCESSIVE duplicates, not global duplicates, so I order by clipid, and then length to eliminate the longer files representing a clip. I still consider the DISTINCT ON behavior a bug. I feel that DISTINCT ON is handy, otherwise I'd have to eliminate duplicates in the calling code (which might be faster, but would cost more transmission overhead and make me write code, and we all know "more code = more bugs") The other alternative would be a temporary table, then a SELECT MAX correlated subquery in the WHERE clause which I suspect postgresql can't do so you'd write an SQL function for the subquery, except how do you pass a list for IN into an SQL function?. Still, you could end up with two files with identical length and you'd have to write code to eliminate that duplicate in the calling language anyway. ARGH! P.S. If I could use a ? for the formatStr for the IN clause I could actually get some benefit from the PreparedStatement, but I don't think JDBC or ODBC has any way to do that. P.P.S. prepare to vomit at the following function definitions. Note carefully the use of COALESCE: "CREATE FUNCTION userRankVideoClip(INT, INT) RETURNS INT AS 'select COALESCE(sum(INT4( p.value) * t.tagvalue), 0) FROM "+TAGREGISTRY+" t, userpreferences p WHERE $1 = t.clipid AND (''InMotion.Producer.''||($2)||''.''||t.tagname) = p.key' LANGUAGE 'SQL'" "CREATE FUNCTION marqueeRankVideoClip(INT) RETURNS INT AS 'SELECT COALESCE(SUM(t.tagvalue), 0) FROM "+TAGREGISTRY+" t WHERE t.clipid = $1 AND t.tagname = ''marquee'' ' language 'SQL'" "CREATE FUNCTION rankVideoClip(INT, INT) RETURNS INT AS 'select userRankVideoClip($1, $2)+marqueeRankVideoClip($1)' LANGUAGE 'SQL'"
В списке pgsql-sql по дате отправления: