Re: "large" IN/NOT IN subqueries result in query returning
От | John McCawley |
---|---|
Тема | Re: "large" IN/NOT IN subqueries result in query returning |
Дата | |
Msg-id | 43B19BBA.8050403@hardgeus.com обсуждение исходный текст |
Ответ на | "large" IN/NOT IN subqueries result in query returning wrong data ("George Pavlov" <gpavlov@mynewplace.com>) |
Список | pgsql-sql |
At a glance I would guess that NULL values in one or both of your tables is the culprit. NULL values always return false. Example: A quick test on my database: select count(*) FROM tbl_employee; count ------- 2689 select count(*) FROM tbl_employee WHERE username IS NULL;count ------- 35 So I have 35 null values. create table tbl_foo (username varchar(32)); insert into tbl_foo (username) values ('poop'); select count(*) FROM tbl_employee WHERE tbl_employee.username NOT IN (select tbl_foo.username FROM tbl_foo);count ------- 2654 So I only get 2654 values that are NOT IN the set 'poop'...i.e. the NULL values are not included when I use the "NOT IN" my query. Nulls can be confusing. Hope this helps. George Pavlov wrote: >The following looks like a bug to me, but please someone review and tell >me what I might be missing. Seems that past a certain result set size a >"[NOT] IN (subquery)" stops behaving as expected and returns 0 matches >even when there should be matches. No errors are returned, just faulty >data. The exact threshholds seem to depend on the data, subquery and >possibly the indexes in place. Nothing in the documentation mentions a >limit to what "IN subquery" can take so I am quite dismayed (if there is >a limit I would expect an ERROR/WARNING/something, but not wrong data). >Read below for details. This has been tried on PG 8.0.4 and 8.1.1 on >WinXP, and 8.0.4 and 8.0.5 on Linux. > >I have two tables: > t1 (id, name, a_type) > t2 (id, name) > >The t1 table is "big" (483211 rows), the name column contains >duplicates, the a_type has only two values. The t2 table is "small" (40 >rows), the name values are unique. Some, but not all, t2.name values are >present in t1.name. To be precise, there are 10 t2.name values that do >not occur in t1.name (this is based on extraneous knowledge). I want to >find out which of the t2.name values are not ever used in t1.name: > >select count(*) from t2 where t2.name not in ( > select t1.name from t1); >--> 0 > >This should return 10, instead it returns 0!!! Adding a LIMIT to the >subquery and doing some trial and error produces very interesting >results: > >select count(*) from t2 where t2.name not in ( > select t1.name from t1 limit 261683) >--> 13 >select count(*) from t2 where t2.name not in ( > select t1.name from t1 limit 261684) >--> 0 > >What is so magical about 261683? The JOIN alternative produces the >correct results no matter what: > >select count(*) > from t2 left join t1 using (name) > where t1.name is null >--> 10 > >This pretty much summarizes the issue. Any thoughts greatly appreciated. >Follow a few variations to show how the threshhold varies. > >-- restrict to only one a_type > >
В списке pgsql-sql по дате отправления: