Re: Intermittent "cache lookup failed for type" buildfarm failures
От | Tom Lane |
---|---|
Тема | Re: Intermittent "cache lookup failed for type" buildfarm failures |
Дата | |
Msg-id | 21004.1472131937@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Intermittent "cache lookup failed for type" buildfarm failures (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-hackers |
I wrote: > There is something rotten in the state of Denmark. Here are four recent > runs that failed with unexpected "cache lookup failed for type nnnn" > errors: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grouse&dt=2016-08-16%2008%3A39%3A03 > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nudibranch&dt=2016-08-13%2009%3A55%3A09 > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-08-09%2001%3A46%3A17 > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern&dt=2016-08-09%2000%3A44%3A18 I believe I've figured this out. I realized that all the possible instances of "cache lookup failed for type" are reporting failures of SearchSysCache1(TYPEOID, ...) or related calls, and therefore I could narrow this down by setting a breakpoint there on the combination of cacheId = TYPEOID and key1 > 16384 (since the OIDs reported for the failures are clearly for some non-builtin type). After a bit of logging it became clear that the only such calls occurring in the statements that are failing in the buildfarm are coming from the parser's attempts to resolve an operator name. And then it was blindingly obvious what changed recently: commits f0c7b789a et al added a test case in case.sql that creates and then drops both an '=' operator and the type it's for. And that runs in parallel with the failing tests, which all need to resolve operators named '='. So in the other sessions, the parser is seeing that transient '=' operator as a possible candidate, but then when it goes to test whether that operator could match the actual inputs, the type is already gone (causing a failure in getBaseType or get_element_type or possibly other places). The best short-term fix, and the only one I'd consider back-patching, is to band-aid the test to prevent this problem, probably by wrapping that whole test case in BEGIN ... ROLLBACK so that concurrent tests never see the transient '=' operator. In the long run, it'd be nice if we were more robust about such situations, but I have to admit I have no idea how to go about making that so. Certainly, just letting the parser ignore catalog lookup failures doesn't sound attractive. regards, tom lane
В списке pgsql-hackers по дате отправления: