Re: VACUUM FULL versus relcache init files
От | Robert Haas |
---|---|
Тема | Re: VACUUM FULL versus relcache init files |
Дата | |
Msg-id | CA+TgmoZxvcb9S6ojhfptJxakLwBud3kpAAg=4-UFPdwHRzR73Q@mail.gmail.com обсуждение исходный текст |
Ответ на | VACUUM FULL versus relcache init files (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-hackers |
On Mon, Aug 15, 2011 at 9:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > This might be the last bug from my concurrent-vacuum-full testing --- at > least, I have no remaining unexplained events from about two full days > of running the tests. The ones that are left involve backends randomly > failing like this: > > psql: FATAL: could not read block 0 in file "base/130532080/130545668": read only 0 of 8192 bytes > > usually during startup, though I have one example of a backend being > repeatedly unable to access pg_proc due to similar errors. > > I believe what this traces to is stale relfilenode information taken > from relcache init files, which contain precomputed relcache data > intended to speed up backend startup. There is a curious little dance > between write_relcache_init_file and RelationCacheInitFileInvalidate > that is intended to ensure that when a process creates a new relcache > init file that is already stale (because someone else invalidated the > information concurrently), the bad file will get unlinked and not used. > I think I invented that logic, so it's my fault that it doesn't work. > > It works fine as long as you consider only the two processes directly > involved; but a third process can get fooled into using stale data. > The scenario requires two successive invalidations, as for example from > vacuum full on two system catalogs in a row, plus a stream of incoming > new backends. It goes like this: > > 1. Process A vacuums a system catalog, unlinks init file, sends sinval > messages, does second unlink (which does nothing). > > 2. Process B starts, observes lack of init file, begins to construct > a new one. It gets to the end of AcceptInvalidationMessages in > write_relcache_init_file. Since it started after A sent the first > sinvals, it sees no incoming sinval messages and has no reason to think > its new init file isn't good. > > 3. Meanwhile, Process A vacuums another system catalog, unlinks init > file (doing nothing), and finally sends its sinval messages just after > B looked for them. Now it will block trying to get RelCacheInitLock, > which B holds. > > 4. Process B renames its already-stale init file into place, then > releases RelCacheInitLock. > > 5. Process A gets the lock and removes the stale init file. > > Now process B is okay, because it will see A's second sinvals before it > tries to make any use of the relcache data it has. And the stale init > file is definitely gone after step 5. > > However, between steps 4 and 5 there is a window for Process C to start, > read the stale init file, and attempt to use it. Since C started after > A's second set of sinval messages, it doesn't see them and doesn't know > it has stale data. > > As far as I can see at the moment, the only way to make this bulletproof > is to turn both creation and deletion of the init file into atomic > operations that include sinval messaging. What I have in mind is > > Creator: must take RelCacheInitLock, check for incoming invals, rename > the new file into place if none, release RelCacheInitLock. (This is > the same as what it does now.) > > Destroyer: must take RelCacheInitLock, unlink the init file, send its > sinvals, release RelCacheInitLock. > > This guarantees that we serialize the sending of the sinval messages > so that anyone who sees a bad init file in place *must* see the sinval > messages afterwards, so long as they join the sinval messaging ring > before looking for the init file (which they do). I don't think it's > any worse than the current scheme from a parallelism point of view: the > destroyer is holding RelCacheInitLock a bit longer than before, but that > should not be a performance critical situation. > > Anybody see a hole in that? I don't. Seems more robust than the old way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления: