Re: [PATCH] Lazy xid assingment V2
От | Florian G. Pflug |
---|---|
Тема | Re: [PATCH] Lazy xid assingment V2 |
Дата | |
Msg-id | 46D9C317.70807@phlo.org обсуждение исходный текст |
Ответ на | Re: [PATCH] Lazy xid assingment V2 ("Heikki Linnakangas" <heikki@enterprisedb.com>) |
Список | pgsql-hackers |
Heikki Linnakangas wrote: > Tom Lane wrote: >> I had an idea this morning that might be useful: back off the strength >> of what we try to guarantee. Specifically, does it matter if we leak a >> file on crash, as long as it isn't occupying a lot of disk space? >> (I suppose if you had enough crashes to accumulate many thousands of >> leaked files, the directory entries would start to be a performance drag, >> but if your DB crashes that much you have other problems.) This leads >> to the idea that we don't really need to protect the open(O_CREAT) per >> se. Rather, we can emit a WAL entry *after* successful creation of a >> file, while it's still empty. This eliminates all the issues about >> logging an action that might fail. The WAL entry would need to include >> the relfilenode and the creating XID. Crash recovery would track these >> until it saw the commit or abort or prepare record for the XID, and if >> it didn't find any, would remove the file. > > That idea, like all other approaches based on tracking WAL records, fail > if there's a checkpoint after the WAL record (and that's quite likely to > happen if the file is large). WAL replay wouldn't see the file creation > WAL entry, and wouldn't know to track the xid. We'd need a way to carry > the information over checkpoints. Yes, checkpoints would need to include a list of created-but-yet-uncommitted files. I think the hardest part is figuring out a way to get that information to the backend doing the checkpoint - my idea was to track them in shared memory, but that would impose a hard limit on the number of concurrent file creations. Not nice :-( But wait... I just had an idea. We already got such a central list of created-but-uncommited files - pg_class itself. There is a small window between file creation and inserting the name into pg_class - but as Tom says, if we leak it then, it won't use up much space anyway. So maybe we should just scan pg_class on VACUUM, and obtain a list of files that are referenced only from DEAD tuples. Those files we can than safely delete, no? If we *do* want a strict no-leakage guarantee, than we'd have to update pg_class before creating the file, and flush the WAL. If we take Alvaro's idea of storing temporary relations in a seperate directory, we could skip the flush for those, because we can just clean out that directory after recovery. Having to flush the WAL when creating non-temporary relations doesn't sound too bad - those operations won't occur very often, I'd say. greetings, Florian Pflug
В списке pgsql-hackers по дате отправления: