pg_serial bloat
От | Thomas Munro |
---|---|
Тема | pg_serial bloat |
Дата | |
Msg-id | CA+hUKG+HQhPqZMOYpKJ18BD0ERqO7XDovqFzu293fB1ePQ3tzA@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: pg_serial bloat
(Thomas Munro <thomas.munro@gmail.com>)
|
Список | pgsql-hackers |
Hi, Our pg_serial truncation logic is a bit broken, as described by the comments in CheckPointPredicate() (a sort of race between xid cycles and checkpointing). We've seen a system with ~30GB of files in there (note: full/untruncated be would be 2³² xids × sizeof(uint64_t) = 32GB). It's not just a gradual disk space leak: according to disk space monitoring, this system suddenly wrote ~half of that data, which I think must be the while loop in SerialAdd() zeroing out pages. Ouch. I see a few questions: 1. How should we fix this fundamentally in future releases? One answer is to key SSI's xid lookup with FullTransactionId (conceptually cleaner IMHO but I'm not sure how far fxids need to 'spread' through the system to do it right). Another already mentioned in comments is to move some logic into vacuum so it can stay in sync with the xid cycle (maybe harder to think about and prove correct). 2. Could there be worse consequences than wasted disk and I/O? 3. Once a system reaches a bloated state like this, what can an administrator do? I looked into question 3. I convinced myself that it must be safe to unlink all the files under pg_serial while the cluster is down, because: * we don't need the data across restarts, it's just for spilling * we don't need the 'head' file because slru.c opens with O_CREAT * open(O_CREAT) followed by pwrite(..., offset) will create a harmless hole * we never read files outside the tailXid/headXid range we've written * we zero out pages as we add them in SerialAdd(), without reading If I have that right, perhaps we should not merely advise that it is safe to do that manually, but proactively do it in SerialInit(). That is where we establish in shared memory that we don't expect there to be any files on disk, so it must be a good spot to make that true if it is not: if (!found) { /* * Set control information to reflect empty SLRU. */ serialControl->headPage = -1; serialControl->headXid = InvalidTransactionId; serialControl->tailXid = InvalidTransactionId; + + /* Also delete any files on disk. */ + SlruScanDirectory(SerialSlruCtl, SlruScanDirCbDeleteAll, NULL); } In common cases that would just readdir() an empty directory. For testing, it is quite hard to convince predicate.c to write any files there: normally you have to overflow its transaction tracking, which requires more than (max backends + max prepared xacts) × 10 SERIALIZABLE transactions in just the right sort of overlapping pattern, so that the committed ones need to be spilled to disk. I might try to write a test for that, but it gets easier if you define TEST_SUMMARIZE_SERIAL. Then you don't need many transactions -- but you still need a slightly finicky schedule. Start with a couple of overlapping SSI transactions, then commit them, to get a non-empty FinishedSerializableTransaction list. Then create some more SSI transactions, which will call SerialAdd() due to the TEST_ macro. Then run a checkpoint, and you should see eg "0000" being created on demand during SLRU writeback, demonstrating that starting from an empty pg_serial directory is always OK. I wanted to try that to remind myself of how it all works, but I suppose it should be obvious that it's OK: initdb's initial state is an empty directory. To create a bunch of junk files that are really just thin links for the above change to unlink, or to test the truncate code when it sees a 'full' directory, you can do: cd pg_serial dd if=/dev/zero of=0000 bs=256k count=1 awk 'BEGIN { for (i = 1; i <= 131071; i++) { printf("%04X\n", i); } }' | xargs -r -I {} ln 0000 {}
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Masahiko SawadaДата:
Сообщение: Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)