Re: [HACKERS] mdnblocks is an amazing time sink in huge relations
От | Tom Lane |
---|---|
Тема | Re: [HACKERS] mdnblocks is an amazing time sink in huge relations |
Дата | |
Msg-id | 1689.940302654@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | RE: [HACKERS] mdnblocks is an amazing time sink in huge relations ("Hiroshi Inoue" <Inoue@tpf.co.jp>) |
Ответы |
RE: [HACKERS] mdnblocks is an amazing time sink in huge relations
|
Список | pgsql-hackers |
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: >> a shared cache for system catalog tuples, which might be a win but I'm >> not sure (I'm worried about contention for the cache, especially if it's >> protected by just one or a few spinlocks). Anyway, if we did have one >> then keeping an accurate block count in the relation's pg_class row >> would be a practical alternative. > But there would be a problem if we use shared catalog cache. > Being updated system tuples are only visible to an updating backend > and other backends should see committed tuples. > On the other hand,an accurate block count should be visible to all > backends. > Which tuple of a row should we load to catalog cache and update ? Good point --- rolling back a transaction would cancel changes to the pg_class row, but it mustn't cause the relation's file to get truncated (since there could be tuples of other uncommitted transactions in the newly added block(s)). This says that having a block count column in pg_class is the Wrong Thing; we should get rid of relpages entirely. The Right Thing is a separate data structure in shared memory that stores the current physical block count for each active relation. The first backend to touch a given relation would insert an entry, and then subsequent extensions/truncations/deletions would need to update it. We already obtain a special lock when extending a relation, so seems like there'd be no extra locking cost to have a table like this. Anyone up for actually implementing this ;-) ? I have other things I want to work on... >> Well, it seems to me that the first misbehavior (incomplete delete becomes >> a partial truncate, and you can try again) is a lot better than the >> second (incomplete delete leaves an undeletable, unrecreatable table). >> Should I go ahead and make delete/truncate work back-to-front, or do you >> see a reason why that'd be a bad thing to do? > I also think back-to-front is better. OK, I have a couple other little things I want to do in md.c, so I'll see what I can do about that. Even with a shared-memory relation length table, back-to-front truncation would be the safest way to proceed, so we'll want to make this change in any case. > Deletion is necessary only not to consume disk space. > > For example vacuum could remove not deleted files. Hmm ... interesting idea ... but I can hear the complaints from users already... regards, tom lane
В списке pgsql-hackers по дате отправления: