Обсуждение: B-tree descend for insertion locking
When inserting into a B-tree index, all the pages are read-locked when descending the tree. When we reach the leaf page, the read-lock is exchanged for a write-lock. There's nothing wrong with that, but why don't we just directly grab a write-lock on the leaf page? When descending, we know the level we're on, and what level the child page is. The only downside I can see is that we would unnecessarily hold a write-lock when a read-lock would suffice, if the page was just split and we have to move right. But that seems like a really bad bet - hitting the page when it was just split is highly unlikely. Grabbing the write-lock directly would obviously save one buffer unlock+lock sequence, for what it's worth, but I think it would also slightly simplify the code. Am I missing something? See attached patch. The new contract of _bt_getroot is a bit weird: it locks the returned page in the requested lock-mode, shared or exclusive, if the root page was also the leaf page. Otherwise it's locked in shared mode regardless off the requested lock mode. But OTOH, the new contract for _bt_search() is more clear now: it actually locks the returned page in the requested mode, where it used to only use the access parameter to decide whether to create a root page if the index was empty. - Heikki
Вложения
On Tue, Mar 18, 2014 at 4:12 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > See attached patch. The new contract of _bt_getroot is a bit weird: it locks > the returned page in the requested lock-mode, shared or exclusive, if the > root page was also the leaf page. Otherwise it's locked in shared mode > regardless off the requested lock mode. But OTOH, the new contract for > _bt_search() is more clear now: it actually locks the returned page in the > requested mode, where it used to only use the access parameter to decide > whether to create a root page if the index was empty. I had considered this myself, and am under the impression that the only reason things work this way is because it makes the interface of _bt_getroot() clearer. I think what you propose is logical, and you should pursue it, but fwiw I'm not convinced that you've really simplified _bt_search(). However, you have kind of made _bt_search() into something that succinctly represents what is useful about Lehman and Yao as compared to earlier concurrent locking techniques, and that's a good thing. I would probably have remarked on that in the comments if I were you. I certainly would not have left out some reference to L&Y. You've removed an existing one where the lock escalation occurs, emphasizing that it's safe, and I'd like to see you at least compensate for that. -- Peter Geoghegan
On Tue, Mar 18, 2014 at 4:42 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > When inserting into a B-tree index, all the pages are read-locked when > descending the tree. When we reach the leaf page, the read-lock is exchanged > for a write-lock. > > There's nothing wrong with that, but why don't we just directly grab a > write-lock on the leaf page? When descending, we know the level we're on, > and what level the child page is. The only downside I can see is that we > would unnecessarily hold a write-lock when a read-lock would suffice, if the > page was just split and we have to move right. But that seems like a really > bad bet - hitting the page when it was just split is highly unlikely. Another case could be when the page is half dead or deleted, but again chances of same are relatively less. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 18 March 2014 11:12, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > When inserting into a B-tree index, all the pages are read-locked when > descending the tree. When we reach the leaf page, the read-lock is exchanged > for a write-lock. > > There's nothing wrong with that, but why don't we just directly grab a > write-lock on the leaf page? When descending, we know the level we're on, > and what level the child page is. The only downside I can see is that we > would unnecessarily hold a write-lock when a read-lock would suffice, if the > page was just split and we have to move right. But that seems like a really > bad bet - hitting the page when it was just split is highly unlikely. Sounds good. Grabbing write lock directly will reduce contention on the buffer, not just reduce the code path. If we have a considerable number of duplicates we would normally step thru until we found a place to insert. Presumably that will happen with write lock enabled, rather than read lock. Would this slow down the insertion of highly duplicate keys under concurrent load? i.e. is this a benefit for nearly-unique but not for other cases? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Mar 18, 2014 at 4:12 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > When inserting into a B-tree index, all the pages are read-locked when > descending the tree. When we reach the leaf page, the read-lock is exchanged > for a write-lock. > > There's nothing wrong with that, but why don't we just directly grab a > write-lock on the leaf page? Whatever happened to this idea? -- Peter Geoghegan