Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
От | Peter Geoghegan |
---|---|
Тема | Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE |
Дата | |
Msg-id | CAM3SWZR5PQTDyJBMSod0_wzuPdRdj1OAcMpSi6xtiScfYeE64w@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE (Robert Haas <robertmhaas@gmail.com>) |
Ответы |
Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE
(Robert Haas <robertmhaas@gmail.com>)
|
Список | pgsql-hackers |
On Thu, Sep 26, 2013 at 12:15 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Well, I think we can rule out value locks that are held for the >> duration of a transaction right away. That's just not going to fly. > > I think I agree with that. I don't think I remember hearing that proposed. I think I might have been unclear - I mean locks that are held for the duration of *another* transaction, not our own, as we wait for that other transaction to commit/abort. I think that earlier remarks from yourself and Andres implied that this would be necessary. Perhaps I'm mistaken. Your most recent design proposal doesn't do this, but I think that that's only because it restricts the user to a single unique index - it would otherwise be necessary to sit on the earlier value locks (index tuples belonging to an unfinished transaction) pending the completion of some other conflicting transaction, which has numerous disadvantages (as described in my "it suits my purposes to have the value locks be held for only an instant" mail to you [1]). >> If we're really lucky, maybe the value locking stuff can be >> generalized or re-used as part of a btree index insertion buffer >> feature. > > Well, that would be nifty. Yes, it would. I think, based on a conversation with Rob Wultsch, that it's another area where MySQL still do quite a bit better. > I see. Well, we could try to mimic their semantics, I suppose. Those > semantics seem like a POLA violation to me; who would have thought > that a REPLACE could delete multiple tuples? But what do I know? I think that it's fairly widely acknowledged to not be very good. Every MySQL user uses INSERT...ON DUPLICATE KEY UPDATE instead. >> The only way that's going to work is if you say "use this unique >> index", which will look pretty gross in DML. > Yeah, it's kind of awful. It is. >> What definition of equality or inequality? > > Binary equality, same as we'd use to decide whether an update can be done HOT. I guess that's acceptable in theory, because binary equality is necessarily a *stricter* condition than equality according to some operator that is an equivalence relation. But the fact remains that you're just ameliorating the problem by making it happen less often (both through this kind of trick, but also by restricting us to one unique index), not actually fixing it. > Well, there are two separate issues here: what to do about MVCC, and > how to do the locking. Totally agreed. Fortunately, unlike the different aspects of value and row locking, I think that these two questions can be reasonable considered independently. > From an MVCC perspective, I can think of only > two behaviors when the conflicting tuple is committed but invisible: > roll back, or update it despite it being invisible. If you're saying > you don't like either of those choices, I couldn't agree more, but I > don't have a third idea. If you do, I'm all ears. I don't have another idea either. In fact, I'd go so far as to say that doing any third thing that's better than those two to any reasonable person is obviously impossible. But I'd add that we simple cannot rollback at read committed, so we're just going to have to hold our collective noses and do strange things with visibility. FWIW, I'm tentatively looking at doing something like this: *************** HeapTupleSatisfiesMVCC(HeapTuple htup, S *** 958,963 **** --- 959,975 ---- * By here, the inserting transaction has committed - have to check * when... */ + + /* + * Not necessarily visible to snapshot under conventional MVCC rules, but + * still locked by our xact and not updated -- importantly, normal MVCC + * semantics apply when we update the row, so only one version will be + * visible at once + */ + if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask) && + TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple))) + return true; + if (XidInMVCCSnapshot(HeapTupleHeaderGetXmin(tuple), snapshot)) return false; /* treat as still in progress */ This is something that I haven't given remotely enough thought yet, so please take it with a big grain of salt. > In terms of how to do the locking, what I'm mostly saying is that we > could try to implement this in a way that invents as few new concepts > as possible. No promise tuples, no new SLRU, no new page-level bits, > just index tuples and heap tuples and so on. Ideally, we don't even > change the WAL format, although step 2 might require a new record > type. To the extent that what I actually described was at variance > with that goal, consider it a defect in my explanation rather than an > intent to vary. I think there's value in considering such an > implementation because each new thing that we have to introduce in > order to get this feature is a possible reason for it to be rejected - > for modularity reasons, or because it hurts performance elsewhere, or > because it's more code we have to maintain, or whatever. There is certainly value in considering that, and you're right to take that tact - it is generally valuable to have a patch be minimally invasive. However, ultimately that's just one aspect of any given design, an aspect that needs to be weighed against others where there is a tension. Obviously in this instance I believe, rightly or wrongly, that doing more - adding more infrastructure than might be considered strictly necessary - is the least worst thing. Also, sometimes the apparent similarity of a design to what we have today is illusory - certainly, I think you'd at least agree that the problems that bloating during the interim between value locking and row locking present are qualitatively different to other problems that bloat presents in all existing scenarios. FWIW I'm not doing things this way because I'm ambitious, and am willing to risk not having my work accepted if that means I might get something that performs better, or has more features (like not requiring the user to specify a unique index in DML). Rather, I'm doing things this way because I sincerely believe that on balance mine is the best, most forward-thinking design proposed to date, and therefore the design most likely to ultimately be accepted (even though I do of course accept that there are numerous aspects that need to be worked out still). If the whole design is ultimately not accepted, that's something that I'll have to deal with, but understand that I don't see any way to play it safe here (except, I suppose, to give up now). > Now, what I hear you saying is, gee, the performance of that might be > terrible. I'm not sure that I believe that, but it's possible that > you're right. I think that the average case will be okay, but not great. I think that the worst case performance may well be unforgivably bad, and it's a fairly plausible worst case. Even if someone disputes its likelihood, and demonstrates that it isn't actually that likely, that isn't necessarily very re-assuring - getting all the details right is pretty subtle, especially compared to just not bloating, and just deferring to the btree code whose responsibilities include enforcing uniqueness. > Much seems to depend on what you think the frequency of > conflicts will be, and perhaps I'm assuming it will be low while > you're assuming a higher value. Regardless, if the performance of the > sort of implementation I'm talking about would be terrible (under some > agreed-upon definition of what terrible means in this context), then > that's a good argument for not doing it that way. I'm just not > convinced that's the case. All fair points. Forgive me for repeating myself, but the word "conflict" needs to be used carefully here, because there are two basic ways of interpreting it - something that happens due to concurrent xact activity around the same values, and something that happens due to there already being some row there with a conflicting value from some time ago (or that our xact inserted, even). Indeed, the former *is* generally much less likely than the latter, so the distinction is important. You could also further differentiate between value level and row level conflicts, or at least I think that you should, and that we should allow for value level conflicts. Let me try and explain myself better, with reference to a concrete example. Suppose we have a table with a primary key column, A, and a unique constraint column, B, and we lock the pk value first and the unique constraint value second. I'm assuming your design, but allowing for multiple unique indexes because I don't think doing anything less will be accepted - promise tuples have some of the same problems, as well as some other idiosyncratic ones (see my earlier remarks on recovery/freezing [2] for examples of those). So there is a fairly high probability that the pk value on A will be unique, and a fairly low probability that the unique constraint value on B will be unique, at least in this usage pattern of interest, where the user is mostly going to end up updating. Mostly, we insert a speculative regular index tuple (that points to a speculative heap tuple that we might decide to kill) into the pk column, A, right away, and then maybe block pending the resolution of a conflicting transaction on the unique constraint column B. I don't think we have any reasonable way of not blocking on A - if we go clean it up for the wait, that's going to bloat quite dramatically, *and* we have to WAL log. In any case you seemed to accept that cleaning up bloat synchronously like that was just going to be too expensive. So I suppose that rules that out. That just leaves sitting on the "value lock" (that the pk index tuple already inserted effectively is) indefinitely, pending the outcome of the first transaction. What are the consequences of sitting on that value lock indefinitely? Well, xacts are going to block on the pk value much more frequently, by simple virtue of the fact that the value locks there are held for a long time - they just needed to hear a "no" answer, which the unique constraint was in most cases happy to immediately give, so this is totally unnecessary. Contention is now in a certain sense almost as bad for every unique index as it is for the weakest link. That's only where the problems begin, though, and it isn't necessary for there to be bad contention on more than one unique index (the pk could just be on a serial column, say) to see bad effects. So your long-running xact that's blocking all the other sessions on its proposed value for a (or maybe even b) - that finally gets to proceed. Regardless of whether it commits or aborts, there will be a big bloat race. This is because when the other sessions get the go-ahead to proceed, they'll all run to get the row lock (one guy might insert instead). Only one will be successful, but they'll all kill their heap tuple on the assumption that they'll probably lock the row, which is only true in the average case. Now, maybe you can teach them to not bother killing the heap tuple when there are no index tuples actually inserted to ruin things, but then maybe not, and maybe it wouldn't help in this instance if you did teach them (because there's a third, otherwise irrelevant constraint or whatever). Realize you can generally only kill the heap tuple *before* you have the row lock, because otherwise a totally innocent non-HOT update (may not update any unique indexed columns at all) will deadlock with your session, which I don't think is defensible, and will probably happen often if allowed to (after all, this is upsert - users are going to want to update their locked rows!). So in this scenario, each of the original blockers will simultaneously try again and again to get the row lock as one transaction proceeds with locking and then probably commits. For every blocker's iteration (there will be n_blockers - 1 iterations, with each iteration resolving things for one blocker only), each blocker bloats. We're talking about creating duplicates in unique indexes for each and every iteration, for each and every blocker, and we all know duplicates in btree indexes are, in a word, bad. I can imagine one or two ridiculously bloated indexes in this scenario. It's even degenerative in another direction - the more aggregate bloat we have, the slower the jump from value to row locking takes, the more likely conflicts are, the more likely bloat is. Contrast this with my design, where re-ordering of would-be conflicters across unique indexes (or serialization failures) can totally nip this in the bud *if* the contention can be re-ordered around, but if not, at least there is no need to worry about aggregating bloat at all, because it creates no bloat. Now, you're probably thinking "but I said I'll reverify the row for conflicts across versions, and it'll be fine - there's generally no need to iterate and bloat again provided no unique-indexed column changed, even if that is more likely to occur due to the clean-up pre row locking". Maybe I'm being unfair, but apart from requiring a considerable amount of additional infrastructure of its own (a new EvalPlanQual()-like thing that cares about binary equality in respect of some columns only across row versions), I think that this is likely to turn out to be subtly flawed in some way, simply because of the modularity violation, so I haven't given you the benefit of the doubt about your ability to frequently avoid repeatedly asking the index + btree code what to do. For example, partial unique indexes - maybe something that looked okay before because you simply didn't have cause to insert into that unique index has to be considered in light of the fact that it changed across row versions - are you going to stash that knowledge too, and is it likely to affect someone who might otherwise not have these issues really badly because we have to assume the worst there? Do you want to do a value verification thing for that too, as we do when deciding to insert into partial indexes in the first place? Even if this new nothing-changed-across-versions infrastructure works, will it work often enough in practice to be worth it -- have you ever tracked the proportion of updates that were HOT updates in a production DB? It isn't uncommon for it to not be great, and I think that we can take that as a proxy for how well this will work. It could be totally legitimate for the UPDATE portion to alter a unique indexed column all the time. > Basically, if there's a way we can do this without changing the > on-disk format (even in a backward-compatible way), I'd be strongly > inclined to go that route unless we have a really compelling reason to > believe it's going to suck (or be outright impossible). I don't believe that anything that I have proposed needs to break our on-disk format - I hadn't considered what the implications might be in this area for other proposals, but it's possible that that's an additional advantage of doing value locking all in-memory. [1] http://www.postgresql.org/message-id/CAM3SWZRV0F-DjgpXu-WxGoG9eEcLawNrEiO5+3UKRp2e5s=TSg@mail.gmail.com [2] http://www.postgresql.org/message-id/CAM3SWZQUUuYYcGksVytmcGqACVMkf1ui1uvfJekM15YkWZpzhw@mail.gmail.com -- Peter Geoghegan
В списке pgsql-hackers по дате отправления: