Re: Expanding HOT updates for expression and partial indexes
| От | Greg Burd |
|---|---|
| Тема | Re: Expanding HOT updates for expression and partial indexes |
| Дата | |
| Msg-id | BCF1BF66-7A1B-4E01-87DC-0BE45EDF2F98@greg.burd.me обсуждение исходный текст |
| Ответ на | Re: Expanding HOT updates for expression and partial indexes (Jeff Davis <pgsql@j-davis.com>) |
| Список | pgsql-hackers |
Hello again. This idea started over a year ago for me while working on a project that used JSONB and had horrible "bloat" with update times that were not fantastic. The root cause, expression indexes prevent HOT updates and all indexes on JSONB are by definition, expressions. That's the backstory. The idea for the solution came from a patch [1] applied [2], then later reverted [3], that basically evaluated the before/after tuple expressions in heap_update() at the point where newbuff == buffer just before deciding to use_hot_update or not. When the evaluated expressions produced equal results using a binary comparison the index didn't need to be updated. While this approach sorta worked, but it was reverted for a few reasons, here's Tom's summary: "The problem here is that [the code that checks for equality] thinks that the type of the index column is identical to the type of the source datum for it, which is not true for any opclass making use of the opckeytype property. [4]" Still, expanding the domain of what might go HOT seems like a good goal to me and it was at the heart of the issues I was facing on the project using JSONB, so I kept at it. The patches I've sent on this thread have evolved from that first idea. First they addressed the specific issue raised by Tom. Then they expanded to include partial indexes as well. This worked, and I helped to ship a fork of Postgres used this approach and solved customer issues with the JSONB use case. Customer experience was better, no unnecessary index updates when indexed data wasn't modified meant no more heap/index bloat and faster update times. Vacuum could finally keep up and performance and storage overhead was much better. For this patch set v1 through v17 were essentially work that refined/reworked that original approach, but there was a serious flaw that I didn't fully appreciate until around v16/17. I was evaluating the expressions while holding a lock on the buffer page which a) expands the time the lock is held, and b) opens the door to self-deadlock. No bueno. Then there was v18, a quick work-around for that. I moved the call that invokes the executor to the beginning of heap_update() before taking the lock on the page. To do this I had to find the set of updated attributes, which I discovered was available in the executor as the updated tuple is created. This was a viable fix, but didn't really go far enough and was a bit hackish IMO. Jeff Davis and others challenged me to move the work to identify what's changed into the executor and clean it up. I'm a sucker for a challenge. More generally, this idea made sense. While Postgres has many places where logic is tightly coupled to the way the heap works this didn't have to be one. To make this work I needed the update path in the executor to be interested in: a) knowing what columns were specified in the UPDATE statement and those impacted by before/after triggers, b) reducing that set to those attributes known to be both indexed and to have changed value, c) finding which of those (and possibly other) attributes that force new index updates. Why? We'll, that code already exists in a few places and in some cases is replicated; for (a) there is ExecGetAllUpdatedCols(), for (b) HeapDetermineColumnsInfo() and index_unchanged_by_update(). An interesting thing to note is that HeapDetermineColumnsInfo() might return a set that includes columns not returned by ExecGetAllUpdatedCols() because HeapDetermineColumnsInfo() iterates over all indexed attributes looking for changes and that might find an indexed attribute that was changed by heap_modify_tuple() but not knowable by ExecGetAllUpdatedCols(). This happens in tsvector code, see tsvector_op.c tsvector_update_trigger() where if (update_needed) heap_modify_tuple_by_cols(). That column isn't known to ExecGetAllUpdatedCols(). HeapDetermineColumnsInfo() is also critical when modifying catalog tuples. Catalog tuples are modified using either Form/GETSTRUCT or values/nulls/replaces then using heap_modify_tuple() and calling into CatalogTupleUpdate() which calls simple_heap_update() that calls heap_update() where we find HeapDetermineColumnsInfo(). The interesting thing here is that when modifying catalog tuples there is knowledge of what attributes are changed, but that knowledge isn't preserved and passed into CatalogTupleUpdate(), rather it is re-discovered in HeapDetermineColumnsInfo(). That's how catalog tuples are able to take the HOT path, they re-use that same logic. There is a fix for that [5] too (and I really hope that lands in master ASAP), but that's not the subject of this thread. HeapDetermineColumnsInfo() also helps inform a few other decisions in heap_update(), but these have to happen after taking the buffer lock and are very heap-specific, namely: 1. Do either the replica identity key attributes overlap with the modified index attributes or do they need to be stored externally, this is passed on to ExtractReplicaIdentity() to find out if we must augment the WAL log or not. 2. Are there any modified indexed attributes that intersect with the primary keys of the relation, if not lower the lock mode to enable multixact to work. HeapDetermineColumnsInfo() also takes a pragmatic approach to testing for equality when looking for modified indexed attributes, it uses datumIsEqual() which boils down to a simple memcmp() of the before/after HeapTuple datum. This is fine in most cases, but limits the scope of what can be HOT. Interestingly, this requirement of binary equality has leaked into other parts of the code, namely nbtree's deduplication of TIDs on page split. That code uses binary equality as well. A nbtree index with collation for case insensitive must store both "A" and "a" despite those being type-equal because they are not binary equivalent. More on this later. At this point, I had my sights set on HeapDetermineColumnsInfo(). I felt that what it was doing should move into the executor, well as much of that work as possible, and outside of the buffer lock. This would also open the door for removal of redundant code. My thought was that the table AM update API should have an additional argument, the "modified indexed attributes" or "mix_attrs", passed in. So, here we are at the door of v19... let's begin. 0001 - Reorganize heap update logic This is preparatory work for the larger goal in that heap_update() serves two masters: CatalogTupleUpdate()/simple_heap_update() and heap_tuple_update() and in reality, they were different but needed most of the same logic that happens at the start of heap_update(). This patch splits that logic out and moves it into heap_tuple_update() and simple_heap_update(). Functionally nothing changes. That's the meat of this patch.Reorganize heap update logic 0002 - Track changed indexed columns in the executor during UPDATEs This is the first core set of changes, it doesn't expand HOT updates but it does restructure where HeapDetermineColumnsInfo()'s core work happens. A new function ExecCheckIndexedAttrsForChanges() in nodeModifyTable.c is now responsible for checking for changes between Datum in the old/new TupleTableSlots. This is different from before in that we're not checking the new HeapTuple Datum verses the HeapTuple we read from the buffer page while holding the lock on that page. An update starts off by reading the existing tuple using the table AM. Then a new updated tuple is created as the set of changes to the old. Then the new TupleTableSlot is the combination of the existing one we just read and the changes we just recorded. So, in the executor before calling into the table AM's update function we have a pin on the buffer and the before/after TupleTableSlots for this update. So, I've put the call to my new ExecCheckIndexedAttrsForChanges() function just before calling table_tuple_update() and I've added the "mix_attrs" into that call which get passed on to the heap in heap_tuple_update() and then heap_update() and all is well. Why is this safe? The way I read heap_update() is that it has always historically had code to deal with cases where the tuple is concurrently updated and react accordingly thanks to HeapTupleSatisfiesUpdate() which remains where it was in heap_update(). Visibility checks happened when we first read the tuple to form the updated tuple and later in heap_update() when we call HeapTupleSatisfiesVisibility() to check for transaction-snapshot mode RI updates. So, this new update path from the executor into the table AM seems to me to be okay and almost functionally equivalent. But there is one big change to discuss before moving to the simple_heap_update() path. In nodeModifyTable.c tts_attr_equal() replaces heap_attr_equal() changing the test for equality when calling into heap_tuple_update(). In the past we used datumIsEqual(), essentially a binary comparison using memcmp(), now the comparison code in tts_attr_equal uses type-specific equality function when available and falls back to datumIsEqual() when not. The other parts of HeapDetermineColumnsInfo() remain in the code, but they still happen within the simple_heap_update() and heap_tuple_update() code. That's where you'll find that after the buffer is locked we do (1) and (2) from above. This keeps the heap-specific work in the heap, but we've moved some work up into the executor and outside the buffer lock. While this accomplishes the goal of removing HeapDetermineColumnsInfo() from heap_update() on the path that uses the table AM API heap_tuple_update() but it doesn't on the simple_heap_update() path. That remains the same as it was in the previous patch. Ideally, my patch to restructure how catalog tuples are updated [5] is committed and we can fully remove HeapDetermineColumnsInfo() and likely speed up all catalog updates in the process. That's what motivated [5], please take a look, it required a huge number of changes so I thought it deserved a life/thread of its own. Finally, there is the ExecSimpleRelationUpdate() path and slot_modify_data(). On this path we know what attributes are being updated, so we just check to see if they changed and then intersect that with the set of indexed attributes and we have our modified indexed attributes set to pass into simple_heap_update(). 0003 - Replace index_unchanged_by_update with ri_ChangedIndexedCols This patch removes the function index_unchanged_by_update() in execIndexing.c and simply re-uses the modified indexed attributes that we've stashed away in ResultRelInfo as ri_ChangedIndexedCols. This provides a hint when calling into the index AM's index_insert() function indicating if the UPDATE was without logical change to the data or not. We've done that check, we don't need to do it again. 0004 - Enable HOT updates for expression and partial indexes This finally gets us back to where this project started, but on much more firm ground than before because we're not going to self-deadlock. The idea has grown from a small function into something larger, but only out of necessity. In this patch I add ExecWhichIndexesRequireUpdates() in execIndexing.c which implements (c) finding the set of attributes that force new index updates. This set can be very different from the modified indexed attributes. We know that some attributes are not equal to their previous versions, but does that mean that the index that references that attribute needs a new index tuple? It may, or it may not. Here's the comment on that function that explains: /* * ExecWhichIndexesRequireUpdates * * Determine which indexes need updating given modified indexed attributes. * This function is a companion to ExecCheckIndexedAttrsForChanges(). On the * surface, they appear similar but they are doing two very different things. * * For a standard index on a set of attributes this is the intersection of * the mix_attrs and the index attrs (key, expression, but not predicate). * * For expression indexes and indexes which implement the amcomparedatums() * index AM API we'll need to form index datum and compare each attribute to * see if any actually changed. * * For expression indexes the result of the expression might not change at all, * this is common with JSONB columns which require expression indexes and where * it is commonplace to index a field within a document and have updates that * generally don't update that field. * * Partial indexes won't trigger index tuples when the old/new tuples are both * outside of the predicate range. * * For nbtree the amcomparedatums() API is critical as it requires that key * attributes are equal when they memcmp(), which might not be the case when * using type-specific comparison or factoring in collation which might make * an index case insensitive. * * All of this is to say that the goal is for the executor to know, ahead of * calling into the table AM for the update and before calling into the index * AM for inserting new index tuples, which attributes at a minimum will * necessitate a new index tuple. * ... */ Whereas before we were comparing Datum in the table relation, now we're comparing Datum in the index relation. Index AMs are free to store what they want, we need to know if what's changed and referenced by the index means that the index needs a new tuple or not. In the case of a JSONB expression index (or any expression index) the expression is evaluated when calling FormIndexDatum(). The result of the expression is the Datum in the values/isnull arrays. Then we need to compare them to see if they changed. This can be done using the same tts_attr_equal() function, but with the attribute desc of the index, not table relation. In some cases that's not enough, for instance nbtree and it's more stringent equality requirements. For that reason (and another coming up in a second) we need a new optional index AM API, amcomparedatums(). Indexes implementing this function have the ability to compare two Datums for equality in what ever way they want. For nbtree, that's a binary comparison. The new case here that this also supports is for indexes like GIN/RUM where there is an opclass that can extract zero or more pieces from the attribute and form multiple index entries when needed. Those extracted pieces might have odd equality rules as well. This opens the door for index implementations to provide information that will help inform heap when making HOT decisions that didn't exist before. Take for example the Linux Foundation's DocumentDB [6] project [7] which aims to be an open source alternative to MongoDB built on top of PostgreSQL. One of the pieces of that project is its "extended RUM" index AM implementation. This index extracts portions of the BSON documents stored and forms index keys from that. Here is an example: CREATE INDEX documents_rum_index_14002 ON documentdb_data.documents_14001 USING documentdb_rum (document bson_rum_single_path_ops (path=a, iswildcard='true', tl='2699')) An index on the "document" column which uses the "bson_rum_single_path_ops" opclass to extract a portion of the BSON document that matches "path=a". For this to potentially be stored by heap as a HOT update we need to know that what changed within that document didn't intersect with the "path=a" and if it did that the new value(s) were all equal to the old values. Equality in DocumentDB isn't what you think, it's quite odd and specific to BSON and rules defined by MongoDB so it's important to allow the index AM to execute what's necessary for it's use case. For the more common JSONB use case we can now have heap make an informed decision about HOT updates after evaluating the expression. For GIN/RUM/etc. implementation there is a path to HOT. For nbtree there is a way to maintain its requirements. Is there a cost to all this? Yes, of course. There is net new work being done on some paths. IndexFormDatum() will be called more frequently and sometimes twice for the same thing. This could be improved, there might be a way to cache that information. But to have tests of old/new we'll have to do that work at least twice. Is there a benefit? Yes, of course. Some redundant code paths are gone and in the end we've increased the number of cases where HOT updates are a possibility. This especially helps out users of JSONB, but not only them. What's left undone? * I need to check code coverage so that I might * create tests covering all the new cases * update the README.HOT documentation, wiki, etc. * performance... For performance I'd like to examine some worst cases as in lots of indexes that have a lot of new code to exercise and all but the last index would allow for a HOT update. That should represent the maximum amount of new overhead for this code. Then, the other side of the equation, how much does this help JSONB? I think that is something to measure in terms of TPS as well as "bloat" avoided and time spent vacuuming. I also don't like the TU_Updating enum, I think it's a leaky abstraction and really pointless now. I'd like to remove it in favor of the bitmap of attributes known to force index tuples to be inserted. Maybe I'll layer that into the next set. In the end, this is a lot of work and I believe that it moves the ball forward. I'll have more metrics on that soon I hope, but I wanted to get the conversation re-started ASAP as we're late in the v19 cycle. Finally, the "elephant" in the room (ha!) is PHOT[9]/WARM[10][11]. Yes, some of this work does help make a solution to allow HOT updates when only updating a subset of indexes closer to reality, I'd be lying not to mention that here, it is a big part of my overall plan for my next year and (I hope) v20 and I need this work to get there. Specifically, PHOT is in part based on HeapDetermineColumnsInfo() and I needed to make that more truthful for it to work. I hope you see the value in this work and will partner with me to finalize it and get it into master. best. -greg [1] https://www.postgresql.org/message-id/flat/4d9928ee-a9e6-15f9-9c82-5981f13ffca6%40postgrespro.ru [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=c203d6cf8 [3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=05f84605dbeb9cf8279a157234b24bbb706c5256 [4] https://www.postgresql.org/message-id/2877.1541538838%40sss.pgh.pa.us [5] https://www.postgresql.org/message-id/flat/2C5C8B8D-8B36-4547-88EB-BDCF9A7C8D94@greg.burd.me [6] https://www.linuxfoundation.org/press/linux-foundation-welcomes-documentdb-to-advance-open-developer-first-nosql-innovation [7] https://github.com/documentdb/documentdb [8] https://github.com/documentdb/documentdb/tree/main/pg_documentdb_extended_rum [9] https://www.postgresql.org/message-id/flat/2ECBBCA0-4D8D-4841-8872-4A5BBDC063D2%40amazon.com [10] https://www.postgresql.org/message-id/flat/CABOikdMop5Rb_RnS2xFdAXMZGSqcJ-P-BY2ruMd%2BbuUkJ4iDPw%40mail.gmail.com [11] https://www.postgresql.org/message-id/flat/CABOikdMNy6yowA%2BwTGK9RVd8iw%2BCzqHeQSGpW7Yka_4RSZ_LOQ%40mail.gmail.com
Вложения
В списке pgsql-hackers по дате отправления: