Re: Hash Indexes
От | Amit Kapila |
---|---|
Тема | Re: Hash Indexes |
Дата | |
Msg-id | CAA4eK1+pnr8jbpf=u64tSboVbhA0cJrzzcoTsF+Eh5_R64SQcg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Hash Indexes (Robert Haas <robertmhaas@gmail.com>) |
Ответы |
Re: Hash Indexes
(Robert Haas <robertmhaas@gmail.com>)
|
Список | pgsql-hackers |
On Wed, Nov 9, 2016 at 1:23 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> [ new patches ] > > Attached is yet another incremental patch with some suggested changes. > > + * This expects that the caller has acquired a cleanup lock on the target > + * bucket (primary page of a bucket) and it is reponsibility of caller to > + * release that lock. > > This is confusing, because it makes it sound like we retain the lock > through the entire execution of the function, which isn't always true. > I would say that caller must acquire a cleanup lock on the target > primary bucket page before calling this function, and that on return > that page will again be write-locked. However, the lock might be > temporarily released in the meantime, which visiting overflow pages. > (Attached patch has a suggested rewrite.) > > + * During scan of overflow pages, first we need to lock the next bucket and > + * then release the lock on current bucket. This ensures that any concurrent > + * scan started after we start cleaning the bucket will always be behind the > + * cleanup. Allowing scans to cross vacuum will allow it to remove tuples > + * required for sanctity of scan. > > This comment says that it's bad if other scans can pass our cleanup > scan, but it doesn't explain why. I think it's because we don't have > page-at-a-time mode yet, and cleanup might decrease the TIDs for > existing index entries. (Attached patch has a suggested rewrite, but > might need further adjustment if my understanding of the reasons is > incomplete.) > Okay, I have included your changes with minor typo fix and updated README to use similar language. > + if (delay) > + vacuum_delay_point(); > > You don't really need "delay". If we're not in a cost-accounted > VACUUM, vacuum_delay_point() just turns into CHECK_FOR_INTERRUPTS(), > which should be safe (and a good idea) regardless. (Fixed in > attached.) > New patch contains this fix. > + if (callback && callback(htup, callback_state)) > + { > + /* mark the item for deletion */ > + deletable[ndeletable++] = offno; > + if (tuples_removed) > + *tuples_removed += 1; > + } > + else if (bucket_has_garbage) > + { > + /* delete the tuples that are moved by split. */ > + bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup > ), > + maxbucket, > + highmask, > + lowmask); > + /* mark the item for deletion */ > + if (bucket != cur_bucket) > + { > + /* > + * We expect tuples to either belong to curent bucket or > + * new_bucket. This is ensured because we don't allow > + * further splits from bucket that contains garbage. See > + * comments in _hash_expandtable. > + */ > + Assert(bucket == new_bucket); > + deletable[ndeletable++] = offno; > + } > + else if (num_index_tuples) > + *num_index_tuples += 1; > + } > + else if (num_index_tuples) > + *num_index_tuples += 1; > + } > > OK, a couple things here. First, it seems like we could also delete > any tuples where ItemIdIsDead, and that seems worth doing. In fact, I > think we should check it prior to invoking the callback, because it's > probably quite substantially cheaper than the callback. Second, > repeating deletable[ndeletable++] = offno and *num_index_tuples += 1 > doesn't seem very clean to me; I think we should introduce a new bool > to track whether we're keeping the tuple or killing it, and then use > that to drive which of those things we do. (Fixed in attached.) > As discussed up thread, I have included your changes apart from the change related to ItemIsDead. > + if (H_HAS_GARBAGE(bucket_opaque) && > + !H_INCOMPLETE_SPLIT(bucket_opaque)) > > This is the only place in the entire patch that use > H_INCOMPLETE_SPLIT(), and I'm wondering if that's really correct even > here. Don't you really want H_OLD_INCOMPLETE_SPLIT() here? (And > couldn't we then remove H_INCOMPLETE_SPLIT() itself?) There's no > garbage to be removed from the "new" bucket until the next split, when > it will take on the role of the "old" bucket. > Fixed. > I think it would be a good idea to change this so that > LH_BUCKET_PAGE_HAS_GARBAGE doesn't get set until > LH_BUCKET_OLD_PAGE_SPLIT is cleared. The current way is confusing, > because those tuples are NOT garbage until the split is completed! > Moreover, both of the places that care about > LH_BUCKET_PAGE_HAS_GARBAGE need to make sure that > LH_BUCKET_OLD_PAGE_SPLIT is clear before they do anything about > LH_BUCKET_PAGE_HAS_GARBAGE, so the change I am proposing would > actually simplify the code very slightly. > Yeah, I have changed as per above suggestion. However, I think with this change we can only check for garbage flag during vacuum. For now, I am checking both incomplete split and garbage flag in the vacuum just to be extra sure, but if you also feel that we can remove the incomplete split check, then I will do so. > +#define H_OLD_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & > LH_BUCKET_OLD_PAGE_SPLIT) > +#define H_NEW_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & > LH_BUCKET_NEW_PAGE_SPLIT) > > The code isn't consistent about the use of these macros, or at least > not in a good way. When you care about LH_BUCKET_OLD_PAGE_SPLIT, you > test it using the macro; when you care about H_NEW_INCOMPLETE_SPLIT, > you ignore the macro and test it directly. Either get rid of both > macros and always test directly, or keep both macros and use both of > them. Using a macro for one but not the other is strange. > Used macro for both. > I wonder if we should rename these flags and macros. Maybe > LH_BUCKET_OLD_PAGE_SPLIT -> LH_BEING_SPLIT and > LH_BUCKET_NEW_PAGE_SPLIT -> LH_BEING_POPULATED. I think that might be > clearer. When LH_BEING_POPULATED is set, the bucket is being filled - > that is, populated - from the old bucket. And maybe > LH_BUCKET_PAGE_HAS_GARBAGE -> LH_NEEDS_SPLIT_CLEANUP, too. > Changed the names as per discussion up thread. > + * Copy bucket mapping info now; The comment in _hash_expandtable > + * where we copy this information and calls _hash_splitbucket explains > + * why this is OK. > > After a semicolon, the next word should not be capitalized. There > shouldn't be two spaces after a semicolon, either. Also, > _hash_splitbucket appears to have a verb before it (calls) and a verb > after it (explains) and I have no idea what that means. > Fixed. > + for (;;) > + { > + mask = lowmask + 1; > + new_bucket = old_bucket | mask; > + if (new_bucket > metap->hashm_maxbucket) > + { > + lowmask = lowmask >> 1; > + continue; > + } > + blkno = BUCKET_TO_BLKNO(metap, new_bucket); > + break; > + } > > I can't help feeling that it should be possible to do this without > looping. Can we ever loop more than once? How? Can we just use an > if-then instead of a for-loop? > > Can't _hash_get_oldbucket_newblock call _hash_get_oldbucket_newbucket > instead of duplicating the logic? > Changed as per discussion up thread. > I still don't like the names of these functions very much. If you > said "get X from Y", it would be clear that you put in Y and you get > out X. If you say "X 2 Y", it would be clear that you put in X and > you get out Y. As it is, it's not very clear which is the input and > which is the output. > Changed as per discussion up thread. > + bool primary_buc_page) > > I think we could just go with "primary_page" here. (Fixed in attached.) > Included the change in attached version of the patch. > + /* > + * Acquiring cleanup lock to clear the split-in-progress flag ensures that > + * there is no pending scan that has seen the flag after it is cleared. > + */ > + _hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE); > + opage = BufferGetPage(bucket_obuf); > + oopaque = (HashPageOpaque) PageGetSpecialPointer(opage); > > I don't understand the comment, because the code *isn't* acquiring a > cleanup lock. > Removed this comment. >> Some more review: >> >> The API contract of _hash_finish_split seems a bit unfortunate. The >> caller is supposed to have obtained a cleanup lock on both the old and >> new buffers, but the first thing it does is walk the entire new bucket >> chain, completely ignoring the old one. That means holding a cleanup >> lock on the old buffer across an unbounded number of I/O operations -- >> which also means that you can't interrupt the query by pressing ^C, >> because an LWLock (on the old buffer) is held. >> > Fixed in attached patch as per algorithm proposed few lines down in this mail. > I see the problem you are talking about, but it was done to ensure > locking order, old bucket first and then new bucket, else there could > be a deadlock risk. However, I think we can avoid holding the cleanup > lock on old bucket till we scan the new bucket to form a hash table of > TIDs. > >> Moreover, the >> requirement to hold a lock on the new buffer isn't convenient for >> either caller; they both have to go do it, so why not move it into the >> function? >> > > Yeah, we can move the locking of new bucket entirely into new function. > Done. >> Perhaps the function should be changed so that it >> guarantees that a pin is held on the primary page of the existing >> bucket, but no locks are held. >> > > Okay, so we can change the locking order as follows: > a. ensure a cleanup lock on old bucket and check if the bucket (old) > has pending split. > b. if there is a pending split, release the lock on old bucket, but not pin. > > below steps will be performed by _hash_finish_split(): > > c. acquire the read content lock on new bucket and form the hash table > of TIDs and in the process of forming hash table, we need to traverse > whole bucket chain. While traversing bucket chain, release the lock > on previous bucket (both lock and pin if not a primary bucket page). > d. After the hash table is formed, acquire cleanup lock on old and new > buckets conditionaly; if we are not able to get cleanup lock on > either, then just return from there. > e. Perform split operation.. > f. release the lock and pin on new bucket > g. release the lock on old bucket. We don't want to release the pin > on old bucket as the caller has ensure it before passing it to > _hash_finish_split(), so releasing pin should be resposibility of > caller. > > Now, both the callers need to ensure that they restart the operation > from begining as after we release lock on old bucket, somebody might > have split the bucket. > > Does the above change in locking strategy sounds okay? > I have changed the locking strategy as per above description by me and accordingly changed the prototype of _hash_finish_split. >> Where _hash_finish_split does LockBufferForCleanup(bucket_nbuf), >> should it instead be trying to get the lock conditionally and >> returning immediately if it doesn't get the lock? Seems like a good >> idea... >> > > Yeah, we can take a cleanup lock conditionally, but it would waste the > effort of forming hash table, if we don't get cleanup lock > immediately. Considering incomplete splits to be a rare operation, > may be this the wasted effort is okay, but I am not sure. Don't you > think we should avoid that effort? > Changed it to conditional lock. >> * We're at the end of the old bucket chain, so we're done partitioning >> * the tuples. Mark the old and new buckets to indicate split is >> * finished. >> * >> * To avoid deadlocks due to locking order of buckets, first lock the old >> * bucket and then the new bucket. >> >> These comments have drifted too far from the code to which they refer. >> The first part is basically making the same point as the >> slightly-later comment /* indicate that split is finished */. >> > > I think we can remove the second comment /* indicate that split is > finished */. Removed this comment. > itupsize = new_itup->t_info & INDEX_SIZE_MASK; > new_itup->t_info &= ~INDEX_SIZE_MASK; > new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK; > new_itup->t_info |= itupsize; > > If I'm not mistaken, you could omit the first, second, and fourth > lines here and keep only the third one, and it would do exactly the > same thing. The first line saves the bits in INDEX_SIZE_MASK. The > second line clears the bits in INDEX_SIZE_MASK. The fourth line > re-sets the bits that were originally saved. > You are right and I have changed the code as per your suggestion. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Вложения
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Andreas KarlssonДата:
Сообщение: Re: Contains and is contained by operators of inet datatypes
Следующее
От: Pavan DeolaseeДата:
Сообщение: Re: Patch: Write Amplification Reduction Method (WARM)