Обсуждение: tbm_lossify causes "unbalanced" hashtables / bitmaps
Hi, Playing around with my hashtable implementation [1] and using it for bitmapscans/tidbitmap.c I noticed something curious. When using small work_mem (4MB in my case) for large-ish tables (~5GB), the performance tanks. That's primarily caused by some issues in the hashtable code (since fixed), but it made me look more at the relevant tidbitmap code. Namely tbm_lossify(): static void tbm_lossify(TIDBitmap *tbm) { ...pagetable_start_iterate_random(tbm->pagetable, &i);while ((page = pagetable_iterate(tbm->pagetable, &i)) != NULL){ if (page->ischunk) continue; /* already a chunk header */ /* * If the page would become a chunk header, we won't save anything by * converting it to lossy, so skip it. */ if ((page->blockno % PAGES_PER_CHUNK) == 0) continue; /* This does the dirty work ... */ tbm_mark_page_lossy(tbm, page->blockno); if (tbm->nentries <= tbm->maxentries / 2) { /* we have done enough */ break; } } tbm_mark_page_lossy() in turn deletes the individual page entry, and adds one for the chunk (spanning several pages). The relevant part is that the loop stops when enough page ranges have been lossified. This leads to some problems: Because iteration (both in my implementation and dynahash) is essentially in bucket order, the front of the hashtable will be mostly empty, whereas later parts of the hashtable will be full (i.e. a lot of collisions). The reason for that is that we'll find all parts of the hashtable that are uncompressed and "early" in the hashspace, and they'll possibly hash to something later in the table. As the number of entries in the hashtable doesn't increase, we'll never grow the hashtable to decrease the number of collisions. I've seen that cause quite noticeable number of conflicts (which of course are worse in open addressing table, rather than a separately chained one). Secondly it'll lead to pretty random parts of the bitmap being compressed. For performance it'd be good to compress "heavily used" areas of the bitmap, instead of just whatever happens to be early in the hash space. I'm not entirely sure how to resolve that best, but it does seem worth addressing. One thing that might be nice is to record the last N insertions once at 90% full or so, and then lossify in a more targeted manner using those. E.g. lossify block ranges (256 long atm) that contain a lot of individual entries or such. Greetings, Andres Freund [1] http://archives.postgresql.org/message-id/20160727004333.r3e2k2y6fvk2ntup%40alap3.anarazel.de
On 2016-09-23 13:58:43 -0700, Andres Freund wrote: > static void > tbm_lossify(TIDBitmap *tbm) > { > ... > pagetable_start_iterate_random(tbm->pagetable, &i); Uh, obviously that's from one of my attempts to address the problem. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Because iteration (both in my implementation and dynahash) is > essentially in bucket order, the front of the hashtable will be mostly > empty, whereas later parts of the hashtable will be full (i.e. a lot of > collisions). The reason for that is that we'll find all parts of the > hashtable that are uncompressed and "early" in the hashspace, and > they'll possibly hash to something later in the table. Hm, yeah, I had supposed that this would hit a random part of the key space, but you're right that over successive calls it's not good. My idea of an appropriate fix would be to resume the scan at the same point where the last scan stopped, and work circularly around the table when necessary. But I'm not sure there is any really good way to do that in the dynahash infrastructure. Maybe it'd work to keep the iteration state around, but I don't remember how well that interacts with other insertions/deletions. What about in your implementation? There's also the point mentioned in the existing comment, that it'd be better to go after pages with more bits set first. Not sure of an inexpensive way to do that (ie, one that doesn't involve multiple scans of the hashtable). But your results suggest that maybe it'd be worth making tbm_lossify slower in order to get better results. regards, tom lane
On 2016-09-23 17:40:13 -0400, Tom Lane wrote: > My idea of an appropriate fix would be to resume the scan at the same > point where the last scan stopped, and work circularly around the table > when necessary. I've played with that idea, and it does help a great deal. Not that surprisingly, it's better than starting at a random point (which in turn is better than starting at one end all the time). > But I'm not sure there is any really good way to do that > in the dynahash infrastructure. Maybe it'd work to keep the iteration > state around, but I don't remember how well that interacts with other > insertions/deletions. What about in your implementation? It's easy enough to specify a start point. Requires exposing some things that I don't necessarily want to - that's why I played around with a random start point - but other than that it's easy to implement. Internally growing the hashtable would be kind of an issue, invalidating that point, but a) we're most of the time not growing at that point anymore b) it'd be quite harmless to start at the wrong point. > There's also the point mentioned in the existing comment, that it'd be > better to go after pages with more bits set first. Not sure of an > inexpensive way to do that (ie, one that doesn't involve multiple > scans of the hashtable). But your results suggest that maybe it'd > be worth making tbm_lossify slower in order to get better results. It's not easy, I agree. Greetings, Andres Freund