Re: Batching in executor
От | Tomas Vondra |
---|---|
Тема | Re: Batching in executor |
Дата | |
Msg-id | 68f3771f-91f5-4cb7-b1de-74d9abbf0b96@vondra.me обсуждение исходный текст |
Ответ на | Batching in executor (Amit Langote <amitlangote09@gmail.com>) |
Ответы |
Re: Batching in executor
|
Список | pgsql-hackers |
Hi Amit, Thanks for the patch. I took a look over the weekend, and done a couple experiments / benchmarks, so let me share some initial feedback (or rather a bunch of questions I came up with). I'll start with some general thoughts, before going into some nitpicky comments about patches / code and perf results. I think the general goal of the patch - reducing the per-tuple overhead and making the executor more efficient for OLAP workloads - is very desirable. I believe the limitations of per-row executor are one of the reasons why attempts to implement a columnar TAM mostly failed. The compression is nice, but it's hard to be competitive without an executor that leverages that too. So starting with an executor, in a way that helps even heap, seems like a good plan. So +1 to this. While looking at the patch, I couldn't help but think about the index prefetching stuff that I work on. It also introduces the concept of a "batch", for passing data between an index AM and the executor. It's interesting how different the designs are in some respects. I'm not saying one of those designs is wrong, it's more due different goals. For example, the index prefetching patch establishes a "shared" batch struct, and the index AM is expected to fill it with data. After that, the batch is managed entirely by indexam.c, with no AM calls. The only AM-specific bit in the batch is "position", but that's used only when advancing to the next page, etc. This patch does things differently. IIUC, each TAM may produce it's own "batch", which is then wrapped in a generic one. For example, heap produces HeapBatch, and it gets wrapped in TupleBatch. But I think this is fine. In the prefetching we chose to move all this code (walking the batch items) from the AMs into the layer above, and make it AM agnostic. But for the batching, we want to retain the custom format as long as possible. Presumably, the various advantages of the TAMs are tied to the custom/columnar storage format. Memory efficiency thanks to compression, execution on compressed data, etc. Keeping the custom format as long as possible is the whole point of "late materialization" (and materializing as late as possible is one of the important details in column stores). How far ahead have you though about these capabilities? I was wondering about two things in particular. First, at which point do we have to "materialize" the TupleBatch into some generic format (e.g. TupleSlots). I get it that you want to enable passing batches between nodes, but would those use the same "format" as the underlying scan node, or some generic one? Second, will it be possible to execute expressions on the custom batches (i.e. on "compressed data")? Or is it necessary to "materialize" the batch into regular tuple slots? I realize those may not be there "now" but maybe it'd be nice to plan for the future. It might be worth exploring some columnar formats, and see if this design would be a good fit. Let's say we want to process data read from a parquet file. Would we be able to leverage the format, or would we need to "materialize" into slots too early? Or maybe it'd be good to look at the VCI extension [1], discussed in a nearby thread. AFAICS that's still based on an index AM, but there were suggestions to use TAM instead (and maybe that'd be a better choice). The other option would be to "create batches" during execution, say by having a new node that accumulates tuples, builds a batch and sends it to the node above. This would help both in cases when either the lower node does not produce batches at all, or the batches are too small (due to filtering, aggregation, ...). Or course, it'd only win if this increases efficiency of the upper part of the plan enough to pay for building the batches. That can be a hard decision. You also mentioned we could make batches larger by letting them span multiple pages, etc. I'm not sure that's worth it - wouldn't that substantially complicate the TAM code, which would need to pin+track multiple buffers for each batch, etc.? Possible, but is it worth it? I'm not sure allowing multi-page batches would actually solve the issue. It'd help with batches at the "scan level", but presumably the batch size in the upper nodes matters just as much. Large scan batches may help, but hard to predict. In the index prefetching patch we chose to keep batches 1:1 with leaf pages, at least for now. Instead we allowed having multiple batches at once. I'm not sure that'd be necessary for TAMs, though. This also reminds me of LIMIT queries. The way I imagine a "batchified" executor to work is that batches are essentially "units of work". For example, a nested loop would grab a batch of tuples from the outer relation, lookup inner tuples for the whole batch, and only then pass the result batch. (I'm ignoring the cases when the batch explodes due to duplicates.) But what if there's a LIMIT 1 on top? Maybe it'd be enough to process just the first tuple, and the rest of the batch is wasted work? Plenty of (very expensive) OLAP have that, and many would likely benefit from batching, so just disabling batching if there's LIMIT seems way too heavy handed. Perhaps it'd be good to gradually ramp up the batch size? Start with small batches, and then make them larger. The index prefetching does that too, indirectly - it reads the whole leaf page as a batch, but then gradually ramps up the prefetch distance (well, read_stream does that). Maybe the batching should have similar thing ... In fact, how shall the optimizer decide whether to use batching? It's one thing to decide whether a node can produce/consume batches, but another thing is "should it"? With a node that "builds" a batch, this decision would apply to even more plans, I guess. I don't have a great answer to this, it seems like an incredibly tricky costing issue. I'm a bit worried we might end up with something too coarse, like "jit=on" which we know is causing problems (admittedly, mostly due to a lot of the LLVM work being unpredictable/external). But having some "adaptive" heuristics (like the gradual ramp up) might make it less risky. FWIW the current batch size limit (64 tuples) seems rather low, but it's hard to say. It'd be good to be able to experiment with different values, so I suggest we make this a GUC and not a hard-coded constant. As for what to add to explain, I'd start by adding info about which nodes are "batched" (consuming/producing batches), and some info about the batch sizes. An average size, maybe a histogram if you want to be a bit fancy. I have no thoughts about the expression patches, at least not beyond what I already wrote above. I don't know enough about that part. [1] https://www.postgresql.org/message-id/OS7PR01MB119648CA4E8502FE89056E56EEA7D2%40OS7PR01MB11964.jpnprd01.prod.outlook.com Now, numbers from some microbenchmarks: On 9/26/25 15:28, Amit Langote wrote: > > To evaluate the overheads and benefits, I ran microbenchmarks with > single and multi-aggregate queries on a single table, with and without > WHERE clauses. Tables were fully VACUUMed so visibility maps are set > and IO costs are minimal. shared_buffers was large enough to fit the > whole table (up to 10M rows, ~43 on each page), and all pages were > prewarmed into cache before tests. Table schema/script is at [2]. > > Observations from benchmarking (Detailed benchmark tables are at [3]; > below is just a high-level summary of the main patterns): > > * Single aggregate, no WHERE (SELECT count(*) FROM bar_N, SELECT > sum(a) FROM bar_N): batching scan output alone improved latency by > ~10-20%. Adding batched transition evaluation pushed gains to ~30-40%, > especially once fmgr overhead was paid per batch instead of per row. > > * Single aggregate, with WHERE (WHERE a > 0 AND a < N): batching the > qual interpreter gave a big step up, with latencies dropping by > ~30-40% compared to batching=off. > > * Five aggregates, no WHERE: batching input from the child scan cut > ~15% off runtime. Adding batched transition evaluation increased > improvements to ~30%. > > * Five aggregates, with WHERE: modest gains from scan/input batching, > but per-batch transition evaluation and batched quals brought ~20-30% > improvement. > > * Across all cases, executor overheads became visible only after IO > was minimized. Once executor cost dominated, batching consistently > reduced CPU time, with the largest benefits coming from avoiding > per-row fmgr calls and evaluating quals across batches. > > I would appreciate if others could try these patches with their own > microbenchmarks or workloads and see if they can reproduce numbers > similar to mine. Feedback on both the general direction and the > details of the patches would be very helpful. In particular, patches > 0001-0003, which add the basic batch APIs and integrate them into > SeqScan, are intended to be the first candidates for review and > eventual commit. Comments on the later, more experimental patches > (aggregate input batching and expression evaluation (qual, aggregate > transition) batching) are also welcome. > I tried to replicate the results, but the numbers I see are not this good. In fact, I see a fair number of regressions (and some are not negligible). I'm attaching the scripts I used to build the tables / run the test. I used the same table structure, and tried to follow the same query pattern with 1 or 5 aggregates (I used "avg"), [0, 1, 5] where conditions (with 100% selectivity). I measured master vs. 0001-0003 vs. 0001-0007 (with batching on/off). And I did that on my (relatively) new ryzen machine, and old xeon. The behavior is quite different for the two machines, but none of them shows such improvements. I used clang 19.0, and --with-llvm. See the attached PDFs with a summary of the results, comparing the results for master and the two batching branches. The ryzen is much "smoother" - it shows almost no difference with batching "off" (as expected). The "scan" branch (with 0001-0003) shows an improvement of 5-10% - it's consistent, but much less than the 10-20% you report. For the "agg" branch the benefits are much larger, but there's also a significant regression for the largest table with 100M rows (which is ~18GB on disk). For xeon, the results are a bit more variable, but it affects runs both with batching "on" and "off". The machine is just more noisy. There seems to be a small benefit of "scan" batching (in most cases much less than the 10-20%). The "agg" is a clear win, with up to 30-40% speedup, and no regression similar to the ryzen. Perhaps I did something wrong. It does not surprise me this is somewhat CPU dependent. It's a bit sad the improvements are smaller for the newer CPU, though. I also tried running TPC-H. I don't have useful numbers yet, but I ran into a segfault - see the attached backtrace. It only happens with the batching, and only on Q22 for some reason. I initially thought it's a bug in clang, because I saw it with clang-22 built from git, and not with clang-14 or gcc. But since then I reproduced it with clang-19 (on debian 13). Still could be a clang bug, of course. I've seen ~20 of those segfaults so far, and the backtraces look exactly the same. regards -- Tomas Vondra
Вложения
В списке pgsql-hackers по дате отправления: