Обсуждение: [WIP]Vertical Clustered Index (columnar store extension) - take2
Hi All,
Suggestions
==========
When analyzing real-time data collected by PostgreSQL,
it can be difficult to tune the current PostgreSQL server for satisfactory performance.
Therefore, we propose Vertical Clustered Indexing (VCI), an in-memory column store function that holds data in a state suitable for business analysis and is also expected to improve analysis performance.
With VCI, you can also expect to run analysis 7.8 times faster. This is achieved by the analytics engine, which optimizes parallel processing of column-oriented data, in addition to the fact that VCI stores data in a columnar format, enabling efficient retrieval of the columns needed for analysis.
Similar Features
============
One column store feature available with postgres is Citus Columnar Table.
If you introduces the citus extension, which allows columnar tables to be used using the columnar access method.
This function is intended to analyze the accumulated data. Therefore, you cannot update or delete data.
VCI supports data updates and deletions. This enables you to analyze not only the accumulated data but also the data that occurs in real time.
Implementing VCI
============
To introduce an updatable column store, we explain how updates to row-oriented data are propagated to column-oriented data.
VCI has two storage areas.
- Write Optimized Storage (WOS)
- Read Optimized Storage (ROS)
Describes WOS.
The WOS stores data for all columns in the VCI in a row-oriented format.
All newly added data is stored in the WOS relation along with the transaction information.
Using WOS to delete and update newly added data has no significant performance impact compared to deleting from columnar storage.
ROS is the storage area where all column data is stored.
When inserting/updating/deleting, data is written synchronously to WOS. It does not compress or index the data.
This avoids the overhead of converting to a columnar while updating the data.
After a certain amount of data accumulates in the WOS, the ROS control daemon converts it to column data asynchronously with updates.
Column data transformation compresses and indexes the data and writes it to ROS.
Describes searching for data.
Since there are two storage formats, the SELECT process needs to convert the WOS data to local ROS to determine whether it is visible or invisible. This conversion cost depends on the number of tuples present in the WOS file. This may introduce some performance overhead.
Obtain search results by referencing the local ROS and referencing the ROS in parallel.
These implementation ideas are also posted on Fujitsu's blog for your reference. [1]
Past discussions
===========
We've proposed features before. [2]
This thread also explains the details, so please check it.
In a previous thread, we suggested implementing a modification to the PostgreSQL backend code.
Based on the FB we received at that time, we think we need to re-implement this feature in pluggable storage using the table access method API.
I also got a FB of the features I needed from a PostgreSQLPro member. We believe it is necessary to deal with these issues in stages.
- Need to provide vector processing for nodes (filter, grand aggregate, aggregation with group by...) to speed up computation
- Requires parallel processing support such as scanning
It is assumed that the re-implementation will also require additional functionality to the current Table Access Method API.
It is useful not only for VCI but also for other access methods.
Therefore, we decided to propose the VCI feature to the community and proceed with development.
Request matter
===========
Are members of the PostgreSQL hackers interested in VCI features?
We welcome your comments and suggestions on this feature.
In particular, if you have any questions, required features, or implementations, please let me know.
Regards,
Aya Iwata
FUJITSU LIMITED
Hello, I came across this email by chance while looking for something else. On 2024-Oct-07, Aya Iwata (Fujitsu) wrote: > Therefore, we propose Vertical Clustered Indexing (VCI), an in-memory > column store function that holds data in a state suitable for business > analysis and is also expected to improve analysis performance. > With VCI, you can also expect to run analysis 7.8 times faster. This > is achieved by the analytics engine, which optimizes parallel > processing of column-oriented data, in addition to the fact that VCI > stores data in a columnar format, enabling efficient retrieval of the > columns needed for analysis. Wow. > Request matter > =========== > > Are members of the PostgreSQL hackers interested in VCI features? > We welcome your comments and suggestions on this feature. > In particular, if you have any questions, required features, or implementations, please let me know. I think this is definitely an important and welcome development. I look forward to your patches in this area. Thank you, -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ Essentially, you're proposing Kevlar shoes as a solution for the problem that you want to walk around carrying a loaded gun aimed at your foot. (Tom Lane)
07.10.2024 17:53, Aya Iwata (Fujitsu) wrote: > Hi All, > > Suggestions > > ========== > > When analyzing real-time data collected by PostgreSQL, > > it can be difficult to tune the current PostgreSQL server for > satisfactory performance. > > Therefore, we propose Vertical Clustered Indexing (VCI), an in-memory > column store function that holds data in a state suitable for business > analysis and is also expected to improve analysis performance. I just don't get, why it should be "in-memory"? All the same things you describe further, but storing in paged index on-disk with caching through shared_buffers - why this way it wouldn't work?
Hi Alvaro san, I am sorry for my late reply. I continue to work on proposing VCI feature to the community. > I think this is definitely an important and welcome development. > I'm looking forward to patches in this area. Thank you! I am currently preparing to share VCI designs with PGConf.dev. I look forward to sharing more about VCI with you. Best regards, Aya Iwata FUJITSU LIMITED
Hi Yura san, > I just don't get, why it should be "in-memory"? All the same things you > describe further, but storing in paged index on-disk with caching > through shared_buffers - why this way it wouldn't work? We make the columnar store resident in memory for maximum search performance. But I'm not very particular about this. Comments are welcome. Best regards, Aya Iwata FUJITSU LIMITED > -----Original Message----- > From: Yura Sokolov <y.sokolov@postgrespro.ru> > Sent: Wednesday, January 15, 2025 11:44 PM > To: pgsql-hackers@lists.postgresql.org > Subject: Re: [WIP]Vertical Clustered Index (columnar store extension) - take2 > > 07.10.2024 17:53, Aya Iwata (Fujitsu) wrote: > > Hi All, > > > > Suggestions > > > > ========== > > > > When analyzing real-time data collected by PostgreSQL, > > > > it can be difficult to tune the current PostgreSQL server for > > satisfactory performance. > > > > Therefore, we propose Vertical Clustered Indexing (VCI), an in-memory > > column store function that holds data in a state suitable for business > > analysis and is also expected to improve analysis performance. > > I just don't get, why it should be "in-memory"? All the same things you > describe further, but storing in paged index on-disk with caching > through shared_buffers - why this way it wouldn't work? > >
08.04.2025 13:29, Aya Iwata (Fujitsu) wrote: > Hi Yura san, > > >> I just don't get, why it should be "in-memory"? All the same things you >> describe further, but storing in paged index on-disk with caching >> through shared_buffers - why this way it wouldn't work? > > We make the columnar store resident in memory for maximum search performance. > But I'm not very particular about this. Comments are welcome. I just wanted to say: there is no need to be super fast. There is the need to be remarkably faster than it is now. ClickHouse, DuckDB, Vertica - they are not in-memory, they are disk based. But they are very fast. If PostgreSQL will be just as twice slower as ClickHouse, it will be very great! Most of users will not setup ClickHouse at all then, because twice slower is still very fast. Databases could be very huge. Even when they are in "columnar" format, which usually consumes less space. And memory is still costs more than disk space. Certainly there are users who think they need "in-memory". But the truth is very few of them really need "in-memory". All of this is just my opinion. I could be wrong. >> -----Original Message----- >> From: Yura Sokolov <y.sokolov@postgrespro.ru> >> Sent: Wednesday, January 15, 2025 11:44 PM >> To: pgsql-hackers@lists.postgresql.org >> Subject: Re: [WIP]Vertical Clustered Index (columnar store extension) - take2 >> >> 07.10.2024 17:53, Aya Iwata (Fujitsu) wrote: >>> Hi All, >>> >>> Suggestions >>> >>> ========== >>> >>> When analyzing real-time data collected by PostgreSQL, >>> >>> it can be difficult to tune the current PostgreSQL server for >>> satisfactory performance. >>> >>> Therefore, we propose Vertical Clustered Indexing (VCI), an in-memory >>> column store function that holds data in a state suitable for business >>> analysis and is also expected to improve analysis performance. >> >> I just don't get, why it should be "in-memory"? All the same things you >> describe further, but storing in paged index on-disk with caching >> through shared_buffers - why this way it wouldn't work? -- regards Yura Sokolov aka funny-falcon
Hello, I found some dead codes in 0001 patches. I removed. Here are new patches. Regards, Aya Iwata Fujitsu Limited
Вложения
Hi hackers, I will share the notes on the discussions in PGConf.dev. Thanks all for participation. Feedback on the Advanced Patch Feedback Session; - A Basic idea that allows both OLTP/OLAP workloads on the same table is OK - Buffering mechanism has already been used by existing code GIN, IIUC - IAM seems a more proper approach than TAM - One reason is that VCI can only optimize the data lookup stuff - TAM needs all possible queries; it's too much - Divide the patch more and more - to allow committers to consider pros and cons and push one by one - 15 patches is the maximum amount - Separate features to some committable group Sawada san suggested several points; 1. Find parts that are useful not only for the VCI and dispatching new threads 2. Develop codes incrementally. E.g., VCI has a custom scan, but we may be able to give up on the first version Also, I have questions and advice below on May 16th. - How to handle visibility? - What if transactions that insert tuples are aborted? - VCI vs Index Only Scan - VCI seems the same as Index Only Scan - VCI has compression. So, is the amount of size better than the Index? - How about the efficiency for storage? How much data can we store? - using TAM (Alvalo's suggestion) - Most of the code can be ported from the heap. ROS and WOS can exist as forks of main files, like .vm and .fm. API can be like: ``` CREATE TABLE foo (id, product, price) USING vci WITH (to_be_indexed=price); ``` Again, I really appreciate your efforts. Regards, Aya Iwata Fujitsu Limited
Hello Iwata-san, Thanks for presenting the patch at pgconf.dev last week, and for the discussions at the APFS session. As promised I did a quick review of the patch this week, trying to understand the design, the trade-offs etc. I'm still trying to understand the patch, so please correct me when I got something wrong. Before I get to the review comments (mostly about the first part) and questions, let me comment on a couple items from your earlier response: On 5/17/25 18:31, Aya Iwata (Fujitsu) wrote: > Hi hackers, > > I will share the notes on the discussions in PGConf.dev. Thanks all for participation. > > Feedback on the Advanced Patch Feedback Session; > > - A Basic idea that allows both OLTP/OLAP workloads on the same table is OK > - Buffering mechanism has already been used by existing code GIN, IIUC Yes. GIN has a "fastupdate" feature, which allows accumulating entries in a buffer, and merging them into the index later (as a batch, which makes it more efficient). This is similar to the WOS/ROS architecture. AFAIK some other columnar databases use a similar WOS/ROS architecture, with similar goal (to move the index updates into a background job, and make it more efficient). I expect there to be a trade-off between the WOS size and query performance. Larger WOS allows larger updates / better efficiency, but it likely also means slower queries (that also have to scan the WOS). > - IAM seems a more proper approach than TAM > - One reason is that VCI can only optimize the data lookup stuff > - TAM needs all possible queries; it's too much I believe this (IAM vs. TAM) is very much an open question. At the conference the goal of the patch was explained as a way to improve performance with analytical queries, while maintaining OLTP performance and features. From this point of view, it makes sense to consider VCI to be an "accelerator" feature. Which is what indexes do, i.e. it's a way to speed-up queries, but we can do without them if that's faster. If VCI got implemented as TAM, i.e. serving as the primary storage, we wouldn't have this option - we'd have to do VCI even for OLTP workloads (e.g. random inserts / selects), and that sounds quite inefficient. Not to mention the VCI would need to support everything heap does, and it supports a lot. VCI doesn't even support arbitrary data types, etc. But this hinges on the assumption that the VCI can be implemented as an IAM, right? And that IAM is a good fit for the stuff VCI needs to do. Because if it can't be implemented as IAM, then the simplicity argument is quite moot. I'm not saying it can't be done as IAM, I don't have a clear opinion on that yet. Another reason to think about this is that after the talk, someone in the audience pointed out this patch originally started on Postgres 9.5, and that back then there was no TAM interface. So when implementing it there was no IAM vs. TAM choice, IAM was the only way to do something like this, and thus it was used. Maybe if TAM was available back then, it'd be done differently? Not sure. Also, Alvaro seemed to think TAM is the way to go, and in order to keep the OLTP performance he suggested to use both heap and VCI at the same time, in different "forks". I'm not sure how would that work, or if we can already do that - AFAIK we can't, because ForkNumber does not allow adding custom forks. We'd have to relax that, or invent some sort of federated TAM (that just multiplexes it to two TAMs). Maybe. But it's not like the IAM approach doesn't need to do this. The first patch had to add stuff to a lot of random places to make this work. And some of the places touch stuff that we don't expect indexes to worry about, like ALTER TABLE, etc. > - Divide the patch more and more > - to allow committers to consider pros and cons and push one by one > - 15 patches is the maximum amount > - Separate features to some committable group > Yes. Both parts need to be divided into small (but meaningful) pieces. I don't recall where the "15 patches" came from, I don't think there's some sort of maximum number of patches. All I care about is that the patches need to be small-ish, and still make sense. Ideally, each part should be committable on it's own. This makes it easier to test/review, but also easier to agree on individual parts - in which case we may commit the "preparatory" pieces and not have to rebase them forever. The 0001 part is ~83K, which is not entirely crazy, but I can easily imagine it split into ~10 parts, each addressing a different subset. For 0002 it's even more important, because no one can review a 1.3MB patch. > Sawada san suggested several points; > 1. Find parts that are useful not only for the VCI and dispatching new threads > 2. Develop codes incrementally. E.g., VCI has a custom scan, but we may > be able to give up on the first version > Yes, I think this makes perfect sense. I'd probably keep it in a single thread (essentially as a single patch series), because it's very useful to see the "bigger picture" why the parts are important. But if you find something that can stand on it's own, then sure. But you'll still need to keep the part in the main patch series (it should not be necessary for people to collect patches from different threads, and it would also break the CI testing commitfests patches). Breaking 0002 into smaller parts makes perfect sense. Something like: - minimum working patch, with only serial sequential scan - add parallel sequential scan - add other custom scans, one by one ... - ... > > Also, I have questions and advice below on May 16th. > > - How to handle visibility? > - What if transactions that insert tuples are aborted? > - VCI vs Index Only Scan > - VCI seems the same as Index Only Scan > - VCI has compression. So, is the amount of size better than the Index? > - How about the efficiency for storage? How much data can we store? > - using TAM (Alvalo's suggestion) > - Most of the code can be ported from the heap. ROS and WOS can exist as forks > of main files, like .vm and .fm. API can be like: > > ``` > CREATE TABLE foo (id, product, price) USING vci WITH (to_be_indexed=price); > ``` > Yes, I think those questions are important. Not only for understanding the current patch, but also to understand what trade-offs the solution can make (e.g. about the consistency/visibility model), etc. What I really miss in the current patch is some sort of READMEs with - high-level design of the VCI indexes - description of the consistency/visibility model (does it behave the same way as querying the heap, can it be out of sync for, ...) - WOS/ROS architecture (when are rows inserted into WOS, promoted into ROS, what triggers that, ...) - what's the in-memory / on-disk format - some places in the patch mention "internal VCI tables" but I have no idea what that is - how does the execution work? compression? crucial points to consider for optimal performance, etc. - limitations (temporary - can be relaxed in the future, permanent - inherent to the columnar design) and trade-offs - what are the various custom scan executor nodes - what "background" processes happen (custom workers, ...) - anything else substantial for understanding the design Maybe there's some of this in the 0002 patch, but I haven't stumbled over it so far. ------------------------------------------------------------------------ Finally, some review comments about the 0001 patch - it's mostly in the order as appearing in git diff, file by file. Some of the points repeat, because multiple places made me think about the same thing. And some of it was mentioned above. - It seems the VCI is implemented as an index AM, but it seems to be trying to do some TAM stuff, e.g. with visibility info etc. For example, I don't think indexes generally care about XIDs that deleted rows, etc. Similarly, I can't think of another index AM that'd try to hook into tablecmds.c or execExpr. - Not sure what's the right choice (makes sense to have it as index on top of heap, so that we do regular stuff for OLTP queries, and columnar stuff using the index for analytics). But it seems it does stuff expected from table AM and not index AM. - But making a table AM would make it much more demanding (it'd have to support storing all data types, ...) and would hurt OLTP performance. - I very much support the idea of implementing this as an extension (so minimal infrastructure changes to core + extension in contrib or entirely outside). - But the patch is not there, not even with 0001, because there's a number of places mentioning "vci". There should be no reference to VCI in core, it should be generic extensibility infrastructure. It's a good indication of places that may need some extensibility. relation_open ------------------ - Why is this change (disabling assert enforcing proper locking) needed? There's a comment in lock.c which mentions VCI parallel workers, but it's suspicious. reloptions.c ------------------ - Changes that should be moved to the contrib module Why should in-core reloptions know about this? See for example how "contrib/bloom" defined reloptions. heapam.c ------------------ - add_index_delete_hook, not sure I understand the purpose of this, likely related to the consistency model, needs explanation. Why is it prefixed with "add_"? - If we need such DELETE hook, heapam.c is not where to call it, it should be called from the same "layer" adding tuples to indexes (which certainly does not happen in heap_insert). - Furthermore, if we need this, why should it be a hook and not a new callback in IndexAmRoutine, just like aminsert()? initscan ------------------ - Isn't it using keep_startblock in a wrong way? it certainly is not meant to indicate "keep the old number of blocks" - Why is this even needed? This is in heapam, so why would this even matter for VCI, which is an index? Shouldn't that be mostly independent from which TAM is used by the table? heapam_handler.c ------------------ - Why does this need to preserve the IndexHeapTuple at all? And why should it be done in heapam_handler? Again, shouldn't IAM be mostly independent of this? - Using a global variable does not seem great. We probably can't have concurrent calls (only a single heapam_index_build_range_scan call at a time), but globals are dubious. - If VCI really needs to remember the tuple, there should be a "proper" way to explicitly pass it (adjust the IndexAmRoutine callback to have a new argument or something like that). heapam_visibility.c ------------------- - It's probably sensible to make visibility snapshots extensible, i.e. to allow defining custom snapshot types etc. But the patch still does that by hard-coding stuff in core, even if the actual "visibility" code is in extension. - But this doesn't really allow custom snapshot types, those are still hardcoded SNAPSHOT_VCI_WOS2ROS / SNAPSHOT_VCI_LOCALROS. - The add_snapshot_satisfies_hook does not seem like the way to make this extensible - what if two extensions want to do this. First HeapTupleSatisfiesVisibility would have to know about all the extensions, and it could easily lead to infinite loop. Because we call the first hook, and it calls HeapTupleSatisfiesVisibility again. - This is probably a good example of VCI trying to do TAM stuff. TAM interface has tuple_satisfies_snapshot, which seems to have the same purpose as the hook. Maybe not (it still assumes regular snapshots). - But I still don't understand why should a new index AM require changes to how heap AM does visibility checks (or any visibility checks in general, as indexes don't have visibility info in PG). And why would it make sense for heap to know about this. IAM is the layer *above* TAM, in a way. xlogfuncs.c ------------------ - RECOVERY_VCI_PAUSE_REQUESTED - what's that for? why do we need a separate state from RECOVERY_PAUSE_REQUESTED? xlogrecovery.c ------------------ - I don't see why we should not log that the recovery is paused because of VCI. I sure users would still like to know why it's paused: /* If pause requested by VCI, the log is not output. */ else if (GetRecoveryPauseState() != RECOVERY_VCI_PAUSE_REQUESTED) dependency.c ------------------ - add_drop_relation_hook raises mostly the same questions as add_index_delete_hook - Don't we already have event triggers to do this kind of stuff in an extensible way? - Why should an index even care about the table getting dropped? I mean, it'll get dropped too, before the table. So why this? index.c ------------------ - Why do we need add_reindex_index_hook at all? we already have ambuild callback, right? - It seems really suspicious ConstructTupleDescriptor gains the ability to index system attributes, why is that needed? surely the VCI does not need to index system attributes, right? Or why is that needed? - It seems add_reindex_index_hook is used to "skip" rebuilding the index, but why would it be useful/necessary, and why should it be done here? - Why couldn't ambuild() just return? How do I rebuild the index if something goes wrong (e.g. it gets corrupted). Or to remove bloat, if that applies to VCI. explain.c ------------------ - T_CustomPlanMarkPos - so this introduces a new type of custom plan node? How is that different from CustomScan? We don't even have MarkPos plan in core, right? So what does it do? - It even appears to be exactly the same as CustomScan in most places, but it also introduces some new callbacks (ExplainCustomPlanTargetRel and SetBoundCustomScan). - If useful, this should be a separate patch (custom scan improvements). And it definitely needs to update the CustomScan docs in SGML. - I don't understand why we need ExplainPropertySortGroupKeys, does not align with the rest of the API (ExplainProperty is for different data types - float, text, ...). Maybe expose show_sort_group_keys? Or just copy show_sort_group_keys into the extension ... indexcmds.c ------------------ - Yet another place referencing "vci" explicitly (not extensible, should not be in core) - Not sure I understand the limitations mentioned there. tablecmds.c ------------------ - add_alter_tablespace_hook, add_alter_table_change_owner_hook, add_alter_table_change_schema_hook - Why does VCI this even need to handle these actions? - And we have event triggers, so why have these new hooks? - Not sure what the code in ATExecSetRelOptions does (another example that handles "vci" AM explicitly, in a non-extensible way). - Is it maybe rebuilding some reloptions? Maybe it's about the attnums? vacuum.c ------------------ - add_skip_vacuum_hook is used to "disable" vacuum for internal VCI tables. - Why. And what is "internal table"? execAmi.c ------------------ - what's T_CustomPlanMarkPos? execExpr.c ------------------ - Why does this expose ExecReadyExpr, ExecCreateExprSetupSteps? - Isn't this a bit too fragile? - Probably needed because VciExecInitExpr needs a new argument? But why does it need that? execExprInterp.c ------------------ - Why does this need ExprEvalVar_hook and ExprEvalParam_hook? - Doesn't even check if the hooks are != NULL - CASE_EEOP_VCI_VAR, CASE_EEOP_VCI_PARAM_EXEC - not extensible, should not reference VCI explicitly - VciExprEvalVarHook (for llvmjit?) execGrouping.c ------------------ - LookupTupleHashEntry_internal removes comment, why? execProcnode.c, execScan.c ------------------ - New MarkPos node, apparetly, but why do we need that? Why can't we just build the Materialize when constructing the plan? execUtils.c ------------------ - again, disables locking check, very suspicious nodeCustom.c ------------------ - Isn't it strange that ExecInitCustomScan short-circuits the init, skipping most of the initialization for selected scans? - If anything, it should not hard-code the scan names like this. Why is this necessary? is the rest of the initialization irrelevant, or what? Maybe this is sign it should be table AM rather than index AM? - But IOS scans would also do this initialization, so ... nodeModifytable.c ------------------ - add_should_index_insert_hook another strange hook - Doesn't the hook apply to all relations all the time, not just those with VCI indexes etc.? gen_node_support.c ------------------ - So is CustomPlanMarkPos a proper node, or why do we need it? why is CustomPlan not enough? - What's SmcAllocSetContext for? Doesn't seem to be mentioned anywhere. params.c ------------------ - ExecInitParam_hook seems unused, no? allpaths.c ------------------ - set_plain_rel_pathlist - I don't understand why this checks the VCI vs. parallelism like this, what it's trying to prevent. Maybe if you don't want certain plan shape, don't generate the partial paths in that case from the CustomScan planner callback? - It seems this tries to allow parallel seqscan, but not other more complex parallel scans (like parallel agg/join ...), but that could be done in callback, no? - Again, we should not reference the index AM this explicitly, it's not extensible. createplan.c ------------------ - I don't think we want to make copy_plan_costsize part of API. Just a detail, though. autovacuum.c ------------------ - Why does this need a custom autovacuum launcher, or what's the purpose of this code? is there a special autovacuum launcher for VCI indexes? - Again, explicit references to "VCI", should not be done like this. bgworker.c ------------------ - I have no idea what's the point of these changes. seems to be enabling canceling bgworkers, but that seems like a generic improvement. In which case it should be a separate patch. - No idea if it's correct etc. - Seems to be done to allow CountOtherDBBackends() to cancel backends, so that it can be dropped etc. - There apparently are bgworkers doing cleanup for VCI, and we want to treat them like autovacuum workers (which get interrupted in various cases instead of waiting forever). - But this will cancel all bgworkers, no? Not just the VCI ones ... lock.c ------------------ - this seems a bit strange, surely this should use the same locking approach as other parallel queries /* * Don't lock for VCI parallel workers and other type of workers should go * in normal flow, In case if there is any change in background worker * name for VCI parallel workers, the following code also needs an update. * FIXME: Try to use the community parallelism code, so that we don't need * our own VCI parallel infrastructure. */ if (AmBackgroundWorkerProcess() && strstr(MyBgworkerEntry->bgw_name, "backend=")) - Presumably this is why some of the Asserts on locking were commented out earlier. lwlock.c ------------------ - Vci_lwlock_kind seems unused. timestamp.c ------------------ - I believe the IntervalAggState is internal on purpose, why should it be made external? It also means timestamp.h has to worry about FRONTEND. relcache.c ------------------ - add_skip_vci_index_hook - yet another hook, same questions as for the other hooks earlier - It seems suspicious we have to skip VCI indexes like this during planning in RelationGetIndexAttrBitmap, not sure why that's needed? Why should't we cost the index and maybe reject it based on that? - Again, references VCI explicitly, so not extensible. mctx.c ------------------ - why MemoryContextMethods shouldn't be "const"? common.c pg_dump.c pg_dump.h ------------------ - isvciview breaks extensibility, tied to a particular IAM / extension. Needs to be abstracted / made extensible. - What is "vci view table" for, actually? reloptions.h ------------------ - Why does this need to add an entry into the relopt_kind enum? It should be extensible, see the "contrib/bloom". xlogrecovery.h ------------------ - Why is RECOVERY_VCI_PAUSE_REQUESTED separate from RECOVERY_PAUSE_REQUESTED? timestamp.h ------------------ - The FRONTEND stuff is a sign making IntervalAggState public may not be a good idea. executor.h ------------------ - How come it's suddenly OK to return entry->additional without the if conditions in TupleHashEntryGetAdditional? - Ah, it's now stored as a separate palloc chunk, not right after the tuple? Well, maybe that's a separately useful improvement? should be a separate patch, then. No opinion on whether it's a good idea. - What's with the struct LimitState; ? Why is it needed? extensible.h ------------------ - Adds SetBoundCustomScan / ExplainCustomPlanTargetRel, but then it should also add them to the SGML docs. memnodes.h ------------------ - What's SmcAllocSetContext? It's not defined anywhere. params.h ------------------ - Adds noNeedLoadFromMain, loadedFromMain to ParamExecData, but it doesn't seem to be used anywhere (in either patch). plannodes.h ------------------ - Adds plan_no to Plan, but it's not set anywhere in core, only in the VCI extension - which seems wrong (e.g. what if there are two extensions that want to set it, and they disagree?) - Either if we want/need this, it should be initialized by core during planning, in a non-ambiguous way. - But also how is this different from the existing plan_node_id? autovacuum.h ------------------ - It seems a bit strage to add "built-in" bgworkers for an extension. If the extension really needs some sort of workers, it should be defined in the extension. bgworker.h ------------------ - So what's the problem with canceling bgworkers? Should this be done as a separate patch? itemptr.h ------------------ - Why do we need this SizeOfIptrData definition? How is it different from sizeof(ItemPointerData)? lwlock.h ------------------ - LWLockKind / Vci_lwlock_kind seems unused numeric.h ------------------ - Why does this expose NumericVar, when the second patch redefines the struct anyway? also, isn't that fragile? although, we're unlikely to tweak the NumericVar definition. rel.h ------------------ - Again, not extensible way to do stuff in an extension. - We should not have pieces specific to AM in StdRdOptions. - Doesn't AM have custom reloptions now? Like contrib/bloom? I also tried to build 0002, but unfortunately that fails with: executor/vci_executor.c: In function ‘vci_setup_executor_hook’: executor/vci_executor.c:115:28: error: assignment to ‘ExecutorStart_hook_type’ {aka ‘void (*)(QueryDesc *, int)’} from incompatible pointer type ‘_Bool (*)(QueryDesc *, int)’ [-Wincompatible-pointer-types] 115 | ExecutorStart_hook = vci_executor_start_routine; | ^ executor/vci_executor.c: In function ‘vci_executor_start_routine’: executor/vci_executor.c:161:28: error: void value not ignored as it ought to be 161 | plan_valid = executor_start_prev(queryDesc, eflags); | ^ executor/vci_executor.c:163:28: error: void value not ignored as it ought to be 163 | plan_valid = standard_ExecutorStart(queryDesc, eflags); | ^ make: *** [../../src/Makefile.global:973: executor/vci_executor.o] Error 1 The extension is not added to contrib/Makefile, so "make world" does not trigger this failure. regards -- Tomas Vondra
Also, Alvaro seemed to think TAM is the way to go, and in order to keep
the OLTP performance he suggested to use both heap and VCI at the same
time, in different "forks". I'm not sure how would that work, or if we
can already do that - AFAIK we can't, because ForkNumber does not allow
adding custom forks. We'd have to relax that, or invent some sort of
federated TAM (that just multiplexes it to two TAMs). Maybe.
But it's not like the IAM approach doesn't need to do this. The first
patch had to add stuff to a lot of random places to make this work. And
some of the places touch stuff that we don't expect indexes to worry
about, like ALTER TABLE, etc.
On 6/4/25 19:59, Jim Nasby wrote: > > > On Fri, May 23, 2025 at 4:29 PM Tomas Vondra <tomas@vondra.me > <mailto:tomas@vondra.me>> wrote: > > Also, Alvaro seemed to think TAM is the way to go, and in order to keep > the OLTP performance he suggested to use both heap and VCI at the same > time, in different "forks". I'm not sure how would that work, or if we > can already do that - AFAIK we can't, because ForkNumber does not allow > adding custom forks. We'd have to relax that, or invent some sort of > federated TAM (that just multiplexes it to two TAMs). Maybe. > > But it's not like the IAM approach doesn't need to do this. The first > patch had to add stuff to a lot of random places to make this work. And > some of the places touch stuff that we don't expect indexes to worry > about, like ALTER TABLE, etc. > > > I suspect another option would be to handle this with table inheritance: > have one child that is heap-based, a second that's VCI, and a background > job to move data from heap to VCI (and vice-versa for updates and maybe > deletes). > > Note that you could actually implement all that in user-space. > Personally I'd much rather have a way to do pure VCI / column-store > sooner and manage it myself than have to wait another release (or more) > to get a complete solution... I don't see how could this ever work with the optimizer, which assumes scanning an inheritance hierarchy means scanning all parts. But this would require making planner "smarter" to know it should scan only one of the child relations. And I believe it's not possible to do that while constructing scans for the heap/VCI parts, those places are not aware of what other parts are being scanned etc. Sure, you could do this in "user-space" by constructing queries that reference either the heap or VCI part. But then why put that into inheritance tree at all? It certainly does not help with moving data between the parts. I may be missing something, of course. But it's not clear to me how is this supposed to work ... What I can imagine is "VCI" as a "proxy" TAM on top of heap, keeping the columnar format in a separate fork. And using either that from custom scans, or the heap as a fallback for cases not supported by VCI. regars -- Tomas Vondra
On 6/4/25 19:59, Jim Nasby wrote:
>
>
> On Fri, May 23, 2025 at 4:29 PM Tomas Vondra <tomas@vondra.me
> <mailto:tomas@vondra.me>> wrote:
>
> Also, Alvaro seemed to think TAM is the way to go, and in order to keep
> the OLTP performance he suggested to use both heap and VCI at the same
> time, in different "forks". I'm not sure how would that work, or if we
> can already do that - AFAIK we can't, because ForkNumber does not allow
> adding custom forks. We'd have to relax that, or invent some sort of
> federated TAM (that just multiplexes it to two TAMs). Maybe.
>
> But it's not like the IAM approach doesn't need to do this. The first
> patch had to add stuff to a lot of random places to make this work. And
> some of the places touch stuff that we don't expect indexes to worry
> about, like ALTER TABLE, etc.
>
>
> I suspect another option would be to handle this with table inheritance:
> have one child that is heap-based, a second that's VCI, and a background
> job to move data from heap to VCI (and vice-versa for updates and maybe
> deletes).
>
> Note that you could actually implement all that in user-space.
> Personally I'd much rather have a way to do pure VCI / column-store
> sooner and manage it myself than have to wait another release (or more)
> to get a complete solution...
I don't see how could this ever work with the optimizer, which assumes
scanning an inheritance hierarchy means scanning all parts. But this
would require making planner "smarter" to know it should scan only one
of the child relations. And I believe it's not possible to do that while
constructing scans for the heap/VCI parts, those places are not aware of
what other parts are being scanned etc.
Sure, you could do this in "user-space" by constructing queries that
reference either the heap or VCI part. But then why put that into
inheritance tree at all? It certainly does not help with moving data
between the parts.
What I can imagine is "VCI" as a "proxy" TAM on top of heap, keeping the
columnar format in a separate fork. And using either that from custom
scans, or the heap as a fallback for cases not supported by VCI.
What I can imagine is "VCI" as a "proxy" TAM on top of heap, keeping the
columnar format in a separate fork. And using either that from custom
scans, or the heap as a fallback for cases not supported by VCI.Yeah, there'd definitely need to be some kind of proxy... I'm just suggesting that we don't *have* to do that as a separate fork...
Here are the v8 patches. The main changes are as follows: v8-0001 VCI - changes to postgres core - same v8-0002 VCI module - main - extracted "compression" related code from this main patch - applied Timur's top-up patch [1] re "session_preload_libraries" - removed some dead code v8-0003 VCI module - documentation - removed mentions about compression ~~~ To avoid too much noise, other extracted patches (below) will be maintained off-list. 0004 VCI module - compression 0005 VCI module - hothash 0006 VCI module - defer WOS2ROS ====== [1] https://www.postgresql.org/message-id/c4e314ef0a58050c8b7847ac1852555876674a69.camel%40postgrespro.ru Kind Regards, Peter Smith. Fujitsu Australia
Вложения
Hi. Here are the latest v9* patches. These have following changes: 0001 - fixes some of the minor quirks reported by Tomas [1]. 0002 - fixes to added address some of Timur's feedback/patches [2] - test code updated to remove unsupported type - test code updated to remove dynamic costings ====== [1] Tomas - https://www.postgresql.org/message-id/a748aa6b-c7e6-4d02-a590-ab404d590448%40vondra.me [2] Timur - https://www.postgresql.org/message-id/6a6058fc089f89561b2545f024953e4daa0b8561.camel%40postgrespro.ru Kind Regards, Peter Smith Fujitsu Australia
Вложения
On Fri, Jun 20, 2025 at 4:09 AM Timur Magomedov <t.magomedov@postgrespro.ru> wrote: > > Hello Peter! > > Thank you for working on VCI updates. > Here are some proposals for small improvements: Hi Timur, Thanks for your feedback! The newly posted v9-0002 addresses some of your comments. Details are below. > > Since hothash feature lives in separate patch now, vci_hothash.o should > be removed from vci/executor/Makefile of VCI-module-main patch. Fixed. Thanks. I also fixed a similar problem and removed vci_mp_rle.o from storage/Makefile. > > 0001-Avoid-magic-numbers-in-vci_is_supported_type.patch > I've looked at vci_supported_types.c and tried to rewrite it without > using magic constants. Removed call to vci_check_supported_types() from > SQL code since there is no need now to check that OID constants still > match same types. > Using defined constants for OIDs seems more robust than plain numbers. > There were 23 type OIDs supported by VCI, all of them are there in a > new version of code. I've replaced binary search by simple switch-case > since compilers optimize it into nice binary decision tree. Previous > version was binary searching among 87 structs. New version only > searches among 23 integers. Hmm. IIUC the “supported types” code was written this way for the following reason. Out of an over-abundance (?) of caution, the original patch authors wanted to call the function 'vci_check_supported_types' as a compatibility check, to discover up-front (during CREATE EXTENSION) if any of the current PostgreSQL OID definitions differ in any way since when the VCI extension was built. e.g. - if vci build supports the type oid, but the oid value in current PostgreSQL no longer exists then raise an error - if vci build supports the type oid but the oid now has a different meaning (typname mismatch) then raise an error In other words, even though these type OIDs are in a range that is supposed to be stable/unlikely to change, I think original VCI developers were deliberately cautious about any chance of OID values changing in later PostgreSQL versions. So, for the compatibility checking we still need to keep that function 'vci_check_supported_types', and therefore we also still need to keep the struct because that is where the known oid/name mapping (at time of VCI extension build) is held which is used for the checking. The current code is written so that when building a new VCI you only need to execute: "SELECT oid, typname FROM pg_type WHERE typnamespace = 11 AND typrelid = 0 AND typelem = 0 ORDER BY oid;" for the current PostgreSQL then substitute in the necessary true/false for support. I agree this means the resulting vci_is_supported_type() is not as efficient as your implementation, but OTOH it is perhaps easier to maintain and check against new PostgreSQL releases? Also, you'll encounter all the same problems with the supported functions logic of vci_supported_funcs.c -- those have similar logic, so maybe it is better to keep them similar despite inefficiencies? FYI, I have re-executed the SQL SELECT oid, typname FROM pg_type WHERE typnamespace = 11 AND typrelid = 0 AND typelem = 0 ORDER BY oid; and this exposed a few changes since whenever this vci_supported_type_table[] was last generated many years ago. ~~ AFAICT we could implement the vci_supported_type_table[] to *only* include the types where supported is “true”. That would be more efficient than now because then the entire array would only have 20ish elements, but comes at a cost of perhaps being less easy to compare with the executed SQL result. Thoughts? Also, while the vci_supported_type_table[] lookup is needed for the compatibility check logic, I guess we could implement the vci_is_supported_type() function exactly the same as your patch switch code, to be called for the runtime checking (during CREATE INDEX). That would be more efficient at runtime, but comes at a cost of then having 2 functions to maintain. Thoughts? ~~~ So, for now, I have only done the following: - Updated vci_supported_type_table[] according to the current SQL result. - Notice that “name” (NAMEOID) no longer qualifies as a VCI supported type (because typelem is not 0 anymore) so the test code was modified to remove the name column “c04”. - In passing, I removed the costings from the EXPLAIN test output > > 0002-Updated-supported-funcs-SQL-for-recent-PostgreSQL.patch > I wonder if it is possible to rewrite vci_supported_funcs.c in a > similar way. There is a vci_supported_funcs.sql which I updated so it > can be executed at master version of PostgreSQL. OK. I included your changes in v9 and also added some IF EXISTS so the .sql file can be run repeatedly without error. I also changed the matching comment in the supported_funcs.c code where the SQL of this file was described. > I don't know if it is intentional, but safe_types list from > vci_supported_funcs.sql have some differences to the types list from > vci_supported_types.c. Specifically, it doesn't include varchar, varbit > and uuid. Thanks for reporting this! TBH, I also expected these lists should be the same. I found multiple inconsistencies: - Some items are in safe_types table but NOT in the vci_supported_type_table[] - Some Items are in the vci_supported_type_table[] but are NOT in the safe_types table Furthermore (and maybe partly caused by content of safe_types), the results of running vci_support_func.sql also differs quite a lot also from the vci_supported_func_table [] (in vci_supported_funcs.c), which also does not seem right to me. - Some items are in the SQL script results but are NOT in the vci_supported_func_table [] - Some Items are in the vci_supported_func_table [] but are NOT in the SQL script results So, this remains an open problem. I am investigating if there is any history/explanation of these differences. IIUC, then I hope later to fix the safe_types, and then also fix the vci_supported_func_table[] based on the result of that SQL script. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Wed, 2025-06-25 at 13:13 +1000, Peter Smith wrote: > Hi Timur, Thanks for your feedback! Hello Peter! > Hmm. IIUC the “supported types” code was written this way for the > following reason. Out of an over-abundance (?) of caution, the > original patch authors wanted to call the function > 'vci_check_supported_types' as a compatibility check, to discover > up-front (during CREATE EXTENSION) if any of the current PostgreSQL > OID definitions differ in any way since when the VCI extension was > built. > e.g. > - if vci build supports the type oid, but the oid value in current > PostgreSQL no longer exists then raise an error > - if vci build supports the type oid but the oid now has a different > meaning (typname mismatch) then raise an error > > In other words, even though these type OIDs are in a range that is > supposed to be stable/unlikely to change, I think original VCI > developers were deliberately cautious about any chance of OID values > changing in later PostgreSQL versions. This check is really necessary if types are maintained as plain OID numbers since they theoretically can change. Using preprocessor constants, for example, BOOLOID instead of "16", make it simpler to survive actual number change. > So, for the compatibility checking we still need to keep that > function > 'vci_check_supported_types', and therefore we also still need to keep > the struct because that is where the known oid/name mapping (at time > of VCI extension build) is held which is used for the checking. > > The current code is written so that when building a new VCI you only > need to execute: > "SELECT oid, typname FROM pg_type WHERE typnamespace = 11 AND > typrelid > = 0 AND typelem = 0 ORDER BY oid;" > for the current PostgreSQL then substitute in the necessary > true/false > for support. Using query to get numbers and storing those numbers in struct instead of using defines like <TYPE_NAME>OID needs some runtime checks indeed. > I agree this means the resulting vci_is_supported_type() is not as > efficient as your implementation, but OTOH it is perhaps easier to > maintain and check against new PostgreSQL releases? > > Also, you'll encounter all the same problems with the supported > functions logic of vci_supported_funcs.c -- those have similar logic, > so maybe it is better to keep them similar despite inefficiencies? I agree, having similar logic in both vci_supported_funcs.c and vci_supported_types.c is good for code readability. > FYI, I have re-executed the SQL > SELECT oid, typname FROM pg_type WHERE typnamespace = 11 AND typrelid > = 0 AND typelem = 0 ORDER BY oid; > and this exposed a few changes since whenever this > vci_supported_type_table[] was last generated many years ago. > > ~~ > > AFAICT we could implement the vci_supported_type_table[] to *only* > include the types where supported is “true”. That would be more > efficient than now because then the entire array would only have > 20ish > elements, but comes at a cost of perhaps being less easy to compare > with the executed SQL result. > > Thoughts? Looks like having both "false" and "true" cases inside structs is done for better maintainability. I.e. in case we only have "true"-valued types we can't say if some type is absent because it is decided to be unsupported or because it was not examined yet. > Also, while the vci_supported_type_table[] lookup is needed for the > compatibility check logic, I guess we could implement the > vci_is_supported_type() function exactly the same as your patch > switch code, to be called for the runtime checking (during CREATE > INDEX). That would be more efficient at runtime, but comes at a cost > of then having 2 functions to maintain. > > Thoughts? I think having two functions could make maintainability worse. Maybe one day those checks will be rewritten using some smart runtime logic in C, not compile-time list of types that were previously queried using SQL script. So I don't insist on my patch, it was just a proposal. It has less code but probably we can do even better or at least stick to current code running SQL query to get actual list of types from time to time. > ~~~ > > So, for now, I have only done the following: > - Updated vci_supported_type_table[] according to the current SQL > result. > - Notice that “name” (NAMEOID) no longer qualifies as a VCI supported > type (because typelem is not 0 anymore) so the test code was modified > to remove the name column “c04”. > - In passing, I removed the costings from the EXPLAIN test output > > > > > 0002-Updated-supported-funcs-SQL-for-recent-PostgreSQL.patch > > I wonder if it is possible to rewrite vci_supported_funcs.c in a > > similar way. There is a vci_supported_funcs.sql which I updated so > > it > > can be executed at master version of PostgreSQL. > > OK. I included your changes in v9 and also added some IF EXISTS so > the > .sql file can be run repeatedly without error. I also changed the > matching comment in the supported_funcs.c code where the SQL of this > file was described. Thanks! > > > I don't know if it is intentional, but safe_types list from > > vci_supported_funcs.sql have some differences to the types list > > from > > vci_supported_types.c. Specifically, it doesn't include varchar, > > varbit > > and uuid. > > Thanks for reporting this! TBH, I also expected these lists should be > the same. > > I found multiple inconsistencies: > - Some items are in safe_types table but NOT in the > vci_supported_type_table[] > - Some Items are in the vci_supported_type_table[] but are > NOT in the safe_types table > > Furthermore (and maybe partly caused by content of safe_types), the > results of running vci_support_func.sql also differs quite a lot also > from the vci_supported_func_table [] (in vci_supported_funcs.c), > which > also does not seem right to me. > - Some items are in the SQL script results but are NOT in > the > vci_supported_func_table [] > - Some Items are in the vci_supported_func_table [] but are > NOT in the SQL script results > > So, this remains an open problem. I am investigating if there is any > history/explanation of these differences. IIUC, then I hope later to > fix the safe_types, and then also fix the vci_supported_func_table[] > based on the result of that SQL script. Thanks! Evolutionary updating supported types and supported functions lists looks good to me for now. -- Regards, Timur Magomedov
On Wed, 2025-06-25 at 12:22 +1000, Peter Smith wrote: > Hi. > > Here are the latest v9* patches. These have following changes: > > 0001 > - fixes some of the minor quirks reported by Tomas [1]. > > 0002 > - fixes to added address some of Timur's feedback/patches [2] > - test code updated to remove unsupported type > - test code updated to remove dynamic costings > Hello Peter! Thanks for updates. I was trying to move some code from Postgres core patch to contrib/vci. Started with moving VCI options from StdRdOptions struct and reloptions.c, reloptions.h files into contrib/vci like Tomas suggested earlier. Getting new relopt kind using add_reloption_kind() instead of hardcoded RELOPT_KIND_VCI and so on, just like in contrib/bloom. During this work I've found that those VCI-specific options, "vci_column_ids" and "vci_dropped_column_ids", are only read in case vci_IsExtendedToMoreThan32Columns() is true. And vci_IsExtendedToMoreThan32Columns() itself checks if vci_column_ids option is non-empty string to return true. AFAIK no one sets those options and vci_IsExtendedToMoreThan32Columns() always returns false. There are comments in code mentioning that one can create VCI index with more than 32 columns using vci_create(). However, there is no such a function anywhere. I guess it is now impossible to create VCI index with more than 32 (INDEX_MAX_KEYS) columns. Maybe the options were used to store column information for indexes created with vci_create() but this function has gone. So, giving the ideas above, can we remove those vci_column_ids, vci_dropped_column_ids options and all the code that assumes vci_IsExtendedToMoreThan32Columns() to be true? This would make core patch a bit shorter, specifically there will be no changes in reloptions.{c,h} files and no changes in StdRdOptions struct in rel.h. -- Regards, Timur Magomedov
On Sat, Jun 28, 2025 at 12:24 AM Timur Magomedov <t.magomedov@postgrespro.ru> wrote: ... > Thanks for updates. > I was trying to move some code from Postgres core patch to contrib/vci. > Started with moving VCI options from StdRdOptions struct and > reloptions.c, reloptions.h files into contrib/vci like Tomas suggested > earlier. Getting new relopt kind using add_reloption_kind() instead of > hardcoded RELOPT_KIND_VCI and so on, just like in contrib/bloom. > > During this work I've found that those VCI-specific options, > "vci_column_ids" and "vci_dropped_column_ids", are only read in case > vci_IsExtendedToMoreThan32Columns() is true. And > vci_IsExtendedToMoreThan32Columns() itself checks if vci_column_ids > option is non-empty string to return true. AFAIK no one sets those > options and vci_IsExtendedToMoreThan32Columns() always returns false. > > There are comments in code mentioning that one can create VCI index > with more than 32 columns using vci_create(). However, there is no such > a function anywhere. I guess it is now impossible to create VCI index > with more than 32 (INDEX_MAX_KEYS) columns. Maybe the options were used > to store column information for indexes created with vci_create() but > this function has gone. > > So, giving the ideas above, can we remove those vci_column_ids, > vci_dropped_column_ids options and all the code that assumes > vci_IsExtendedToMoreThan32Columns() to be true? This would make core > patch a bit shorter, specifically there will be no changes in > reloptions.{c,h} files and no changes in StdRdOptions struct in rel.h. > Hi Timur. Thanks for reporting this! Your idea seems correct to me. AFAICT, those relops were intended for supporting >32 vci indexes, which is only possible when the VCI index is created using “vci_create()” function, but that function does not exist in the VCI implementation posted to OSS. So, I agree with you – since those relops are not currently needed they can just be removed. I will confirm this understanding with the original patch devs and, if correct, I will address this in a future patch version. ====== Kind Regards, Peter Smith. Fujitsu Australia.
Hi Tomas-san, > > - Divide the patch more and more > > - to allow committers to consider pros and cons and push one by one > > - 15 patches is the maximum amount > > - Separate features to some committable group > > > > Yes. Both parts need to be divided into small (but meaningful) pieces. I > don't recall where the "15 patches" came from, I don't think there's > some sort of maximum number of patches. All I care about is that the > patches need to be small-ish, and still make sense. Ideally, each part > should be committable on it's own. This makes it easier to test/review, > but also easier to agree on individual parts - in which case we may > commit the "preparatory" pieces and not have to rebase them forever. Yes, we are aware the patch 0002 is far too big and are taking steps to try to address that, firstly by removing any "dead" code, and secondly by extracting out any "optional" components. This is an ongoing task. If you have any other ideas how to split the patch please let us know. > relation_open > ------------------ > - Why is this change (disabling assert enforcing proper locking) needed? > There's a comment in lock.c which mentions VCI parallel workers, but > it's suspicious. This comment is removed in the current posted patches. > reloptions.c > ------------------ > - Changes that should be moved to the contrib module Why should in-core > reloptions know about this? See for example how "contrib/bloom" defined > reloptions. Thanks. We are studying how to replace this with an extensible solution using "contrib/bloom" as guidance. Also, as Timur (28/Jun) suggests that this might be unused code anyway. Investigating... > heapam.c > ------------------ > - add_index_delete_hook, not sure I understand the purpose of this, > likely related to the consistency model, needs explanation. Why is it > prefixed with "add_"? Correct, this is related to maintaining consistency for the WOS when a delete occurs. The "add_" prefix has no special meaning. All of the VCI ad-hoc hooks have a prefix of "add_" > - Furthermore, if we need this, why should it be a hook and not a new > callback in IndexAmRoutine, just like aminsert()? Agree. We intend later to implement this ad-hoc hook as a callback in IndexAmRoutine as suggested. > vacuum.c > ------------------ > - Why. And what is "internal table"? Please refer to the contrib/vci/README section 2.4 for a description of VCI "Internal Relations" > execGrouping.c > ------------------ > - LookupTupleHashEntry_internal removes comment, why? The latest post patches do not have this code. > execUtils.c > ------------------ > - again, disables locking check, very suspicious Fixed in latest posted patches. > nodeModifytable.c > ------------------ > - add_should_index_insert_hook another strange hook Agree. We intend later to implement this ad-hoc hook as a callback in IndexAmRoutine as suggested. > mctx.c > ------------------ > - why MemoryContextMethods shouldn't be "const"? Fixed in latest posted patches. > mctx.c > ------------------ > - why MemoryContextMethods shouldn't be "const"? > > > common.c > pg_dump.c > pg_dump.h > ------------------ > > - What is "vci view table" for, actually? Fixed in latest posted patches. > reloptions.h > ------------------ > - Why does this need to add an entry into the relopt_kind enum? It > should be extensible, see the "contrib/bloom". Thanks. We are studying how to replace this with an extensible solution using "contrib/bloom" as guidance. (Timur suggests that this also unused code anyway. Investigating...) > executor.h > ------------------ > - How come it's suddenly OK to return entry->additional without the if > conditions in TupleHashEntryGetAdditional? This was an accidental revert of a master commit also reported by Timur. The latest posted patches no longer have this change. > > - What's with the struct LimitState; ? Why is it needed? Fixed in latest posted patches. > numeric.h > ------------------ > - Why does this expose NumericVar, when the second patch redefines the > struct anyway? also, isn't that fragile? although, we're unlikely to > tweak the NumericVar definition. Fixed in latest posted patches. > rel.h > ------------------ > - Doesn't AM have custom reloptions now? Like contrib/bloom? Thanks. We are studying how to replace this StdRdOptions change with an extensible solution using "contrib/bloom" as guidance. (Timur suggests that this also unused code anyway. Investigating...) Regards, Aya Iwata Fujitsu Limited
On Tue, Jul 1, 2025 at 1:17 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Sat, Jun 28, 2025 at 12:24 AM Timur Magomedov > <t.magomedov@postgrespro.ru> wrote: > ... > > Thanks for updates. > > I was trying to move some code from Postgres core patch to contrib/vci. > > Started with moving VCI options from StdRdOptions struct and > > reloptions.c, reloptions.h files into contrib/vci like Tomas suggested > > earlier. Getting new relopt kind using add_reloption_kind() instead of > > hardcoded RELOPT_KIND_VCI and so on, just like in contrib/bloom. > > > > During this work I've found that those VCI-specific options, > > "vci_column_ids" and "vci_dropped_column_ids", are only read in case > > vci_IsExtendedToMoreThan32Columns() is true. And > > vci_IsExtendedToMoreThan32Columns() itself checks if vci_column_ids > > option is non-empty string to return true. AFAIK no one sets those > > options and vci_IsExtendedToMoreThan32Columns() always returns false. > > > > There are comments in code mentioning that one can create VCI index > > with more than 32 columns using vci_create(). However, there is no such > > a function anywhere. I guess it is now impossible to create VCI index > > with more than 32 (INDEX_MAX_KEYS) columns. Maybe the options were used > > to store column information for indexes created with vci_create() but > > this function has gone. > > > > So, giving the ideas above, can we remove those vci_column_ids, > > vci_dropped_column_ids options and all the code that assumes > > vci_IsExtendedToMoreThan32Columns() to be true? This would make core > > patch a bit shorter, specifically there will be no changes in > > reloptions.{c,h} files and no changes in StdRdOptions struct in rel.h. > > > > Hi Timur. > > Thanks for reporting this! Your idea seems correct to me. > > AFAICT, those relops were intended for supporting >32 vci indexes, > which is only possible when the VCI index is created using > “vci_create()” function, but that function does not exist in the VCI > implementation posted to OSS. So, I agree with you – since those > relops are not currently needed they can just be removed. I will > confirm this understanding with the original patch devs and, if > correct, I will address this in a future patch version. > Fixed in v10. ====== Kind Regards, Peter Smith. Fujitsu Australia.
On Wed, Jul 2, 2025 at 10:22 AM Aya Iwata (Fujitsu) <iwata.aya@fujitsu.com> wrote: > ... > > > reloptions.c > > ------------------ > > - Changes that should be moved to the contrib module Why should in-core > > reloptions know about this? See for example how "contrib/bloom" defined > > reloptions. > > Thanks. We are studying how to replace this with an extensible solution > using "contrib/bloom" as guidance. Also, as Timur (28/Jun) suggests that > this might be unused code anyway. Investigating... Fixed (removed) in v10. > > > reloptions.h > > ------------------ > > - Why does this need to add an entry into the relopt_kind enum? It > > should be extensible, see the "contrib/bloom". > > Thanks. We are studying how to replace this with an extensible > solution using "contrib/bloom" as guidance. > (Timur suggests that this also unused code anyway. Investigating...) Fixed (removed) in v10. > > rel.h > > ------------------ > > > - Doesn't AM have custom reloptions now? Like contrib/bloom? > > Thanks. We are studying how to replace this StdRdOptions change > with an extensible solution using "contrib/bloom" as guidance. > (Timur suggests that this also unused code anyway. Investigating...) > Fixed (removed) in v10. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Jul 4, 2025 at 4:03 PM Japin Li <japinli@hotmail.com> wrote: ... > When trying vic, I received the following error: > > TRAP: failed Assert("TransactionIdIsNormal(xmax)"), File: "/data/Codes/pg/master/build/../src/backend/access/heap/heapam.c",Line: 7373, PID: 1719347 > postgres: japin postgres [local] VACUUM(ExceptionalCondition+0xbb)[0x562674fbc44a] > postgres: japin postgres [local] VACUUM(heap_pre_freeze_checks+0x18f)[0x5626747eba07] > /data/Codes/pg/master/build/pg/lib/postgresql/vci.so(+0x61fca)[0xd355ccbbfca] > /data/Codes/pg/master/build/pg/lib/postgresql/vci.so(+0x6288f)[0xd355ccbc88f] > /data/Codes/pg/master/build/pg/lib/postgresql/vci.so(+0x51673)[0xd355ccab673] > /data/Codes/pg/master/build/pg/lib/postgresql/vci.so(+0x4ec41)[0xd355cca8c41] > postgres: japin postgres [local] VACUUM(index_vacuum_cleanup+0x181)[0x562674810008] > postgres: japin postgres [local] VACUUM(vac_cleanup_one_index+0x27)[0x562674a78dbf] > postgres: japin postgres [local] VACUUM(+0x1bb73b)[0x56267480973b] > postgres: japin postgres [local] VACUUM(+0x1bb489)[0x562674809489] > postgres: japin postgres [local] VACUUM(+0x1b8e64)[0x562674806e64] > postgres: japin postgres [local] VACUUM(heap_vacuum_rel+0x8de)[0x56267480578b] > postgres: japin postgres [local] VACUUM(+0x426ac9)[0x562674a74ac9] > postgres: japin postgres [local] VACUUM(+0x42a5cf)[0x562674a785cf] > postgres: japin postgres [local] VACUUM(vacuum+0x49c)[0x562674a75ec1] > postgres: japin postgres [local] VACUUM(ExecVacuum+0xdf3)[0x562674a759f7] > postgres: japin postgres [local] VACUUM(standard_ProcessUtility+0xa62)[0x562674dac27e] > /data/Codes/pg/master/build/pg/lib/postgresql/vci.so(+0x52c7e)[0xd355ccacc7e] > postgres: japin postgres [local] VACUUM(ProcessUtility+0x109)[0x562674dab7e4] > postgres: japin postgres [local] VACUUM(+0x75c0b7)[0x562674daa0b7] > postgres: japin postgres [local] VACUUM(+0x75c32f)[0x562674daa32f] > postgres: japin postgres [local] VACUUM(PortalRun+0x31d)[0x562674da978f] > postgres: japin postgres [local] VACUUM(+0x753dbe)[0x562674da1dbe] > postgres: japin postgres [local] VACUUM(PostgresMain+0xb43)[0x562674da7603] > postgres: japin postgres [local] VACUUM(+0x74f1da)[0x562674d9d1da] > postgres: japin postgres [local] VACUUM(postmaster_child_launch+0x1b1)[0x562674c97b05] > postgres: japin postgres [local] VACUUM(+0x650859)[0x562674c9e859] > postgres: japin postgres [local] VACUUM(+0x64dd28)[0x562674c9bd28] > postgres: japin postgres [local] VACUUM(PostmasterMain+0x1546)[0x562674c9b5b0] > postgres: japin postgres [local] VACUUM(main+0x38c)[0x562674b359e6] > /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0xd355b82a1ca] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0xd355b82a28b] > postgres: japin postgres [local] VACUUM(_start+0x25)[0x562674769e15] > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > connection to server was lost > > Here's how to reproduce the issue: > > initdb -D demo > cat <<EOF >demo/postgresql.auto.conf > shared_preload_libraries = 'vci' > max_worker_processes = '20' > EOF > > pg_ctl -D demo start > > cat <<EOF | psql postgres > CREATE EXTENSION vci; > CREATE TABLE t (id int, info text); > CREATE INDEX ON t USING vci (id); > INSERT INTO t SELECT id, md5(random()::text) FROM generate_series(1, 1000) id; > VACUUM t; > EOF > Hi Japin, Thank you for your interest in VCI and for reporting the problem. We are working to get this patch into a better shape; this vaccum issue has been added to our list of things to fix. Thanks also for your script - using it, I reproduced the same TRAP that you reported. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Sat, May 24, 2025 at 7:29 AM Tomas Vondra <tomas@vondra.me> wrote: > > Hello Iwata-san, > ... > autovacuum.c > ------------------ > - Why does this need a custom autovacuum launcher, or what's the purpose > of this code? is there a special autovacuum launcher for VCI indexes? > > - Again, explicit references to "VCI", should not be done like this. Fixed in v11-0001 > autovacuum.h > ------------------ > - It seems a bit strage to add "built-in" bgworkers for an extension. If > the extension really needs some sort of workers, it should be defined in > the extension. > Fixed in v11-0001 ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi. Here is the latest patch set v12* Main differences are: Patch 0001 (core) - removed SizeOfIptrData macro, as reported by Tomas [1] and Japin [2] Patch 0002 (vci module) - Made fixes so the "ROS Control Worker" (for background WOS->ROS transfer) can now launch ok. ====== [1] Tomas - https://www.postgresql.org/message-id/a748aa6b-c7e6-4d02-a590-ab404d590448%40vondra.me [2] Japin - https://www.postgresql.org/message-id/ME0P300MB04457E24CA8965F008FB2CDBB648A%40ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM Kind Regards, Peter Smith. Fujitsu Australia.
Вложения
Hi. Here are the latest v13 patches. Changes include: PATCH 0002. - README improvements -- as previously sent separately [1] - Refactor InitPageCoreWithoutLock -- per proposal from Japin [2-#2] - Changes to eliminate warnings from "headerscheck" and "cpluspluscheck" - Add missing assignments in vci_handler() ====== [1] https://www.postgresql.org/message-id/CAHut%2BPvYQZAHcD-tK5XaobUpWoTf0Gkjx7nAA9eJq_HbPCSxCQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/ME0P300MB0445FD473D75F65E8B0A6F5DB64BA%40ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM Kind Regards, Peter Smith. Fujitsu Australia
Вложения
Hi. Here are the latest v14 patches. Changes include: PATCH 0002. - Fixes the enable_seqscan PANIC bug reported by Japin [1] - Adds a new regression test case for this ====== [1] https://www.postgresql.org/message-id/ME0P300MB04457E24CA8965F008FB2CDBB648A%40ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM Kind Regards, Peter Smith. Fujitsu Australia
Вложения
Here are the latest v15 patches. Changes include: PATCH 0002. - README now says user should not tamper with VCI internal relations - fixes/test the VACUUM bug -- fix provided by Japin [1] - fixes/tests the reported segv for attempted REFRESH of VCI internal relation -- see [2 comment#1] - fixes/tests VCI internal relation dependency on the indexed table - simplifies code for PG_TEMP_FILES_DIR -- see [2 comment#2] ====== [1] https://www.postgresql.org/message-id/ME0P300MB0445891C69BD82561055F218B65FA%40ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM [2] https://www.postgresql.org/message-id/ME0P300MB0445EBA04D6947DD717074DFB65CA%40ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM Kind Regards, Peter Smith. Fujitsu Australia.
Вложения
Here are the latest v16 patches. Changes include: PATCH 0001. - REINDEX bugfix -- fix provided by Japin [1] PATCH 0002. - REINDEX bugfix test cases - per [1] - README now list all the internal relations -- per [2] - Query of VCI internal relation no longer causes confusing HINT -- per [3] - Renamed the VCI internal relations to have a "pg_" prefix ====== [1] https://www.postgresql.org/message-id/ME0P300MB04453BEE52F84048683460E4B627A%40ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM [2] https://www.postgresql.org/message-id/CAHut%2BPt8naGc7pH0YG_0G8Wu5aqJiHoT6xP%2BY81_eJWapg9%3DDA%40mail.gmail.com [3] https://www.postgresql.org/message-id/ME0P300MB0445307958A2DC0831CEF56DB624A%40ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM Kind Regards, Peter Smith. Fujitsu Australia
Вложения
Here are the latest v17 patches. Changes include: PATCH 0002. - Rebase to fix compile error, result of recent master change - Bugfix for an unreported EXPLAIN ANALYZE error - Bugfix for an unreported double pfree - Code cleanup (ran pgindent; corrected ~100 typos; removed empty coverage comments) ====== Kind Regards, Peter Smith. Fujitsu Australia
Вложения
On Fri, Aug 08, 2025 at 06:08:27PM +1000, Peter Smith wrote: > Here are the latest v17 patches. > > Changes include: > > PATCH 0002. > - Rebase to fix compile error, result of recent master change > - Bugfix for an unreported EXPLAIN ANALYZE error > - Bugfix for an unreported double pfree > - Code cleanup (ran pgindent; corrected ~100 typos; removed empty > coverage comments) > 1. +static struct +{ + int transfn_oid; /* Transition function's funcoid. Arrays are + * sorted in ascending order */ + Oid transtype; /* Transition data type */ + PGFunction merge_trans; /* Function pointer set required for parallel + * aggregation for each transfn_oid */ + vci_aggtranstype_kind kind; /* If transtype is INTERNALOID, its details */ +} trans_funcs_table[] = { + {F_FLOAT4_ACCUM, 1022, merge_floatX_accum, VCI_AGG_NOT_INTERNAL}, /* 208 */ + {F_FLOAT8_ACCUM, 1022, merge_floatX_accum, VCI_AGG_NOT_INTERNAL}, /* 222 */ + {F_INT8INC, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 1833 */ + {F_NUMERIC_ACCUM, 2281, numeric_combine, VCI_AGG_NUMERIC_AGG_STATE}, /* 1834 */ + {F_INT2_ACCUM, 2281, numeric_poly_combine, VCI_AGG_POLY_NUM_AGG_STATE}, /* 1836 */ + {F_INT4_ACCUM, 2281, numeric_poly_combine, VCI_AGG_POLY_NUM_AGG_STATE}, /* 1835 */ + {F_INT8_ACCUM, 2281, numeric_combine, VCI_AGG_NUMERIC_AGG_STATE}, /* 1836 */ + {F_INT2_SUM, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 1840 */ + {F_INT4_SUM, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 1841 */ + {F_INTERVAL_AVG_COMBINE, 2281, merge_interval_avg_accum, VCI_AGG_NOT_INTERNAL}, /* 3325 */ + {F_INT2_AVG_ACCUM, 1016, merge_intX_accum, VCI_AGG_NOT_INTERNAL}, /* 1962 */ + {F_INT4_AVG_ACCUM, 1016, merge_intX_accum, VCI_AGG_NOT_INTERNAL}, /* 1963 */ + {F_INT8INC_ANY, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 2804 */ + {F_INT8_AVG_ACCUM, 2281, int8_avg_combine, VCI_AGG_POLY_AVG_NUM_AGG_STATE}, /* 2746 */ + {F_NUMERIC_AVG_ACCUM, 2281, numeric_avg_combine, VCI_AGG_AVG_NUMERIC_AGG_STATE}, /* 2858 */ +}; The comments state that this is sorted in ascending order, but the code doesn't follow that rule. While the current linear search works, a future change to binary search could cause problems. 2. +static void +CheckIndexedRelationKind(Relation rel) +{ + char relKind = get_rel_relkind(RelationGetRelid(rel)); + + if (relKind == RELKIND_MATVIEW) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("access method \"%s\" does not support index on materialized view", VCI_STRING))); + + if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("access method \"%s\" does not support index on temporary table", VCI_STRING))); +} Would it be possible to use rel->rd_rel->relkind directly? This might avoid the overhead of a function call. 3. The discussion on add_index_delete_hook [1] makes me wonder if an Index Access Method callback could serve the same purpose. This also raises the question: would we then need an index update callback as well? 3. Here are some typos. a) @@ -475,7 +477,7 @@ vci_scan_EndCustomPlan(CustomScanState *node) default: /* LCOV_EXCL_START */ - elog(PANIC, "Should not reach hare"); + elog(PANIC, "Should not reach here"); /* LCOV_EXCL_STOP */ break; } b) @@ -543,7 +545,7 @@ vci_create_relation(const char *rel_identifier, Relation indexRel, IndexInfo *in TupleDescInitEntry(new_tupdesc, (AttrNumber) 1, "bindata", BYTEAOID, -1, 0); /* */ break; - /* TIC-CRID */ + /* TID-CRID */ case VCI_RELTYPE_TIDCRID: natts = 1; new_tupdesc = CreateTemplateTupleDesc(natts); /* no Oid */ c) @@ -1065,7 +1065,7 @@ vci_inner_build(Relation heapRel, Relation indexRel, IndexInfo *indexInfo) /* * Put or Copy page into INIT_FORK. * If valid page is given, that page will be putted into INIT_FORK. - * If Invalid page (NULL pointer) is given, MAIN_FORK page well be copied. + * If invalid page (NULL pointer) is given, MAIN_FORK page well be copied. */ static void vci_putInitPage(Oid oid, Page page, BlockNumber blkno) [1] https://www.postgresql.org/message-id/OS7PR01MB119642862AA1CE536549D08CFEA40A%40OS7PR01MB11964.jpnprd01.prod.outlook.com -- Best regards, Japin Li ChengDu WenWu Information Technology Co., LTD.
Вложения
On Mon, Aug 11, 2025 at 05:39:01PM +0800, Japin Li wrote: > On Fri, Aug 08, 2025 at 06:08:27PM +1000, Peter Smith wrote: > > Here are the latest v17 patches. > > > > Changes include: > > > > PATCH 0002. > > - Rebase to fix compile error, result of recent master change > > - Bugfix for an unreported EXPLAIN ANALYZE error > > - Bugfix for an unreported double pfree > > - Code cleanup (ran pgindent; corrected ~100 typos; removed empty > > coverage comments) > > > 3. > Here are some typos. > > a) > @@ -475,7 +477,7 @@ vci_scan_EndCustomPlan(CustomScanState *node) > > default: > /* LCOV_EXCL_START */ > - elog(PANIC, "Should not reach hare"); > + elog(PANIC, "Should not reach here"); > /* LCOV_EXCL_STOP */ > break; > } > b) > @@ -543,7 +545,7 @@ vci_create_relation(const char *rel_identifier, Relation indexRel, IndexInfo *in > TupleDescInitEntry(new_tupdesc, (AttrNumber) 1, "bindata", BYTEAOID, -1, 0); /* */ > break; > > - /* TIC-CRID */ > + /* TID-CRID */ > case VCI_RELTYPE_TIDCRID: > natts = 1; > new_tupdesc = CreateTemplateTupleDesc(natts); /* no Oid */ > > c) > @@ -1065,7 +1065,7 @@ vci_inner_build(Relation heapRel, Relation indexRel, IndexInfo *indexInfo) > /* > * Put or Copy page into INIT_FORK. > * If valid page is given, that page will be putted into INIT_FORK. > - * If Invalid page (NULL pointer) is given, MAIN_FORK page well be copied. > + * If invalid page (NULL pointer) is given, MAIN_FORK page well be copied. > */ > static void > vci_putInitPage(Oid oid, Page page, BlockNumber blkno) > > 1. PFA the other typos. 2. I found it skip vci query context initialization in vci_intialize_query_context() if full page writes is disabled, Could you explain why we need full page write enabled for VCI? 3. Both vci_ros.h and vci_ros.c have a comment about accessing the VCI main relation header, but they are slightly different. Could we sync them and keep only one? It seems the comment is outdated, as the functions vci_KeepReadingMainRelHeader() and vci_KeepWritingMainRelHeader() do not exist in the current VCI implementation. 4. +/** + * This function is assumed when the VCI index is newly built, and + * it converts all the data in the relation of PostgreSQL into ROS. + */ +uint64 +vci_ConvertWos2RosForBuild(Relation mainRel, + Size workareaSize, + IndexInfo *indexInfo) ... + result = (uint64) table_index_build_scan(comContext.heapRel, + mainRel, + indexInfo, + true, /* allow syncscan */ + true, + vci_build_callback, + (void *) &comContext, NULL) Perhaps we can use a double return type to avoid type casting since other places also use double. -- Best regards, Japin Li ChengDu WenWu Information Technology Co., LTD.
Вложения
On Mon, 2025-08-11 at 17:39 +0800, Japin Li wrote: > > 1. > +static struct > +{ > + int transfn_oid; /* Transition function's funcoid. > Arrays are > + * sorted in ascending order */ > + Oid transtype; /* Transition data type */ > + PGFunction merge_trans; /* Function pointer set required for > parallel > + * aggregation for each transfn_oid > */ > + vci_aggtranstype_kind kind; /* If transtype is INTERNALOID, its > details */ > +} trans_funcs_table[] = { > + {F_FLOAT4_ACCUM, 1022, merge_floatX_accum, > VCI_AGG_NOT_INTERNAL}, /* 208 */ > + {F_FLOAT8_ACCUM, 1022, merge_floatX_accum, > VCI_AGG_NOT_INTERNAL}, /* 222 */ > + {F_INT8INC, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 1833 */ > + {F_NUMERIC_ACCUM, 2281, numeric_combine, > VCI_AGG_NUMERIC_AGG_STATE}, /* 1834 */ > + {F_INT2_ACCUM, 2281, numeric_poly_combine, > VCI_AGG_POLY_NUM_AGG_STATE}, /* 1836 */ > + {F_INT4_ACCUM, 2281, numeric_poly_combine, > VCI_AGG_POLY_NUM_AGG_STATE}, /* 1835 */ > + {F_INT8_ACCUM, 2281, numeric_combine, > VCI_AGG_NUMERIC_AGG_STATE}, /* 1836 */ > + {F_INT2_SUM, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 1840 */ > + {F_INT4_SUM, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 1841 */ > + {F_INTERVAL_AVG_COMBINE, 2281, merge_interval_avg_accum, > VCI_AGG_NOT_INTERNAL}, /* 3325 */ > + {F_INT2_AVG_ACCUM, 1016, merge_intX_accum, > VCI_AGG_NOT_INTERNAL}, /* 1962 */ > + {F_INT4_AVG_ACCUM, 1016, merge_intX_accum, > VCI_AGG_NOT_INTERNAL}, /* 1963 */ > + {F_INT8INC_ANY, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 2804 */ > + {F_INT8_AVG_ACCUM, 2281, int8_avg_combine, > VCI_AGG_POLY_AVG_NUM_AGG_STATE}, /* 2746 */ > + {F_NUMERIC_AVG_ACCUM, 2281, numeric_avg_combine, > VCI_AGG_AVG_NUMERIC_AGG_STATE}, /* 2858 */ > +}; > > The comments state that this is sorted in ascending order, but the > code doesn't > follow that rule. While the current linear search works, a future > change to > binary search could cause problems. > Hi Japin! I've looked at the code, vci_set_merge_and_copy_trans_funcs() function is unused and almost all code of vci_aggmergetranstype.c file including trans_funcs_table[] can be either removed either replaced with simple switch-case. Only transfn_oid fields of trans_funcs_table[] were actually used. Here is my patch made on top of v17. Peter, what do you think? Is it OK to remove those code? -- Regards, Timur Magomedov
Вложения
Here are the latest v18* patches. Changes include: PATCH 0002. Cleaning: - Fix typos (per Japin patches [1] and [2]) - Remove all #if 0 code - Make all header comments more consistent - Cleanup Doxygen annotation formatting - Remove all double blank lines Fixes: - trans_funcs_table[] ordering (per Japin patch [1]) - Access relkind via member instead of function calls (per Japin patch [1]) - Change vci_ConvertWos2RosForBuild return type to reduce casts (per Japin [2]) ====== [1] Japin 11/8. https://www.postgresql.org/message-id/ME0P300MB04457AAC931AD1E3D0CE32FBB628A%40ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM [2] Japin 12/8. https://www.postgresql.org/message-id/ME0P300MB04450DF54484C145B77BD0D8B62BA%40ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM Kind Regards, Peter Smith. Fujitsu Australia
Вложения
On Mon, Aug 11, 2025 at 7:39 PM Japin Li <japinli@hotmail.com> wrote: > > On Fri, Aug 08, 2025 at 06:08:27PM +1000, Peter Smith wrote: > > Here are the latest v17 patches. > > > > Changes include: > > > > PATCH 0002. > > - Rebase to fix compile error, result of recent master change > > - Bugfix for an unreported EXPLAIN ANALYZE error > > - Bugfix for an unreported double pfree > > - Code cleanup (ran pgindent; corrected ~100 typos; removed empty > > coverage comments) > > > > 1. > +static struct > +{ > + int transfn_oid; /* Transition function's funcoid. Arrays are > + * sorted in ascending order */ > + Oid transtype; /* Transition data type */ > + PGFunction merge_trans; /* Function pointer set required for parallel > + * aggregation for each transfn_oid */ > + vci_aggtranstype_kind kind; /* If transtype is INTERNALOID, its details */ > +} trans_funcs_table[] = { > + {F_FLOAT4_ACCUM, 1022, merge_floatX_accum, VCI_AGG_NOT_INTERNAL}, /* 208 */ > + {F_FLOAT8_ACCUM, 1022, merge_floatX_accum, VCI_AGG_NOT_INTERNAL}, /* 222 */ > + {F_INT8INC, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 1833 */ > + {F_NUMERIC_ACCUM, 2281, numeric_combine, VCI_AGG_NUMERIC_AGG_STATE}, /* 1834 */ > + {F_INT2_ACCUM, 2281, numeric_poly_combine, VCI_AGG_POLY_NUM_AGG_STATE}, /* 1836 */ > + {F_INT4_ACCUM, 2281, numeric_poly_combine, VCI_AGG_POLY_NUM_AGG_STATE}, /* 1835 */ > + {F_INT8_ACCUM, 2281, numeric_combine, VCI_AGG_NUMERIC_AGG_STATE}, /* 1836 */ > + {F_INT2_SUM, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 1840 */ > + {F_INT4_SUM, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 1841 */ > + {F_INTERVAL_AVG_COMBINE, 2281, merge_interval_avg_accum, VCI_AGG_NOT_INTERNAL}, /* 3325 */ > + {F_INT2_AVG_ACCUM, 1016, merge_intX_accum, VCI_AGG_NOT_INTERNAL}, /* 1962 */ > + {F_INT4_AVG_ACCUM, 1016, merge_intX_accum, VCI_AGG_NOT_INTERNAL}, /* 1963 */ > + {F_INT8INC_ANY, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 2804 */ > + {F_INT8_AVG_ACCUM, 2281, int8_avg_combine, VCI_AGG_POLY_AVG_NUM_AGG_STATE}, /* 2746 */ > + {F_NUMERIC_AVG_ACCUM, 2281, numeric_avg_combine, VCI_AGG_AVG_NUMERIC_AGG_STATE}, /* 2858 */ > +}; > > The comments state that this is sorted in ascending order, but the code doesn't > follow that rule. While the current linear search works, a future change to > binary search could cause problems. Fixed in v18. > > 2. > +static void > +CheckIndexedRelationKind(Relation rel) > +{ > + char relKind = get_rel_relkind(RelationGetRelid(rel)); > + > + if (relKind == RELKIND_MATVIEW) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("access method \"%s\" does not support index on materialized view", VCI_STRING))); > + > + if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("access method \"%s\" does not support index on temporary table", VCI_STRING))); > +} > > Would it be possible to use rel->rd_rel->relkind directly? This might avoid > the overhead of a function call. > Fixed in v18. > 3. > The discussion on add_index_delete_hook [1] makes me wonder if an Index Access > Method callback could serve the same purpose. This also raises the question: > would we then need an index update callback as well? > Yeah, the README "3.3.1 Ad-hoc hooks" already says that the plan is to try to see if we can convert these ad-hoc VCI hooks to instead use IAM callback methods wherever possible. > 3. > Here are some typos. > > a) > @@ -475,7 +477,7 @@ vci_scan_EndCustomPlan(CustomScanState *node) > > default: > /* LCOV_EXCL_START */ > - elog(PANIC, "Should not reach hare"); > + elog(PANIC, "Should not reach here"); > /* LCOV_EXCL_STOP */ > break; > } > b) > @@ -543,7 +545,7 @@ vci_create_relation(const char *rel_identifier, Relation indexRel, IndexInfo *in > TupleDescInitEntry(new_tupdesc, (AttrNumber) 1, "bindata", BYTEAOID, -1, 0); /* */ > break; > > - /* TIC-CRID */ > + /* TID-CRID */ > case VCI_RELTYPE_TIDCRID: > natts = 1; > new_tupdesc = CreateTemplateTupleDesc(natts); /* no Oid */ > > c) > @@ -1065,7 +1065,7 @@ vci_inner_build(Relation heapRel, Relation indexRel, IndexInfo *indexInfo) > /* > * Put or Copy page into INIT_FORK. > * If valid page is given, that page will be putted into INIT_FORK. > - * If Invalid page (NULL pointer) is given, MAIN_FORK page well be copied. > + * If invalid page (NULL pointer) is given, MAIN_FORK page well be copied. > */ > static void > vci_putInitPage(Oid oid, Page page, BlockNumber blkno) > Fixed in v18. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Aug 12, 2025 at 1:48 PM Japin Li <japinli@hotmail.com> wrote: ... > 1. > PFA the other typos. > Fixed in v18. > 2. > I found it skip vci query context initialization in vci_intialize_query_context() > if full page writes is disabled, Could you explain why we need full page write > enabled for VCI? > TODO > 3. > Both vci_ros.h and vci_ros.c have a comment about accessing the VCI main relation > header, but they are slightly different. Could we sync them and keep only one? > > It seems the comment is outdated, as the functions vci_KeepReadingMainRelHeader() > and vci_KeepWritingMainRelHeader() do not exist in the current VCI implementation. > TODO. I agree that there should be only one comment, and outdated function references need to be fixed. > 4. > +/** > + * This function is assumed when the VCI index is newly built, and > + * it converts all the data in the relation of PostgreSQL into ROS. > + */ > +uint64 > +vci_ConvertWos2RosForBuild(Relation mainRel, > + Size workareaSize, > + IndexInfo *indexInfo) > ... > + result = (uint64) table_index_build_scan(comContext.heapRel, > + mainRel, > + indexInfo, > + true, /* allow syncscan */ > + true, > + vci_build_callback, > + (void *) &comContext, NULL) > > Perhaps we can use a double return type to avoid type casting since > other places also use double. Fixed in v18. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Thu, Aug 14, 2025 at 2:28 AM Timur Magomedov <t.magomedov@postgrespro.ru> wrote: > > On Mon, 2025-08-11 at 17:39 +0800, Japin Li wrote: > > > > 1. > > +static struct > > +{ > > + int transfn_oid; /* Transition function's funcoid. > > Arrays are > > + * sorted in ascending order */ > > + Oid transtype; /* Transition data type */ > > + PGFunction merge_trans; /* Function pointer set required for > > parallel > > + * aggregation for each transfn_oid > > */ > > + vci_aggtranstype_kind kind; /* If transtype is INTERNALOID, its > > details */ > > +} trans_funcs_table[] = { > > + {F_FLOAT4_ACCUM, 1022, merge_floatX_accum, > > VCI_AGG_NOT_INTERNAL}, /* 208 */ > > + {F_FLOAT8_ACCUM, 1022, merge_floatX_accum, > > VCI_AGG_NOT_INTERNAL}, /* 222 */ > > + {F_INT8INC, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 1833 */ > > + {F_NUMERIC_ACCUM, 2281, numeric_combine, > > VCI_AGG_NUMERIC_AGG_STATE}, /* 1834 */ > > + {F_INT2_ACCUM, 2281, numeric_poly_combine, > > VCI_AGG_POLY_NUM_AGG_STATE}, /* 1836 */ > > + {F_INT4_ACCUM, 2281, numeric_poly_combine, > > VCI_AGG_POLY_NUM_AGG_STATE}, /* 1835 */ > > + {F_INT8_ACCUM, 2281, numeric_combine, > > VCI_AGG_NUMERIC_AGG_STATE}, /* 1836 */ > > + {F_INT2_SUM, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 1840 */ > > + {F_INT4_SUM, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 1841 */ > > + {F_INTERVAL_AVG_COMBINE, 2281, merge_interval_avg_accum, > > VCI_AGG_NOT_INTERNAL}, /* 3325 */ > > + {F_INT2_AVG_ACCUM, 1016, merge_intX_accum, > > VCI_AGG_NOT_INTERNAL}, /* 1962 */ > > + {F_INT4_AVG_ACCUM, 1016, merge_intX_accum, > > VCI_AGG_NOT_INTERNAL}, /* 1963 */ > > + {F_INT8INC_ANY, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 2804 */ > > + {F_INT8_AVG_ACCUM, 2281, int8_avg_combine, > > VCI_AGG_POLY_AVG_NUM_AGG_STATE}, /* 2746 */ > > + {F_NUMERIC_AVG_ACCUM, 2281, numeric_avg_combine, > > VCI_AGG_AVG_NUMERIC_AGG_STATE}, /* 2858 */ > > +}; > > > > The comments state that this is sorted in ascending order, but the > > code doesn't > > follow that rule. While the current linear search works, a future > > change to > > binary search could cause problems. > > > > Hi Japin! > I've looked at the code, vci_set_merge_and_copy_trans_funcs() function > is unused and almost all code of vci_aggmergetranstype.c file including > trans_funcs_table[] can be either removed either replaced with simple > switch-case. Only transfn_oid fields of trans_funcs_table[] were > actually used. Here is my patch made on top of v17. > > Peter, what do you think? Is it OK to remove those code? > Hi Timur. Thanks for the suggestion/patch. I had already prepared the v18 patches using Japin's reordering of that array; I'll study your switch/removal patch for possible inclusion in v19 (next week?). ====== Kind Regards, Peter Smith. Fujitsu Australia.
On Wed, Aug 13, 2025 at 07:28:34PM +0300, Timur Magomedov wrote: > On Mon, 2025-08-11 at 17:39 +0800, Japin Li wrote: > > > > 1. > > +static struct > > +{ > > + int transfn_oid; /* Transition function's funcoid. > > Arrays are > > + * sorted in ascending order */ > > + Oid transtype; /* Transition data type */ > > + PGFunction merge_trans; /* Function pointer set required for > > parallel > > + * aggregation for each transfn_oid > > */ > > + vci_aggtranstype_kind kind; /* If transtype is INTERNALOID, its > > details */ > > +} trans_funcs_table[] = { > > + {F_FLOAT4_ACCUM, 1022, merge_floatX_accum, > > VCI_AGG_NOT_INTERNAL}, /* 208 */ > > + {F_FLOAT8_ACCUM, 1022, merge_floatX_accum, > > VCI_AGG_NOT_INTERNAL}, /* 222 */ > > + {F_INT8INC, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 1833 */ > > + {F_NUMERIC_ACCUM, 2281, numeric_combine, > > VCI_AGG_NUMERIC_AGG_STATE}, /* 1834 */ > > + {F_INT2_ACCUM, 2281, numeric_poly_combine, > > VCI_AGG_POLY_NUM_AGG_STATE}, /* 1836 */ > > + {F_INT4_ACCUM, 2281, numeric_poly_combine, > > VCI_AGG_POLY_NUM_AGG_STATE}, /* 1835 */ > > + {F_INT8_ACCUM, 2281, numeric_combine, > > VCI_AGG_NUMERIC_AGG_STATE}, /* 1836 */ > > + {F_INT2_SUM, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 1840 */ > > + {F_INT4_SUM, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 1841 */ > > + {F_INTERVAL_AVG_COMBINE, 2281, merge_interval_avg_accum, > > VCI_AGG_NOT_INTERNAL}, /* 3325 */ > > + {F_INT2_AVG_ACCUM, 1016, merge_intX_accum, > > VCI_AGG_NOT_INTERNAL}, /* 1962 */ > > + {F_INT4_AVG_ACCUM, 1016, merge_intX_accum, > > VCI_AGG_NOT_INTERNAL}, /* 1963 */ > > + {F_INT8INC_ANY, 20, int8pl, VCI_AGG_NOT_INTERNAL}, /* 2804 */ > > + {F_INT8_AVG_ACCUM, 2281, int8_avg_combine, > > VCI_AGG_POLY_AVG_NUM_AGG_STATE}, /* 2746 */ > > + {F_NUMERIC_AVG_ACCUM, 2281, numeric_avg_combine, > > VCI_AGG_AVG_NUMERIC_AGG_STATE}, /* 2858 */ > > +}; > > > > The comments state that this is sorted in ascending order, but the > > code doesn't > > follow that rule. While the current linear search works, a future > > change to > > binary search could cause problems. > > > > Hi Japin! > I've looked at the code, vci_set_merge_and_copy_trans_funcs() function > is unused and almost all code of vci_aggmergetranstype.c file including > trans_funcs_table[] can be either removed either replaced with simple > switch-case. Only transfn_oid fields of trans_funcs_table[] were > actually used. Here is my patch made on top of v17. > Yeah. The code isn't used in these patches, so it's a good idea to remove it. -- Best regards, Japin Li ChengDu WenWu Information Technology Co., LTD.
On Thu, 2025-08-14 at 11:23 +1000, Peter Smith wrote: > Here are the latest v18* patches. > Hi Peter! I've reworked my recent patch [1] so it is now based on v18 and is divided into several simpler patches. Here they are plus one additional patch. 0001-Fixed-comment-and-guard-name-in-vci_pg_copy.h.patch Looks like vci_pg_copy.h was renamed from vci_numeric.h but file name comment and define guard name were not updated. Fixed it. 0002-Removed-vci_set_merge_and_copy_trans_funcs.patch Found that vci_set_merge_and_copy_trans_funcs() is not used anywhere, removed it alogn with the code that was only called inside it. trans_funcs_table[] now only contains single transfn_oid field, others (unused) are removed. 0003-Replaced-linear-search-by-switch-case.patch Replaced linear search inside trans_funcs_table array to more optimal switch-case. 0004-Removed-worker-name-check-in-lock.c.patch This is one I'm not sure about. Found that changes in Postgres core lock.c file check for "backend=" substring in background worker name. There is also a comment in vci_ros_daemon.c mentioning bgw_name checks of LockAquire(). Names don't match however. So as far as I understand the check for "backend=" in name is always false since no code in VCI sets bgw_name to something similar. This is either forgotten feature that can be easily fixed by removing bgw_name checks, either some bug, either my misunderstanding. For the first case, here is a patch that removes bgw_name checks in lock.c. It makes core patch a bit smaller and not touching lock.c at all (Yay!). [1] https://www.postgresql.org/message-id/8beac6e8a01971b22ccf0f2e2a8eb12a78e5a7ac.camel%40postgrespro.ru -- Regards, Timur Magomedov
Вложения
Hi Timur. Thanks for the patches you provided. My replies are inline below. On Sat, Aug 16, 2025 at 3:45 AM Timur Magomedov <t.magomedov@postgrespro.ru> wrote: > > On Thu, 2025-08-14 at 11:23 +1000, Peter Smith wrote: > > Here are the latest v18* patches. > > > > Hi Peter! > I've reworked my recent patch [1] so it is now based on v18 and is > divided into several simpler patches. Here they are plus one additional > patch. > > 0001-Fixed-comment-and-guard-name-in-vci_pg_copy.h.patch > Looks like vci_pg_copy.h was renamed from vci_numeric.h but file name > comment and define guard name were not updated. Fixed it. In v19 I intend to merge vci_pg_copy.h into postgresql_copy.h, so this is moot. > > 0002-Removed-vci_set_merge_and_copy_trans_funcs.patch > Found that vci_set_merge_and_copy_trans_funcs() is not used anywhere, > removed it alogn with the code that was only called inside it. > trans_funcs_table[] now only contains single transfn_oid field, others > (unused) are removed. > > 0003-Replaced-linear-search-by-switch-case.patch > Replaced linear search inside trans_funcs_table array to more optimal > switch-case. Thanks, these 0002 and 0003 changes will be in v19 patches which I'm hoping to post tomorrow or the day after. > > 0004-Removed-worker-name-check-in-lock.c.patch > This is one I'm not sure about. > Found that changes in Postgres core lock.c file check for "backend=" > substring in background worker name. There is also a comment in > vci_ros_daemon.c mentioning bgw_name checks of LockAquire(). Names > don't match however. So as far as I understand the check for "backend=" > in name is always false since no code in VCI sets bgw_name to something > similar. > This is either forgotten feature that can be easily fixed by removing > bgw_name checks, either some bug, either my misunderstanding. > For the first case, here is a patch that removes bgw_name checks in > lock.c. It makes core patch a bit smaller and not touching lock.c at > all (Yay!). > There is code in the function vci_LaunchROSControlWorker() which does a sprintf to assign the worker.bgw_name, but I also do not see how LockAcquire can have a bgw_name containing string “backend=”. Frankly, I expected the patch 0001 code should say more like strstr(…, “vci:”) because otherwise it does not seem VCI specific. Indeed, if it was checking “vci:” then the comment in storage/vci_ros_daemon.c would make sense to me. So, I agree that the Acquire/ReleaseLock code seems like it might be unreachable, OTOH, I am reluctant to remove it without understanding more about what was the intended purpose in the first place. Checking… ====== Kind Regards, Peter Smith. Fujitsu Australia
Here are the latest v19* patches. Changes include: PATCH 0002. Cleaning - Removed unused code (per Timur's patch 0002 [1]) - Removed vci_pg_copy.h; Combined this into postrgesql_copy.h Code changes - Changed linear search to switch (per Timur's patch 0003 [1]) - Removed UpperUniquePath (fix build issue due to master commit 24225ad) ====== [1] Timur's patches. https://www.postgresql.org/message-id/07c53a696afb8089d724214dbaeded6fcaa8fc0d.camel%40postgrespro.ru Kind Regards, Peter Smith. Fujitsu Australia
Вложения
Here are the latest v20* patches. Changes include: PATCH 0002. - Addressed all compiler warnings. Hopefully, cfbot CI will now report green for these. ====== Kind Regards, Peter Smith. Fujitsu Australia
Вложения
Here are the latest v21* patches. Changes include: PATCH 0002. - Removed all the unused "port" subfolder files plus associated include files. ====== Kind Regards, Peter Smith. Fujitsu Australia
Вложения
Here are the latest v22* patches. Changes include: PATCH 0002. - Code cleanups -- remove some redundant code related to "devload_t", because the OSS patch only implements the "unmonitored" device. ====== Kind Regards, Peter Smith. Fujitsu Australia
Вложения
Here are the latest v23* patches. Changes include: PATCH 0002. - Some code/config changes so that cfbot can work now for MacOS and FreeBSD. - Modify test code to make the ANALYZE expected results more stable (don't display buffers). ====== Kind Regards, Peter Smith. Fujitsu Australia
Вложения
Hello Peter! Thanks for updates. Here is a small fix for clang "variable set but not used" warnings. -- Regards, Timur Magomedov
Вложения
On Sat, Sep 13, 2025 at 1:58 AM Timur Magomedov <t.magomedov@postgrespro.ru> wrote: > > Hello Peter! > Thanks for updates. Here is a small fix for clang "variable set but not > used" warnings. > Thanks for your fixes. I will include these whenever I post the next version. ====== Kind Regards, Peter Smith. Fujitsu Australia
Hello Peter! I've found (using valgrind) some cases of reading random garbage after allocated memory. Investigation showed this was caused by converting some nodes to VciScanState* even if they have smaller size allocated originally. So accessing VciScanState fields was accessing memory after palloc'ed memory which could be used by any other allocation. I think converting to VciScanState* is only valid for nodes with tag T_CustomScanState so here is a patch that adds assertions for this: 0001-Assert-corrrect-node-tags-when-casting-to-VciScanSta.patch VCI v23 passes the tests with this patch applied. There are queries that fail unfortunately. I've added one of them to bugs.sql: 0002-Reproducer-of-invalid-cast-to-VciScanState.patch Node with tag 420 (T_NestLoopState) is cast to VciScanState* that fails newly added assertion. I'm not sure where to look next to fix this. Looking forward for you comments and ideas. -- Regards, Timur Magomedov
Вложения
Here are the latest v24* patches. Changes include: PATCH 0002. - Some compiler warning fixes, from Timur [1] - Pre-emptive removal of PointerIsValid() macro to prevent a rebase - Added sanity Asserts for T_CustomScanState, from Timur [2] ====== [1] https://www.postgresql.org/message-id/2af90dfaf6004e17782bd6c8cb8444670ab4d82c.camel%40postgrespro.ru [2] https://www.postgresql.org/message-id/149d6694c0c5a789b0ee865e80109022002bade5.camel%40postgrespro.ru Kind Regards, Peter Smith. Fujitsu Australia
Вложения
Hi Timur, Thanks for your ongoing work for this patch. On Thu, Sep 18, 2025 at 1:15 AM Timur Magomedov <t.magomedov@postgrespro.ru> wrote: ... > I've found (using valgrind) some cases of reading random garbage after > allocated memory. Investigation showed this was caused by converting > some nodes to VciScanState* even if they have smaller size allocated > originally. So accessing VciScanState fields was accessing memory after > palloc'ed memory which could be used by any other allocation. > > I think converting to VciScanState* is only valid for nodes with tag > T_CustomScanState so here is a patch that adds assertions for this: > 0001-Assert-corrrect-node-tags-when-casting-to-VciScanSta.patch What exactly did Valgrind report? For example, you said the VciScanState points beyond allocated memory. Do you have any more clues, like where that happened? Did you discover where that (smaller than it should be) memory was allocated in the first place? > > VCI v23 passes the tests with this patch applied. OK. I am not 100% certain about the asserts, but since the existing VCI tests are passing, I have merged your patch as-is into v24-0002. I guess we will find out later if the bug below is due to an old code cast problem or a new code assert problem. > > There are queries that fail unfortunately. I've added one of them to > bugs.sql: > 0002-Reproducer-of-invalid-cast-to-VciScanState.patch > Node with tag 420 (T_NestLoopState) is cast to VciScanState* that fails > newly added assertion. > > I'm not sure where to look next to fix this. Looking forward for you > comments and ideas. OK. I ran with your 2nd patch applied and reproduced the core-dump below, where it tripped over one of your new Asserts at executor/vci_sort.c:89. I can see it is an unexpected value T_NestLoopState. (gdb) bt 15 #0 0x00007ff948aa62c7 in raise () from /lib64/libc.so.6 #1 0x00007ff948aa79b8 in abort () from /lib64/libc.so.6 #2 0x0000000000c07977 in ExceptionalCondition (conditionName=0x7ff940839fa8 "scanstate->vci.css.ss.ps.type == T_CustomScanState", fileName=0x7ff940839f90 "executor/vci_sort.c", lineNumber=89) at assert.c:66 #3 0x00007ff9408084e6 in vci_sort_ExecCustomPlan (node=0x2a862f0) at executor/vci_sort.c:89 #4 0x000000000079d5bd in ExecCustomScan (pstate=0x2a862f0) at nodeCustom.c:137 #5 0x000000000077f693 in ExecProcNodeInstr (node=0x2a862f0) at execProcnode.c:486 #6 0x000000000077f664 in ExecProcNodeFirst (node=0x2a862f0) at execProcnode.c:470 #7 0x0000000000772b72 in ExecProcNode (node=0x2a862f0) at ../../../src/include/executor/executor.h:316 #8 0x0000000000775774 in ExecutePlan (queryDesc=0x2a89100, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0xe5b1a0 <donothingDR>) at execMain.c:1697 #9 0x000000000077317b in standard_ExecutorRun (queryDesc=0x2a89100, direction=ForwardScanDirection, count=0) at execMain.c:366 #10 0x00007ff9407f9efd in vci_executor_run_routine (queryDesc=0x2a89100, direction=ForwardScanDirection, count=0) at executor/vci_executor.c:178 #11 0x0000000000772ff5 in ExecutorRun (queryDesc=0x2a89100, direction=ForwardScanDirection, count=0) at execMain.c:301 #12 0x00000000006b7f66 in ExplainOnePlan (plannedstmt=0x2a8a628, into=0x0, es=0x2a81388, queryString=0x28b0fd0 "EXPLAIN (ANALYZE, COSTS FALSE, BUFFERS FALSE, TIMING FALSE, SUMMARY FALSE)\nSELECT *\n FROM main m\n JOIN secondary s\n\tON m.id = s.main_id\n WHERE s.val in (\n\t\tSELECT MAX(val)\n\t\t FROM secondary s2\n\t\t W"..., params=0x0, queryEnv=0x0, planduration=0x7ffe51311320, bufusage=0x0, mem_counters=0x0) at explain.c:579 #13 0x00000000006b799a in standard_ExplainOneQuery (query=0x2ac21f8, cursorOptions=2048, into=0x0, es=0x2a81388, queryString=0x28b0fd0 "EXPLAIN (ANALYZE, COSTS FALSE, BUFFERS FALSE, TIMING FALSE, SUMMARY FALSE)\nSELECT *\n FROM main m\n JOIN secondary s\n\tON m.id = s.main_id\n WHERE s.val in (\n\t\tSELECT MAX(val)\n\t\t FROM secondary s2\n\t\t W"..., params=0x0, queryEnv=0x0) at explain.c:372 #14 0x00007ff9407f9ff3 in vci_explain_one_query_routine (queryDesc=0x2ac21f8, cursorOptions=2048, into=0x0, es=0x2a81388, queryString=0x28b0fd0 "EXPLAIN (ANALYZE, COSTS FALSE, BUFFERS FALSE, TIMING FALSE, SUMMARY FALSE)\nSELECT *\n FROM main m\n JOIN secondary s\n\tON m.id = s.main_id\n WHERE s.val in (\n\t\tSELECT MAX(val)\n\t\t FROM secondary s2\n\t\t W"..., params=0x0, queryEnv=0x0) at executor/vci_executor.c:224 (More stack frames follow...) (gdb) I will keep investigating it... I have not included your test case in the v24* patches because I didn't want this known test failure to mask out any other unknown test problems that might occur. ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi Peter! > What exactly did Valgrind report? For example, you said the > VciScanState points beyond allocated memory. Do you have any more > clues, like where that happened? Did you discover where that (smaller > than it should be) memory was allocated in the first place? Doing some experiments I've faced a segfault on a query joining tables filled with some amount of data. It was flaky so I used Valgrind. There is a line in vci_scan.c, exec_custom_plan_enabling_vp(): if (!scanstate->first_fetch || (scanstate->pos.num_fetched_rows <= scanstate->pos.current_row)) Valgrind reported that line as Invalid read of size 1, 4 and 4. So all three of the values checked in this line are read from some random memory, possibly allocated and used by other objects already. When the expression in exec_custom_plan_enabling_vp() randomly evaluated to true, the following ExecClearTuple() dereferences NULL in slot->tts_ops. The memory was originally allocated in nodeHashJoin.c, in hjstate = makeNode(HashJoinState) line so it is really smaller than VciScanState. I did not use any table data for reproducer since asserts helps to catch the original problem. I also simplified the original query for a reproducer. > OK. I am not 100% certain about the asserts, but since the existing > VCI tests are passing, I have merged your patch as-is into v24-0002. > I > guess we will find out later if the bug below is due to an old code > cast problem or a new code assert problem. > Thanks for merging asserts. And looks like the problem is related to VCI join nodes. There is no VCI hash join or VCI nested loop. There is a code in VCI planner that still puts VCI Sort or VCI Aggregate nodes on top of regular join nodes which makes no sense to me. This is the cause of the problem. VCI Sort and VCI Aggregate then convert outer nodes to VCI Scan since they know there can't be anything another. This can be fixed either by implementing VCI joins either by disabling them in a deeper way. Since we already have developer GUCs for them I would rather set them to disabled by default instead of removing all useful VCI joins related code. Made a patch with a test and a simplest fix (disabling joins in GUCs). -- Regards, Timur Magomedov
Вложения
Here are the latest v25* patches. Changes include: PATCH 0001. - A rebase was needed due to a recent commit d4d1fc5. ====== Kind Regards, Peter Smith. Fujitsu Australia
Вложения
On Sat, Sep 20, 2025 at 12:13 AM Timur Magomedov <t.magomedov@postgrespro.ru> wrote: > > Hi Peter! > > > What exactly did Valgrind report? For example, you said the > > VciScanState points beyond allocated memory. Do you have any more > > clues, like where that happened? Did you discover where that (smaller > > than it should be) memory was allocated in the first place? > > Doing some experiments I've faced a segfault on a query joining tables > filled with some amount of data. It was flaky so I used Valgrind. > There is a line in vci_scan.c, exec_custom_plan_enabling_vp(): > if (!scanstate->first_fetch || (scanstate->pos.num_fetched_rows <= > scanstate->pos.current_row)) > Valgrind reported that line as Invalid read of size 1, 4 and 4. So all > three of the values checked in this line are read from some random > memory, possibly allocated and used by other objects already. > > When the expression in exec_custom_plan_enabling_vp() randomly > evaluated to true, the following ExecClearTuple() dereferences NULL in > slot->tts_ops. > The memory was originally allocated in nodeHashJoin.c, in hjstate = > makeNode(HashJoinState) line so it is really smaller than VciScanState. > > I did not use any table data for reproducer since asserts helps to > catch the original problem. I also simplified the original query for a > reproducer. > > > OK. I am not 100% certain about the asserts, but since the existing > > VCI tests are passing, I have merged your patch as-is into v24-0002. > > I > > guess we will find out later if the bug below is due to an old code > > cast problem or a new code assert problem. > > > > Thanks for merging asserts. And looks like the problem is related to > VCI join nodes. > There is no VCI hash join or VCI nested loop. There is a code in VCI > planner that still puts VCI Sort or VCI Aggregate nodes on top of > regular join nodes which makes no sense to me. This is the cause of the > problem. VCI Sort and VCI Aggregate then convert outer nodes to VCI > Scan since they know there can't be anything another. This can be fixed > either by implementing VCI joins either by disabling them in a deeper > way. Since we already have developer GUCs for them I would rather set > them to disabled by default instead of removing all useful VCI joins > related code. > > Made a patch with a test and a simplest fix (disabling joins in GUCs). > Hi Timur, Thanks for the patch! Unfortunately, this is straying into areas with which I'm not familiar, so I'm taking it on faith that these are good changes. For now, I'm happy to merge your patch into the next VCI version, posted unless someone else objects. ~ But, I still have a couple of questions for clarification: 1. What about the original Valgrind issue? Is that still a problem that needs to be addressed? E.g., is the bad allocation still lurking, and your sort avoidance patch is simply preventing the bad allocation from being exposed until some next thing randomly fails? Or is there no allocation problem anymore to worry about? 2. What about your added Assert that was previously failing at executor/vci_sort.c:89? That Assert is still present in vci_sort.c, but AFAICT the current tests are not executing that code. Do those patched GUC changes simply make that code unreachable now? In other words, should that previously failing Assert be left where it is or not? Should there be another test case added to execute this Assert? ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi Peter! On Wed, 2025-09-24 at 12:46 +1000, Peter Smith wrote: > On Sat, Sep 20, 2025 at 12:13 AM Timur Magomedov > <t.magomedov@postgrespro.ru> wrote: > > > > Hi Peter! > > > > > What exactly did Valgrind report? For example, you said the > > > VciScanState points beyond allocated memory. Do you have any more > > > clues, like where that happened? Did you discover where that > > > (smaller > > > than it should be) memory was allocated in the first place? > > > > Doing some experiments I've faced a segfault on a query joining > > tables > > filled with some amount of data. It was flaky so I used Valgrind. > > There is a line in vci_scan.c, exec_custom_plan_enabling_vp(): > > if (!scanstate->first_fetch || (scanstate->pos.num_fetched_rows <= > > scanstate->pos.current_row)) > > Valgrind reported that line as Invalid read of size 1, 4 and 4. So > > all > > three of the values checked in this line are read from some random > > memory, possibly allocated and used by other objects already. > > > > When the expression in exec_custom_plan_enabling_vp() randomly > > evaluated to true, the following ExecClearTuple() dereferences NULL > > in > > slot->tts_ops. > > The memory was originally allocated in nodeHashJoin.c, in hjstate = > > makeNode(HashJoinState) line so it is really smaller than > > VciScanState. > > > > I did not use any table data for reproducer since asserts helps to > > catch the original problem. I also simplified the original query > > for a > > reproducer. > > > > > OK. I am not 100% certain about the asserts, but since the > > > existing > > > VCI tests are passing, I have merged your patch as-is into v24- > > > 0002. > > > I > > > guess we will find out later if the bug below is due to an old > > > code > > > cast problem or a new code assert problem. > > > > > > > Thanks for merging asserts. And looks like the problem is related > > to > > VCI join nodes. > > There is no VCI hash join or VCI nested loop. There is a code in > > VCI > > planner that still puts VCI Sort or VCI Aggregate nodes on top of > > regular join nodes which makes no sense to me. This is the cause of > > the > > problem. VCI Sort and VCI Aggregate then convert outer nodes to VCI > > Scan since they know there can't be anything another. This can be > > fixed > > either by implementing VCI joins either by disabling them in a > > deeper > > way. Since we already have developer GUCs for them I would rather > > set > > them to disabled by default instead of removing all useful VCI > > joins > > related code. > > > > Made a patch with a test and a simplest fix (disabling joins in > > GUCs). > > > > Hi Timur, > > Thanks for the patch! Unfortunately, this is straying into areas with > which I'm not familiar, so I'm taking it on faith that these are good > changes. For now, I'm happy to merge your patch into the next VCI > version, posted unless someone else objects. > > ~ > > But, I still have a couple of questions for clarification: > > 1. What about the original Valgrind issue? > > Is that still a problem that needs to be addressed? E.g., is the bad > allocation still lurking, and your sort avoidance patch is simply > preventing the bad allocation from being exposed until some next > thing > randomly fails? Or is there no allocation problem anymore to worry > about? Allocations are fine, the problem was using some nodes as nodes of another type (and bigger size) which leads to crossing boundary of allocated memory. We are safe now and asserts guard us from repeating the original bug. > 2. What about your added Assert that was previously failing at > executor/vci_sort.c:89? > > That Assert is still present in vci_sort.c, but AFAICT the current > tests are not executing that code. Do those patched GUC changes > simply > make that code unreachable now? In other words, should that > previously > failing Assert be left where it is or not? Should there be another > test case added to execute this Assert? Added simple test for running VCI sort node, it executes the assertion code in vci_sort.c. No assertion fails, so VCI Sort itself is OK. Here are both two commits on top of v25 version. -- Regards, Timur Magomedov
Вложения
Hi Timur. On Wed, Sep 24, 2025 at 11:47 PM Timur Magomedov <t.magomedov@postgrespro.ru> wrote: ... > > > > Thanks for the patch! Unfortunately, this is straying into areas with > > which I'm not familiar, so I'm taking it on faith that these are good > > changes. For now, I'm happy to merge your patch into the next VCI > > version, posted unless someone else objects. > > > > ~ > > > > But, I still have a couple of questions for clarification: > > > > 1. What about the original Valgrind issue? > > > > Is that still a problem that needs to be addressed? E.g., is the bad > > allocation still lurking, and your sort avoidance patch is simply > > preventing the bad allocation from being exposed until some next > > thing > > randomly fails? Or is there no allocation problem anymore to worry > > about? > > Allocations are fine, the problem was using some nodes as nodes of > another type (and bigger size) which leads to crossing boundary of > allocated memory. We are safe now and asserts guard us from repeating > the original bug. > Thanks for the clarification. > > > 2. What about your added Assert that was previously failing at > > executor/vci_sort.c:89? > > > > That Assert is still present in vci_sort.c, but AFAICT the current > > tests are not executing that code. Do those patched GUC changes > > simply > > make that code unreachable now? In other words, should that > > previously > > failing Assert be left where it is or not? Should there be another > > test case added to execute this Assert? > > Added simple test for running VCI sort node, it executes the assertion > code in vci_sort.c. No assertion fails, so VCI Sort itself is OK. Here > are both two commits on top of v25 version. > These have been included in v26. Thanks! ====== Kind Regards, Peter Smith. Fujitsu Australia
Here are the latest v26* patches. Changes include: PATCH 0002. - Rebase was needed due to a5b35fc - Merged the sort/join patches provided by Timur [1] ====== [1] https://www.postgresql.org/message-id/fd67d6c54cccaf0d98ec8a19182635067392b928.camel%40postgrespro.ru Kind Regards, Peter Smith. Fujitsu Australia
Вложения
Hello Peter! There is a code in vci_ros.c that initializes xl_heap_inplace xlrec. Comment says this code was taken from src/backend/access/heap/heapam.c. It was fine for Postgres 17 and earlier however struct xl_heap_inplace has 6 fields, not one since commit 243e9b40f1b2. So nmsgs field of xlrec has some random uninitialized value from stack. It goes to WAL and in case of big nmsgs it can cause segfault during server startup. Here are backtrace of a segfault while applying WAL on server startup and a patch that initializes all necessary fields of xlrec to avoid bad WAL records. -- Regards, Timur Magomedov
Вложения
Here are the latest v27* patches. Changes include: PATCH 0002. - Rebase vci_ros.c was needed due to 243e9b4 (patch provided by Timur [1]) ====== [1] https://www.postgresql.org/message-id/b0183172fbe8fbf4260d10df50a57127753eba68.camel%40postgrespro.ru Kind Regards, Peter Smith. Fujitsu Australia