Обсуждение: Improve LWLock tranche name visibility across backends
Hi, This is a follow-up to a discussion started in [0]. LWLocks in PostgreSQL are categorized into tranches, and the tranche name appears as the wait_event in pg_stat_activity. There are both built-in tranche names and tranche names that can be registered by extensions using RequestNamedLWLockTranche() or LWLockRegisterTranche(). Tranche names are stored in process-local memory when registered. If a tranche is registered during postmaster startup, such as with built-in tranches or those registered via RequestNamedLWLockTranche(), its name is inherited by backend processes via fork(). However, if a tranche is registered dynamically by a backend using LWLockRegisterTranche(), other backends will not be aware of it unless they explicitly register it as well. Consider a case in which an extension allows a backend to attach a new dshash via the GetNamedDSHash API and supplies a tranche name like "MyUsefulExtension". The first backend to call GetNamedDSHash will initialize an LWLock using the extension-defined tranche name and associate it with a tranche ID in local memory. Other backends that later attach to the same dshash will also learn about the tranche name and ID. Backends that do not attach the dshash will not know this tranche name. This results in differences in how wait events are reported in pg_stat_activity. When querying pg_stat_activity, the function pgstat_get_wait_event is called, which internally uses GetLWLockIdentifier and GetLWTrancheName to map the LWLock to its tranche name. If the backend does not recognize the tranche ID, a fallback name "extension" is used. Therefore, backends that have registered the tranche will report the correct extension-defined tranche name, while others will report the generic fallback of "extension". i.e. ```` postgres=# select wait_event, wait_event_type from pg_stat_activity; -[ RECORD 1 ]---+-------------------- wait_event | extension wait_event_type | LWLock ``` instead of ``` postgres=# select wait_event, wait_event_type from pg_stat_activity; -[ RECORD 1 ]---+-------------------- wait_event | MyUsefulExtension wait_event_type | LWLock ``` This is the current design, but I think we can do better to avoid inconsitencies this my lead for monitoring tools and diagnostics. To improve this, we could store tranche names registered by a normal backend in shared memory, for example in a dshash, allowing tranche names to be resolved even by backends that have not explicitly registered them. This would lead to more consistent behavior, particularly as more extensions adopt APIs like GetNamedDSHash, where tranche names are registered by the backend rather than the postmaster. Attached is a proof of concept that does not alter the LWLockRegisterTranche API. Instead, it detects when a registration is performed by a normal backend and stores the tranche name in shared memory, using a dshash keyed by tranche ID. Tranche name lookup now proceeds in the order of built-in names, the local list, and finally the shared memory. The fallback name "extension" can still be returned if an extension does not register a tranche. An exclusive lock is taken when adding a new tranche, which should be a rare occurrence. A shared lock is taken when looking up a tranche name via GetLWTrancheName. There are still some open questions I have: 1/ There is currently no mechanism for deleting entries. I am not sure whether this is a concern, since the size of the table would grow only with the number of extensions and the number of LWLocks they initialize, which is typically small. That said, others may have different thoughts on this. 2/ What is the appropriate size limit for a tranche name. The work done in [0] caps the tranche name to 128 bytes for the dshash tranche, and 128 bytes + length of " DSA" suffix for the dsa tranche. Also, the existing RequestNamedLWLockTranche caps the name to NAMEDATALEN. Currently, LWLockRegisterTranche does not have a limit on the tranche name. I wonder if we also need to take care of this and implement some common limit that applies to tranch names regardless of how they're created? [0] https://www.postgresql.org/message-id/aEiTzmndOVPmA6Mm%40nathan -- Sami Imseih Amazon Web Services (AWS)
Вложения
Hi, On Wed, Jul 09, 2025 at 04:39:48PM -0500, Sami Imseih wrote: > Hi, > > When querying pg_stat_activity, the function pgstat_get_wait_event is > called, which internally uses GetLWLockIdentifier and GetLWTrancheName > to map the LWLock to its tranche name. If the backend does not recognize > the tranche ID, a fallback name "extension" is used. Therefore, backends > that have registered the tranche will report the correct extension-defined > tranche name, while others will report the generic fallback of "extension". > > i.e. > ```` > postgres=# select wait_event, wait_event_type from pg_stat_activity; > -[ RECORD 1 ]---+-------------------- > wait_event | extension > wait_event_type | LWLock > ``` > instead of > ``` > postgres=# select wait_event, wait_event_type from pg_stat_activity; > -[ RECORD 1 ]---+-------------------- > wait_event | MyUsefulExtension > wait_event_type | LWLock > ``` > > This is the current design, but I think we can do better to avoid inconsitencies > this my lead for monitoring tools and diagnostics. +1 on finding a way to improve this, thanks for looking at it. > Attached is a proof of concept that does not alter the > LWLockRegisterTranche API. Instead, it detects when a registration is > performed by a normal backend and stores the tranche name in shared memory, > using a dshash keyed by tranche ID. Tranche name lookup now proceeds in > the order of built-in names, the local list, and finally the shared memory. > The fallback name "extension" can still be returned if an extension does > not register a tranche. I did not look in details, but do you think we could make use of WaitEventCustomNew()? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Thanks for the feedback! > > Attached is a proof of concept that does not alter the > > LWLockRegisterTranche API. Instead, it detects when a registration is > > performed by a normal backend and stores the tranche name in shared memory, > > using a dshash keyed by tranche ID. Tranche name lookup now proceeds in > > the order of built-in names, the local list, and finally the shared memory. > > The fallback name "extension" can still be returned if an extension does > > not register a tranche. > > I did not look in details, but do you think we could make use of > WaitEventCustomNew()? It looks like I overlooked the custom wait event, so I didn’t take it into account initially. That said, I do think it’s reasonable to consider piggybacking on this infrastructure. After all, LWLockRegisterTranche is already creating a custom wait event defined by the extension. The advantage here is that we can avoid creating new shared memory and instead reuse the existing static hash table, which is capped at 128 custom wait events: ``` #define WAIT_EVENT_CUSTOM_HASH_MAX_SIZE 128 ``` However, WaitEventCustomNew as it currently stands won’t work for our use case, since it assigns an eventId automatically. The API currently takes a classId and wait_event_name, but in our case, we’d actually want to pass in a trancheId. So, we might need a new API, something like: ``` WaitEventCustomNewWithEventId(uint32 classId, uint16 eventId, const char *wait_event_name); ``` eventId in the LWLock case will be a tracheId that was generated by the user in some earlier step, like LWLockInitialize This would behave the same as the existing WaitEventCustomNew API, except that it uses the provided eventId. or maybe we can just allow WaitEventCustomNew to take in the eventId, and if it's > 0, then use the passed in value, otherwise generate the next eventId. I do like the latter approach more, what do you think? With this API, we can then teach LWLockRegisterTranche to register the custom wait event. -- Sami
Hi, On Thu, Jul 10, 2025 at 04:34:34PM -0500, Sami Imseih wrote: > Thanks for the feedback! > > > > Attached is a proof of concept that does not alter the > > > LWLockRegisterTranche API. Instead, it detects when a registration is > > > performed by a normal backend and stores the tranche name in shared memory, > > > using a dshash keyed by tranche ID. Tranche name lookup now proceeds in > > > the order of built-in names, the local list, and finally the shared memory. > > > The fallback name "extension" can still be returned if an extension does > > > not register a tranche. > > > > I did not look in details, but do you think we could make use of > > WaitEventCustomNew()? > > It looks like I overlooked the custom wait event, so I didn’t take it into > account initially. That said, I do think it’s reasonable to consider > piggybacking on this infrastructure. > > After all, LWLockRegisterTranche is already creating a custom wait event > defined by the extension. Right, the tranche is nothing but an eventID (from a wait_event_info point of view). > The advantage here is that we can avoid creating > new shared memory Right, I think it's good to rely on this existing machinery. > and instead reuse the existing static hash table, which is > capped at 128 custom wait events: > > ``` > #define WAIT_EVENT_CUSTOM_HASH_MAX_SIZE 128 > ``` That's probably still high enough, thoughts? > However, WaitEventCustomNew as it currently stands won’t work for our use > case, since it assigns an eventId automatically. The API currently takes a > classId and wait_event_name, but in our case, we’d actually want to pass in a > trancheId. > > So, we might need a new API, something like: > ``` > WaitEventCustomNewWithEventId(uint32 classId, uint16 eventId, > const char *wait_event_name); > ``` > eventId in the LWLock case will be a tracheId that was generated > by the user in some earlier step, like LWLockInitialize > > This would behave the same as the existing WaitEventCustomNew API, > except that it uses the provided eventId. > > or maybe we can just allow WaitEventCustomNew to take in the eventId, and > if it's > 0, then use the passed in value, otherwise generate the next eventId. > > I do like the latter approach more, what do you think? I think I do prefer it too, but in both cases we'll have to make sure there is no collision on the eventID (LWTRANCHE_FIRST_USER_DEFINED is currently 95). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Jul 09, 2025 at 04:39:48PM -0500, Sami Imseih wrote: > Attached is a proof of concept that does not alter the > LWLockRegisterTranche API. IMHO we should consider modifying the API, because right now you have to call LWLockRegisterTranche() in each backend. Why not accept the name as an argument for LWLockNewTrancheId() and only require it to be called once during your shared memory initialization? In any case, a lot of existing code will continue to call it in each backend unless the API changes. > Instead, it detects when a registration is > performed by a normal backend and stores the tranche name in shared memory, > using a dshash keyed by tranche ID. Tranche name lookup now proceeds in > the order of built-in names, the local list, and finally the shared memory. > The fallback name "extension" can still be returned if an extension does > not register a tranche. Why do we need three different places for the lock names? Is there a reason we can't put it all in shared memory? > 1/ There is currently no mechanism for deleting entries. I am not sure whether > this is a concern, since the size of the table would grow only with the > number of extensions and the number of LWLocks they initialize, which is > typically small. That said, others may have different thoughts on this. I don't see any strong reason to allow deletion unless we started to reclaim tranche IDs, and I don't see any strong reason for that, either. > 2/ What is the appropriate size limit for a tranche name. The work done > in [0] caps the tranche name to 128 bytes for the dshash tranche, and > 128 bytes + length of " DSA" suffix for the dsa tranche. Also, the > existing RequestNamedLWLockTranche caps the name to NAMEDATALEN. Currently, > LWLockRegisterTranche does not have a limit on the tranche name. I wonder > if we also need to take care of this and implement some common limit that > applies to tranch names regardless of how they're created? Do we need to set a limit? If we're using a DSA and dshash, we could let folks use arbitrary long tranche names, right? The reason for the limit in the DSM registry is because the name is used as the key for the dshash table. -- nathan
> > and instead reuse the existing static hash table, which is > > capped at 128 custom wait events: > > > > ``` > > #define WAIT_EVENT_CUSTOM_HASH_MAX_SIZE 128 > > ``` > > That's probably still high enough, thoughts? I have no reason to believe that this number could be too low. I am not aware of an extension that will initialize more than a couple of LWLocks. > > or maybe we can just allow WaitEventCustomNew to take in the eventId, and > > if it's > 0, then use the passed in value, otherwise generate the next eventId. > > > > I do like the latter approach more, what do you think? > > I think I do prefer it too, but in both cases we'll have to make sure there > is no collision on the eventID (LWTRANCHE_FIRST_USER_DEFINED is currently > 95). As far as collisions are concerned, the key of the hash is the wait_event_info, which is a bitwise OR of classId and eventId ``` wait_event_info = classId | eventId; ``` Do you think collision can still be possible? Now, what I think will be a good API is to provide an alternative to LWLockRegisterTranche, which now takes in both a tranche ID and tranche_name. The new API I propose is LWLockRegisterTrancheWaitEventCustom which takes only a tranche_name and internally calls WaitEventCustomNew to add a new wait_event_info to the hash table. The wait_event_info is made up of classId = PG_WAIT_LWLOCK and LWLockNewTrancheId(). I prefer we implement a new API for this to make it explicit that this API will both register a tranche and create a custom wait event. I do not think we should get rid of LWLockRegisterTranche because it is used by CreateLWLocks during startup and I don't see a reason to change that. See the attached v1. What is missing from the attached v1 is: 1/ the documentation changes required in [0]. 2/ in dsm_registry.c, we are going to have to modify GetNamedDSHash and GetNamedDSA to use the new API. Here is an example of how that looks like ``` @@ -389,14 +385,12 @@ GetNamedDSHash(const char *name, const dshash_parameters *params, bool *found) entry->type = DSMR_ENTRY_TYPE_DSH; /* Initialize the LWLock tranche for the DSA. */ - dsa_state->tranche = LWLockNewTrancheId(); sprintf(dsa_state->tranche_name, "%s%s", name, DSMR_DSA_TRANCHE_SUFFIX); - LWLockRegisterTranche(dsa_state->tranche, dsa_state->tranche_name); + dsa_state->tranche = LWLockRegisterTrancheWaitEventCustom(dsa_state->tranche_name); ``` Is there a concern with a custom wait event to be created implicitly via the GetNamed* APIs? 3/ I still did not workout the max length requirement of the tranche name, but I think it would have to be as large as the GetNamed* APIs from the dsm registry support. I hope with v1, we can agree to the general direction of this change. [0] https://www.postgresql.org/docs/current/xfunc-c.html -- Sami
Вложения
On Fri, Jul 11, 2025 at 04:32:13PM -0500, Sami Imseih wrote: > Now, what I think will be a good API is to provide an alternative to > LWLockRegisterTranche, > which now takes in both a tranche ID and tranche_name. The new API I propose is > LWLockRegisterTrancheWaitEventCustom which takes only a tranche_name > and internally > calls WaitEventCustomNew to add a new wait_event_info to the hash > table. The wait_event_info > is made up of classId = PG_WAIT_LWLOCK and LWLockNewTrancheId(). > > I prefer we implement a new API for this to make it explicit that this > API will both register > a tranche and create a custom wait event. I do not think we should get > rid of LWLockRegisterTranche > because it is used by CreateLWLocks during startup and I don't see a > reason to change that. > See the attached v1. Hm. I was thinking we could have LWLockNewTrancheId() take care of registering the name. The CreateLWLocks() case strikes me as a special path. IMHO LWLockRegisterTranche() should go away. > Is there a concern with a custom wait event to be created implicitly > via the GetNamed* APIs? I'm not sure I see any particular advantage to using custom wait events versus a dedicated LWLock tranche name table. If anything, the limits on the number of tranches and the lengths of the names gives me pause. -- nathan
>> Attached is a proof of concept that does not alter the >> LWLockRegisterTranche API. > IMHO we should consider modifying the API, because right now you have to > call LWLockRegisterTranche() in each backend. Why not accept the name as > an argument for LWLockNewTrancheId() and only require it to be called once > during your shared memory initialization? Yes, we could do that, and this will simplify the tranche registration from the current 2 step process of LWLockNewTrancheId() followed up with a LWLockRegisterTranche(), to just simply LWLockNewTrancheId("my tranche"). I agree. >> Instead, it detects when a registration is >> performed by a normal backend and stores the tranche name in shared memory, >> using a dshash keyed by tranche ID. Tranche name lookup now proceeds in >> the order of built-in names, the local list, and finally the shared memory. >> The fallback name "extension" can still be returned if an extension does >> not register a tranche. > Why do we need three different places for the lock names? Is there a > reason we can't put it all in shared memory? The real reason I felt it was better to keep three separate locations is that it allows for a clear separation between user-defined tranches registered during postmaster startup and those registered during a normal backend. The tranches registered during postmaster are inherited by the backend via fork() (or EXEC_BACKEND), and therefore, the dshash table will only be used by a normal backend. Since DSM is not available during postmaster, if we were to create a DSA segment in place, similar to what's done in StatsShmemInit(), we would also need to ensure that the initial shared memory is sized appropriately. This is because it would need to be large enough to accommodate all user-defined tranches registered during postmaster, without having to rely on new dsm segments. From my experimentation, this sizing is not as straightforward as simply calculating # of tranches * size of a tranche entry. I still think we should create the dsa during postmaster, as we do with StatsShmemInit, but it would be better if postmaster keeps its hands off this dshash and only normal backends can use them. Thoughts? >> 2/ What is the appropriate size limit for a tranche name. The work done >> in [0] caps the tranche name to 128 bytes for the dshash tranche, and >> 128 bytes + length of " DSA" suffix for the dsa tranche. Also, the >> existing RequestNamedLWLockTranche caps the name to NAMEDATALEN. Currently, >> LWLockRegisterTranche does not have a limit on the tranche name. I wonder >> if we also need to take care of this and implement some common limit that >> applies to tranch names regardless of how they're created? > Do we need to set a limit? If we're using a DSA and dshash, we could let > folks use arbitrary long tranche names, right? The reason for the limit in > the DSM registry is because the name is used as the key for the dshash > table. Sure that is a good point. The dshash entry could be like below, without a limit on the tranche_name. ``` typedef struct LWLockTracheNamesEntry { int trancheId; const char *tranche_name; } LWLockTracheNamesEntry; ``` >> Is there a concern with a custom wait event to be created implicitly >> via the GetNamed* APIs? > I'm not sure I see any particular advantage to using custom wait events > versus a dedicated LWLock tranche name table. If anything, the limits on > the number of tranches and the lengths of the names gives me pause. Sure, after contemplating on this a bit, I prefer a separate shared memory as well. Custom wait events, while could work, will also be a bit of a confusing user experience. -- Sami
On Mon, Jul 14, 2025 at 02:34:00PM -0500, Sami Imseih wrote: >> Why do we need three different places for the lock names? Is there a >> reason we can't put it all in shared memory? > > The real reason I felt it was better to keep three separate locations is that > it allows for a clear separation between user-defined tranches registered > during postmaster startup and those registered during a normal backend. The > tranches registered during postmaster are inherited by the backend via > fork() (or EXEC_BACKEND), and therefore, the dshash table will only be used > by a normal backend. > > Since DSM is not available during postmaster, if we were to create a DSA > segment in place, similar to what's done in StatsShmemInit(), we would also > need to ensure that the initial shared memory is sized appropriately. This is > because it would need to be large enough to accommodate all user-defined > tranches registered during postmaster, without having to rely on new > dsm segments. > From my experimentation, this sizing is not as straightforward as simply > calculating # of tranches * size of a tranche entry. > > I still think we should create the dsa during postmaster, as we do with > StatsShmemInit, but it would be better if postmaster keeps its hands off this > dshash and only normal backends can use them. Ah, I missed the problem with postmaster. Could we have the first backend that needs to access the table be responsible for creating it and populating it with the built-in/requested-at-startup entries? Also, is there any chance that postmaster might need to access the tranche names? -- nathan
Nathan Bossart <nathandbossart@gmail.com> writes: > Ah, I missed the problem with postmaster. Could we have the first backend > that needs to access the table be responsible for creating it and > populating it with the built-in/requested-at-startup entries? Also, is > there any chance that postmaster might need to access the tranche names? Seems quite hazardous to let the postmaster get involved with such a data structure. If it seems to need to, we'd better rethink where to put the functionality that needs the access. regards, tom lane
> Ah, I missed the problem with postmaster. Could we have the first backend > that needs to access the table be responsible for creating it and > populating it with the built-in/requested-at-startup entries? We can certainly maintain a flag in the shared state that is set once the first backend loads all the tranches in shared memory. That did not cross my mind, but it feels wrong to offload such responsibility to a normal backend. > Also, is there any chance that postmaster might need to access the > tranche names? A postmaster does not currently have a reason to lookup a tranche name, afaict. This only occurs when looking up wait events or if lwlock tracing is enabled. -- Sami
On Mon, Jul 14, 2025 at 03:45:02PM -0500, Sami Imseih wrote: >> Ah, I missed the problem with postmaster. Could we have the first backend >> that needs to access the table be responsible for creating it and >> populating it with the built-in/requested-at-startup entries? > > We can certainly maintain a flag in the shared state that is set once > the first backend loads all the tranches in shared memory. That did not > cross my mind, but it feels wrong to offload such responsibility to a > normal backend. Well, we already need each backend to either initialize or attach to the dshash table, and the initialization would only ever happen once on a running server. Adding a new initialization step to bootstrap the built-in and registered-at-startup tranche names doesn't seem like that much of a leap to me. Another random thought: I worry that the dshash approach might be quite a bit slower, and IIUC we just need to map an integer to a string. Maybe we should just use a DSA for LWLockTrancheNames. IOW we'd leave it as a char ** but put it in shared memory. -- nathan
Hi,
Since DSM is not available during postmaster, if we were to create a DSA
segment in place, similar to what's done in StatsShmemInit(), we would also
need to ensure that the initial shared memory is sized appropriately. This is
because it would need to be large enough to accommodate all user-defined
tranches registered during postmaster, without having to rely on new
dsm segments.
From my experimentation, this sizing is not as straightforward as simply
calculating # of tranches * size of a tranche entry.
I still think we should create the dsa during postmaster, as we do with
StatsShmemInit, but it would be better if postmaster keeps its hands off this
dshash and only normal backends can use them.
Thoughts?
Since creating a DSA segment in place during Postmaster startup still requires calculating the size of
the tranche names table including the user-defined tranches, why not use static shared memory to
pre-allocate a fixed sized table and arrays of size NAMEDATALEN to store the names?
If a dshash table is used to store tranche names and IDs, where would the tranche name for this table
be registered?
Thank you,
Rahila Syed
Hi, On Fri, Jul 11, 2025 at 04:32:13PM -0500, Sami Imseih wrote: > > > and instead reuse the existing static hash table, which is > > > capped at 128 custom wait events: > > > > > > ``` > > > #define WAIT_EVENT_CUSTOM_HASH_MAX_SIZE 128 > > > ``` > > > > That's probably still high enough, thoughts? > > I have no reason to believe that this number could be too low. > I am not aware of an extension that will initialize more than a > couple of LWLocks. > > > > or maybe we can just allow WaitEventCustomNew to take in the eventId, and > > > if it's > 0, then use the passed in value, otherwise generate the next eventId. > > > > > > I do like the latter approach more, what do you think? > > > > I think I do prefer it too, but in both cases we'll have to make sure there > > is no collision on the eventID (LWTRANCHE_FIRST_USER_DEFINED is currently > > 95). > > As far as collisions are concerned, the key of the hash is the wait_event_info, > which is a bitwise OR of classId and eventId > ``` > wait_event_info = classId | eventId; > ``` > Do you think collision can still be possible? I meant to say collision between the trancheID and WaitEventCustomCounter->nextId Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> Another random thought: I worry that the dshash approach might be quite a > bit slower, and IIUC we just need to map an integer to a string. Maybe we > should just use a DSA for LWLockTrancheNames. IOW we'd leave it as a char** > but put it in shared memory. To use DSA just for this purpose, we would need to maintain an array of dsa_pointers that reference the string(s), right? I am not clear what you mean by using dsa to put the char** -- Sami
On Tue, Jul 15, 2025 at 11:52:19AM -0500, Sami Imseih wrote: >> Another random thought: I worry that the dshash approach might be quite a >> bit slower, and IIUC we just need to map an integer to a string. Maybe we >> should just use a DSA for LWLockTrancheNames. IOW we'd leave it as a char** >> but put it in shared memory. > > To use DSA just for this purpose, we would need to maintain an array of > dsa_pointers that reference the string(s), right? I am not clear what you > mean by using dsa to put the char** I was imagining putting the array in one big DSA allocation instead of carting around a pointer for each tranche name. (Sorry, I realize I am hand-waving over some of the details.) -- nathan
On Tue, Jul 15, 2025 at 11:57 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Jul 15, 2025 at 11:52:19AM -0500, Sami Imseih wrote: > >> Another random thought: I worry that the dshash approach might be quite a > >> bit slower, and IIUC we just need to map an integer to a string. Maybe we > >> should just use a DSA for LWLockTrancheNames. IOW we'd leave it as a char** > >> but put it in shared memory. > > > > To use DSA just for this purpose, we would need to maintain an array of > > dsa_pointers that reference the string(s), right? I am not clear what you > > mean by using dsa to put the char** > > I was imagining putting the array in one big DSA allocation instead of > carting around a pointer for each tranche name. (Sorry, I realize I am > hand-waving over some of the details.) I understood it like this. Here is a sketch: ``` dsa_pointer p; dsa = dsa_create(....) p = dsa_allocate(dsa, LWLockTranchesInitialSize()); tranche_names = (char **) dsa_get_address(dsa, p); tranche_names[0] = "my tranche"; tranche_names[1] = "my tranche"; ``` We will need to track the size and resize if needed. Is this what you mean, from a high level? -- Sami
On Tue, Jul 15, 2025 at 12:06:00PM -0500, Sami Imseih wrote: > On Tue, Jul 15, 2025 at 11:57 AM Nathan Bossart > <nathandbossart@gmail.com> wrote: >> I was imagining putting the array in one big DSA allocation instead of >> carting around a pointer for each tranche name. (Sorry, I realize I am >> hand-waving over some of the details.) > > I understood it like this. Here is a sketch: > > ``` > dsa_pointer p; > > dsa = dsa_create(....) > > p = dsa_allocate(dsa, LWLockTranchesInitialSize()); > tranche_names = (char **) dsa_get_address(dsa, p); > tranche_names[0] = "my tranche"; > tranche_names[1] = "my tranche"; > ``` > > We will need to track the size and resize if needed. > > Is this what you mean, from a high level? Yes, that's roughly what I had in mind. We might need to employ some tricks to avoid a limit on tranche name length, but maybe that's not worth the energy. -- nathan
Hi, On Tue, Jul 15, 2025 at 12:59:04PM +0530, Rahila Syed wrote: > Hi, > > If a dshash table is used to store tranche names and IDs, where would the > tranche name for this table > be registered? I guess it could be a new BuiltinTrancheId for this dsa but not sure what Nathan and Sami have in mind. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> Hi,
>
> If a dshash table is used to store tranche names and IDs, where would the
> tranche name for this table
> be registered?
I guess it could be a new BuiltinTrancheId for this dsa but not sure what Nathan
and Sami have in mind.
Yes, it will be a BuiltinTrancheId for a shared memory that is allocated
during postmaster for tracking tranches. The shared memory will then
only be used by normal backends to register tranches. Any tranche
registered during postmaster is inherited by the backends.
Regards,
Sami
> >> I was imagining putting the array in one big DSA allocation instead of > >> carting around a pointer for each tranche name. (Sorry, I realize I am > >> hand-waving over some of the details.) > > > > I understood it like this. Here is a sketch: > > > > ``` > > dsa_pointer p; > > > > dsa = dsa_create(....) > > > > p = dsa_allocate(dsa, LWLockTranchesInitialSize()); > > tranche_names = (char **) dsa_get_address(dsa, p); > > tranche_names[0] = "my tranche"; > > tranche_names[1] = "my tranche"; > > ``` > > > > We will need to track the size and resize if needed. > > > > Is this what you mean, from a high level? > > Yes, that's roughly what I had in mind. We might need to employ some > tricks to avoid a limit on tranche name length, but maybe that's not worth > the energy. After spending some time looking at this, I don't think it's a good idea to do as mentioned above, since it's unclear when to resize the dsa pointer. It's better to have a single dsa pointer that holds a list of dsa pointers, each tracking a tranche name. We can also track the size of this list and grow it when needed. Essentially we should only be manipulating a DSA area via dsa_pointers. Attached is what I have in mind, if we want to avoid using dshash. This patch initializes a DSA with create_in_place at startup ( similar to what we do for pgstat, see pgstat_shmem.c) , and provides a routine to append a tranche name to the list, as well as a routine to look up a tranche name by index. The index is the tranche ID offset by LWTRANCHE_FIRST_USER_DEFINED. Additionally, this patch removes the step in [0] to call LWLockRegisterTranche. Instead, LWLockNewTrancheId now takes a tranche_name and returns the tranche ID. This new API is what we discussed earlier, and I agree with it, as it simplifies the workflow. LWLockRegisterTranche is still kept, but it's a private routine. To support the above, LWLockTrancheNames is now a struct that tracks both a local list of tranche names (populated during postmaster startup) and pointers to shared memory, for tranche names registered after postmaster. GetLWTrancheName now first checks BuiltinTrancheNames, then the local list in LWLockTrancheNames, and finally the shared memory. This allows tranche names registered after postmaster startup to be visible to all backends. Unlike the local list of tranche names, appending to and searching the shared memory list requires an LWLock; in exclusive mode to append, and shared mode to search. -- Sami [0] https://www.postgresql.org/docs/current/xfunc-c.html#XFUNC-ADDIN-LWLOCKS-AFTER-STARTUP
Вложения
I've attached a rebased version of the patch. On Mon, Jul 21, 2025 at 11:26:44PM -0500, Sami Imseih wrote: > Unlike the local list of tranche names, appending to and searching the > shared memory list requires an LWLock; in exclusive mode to append, and > shared mode to search. The thing that stood out the most to me is how much more expensive looking up the lock name is. At the risk of suggesting even more complexity, I'm wondering if we should add some sort of per-backend cache to avoid the need for locking and dsa_get_address() calls every time you need to look up a lock name. Also, I suspect that there might be some concerns about the API changes. While it should be a very easy fix, that seems likely to break a lot of extensions. I don't know if it's possible to make this stuff backward compatible, and I also don't know if we really want to, as that'll both introduce more complexity and keep folks using the old API. (That being said, I just looked on GitHub and Debian Code Search and was surprised at how few uses of LWLockNewTrancheId() and LWLockRegisterTranche() there are...) -- nathan
Вложения
Hi, On Thu, Jul 31, 2025 at 04:24:38PM -0500, Nathan Bossart wrote: > I've attached a rebased version of the patch. > > On Mon, Jul 21, 2025 at 11:26:44PM -0500, Sami Imseih wrote: > > Unlike the local list of tranche names, appending to and searching the > > shared memory list requires an LWLock; in exclusive mode to append, and > > shared mode to search. > > The thing that stood out the most to me is how much more expensive looking > up the lock name is. At the risk of suggesting even more complexity, I'm > wondering if we should add some sort of per-backend cache to avoid the need > for locking and dsa_get_address() calls every time you need to look up a > lock name. Yeah, I 've the same concern that GetLWTrancheNameByID() might be too costly. OTOH it's not really used in a hot path, "only" when displaying wait events. But still, I agree that the extra overhead (and possible contention) is concerning and that, for example, a per-backend cache (with fallback to the dsa if needed) could help. > Also, I suspect that there might be some concerns about the API changes. > While it should be a very easy fix, that seems likely to break a lot of > extensions. > I don't know if it's possible to make this stuff backward > compatible, and I also don't know if we really want to, as that'll both > introduce more complexity and keep folks using the old API. Yeah, I'm not sure that would be worth the extra complexity and using the old API would "keep" the issue we're trying to solve here. I don't think we should be worried that much by the number of extensions impacted but more about the change complexity and it looks pretty simple. So, I don't think we should worry that much in that regard. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> > > Unlike the local list of tranche names, appending to and searching the > > > shared memory list requires an LWLock; in exclusive mode to append, and > > > shared mode to search. > > > > The thing that stood out the most to me is how much more expensive looking > > up the lock name is. At the risk of suggesting even more complexity, I'm > > wondering if we should add some sort of per-backend cache to avoid the need > > for locking and dsa_get_address() calls every time you need to look up a > > lock name. > > Yeah, I 've the same concern that GetLWTrancheNameByID() might be too costly. > OTOH it's not really used in a hot path, "only" when displaying wait events. LWLockTrancheGetName also gets called, via T_NAME, when trace_lwlock is enabled. This is a developer option that requires LOCK_DEBUG to be set at compile time. So, that path may become even slower than it currently is under high concurrency. Also, if dtrace is enabled and the specific LWLock probe is used. ``` if (TRACE_POSTGRESQL_LWLOCK_WAIT_START_ENABLED()) TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(lock), mode); ``` But both of the above cases are not used in normal production activity. For the case of dtrace, b94409a02f612 added the _ENABLED macros to ensure that GetLWTrancheName ( via T_NAME) is only called when necessary. " If dtrace is compiled in but disabled, the lwlock dtrace probes still evaluate their arguments. Since PostgreSQL 13, T_NAME(lock) does nontrivial work, so it should be avoided if not needed. To fix, make these calls conditional on the *_ENABLED() macro corresponding to each probe. " The normal case is when pg_stat_get_activity calls pgstat_get_wait_event to retrieve the wait event via GetLWLockIdentifier. However, as mentioned, this is not on the hot path. I think we could add a local backend copy that stays up to date with the DSA. One idea would be to use an atomic counter to track the number of entries in the DSA and compare it with a local backend counter whenever the tranche name lookup occurs. If the atomic counter is higher (since we don't have deletions), we can update the local copy. Updating the local table should be a rare occurrence, but it would require an additional atomic fetch every time the name lookup occurs, in all the above code paths. Perhaps there's a better approach? -- Sami
> Also, I suspect that there might be some concerns about the API changes.
> While it should be a very easy fix, that seems likely to break a lot of
> extensions.
> I don't know if it's possible to make this stuff backward
> compatible, and I also don't know if we really want to, as that'll both
> introduce more complexity and keep folks using the old API.
Yeah, I'm not sure that would be worth the extra complexity and using the old API
would "keep" the issue we're trying to solve here.
I don't think we should be worried that much by the number of extensions impacted
but more about the change complexity and it looks pretty simple.
So, I don't think we should worry that much in that regard.
I agree. I don’t think the API changes are a big
deal, especially when they reduce the number of
of steps.
--
Sami Hi, On Fri, Aug 01, 2025 at 01:15:24PM -0500, Sami Imseih wrote: > I think we could add a local backend copy that stays up to date with the > DSA. One idea would be to use an atomic counter to track the number of > entries in the DSA and compare it with a local backend counter whenever the > tranche name lookup occurs. If the atomic counter is higher (since we > don't have deletions), > we can update the local copy. Updating the local table should be a > rare occurrence, but it would > require an additional atomic fetch every time the name lookup occurs, in all the > above code paths. > > Perhaps there's a better approach? I was thinking to switch to the DSA (and update local copy) when a name is not found in the local copy. That way there is no need to maintain a counter and the DSA overhead should be rare enough. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> > I think we could add a local backend copy that stays up to date with the > > DSA. One idea would be to use an atomic counter to track the number of > > entries in the DSA and compare it with a local backend counter whenever the > > tranche name lookup occurs. If the atomic counter is higher (since we > > don't have deletions), > > we can update the local copy. Updating the local table should be a > > rare occurrence, but it would > > require an additional atomic fetch every time the name lookup occurs, in all the > > above code paths. > > > > Perhaps there's a better approach? > > I was thinking to switch to the DSA (and update local copy) when a name is > not found in the local copy. That way there is no need to maintain a counter and > the DSA overhead should be rare enough. > > Regards, That should work as well. good idea. -- Sami
> > > I think we could add a local backend copy that stays up to date with the > > > DSA. One idea would be to use an atomic counter to track the number of > > > entries in the DSA and compare it with a local backend counter whenever the > > > tranche name lookup occurs. If the atomic counter is higher (since we > > > don't have deletions), > > > we can update the local copy. Updating the local table should be a > > > rare occurrence, but it would > > > require an additional atomic fetch every time the name lookup occurs, in all the > > > above code paths. > > > > > > Perhaps there's a better approach? > > > > I was thinking to switch to the DSA (and update local copy) when a name is > > not found in the local copy. That way there is no need to maintain a counter and > > the DSA overhead should be rare enough. > > > > Regards, > > That should work as well. good idea. With a local hash table, I don't think it's necessary to introduce new code for managing a DSA based list of tranche names as is done in v3. We can go back to storing the shared trance names in dshash. What do you think? -- Sami
On Mon, Aug 04, 2025 at 11:32:19AM -0500, Sami Imseih wrote: > With a local hash table, I don't think it's necessary to introduce new > code for managing > a DSA based list of tranche names as is done in v3. We can go back to > storing the shared > trance names in dshash. > > What do you think? My first thought is that a per-backend hash table seems too expensive/complicated for this. Couldn't it just be an array like we have now? -- nathan
> On Mon, Aug 04, 2025 at 11:32:19AM -0500, Sami Imseih wrote: > > With a local hash table, I don't think it's necessary to introduce new > > code for managing > > a DSA based list of tranche names as is done in v3. We can go back to > > storing the shared > > trance names in dshash. > > > > What do you think? > > My first thought is that a per-backend hash table seems too > expensive/complicated for this. Couldn't it just be an array like we have > now? We can, but I was considering simplicity of implementation, and using a local hash table is slightly simpler. That said, since we're dealing with an append-only data structure, a hash table is probably more than we need. All we need is index-based lookup, so I’ll go with the local array to mirror the shared ( dsa ) array. -- Sami
> > > With a local hash table, I don't think it's necessary to introduce new > > > code for managing > > > a DSA based list of tranche names as is done in v3. We can go back to > > > storing the shared > > > trance names in dshash. > > > > > > What do you think? > > > > My first thought is that a per-backend hash table seems too > > expensive/complicated for this. Couldn't it just be an array like we have > > now? > > We can, but I was considering simplicity of implementation, and using a > local hash table is slightly simpler. > > That said, since we're dealing with an append-only data structure, a hash > table is probably more than we need. All we need is index-based lookup, > so I’ll go with the local array to mirror the shared ( dsa ) array. Here is v4. We can keep using the local array as the backend local array. It will already include the tranches registered during postmaster startup, and it will be updated during tranche name lookup from the dsa based array. I added a new routine UpdateLocalTrancheName which can be used in both LWLockRegisterTranche and GetLWTrancheName to append to the local tranche during both postmaster startup and afterwards. Also, initially I thought we can get rid of returning the "extension" wait event name, but it's still possible for an "extension" tranche name to be returned if LWLockInitialize is called with a tranche_id that was never registered with LWLockNewTrancheId. We don't prevent that, and I can't see how we can without changing the LWLockInitialize API, but it is something that should probably be improved. For now, I added a mention of this behavior in the documentation. This is existing behavior, so I could also separate this doc change into a separate patch if it makes more sense to do so. -- Sami
Вложения
Hi, On Mon, Aug 04, 2025 at 10:47:45PM -0500, Sami Imseih wrote: > > > > With a local hash table, I don't think it's necessary to introduce new > > > > code for managing > > > > a DSA based list of tranche names as is done in v3. We can go back to > > > > storing the shared > > > > trance names in dshash. > > > > > > > > What do you think? > > > > > > My first thought is that a per-backend hash table seems too > > > expensive/complicated for this. Couldn't it just be an array like we have > > > now? > > > > We can, but I was considering simplicity of implementation, and using a > > local hash table is slightly simpler. > > > > That said, since we're dealing with an append-only data structure, a hash > > table is probably more than we need. All we need is index-based lookup, > > so I’ll go with the local array to mirror the shared ( dsa ) array. +1 for the local array. > Here is v4. Thanks for the patch update! > We can keep using the local array as the backend local array. > It will already include the tranches registered during postmaster startup, > and it will be updated during tranche name lookup from the dsa based > array. > I added a new routine UpdateLocalTrancheName which can be used > in both LWLockRegisterTranche and GetLWTrancheName to append to the > local tranche during both postmaster startup and afterwards. I did not look at the code in details, just played a bit with it and found some issues. Issue 1 -- If I register enough tranches to go to: + /* Resize if needed */ + if (LWLockTrancheNames.shmem->count >= LWLockTrancheNames.shmem->allocated) + { + newalloc = pg_nextpower2_32(Max(LWLOCK_TRANCHE_NAMES_INIT_SIZE, LWLockTrancheNames.shmem->allocated+ 1)); + new_list_ptr = dsa_allocate(LWLockTrancheNames.dsa, newalloc * sizeof(dsa_pointer)); + new_name_ptrs = dsa_get_address(LWLockTrancheNames.dsa, new_list_ptr); + memcpy(new_name_ptrs, current_ptrs, LWLockTrancheNames.shmem->allocated * sizeof(dsa_pointer)); + dsa_free(LWLockTrancheNames.dsa, *current_ptrs); then I get: ERROR: dsa_area could not attach to a segment that has been freed I think we should " dsa_free(LWLockTrancheNames.dsa, LWLockTrancheNames.shmem->list_ptr) " instead. Issue 2 -- If an extension calls RequestNamedLWLockTranche() it will register the same tranche twice: (gdb) p LWLockTrancheNames.local[0] $1 = 0x7acf5a40c420 "pg_playmem" (gdb) p LWLockTrancheNames.local[97] $2 = 0x7acf5a40c420 "pg_playmem" First (local[0]) during LWLockNewTrancheId() during InitializeLWLocks() Second (local[97]) during LWLockRegisterTranche() during CreateLWLocks() Maybe we should put this back? - /* This should only be called for user-defined tranches. */ - if (tranche_id < LWTRANCHE_FIRST_USER_DEFINED) - return; - /* Convert to array index. */ - tranche_id -= LWTRANCHE_FIRST_USER_DEFINED; and remove: + tranche_index = tranche_id - LWTRANCHE_FIRST_USER_DEFINED; from LWLockNewTrancheId()? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Thanks for reviewing! > Issue 1 -- > > If I register enough tranches to go to: > > + /* Resize if needed */ > + if (LWLockTrancheNames.shmem->count >= LWLockTrancheNames.shmem->allocated) > + { > + newalloc = pg_nextpower2_32(Max(LWLOCK_TRANCHE_NAMES_INIT_SIZE, LWLockTrancheNames.shmem->allocated+ 1)); > + new_list_ptr = dsa_allocate(LWLockTrancheNames.dsa, newalloc * sizeof(dsa_pointer)); > + new_name_ptrs = dsa_get_address(LWLockTrancheNames.dsa, new_list_ptr); > + memcpy(new_name_ptrs, current_ptrs, LWLockTrancheNames.shmem->allocated * sizeof(dsa_pointer)); > + dsa_free(LWLockTrancheNames.dsa, *current_ptrs); > > then I get: > > ERROR: dsa_area could not attach to a segment that has been freed Will investigate this one and correct in the next rev. > Issue 2 -- > > If an extension calls RequestNamedLWLockTranche() it will register the same > tranche twice: > > (gdb) p LWLockTrancheNames.local[0] > $1 = 0x7acf5a40c420 "pg_playmem" > (gdb) p LWLockTrancheNames.local[97] > $2 = 0x7acf5a40c420 "pg_playmem" Thanks for catching. This one is clear, there is an extra call to register inside RequestNamedLWLockTranche. I'll fix this in the next rev. -- Sami
> I'll fix this in the next rev. v5 fixes the above 2 issues found above. For the issue that was throwing ".... segment that has been freed", indeed we should be freeing LWLockTrancheNames.shmem->list_ptr, list_ptr is a dsa_pointer that stores an array of dsa_pointers, which then made me realize that I was not freeing the actual dsa_pointers holding the tranche names. I fixed that as well. I also made some improvements to the code that copies the old list to the new list and fixed the lookup in GetLWTrancheName. -- Sami
Вложения
Hi, On Tue, Aug 05, 2025 at 11:24:04PM -0500, Sami Imseih wrote: > > I'll fix this in the next rev. > > v5 fixes the above 2 issues found above. Thanks! > For the issue that was throwing ".... segment that has been freed", > indeed we should be freeing LWLockTrancheNames.shmem->list_ptr, Yeah. > list_ptr is a dsa_pointer that stores an array of dsa_pointers, which > then made me realize that I was not freeing the actual dsa_pointers > holding the tranche names. I fixed that as well. > > I also made some improvements to the code that copies the old > list to the new list and fixed the lookup in > GetLWTrancheName. > Still doing a bit of testing without looking closely to the code. Issue 1 -- If I: LWLockInitialize(&lck, LWLockNewTrancheId("BDT_Lck")); LWLockAcquire(&lck, LW_EXCLUSIVE); LWLockAcquire(&lck, LW_EXCLUSIVE); So that this process is locked on the LWLock BDT_Lck and if I query pg_stat_activity in another backend, then I get: postgres=# select wait_event from pg_stat_activity where wait_event_type = 'LWLock'; wait_event ------------ BDT_Lck (1 row) which is what we expect. But if I look at LWLockTrancheNames.local (for the process that queried pg_stat_activity): (gdb) p LWLockTrancheNames.local[0] $1 = 0x0 (gdb) p LWLockTrancheNames.local[1] $2 = 0x7b759a81b038 "BDT_Lck" It looks like there is an indexing issue, as we should start at index 0. Issue 2 / question -- If I call LWLockNewTrancheId("BDT_play2") multiple times to ensure that trancheId >= LWLockTrancheNames.allocated and then ensure that a backend is locked doing: " LWLockInitialize(&lck, LWLockNewTrancheId("BDT_Lck")); LWLockAcquire(&lck, LW_EXCLUSIVE); LWLockAcquire(&lck, LW_EXCLUSIVE); " Then if, in another backend, I call GetLWTrancheName by querying pg_stat_activity then I see "BDT_Lck" being reported as wait event name (which is good). The thing that worries me a bit is that the local cache is populated only for "BDT_Lck", but not for all the other "BDT_play2". (gdb) p LWLockTrancheNames.local[2] $9 = 0x0 (gdb) p LWLockTrancheNames.local[3] $10 = 0x0 (gdb) p LWLockTrancheNames.local[12] $11 = 0x0 (gdb) p LWLockTrancheNames.local[13] $12 = 0x780593a1b138 "BDT_Lck" When we need to populate the local cache, would it be better to populate it with the whole dsa content instead of just the current missing ID as done in GetLWTrancheName(): + if (tranche_name != NULL) + SetLocalTrancheName(trancheId, tranche_name);? I think that would make more sense, as that would avoid to take the LWLockTrancheNames.shmem->lock in GetLWTrancheNameByID() multiple times should the wait event change, thoughts? Issue 3 -- If I call LWLockNewTrancheId("BDT_play2") only one time to ensure that trancheId < LWLockTrancheNames.allocated and then ensure that a backend is locked doing: " LWLockInitialize(&lck, LWLockNewTrancheId("BDT_Lck")); LWLockAcquire(&lck, LW_EXCLUSIVE); LWLockAcquire(&lck, LW_EXCLUSIVE); " Then if, in another backend, I call GetLWTrancheName by querying pg_stat_activity then I see "extension" being reported as wait event name (which is not good). The issue is that in GetLWTrancheName(), the patch does: + if (trancheId >= LWLockTrancheNames.allocated) + { + tranche_name = GetLWTrancheNameByID(trancheId); + if (tranche_name != NULL) + SetLocalTrancheName(trancheId, tranche_name); + } + else + tranche_name = LWLockTrancheNames.local[trancheId]; Then does not find the lock in the local cache and then returns "extension". I think that in that case we should fall back to querying the DSA (and populate the local cache), thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi,
2. copied_ptr = dsa_allocate(LWLockTrancheNames.dsa, len);+
+ copied_addr = dsa_get_address(LWLockTrancheNames.dsa, copied_ptr);
+ memcpy(copied_addr, old_name, len);
+
+ new_ptrs[i] = copied_ptr;
+
+ /* free old tranche names */
+ dsa_free(LWLockTrancheNames.dsa, old_ptrs[i]);
Why is it necessary to allocate a new dsa_pointer for tranche names that are the same size and then
free the old one?
Is there a reason we can't just assign new_ptrs[i] = old_ptrs[i]?
3.
>Additionally, while users should not pass arbitrary tranche IDs (that is,
>IDs not created via LWLockNewTrancheId) to LWLockInitialize, nothing
>technically prevents them from doing so. Therefore, we must continue to
>handle such cases gracefully by returning a default "extension" tranche name.
Would it be possible to update LWLockInitialize so that it checks if tranche_id is
already registered in the dsa, and if not, registers it during the LWLockInitialize() process?
Thank you,
I've begun reviewing this patch and have a few questions listed below:
1. + if (i < LWLockTrancheNames.shmem->allocated && DsaPointerIsValid(old_ptrs[i]))
Should an assert be used for the second condition instead?
Since for i < LWLockTrancheNames.shmem->allocated, the dsa pointer is expected to be valid.
1. + if (i < LWLockTrancheNames.shmem->allocated && DsaPointerIsValid(old_ptrs[i]))
Should an assert be used for the second condition instead?
Since for i < LWLockTrancheNames.shmem->allocated, the dsa pointer is expected to be valid.
2. copied_ptr = dsa_allocate(LWLockTrancheNames.dsa, len);
+ copied_addr = dsa_get_address(LWLockTrancheNames.dsa, copied_ptr);
+ memcpy(copied_addr, old_name, len);
+
+ new_ptrs[i] = copied_ptr;
+
+ /* free old tranche names */
+ dsa_free(LWLockTrancheNames.dsa, old_ptrs[i]);
Why is it necessary to allocate a new dsa_pointer for tranche names that are the same size and then
free the old one?
Is there a reason we can't just assign new_ptrs[i] = old_ptrs[i]?
3.
>Additionally, while users should not pass arbitrary tranche IDs (that is,
>IDs not created via LWLockNewTrancheId) to LWLockInitialize, nothing
>technically prevents them from doing so. Therefore, we must continue to
>handle such cases gracefully by returning a default "extension" tranche name.
Would it be possible to update LWLockInitialize so that it checks if tranche_id is
already registered in the dsa, and if not, registers it during the LWLockInitialize() process?
Thank you,
Rahila Syed
Thanks for testing! > which is what we expect. > > But if I look at LWLockTrancheNames.local (for the process that queried pg_stat_activity): > > (gdb) p LWLockTrancheNames.local[0] > $1 = 0x0 > (gdb) p LWLockTrancheNames.local[1] > $2 = 0x7b759a81b038 "BDT_Lck" > > It looks like there is an indexing issue, as we should start at index 0. > > Issue 2 / question -- > > If I call LWLockNewTrancheId("BDT_play2") multiple times to ensure that > > > Then if, in another backend, I call GetLWTrancheName by querying pg_stat_activity > then I see "BDT_Lck" being reported as wait event name (which is good). > > The thing that worries me a bit is that the local cache is populated only for > "BDT_Lck", but not for all the other "BDT_play2". Issue #1 and issue #2 are related. In both cases you have a registered wait event that only lives in shared memory, but was never required to be accessed by pg_stat_activity, so it was never cached locally. > When we need to populate the local cache, would it be better to populate > it with the whole dsa content instead of just the current missing ID as done > in GetLWTrancheName(): I do agree that we should just sync the local cache totally whenever we can't find the tranche name in the local cache. Will fix in the next rev. -- Sami
Thanks for testing! > Why is it necessary to allocate a new dsa_pointer for tranche names that are the same size and then > free the old one? > Is there a reason we can't just assign new_ptrs[i] = old_ptrs[i]? Fair point. I will updated in the next rev. We don't need to free the' existing tranche name pointers, only the list. > Would it be possible to update LWLockInitialize so that it checks if tranche_id is > already registered in the dsa, and if not, registers it during the LWLockInitialize() process? We could. I do think this will need a separate discussion as a follow-up to this thread. -- Sami
From this discussion, and the fact that the tranche name could come from either local or shared memory, I think we should have tests. So, I am working on adding tests using INJECTION_POINTS. Some of the tests I have in mind now are: 1/ Does the shared memory grow correctly, 2/ Is the tranche name being returned the correct one 3/ Is the local cache being updated correctly. I will have an updated patch with the new test module early next week. -- Sami
Hi, Attached v6 which addresses the feedback from the last review. 1/ Rahila raised a point about the necessity to allocate new dsa pointers for tranche names when copying ( during resize). That is correct. We can simply memcpy those dsa pointers at the time of copy. All we need to do is free the dsa pointer that tracks the list. 2/ I also implemented a full sync of the local cache when necessary. That is the tranche name is not found locally and the tranche index is not higher than the max used index. To do that we do have to track the highest index used as the allocated counter is not sufficient to do this. Allocation will grow geometrically to avoid resize operations. 3/ I also added tests using INJECTION_POINTS. Although the same tests could be done with DEBUG, I felt INJECTION_POINTS are better to use for this purpose to avoid unnecessary logging. I created a test module called test_lwlock_tranches which has a perl test because we do need to set shared_preload_libraries. I did have to define the following in wait_classes.h: +#define WAIT_EVENT_CLASS_MASK 0xFF000000 +#define WAIT_EVENT_ID_MASK 0x0000FFFF This is because the extension calls GetLWLockIdentifier ( to mimic pg_stat_activity) and needs to calculate a classId. -- Sami
Вложения
I haven't followed the latest discussion, but I took a look at the patch. + It is possible to use a <literal>tranche_id</literal> that was not retrieved + using <function>LWLockNewTrancheId</function>, but this is not recommended. + The ID may clash with an already registered tranche name, or the specified + name may not be found. In such cases, looking up the name will return a generic + "extension" tranche name. Is there any reason to continue allowing this? For example, maybe we could ERROR if LWLockInitialize()/GetLWTrancheName() are given a tranche_id greater than the number allocated. I guess I'm not following why we should gracefully handle these kinds of coding errors, especially when they result in unhelpful behavior like an "extension" tranche. +#else +elog(ERROR, "injection points not supported"); +#endif This causes compilation to fail when injection points are not enabled. I haven't combed through the patch character-by-character, but upon a read-through, the general shape looks reasonable to me. As a general note, I'd suggest adding more commentary throughout and finding opportunities to simplify and/or clean up the code as much as possible. -- nathan
> I haven't followed the latest discussion, but I took a look at the patch. > > + It is possible to use a <literal>tranche_id</literal> that was not retrieved > + using <function>LWLockNewTrancheId</function>, but this is not recommended. > + The ID may clash with an already registered tranche name, or the specified > + name may not be found. In such cases, looking up the name will return a generic > + "extension" tranche name. > > Is there any reason to continue allowing this? For example, maybe we could > ERROR if LWLockInitialize()/GetLWTrancheName() are given a tranche_id > greater than the number allocated. I guess I'm not following why we should > gracefully handle these kinds of coding errors, especially when they result > in unhelpful behavior like an "extension" tranche. We could definitely error out LWLockInitialize in this case. It may break things out there, so that is why I have been hesitant. I don’t think we can use allocated in this case, since that value refers to pre-allocated slots. Instead, we will need to do a lookup, and if no tranche_name is found, then we should error out. Essentially, we can call GetLWTrancheName within LWLockInitialize, but GetLWTrancheName can no longer have a fallback value such as "extension". It should error out instead. I believe that is what you mean. correct? One thought I had was to change the signature of LWLockInitialize to take a tranche name and no longer should LWLockNewTrancheId be no longer be global. That will slightly complicate things with built-in tranches a bit (but is doable). On the other hand, an error seems like a reasonable approach. > Attached v6 which addresses the feedback from the last review. I spoke offline with Bertrand and discussed the synchronization of the local memory from shared memory. After that discussion it is clear we don't need to track the highest used index in shared memory. Because the tranche_id comes from a shared counter, it is not possible that the same tranche_id can be used twice, so we will not have overlap. That means, when we sync, we just need to know the highest used index in the local memory ( since that could be populated during postmaster startup and inherited via fork ) and start syncing local memory from that point. We can also cut off the sync once we encounter an invalid dsa pointer, since there can't be any tranche names in shared memory after that point. I did change SyncLWLockTrancheNames to accept the tranche_id that triggered the sync, so we can make sure it gets added to local memory, either as part of the sync or added with the default "extension" tranche name. I also corrected the INJECTION_POINT tests. I was missing a skip if INJECTION_POINT is not enabled and removed +elog(ERROR, "injection points not supported"); mentioned by Nathan earlier. To simplify the review, I also split the test patch. It is not yet clear if there is consensus to include tests ( I think we should ), but having the tests will help the review. More commentary was added to the code for the point Nathan also raised. See v7; which could be simplified further depending on the handling of LWLockInitialize as mentioned at the top of this message, since we don't need to account for out of range tranche IDs. -- Sami
Вложения
Hi, On Tue, Aug 12, 2025 at 04:16:48PM -0500, Sami Imseih wrote: > I spoke offline with Bertrand and discussed the synchronization of > the local memory from shared memory. After that discussion it is clear > we don't need to track the highest used index in shared memory. Because > the tranche_id comes from a shared counter, it is not possible that > the same tranche_id can be used twice, so we will not have overlap. > That means, when we sync, we just need to know the highest used > index in the local memory ( since that could be populated during > postmaster startup and inherited via fork ) and start syncing local > memory from that point. Thanks for the updated version! I think that we can also get rid of max_used_index. Indeed, I can see that it's updated when SetLocalTrancheName() is called. But I don't think that's needed as that would be perfectly fine to restart always from the "initial max_used_index" because that should be rare enough and the number of entries that we scan is not large. That would mean: entry not found? resync the entire local cache (starting again from the non updated max_used_index). So, if we agree that we don't need to update max_used_index in SetLocalTrancheName(), then during SyncLWLockTrancheNames() we could rely only on DsaPointerIsValid(pointers[i]). Indeed, those DSA pointers are only valid for dynamically registered tranches. For example with max_used_index = 5, what you would find is DsaPointerIsValid(pointers[i]) invalid from 0 to 4 and starts to be valid at [5] (if a tranche has been dynamically added). 5 is exactly: " int next_index = LWLockTrancheNames.max_used_index + 1; " So we could do something like: int i = 0; while (i < LWLockTrancheNames.shmem->allocated && !DsaPointerIsValid(shared_ptrs[i])) { i++; } Now this "i" acts as the "next_index" in v7 and now we can start iterating from it and sync until it's invalid again. That way, we get rid of max_used_index and I think that this part is simpler and easier to understand. Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> So we could do something like: > > int i = 0; > while (i < LWLockTrancheNames.shmem->allocated && > !DsaPointerIsValid(shared_ptrs[i])) > { > i++; > } I found a different approach without tracking an additional counter. Then sync stops when an invalid pointer is found after at least one valid entry in shared memory, so we don't stop if there was a lwlock registered in local memory during postmaster startup, and this tranche will never actually be added to shared memory at the same position. >> Is there any reason to continue allowing this? For example, maybe we could >> ERROR if LWLockInitialize()/GetLWTrancheName() are given a tranche_id >> greater than the number allocated. I guess I'm not following why we should >> gracefully handle these kinds of coding errors, especially when they result >> in unhelpful behavior like an "extension" tranche. > Essentially, we can call GetLWTrancheName within LWLockInitialize, but > GetLWTrancheName can no longer have a fallback value such as "extension". > It should error out instead. I believe that is what you mean. correct? I also added this, so now only LWLocks correctly registered can be looked up. I also reconsidered the use of INJECTION_POINT for this purpose. I felt it added unnecessary code where we only need a DEBUG ( I used DEBUG3) statement at a few key places. See v7. -- Sami
Вложения
>> See v7. I fixed an earlier issue with Windows, which was due to not initializing the shared memory inside ``` #ifdef EXEC_BACKEND extern void AttachSharedMemoryStructs(void); #endif ``` But then I found another one after. LWLockTrancheNames gets forked on Linux, but of course that will not happen on Windows without extra work. So, the tests failed because the requested tranches (via RequestNamedLWLockTranche) were not being found when looked up. But of course, we already have provisions to copy these tranches for Windows ( see inside launch_backend.c ). ``` int NamedLWLockTrancheRequests; NamedLWLockTranche *NamedLWLockTrancheArray; LWLockPadded *MainLWLockArray; ``` So, this means that for the local memory sync, we can actually just copy the requested tranches (via RequestNamedLWLockTranche) and then the shared memory tranches. This is much better, as it syncs using both possible sources for tranche names. ``` int i = 0; while (i < NamedLWLockTrancheRequests) { NamedLWLockTranche *tranche; tranche = &NamedLWLockTrancheArray[i]; SetLocalTrancheName(i, tranche->trancheName); i++; } /* Acquire shared lock on tranche names shared memory */ LWLockAcquire(&LWLockTrancheNames.shmem->lock, LW_SHARED); while (i < LWLockTrancheNames.shmem->allocated) { ``` So, now these tests pass locally on Windows. Attached is v8. -- Sami
Вложения
Hi, On Sat, Aug 16, 2025 at 10:18:05PM -0500, Sami Imseih wrote: > Attached is v8. Thanks for the new version! A few random comments about 0001: 1 === +} LWLockTrancheNamesShmem; +} LWLockTrancheNamesStruct; Add LWLockTrancheNamesShmem and LWLockTrancheNamesStruct to src/tools/pgindent/typedefs.list? 2 === Maybe a comment before the above structs definitions to explain what they are used for? 3 === +static void +SetSharedTrancheName(int tranche_index, const char *tranche_name) { - /* This should only be called for user-defined tranches. */ - if (tranche_id < LWTRANCHE_FIRST_USER_DEFINED) - return; + dsa_pointer *name_ptrs; + dsa_pointer str_ptr; + char *str_addr; + int len; + int current_allocated; + int new_alloc = 0; - /* Convert to array index. */ - tranche_id -= LWTRANCHE_FIRST_USER_DEFINED; + LWLockAcquire(&LWLockTrancheNames.shmem->lock, LW_EXCLUSIVE); Add Assert(IsUnderPostmaster || !IsPostmasterEnvironment);? 4 === +static void +SetLocalTrancheName(int tranche_index, const char *tranche_name) +{ + int newalloc; + + Assert(tranche_name); should we add some assert on IsUnderPostmaster and/or IsPostmasterEnvironment too? 5 === + while (i < NamedLWLockTrancheRequests) + { + NamedLWLockTranche *tranche; + + tranche = &NamedLWLockTrancheArray[i]; + + SetLocalTrancheName(i, tranche->trancheName); + + i++; + } Maybe add a comment that those are the ones allocated by the postmaster during startup? Also, as this will be done during each sync and those tranches don't change, so one could think there is room for improvement. Maybe add a comment that's probably not worth optimizing (due to the fact that NamedLWLockTrancheRequests should be small enough and the sync rare)? 6 === There is this existing comment: /* * NamedLWLockTrancheRequests is both the valid length of the request array, * and the length of the shared-memory NamedLWLockTrancheArray later on. * This variable and NamedLWLockTrancheArray are non-static so that * postmaster.c can copy them to child processes in EXEC_BACKEND builds. */ int NamedLWLockTrancheRequests = 0; Maybe add something like? "Additional dynamic tranche names beyond this count are stored in a DSA". 7 === + old_ptrs = dsa_get_address(LWLockTrancheNames.dsa, + LWLockTrancheNames.shmem->list_ptr); + + name_ptrs = dsa_get_address(LWLockTrancheNames.dsa, new_list); + + memset(name_ptrs, InvalidDsaPointer, new_alloc); + memcpy(name_ptrs, old_ptrs, sizeof(dsa_pointer) * current_allocated); + + dsa_free(LWLockTrancheNames.dsa, LWLockTrancheNames.shmem->list_ptr); maybe use local variable for LWLockTrancheNames.shmem->list_ptr (and LWLockTrancheNames.dsa)? 8 === + needs_sync = (trancheId >= LWLockTrancheNames.allocated || + LWLockTrancheNames.local[trancheId] == NULL) + && (IsUnderPostmaster || !IsPostmasterEnvironment); formating does not look good. 9 === - if (trancheId >= LWLockTrancheNamesAllocated || - LWLockTrancheNames[trancheId] == NULL) - return "extension"; + if (trancheId < LWLockTrancheNames.allocated) + tranche_name = LWLockTrancheNames.local[trancheId]; - return LWLockTrancheNames[trancheId]; + if (!tranche_name) + elog(ERROR, "tranche ID %d is not registered", tranche_id_saved); We now error out instead of returning "extension". That looks ok given the up-thread discussion but then the commit message needs some updates as it states: " Additionally, while users should not pass arbitrary tranche IDs (that is, IDs not created via LWLockNewTrancheId) to LWLockInitialize, nothing technically prevents them from doing so. Therefore, we must continue to handle such cases gracefully by returning a default "extension" tranche name. " 10 === +LWLockTrancheNamesInitSize() +{ + Size sz; + + /* + * This value is used by other facilities, see pgstat_shmem.c, so it's + * good enough. + */ + sz = 256 * 1024; use DSA_MIN_SEGMENT_SIZE? 11 === + if (!IsUnderPostmaster) + { + dsa_area *dsa; + LWLockTrancheNamesShmem *ctl = LWLockTrancheNames.shmem; + char *p = (char *) ctl; + + /* Calculate layout within the shared memory region */ + p += MAXALIGN(sizeof(LWLockTrancheNamesShmem)); + ctl->raw_dsa_area = p; + p += MAXALIGN(LWLockTrancheNamesInitSize()); LWLockTrancheNamesInitSize() already does MAXALIGN() so the last one above is not needed. But the last p advance seems not necessary as not used after. I think the same is true in StatsShmemInit() (see [1]). 12 === +LWLockTrancheNamesBEInit(void) +{ + /* already attached, nothing to do */ + if (LWLockTrancheNames.dsa) + return; + + LWLockTrancheNamesAttach(); + + /* Set up a process-exit hook to clean up */ s/already/Already/? For 0002, a quick review: 13 === + if (log) + elog(DEBUG3, "current_allocated: %d in tranche names shared memory", new_alloc); Instead of this, could we elog distinct messages where the patch currently sets log = true? 14 === - xid_wraparound + xid_wraparound \ + test_lwlock_tranches breaks the ordering. 15 === subdir('worker_spi') subdir('xid_wraparound') +subdir('test_lwlock_tranches') Same, breaks the ordering. 16 === +my $ENABLE_LOG_WAIT = 1; + +my $isession; +my $log_location; + +# Helper function: wait for one or more logs if $ENABLE_LOG_WAIT is true +sub maybe_wait_for_log { + my ($node, $logs, $log_loc) = @_; + return $log_loc unless $ENABLE_LOG_WAIT; is ENABLE_LOG_WAIT needed as it does not change? [1]: https://www.postgresql.org/message-id/aKLsu2sdpnyeuSSc%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> 1 === > > +} LWLockTrancheNamesShmem; > > +} LWLockTrancheNamesStruct; > > Add LWLockTrancheNamesShmem and LWLockTrancheNamesStruct to > src/tools/pgindent/typedefs.list? done. > 2 === > > Maybe a comment before the above structs definitions to explain what they are > used for? done. > 3 === > > +static void > +SetSharedTrancheName(int tranche_index, const char *tranche_name) > { > - /* This should only be called for user-defined tranches. */ > - if (tranche_id < LWTRANCHE_FIRST_USER_DEFINED) > - return; > + dsa_pointer *name_ptrs; > + dsa_pointer str_ptr; > + char *str_addr; > + int len; > + int current_allocated; > + int new_alloc = 0; > > - /* Convert to array index. */ > - tranche_id -= LWTRANCHE_FIRST_USER_DEFINED; > + LWLockAcquire(&LWLockTrancheNames.shmem->lock, LW_EXCLUSIVE); > > Add Assert(IsUnderPostmaster || !IsPostmasterEnvironment);? done. This is required here since we should not be dealling with DSA during postmaster or single-user. > 4 === > > +static void > +SetLocalTrancheName(int tranche_index, const char *tranche_name) > +{ > + int newalloc; > + > + Assert(tranche_name); > > should we add some assert on IsUnderPostmaster and/or IsPostmasterEnvironment too? No, It's not needed here. SetLocalTrancheName is called during startup and by normal backends. > 5 === > > + while (i < NamedLWLockTrancheRequests) > + { > + NamedLWLockTranche *tranche; > + > + tranche = &NamedLWLockTrancheArray[i]; > + > + SetLocalTrancheName(i, tranche->trancheName); > + > + i++; > + } > > Maybe add a comment that those are the ones allocated by the postmaster during > startup? > > Also, as this will be done during each sync and those tranches don't change, > so one could think there is room for improvement. Maybe add a comment that's > probably not worth optimizing (due to the fact that NamedLWLockTrancheRequests > should be small enough and the sync rare)? done. > 6 === > > There is this existing comment: > > /* > * NamedLWLockTrancheRequests is both the valid length of the request array, > * and the length of the shared-memory NamedLWLockTrancheArray later on. > * This variable and NamedLWLockTrancheArray are non-static so that > * postmaster.c can copy them to child processes in EXEC_BACKEND builds. > */ > int NamedLWLockTrancheRequests = 0; > > Maybe add something like? "Additional dynamic tranche names beyond this count > are stored in a DSA". No. I don't like talking about that here. This is already described in: ``` * 3. Extensions can create new tranches using either RequestNamedLWLockTranche * or LWLockNewTrancheId. Tranche names are reg ``` > 7 === > > + old_ptrs = dsa_get_address(LWLockTrancheNames.dsa, > + LWLockTrancheNames.shmem->list_ptr); > + > + name_ptrs = dsa_get_address(LWLockTrancheNames.dsa, new_list); > + > + memset(name_ptrs, InvalidDsaPointer, new_alloc); > + memcpy(name_ptrs, old_ptrs, sizeof(dsa_pointer) * current_allocated); > + > + dsa_free(LWLockTrancheNames.dsa, LWLockTrancheNames.shmem->list_ptr); > > maybe use local variable for LWLockTrancheNames.shmem->list_ptr (and > LWLockTrancheNames.dsa)? hmm, I don't see the point to this. > > 8 === > > + needs_sync = (trancheId >= LWLockTrancheNames.allocated || > + LWLockTrancheNames.local[trancheId] == NULL) > + && (IsUnderPostmaster || !IsPostmasterEnvironment); > > formating does not look good. pgindent told me it's fine. But, I did add a parenthesis around the expression, so pgindent now aligned the conditions in a better way. > 9 === > > - if (trancheId >= LWLockTrancheNamesAllocated || > - LWLockTrancheNames[trancheId] == NULL) > - return "extension"; > + if (trancheId < LWLockTrancheNames.allocated) > + tranche_name = LWLockTrancheNames.local[trancheId]; > > - return LWLockTrancheNames[trancheId]; > + if (!tranche_name) > + elog(ERROR, "tranche ID %d is not registered", tranche_id_saved); > > We now error out instead of returning "extension". That looks ok given the > up-thread discussion but then the commit message needs some updates as it > states: > " > Additionally, while users should not pass arbitrary tranche IDs (that is, > IDs not created via LWLockNewTrancheId) to LWLockInitialize, nothing > technically prevents them from doing so. Therefore, we must continue to > handle such cases gracefully by returning a default "extension" tranche name. > " done. > 10 === > > +LWLockTrancheNamesInitSize() > +{ > + Size sz; > + > + /* > + * This value is used by other facilities, see pgstat_shmem.c, so it's > + * good enough. > + */ > + sz = 256 * 1024; > > use DSA_MIN_SEGMENT_SIZE? done. > 11 === > > + if (!IsUnderPostmaster) > + { > + dsa_area *dsa; > + LWLockTrancheNamesShmem *ctl = LWLockTrancheNames.shmem; > + char *p = (char *) ctl; > + > + /* Calculate layout within the shared memory region */ > + p += MAXALIGN(sizeof(LWLockTrancheNamesShmem)); > + ctl->raw_dsa_area = p; > + p += MAXALIGN(LWLockTrancheNamesInitSize()); > > LWLockTrancheNamesInitSize() already does MAXALIGN() so the last one > above is not needed. But the last p advance seems not necessary as not used > after. I think the same is true in StatsShmemInit() (see [1]). yeah, good point. done. > 12 === > > +LWLockTrancheNamesBEInit(void) > +{ > + /* already attached, nothing to do */ > + if (LWLockTrancheNames.dsa) > + return; > + > + LWLockTrancheNamesAttach(); > + > + /* Set up a process-exit hook to clean up */ > > s/already/Already/? done. > For 0002, a quick review: > > 13 === > > + if (log) > + elog(DEBUG3, "current_allocated: %d in tranche names shared memory", new_alloc); > > Instead of this, could we elog distinct messages where the patch currently sets > log = true? ok, that was my initial thought, but I did not like repeating the messages. I went back to distinct messages. I have no strong feelings either way. > > 14 === > > - xid_wraparound > + xid_wraparound \ > + test_lwlock_tranches > > breaks the ordering. > > 15 === > > subdir('worker_spi') > subdir('xid_wraparound') > +subdir('test_lwlock_tranches') > > Same, breaks the ordering. > > 16 === done and done. > +my $ENABLE_LOG_WAIT = 1; > + > +my $isession; > +my $log_location; > + > +# Helper function: wait for one or more logs if $ENABLE_LOG_WAIT is true > +sub maybe_wait_for_log { > + my ($node, $logs, $log_loc) = @_; > + return $log_loc unless $ENABLE_LOG_WAIT; > > is ENABLE_LOG_WAIT needed as it does not change? That was a remnant of my testing. removed. see v9 -- Sami
Вложения
> see v9 Sorry about the quick follow-up I was not happy with one of the comment updates in SyncLWLockTrancheNames. I updated it. Attached is v10, -- Sami
Вложения
On Mon, Aug 18, 2025 at 01:06:42PM -0500, Sami Imseih wrote: > Attached is v10, I've been staring at the latest patch for a bit, and I'm a bit concerned at how much complexity it adds. I think it's a good idea to keep a local array of tranche names indexed by tranche ID, but the code for managing the list of DSA pointers scares me. I know we were trying to avoid using dshash earlier, if for no other reason than it's perhaps not the best data structure for the job, but ISTM we'd eliminate a lot of complexity if we offloaded the shmem pieces to a dshash table (or some other shmem-based data structure we have yet to introduce, like a dslist/dsarray). WDYT? -- nathan
> I've been staring at the latest patch for a bit, and I'm a bit concerned > at how much complexity it adds. The complexity is the fact we have to track a dsa_pointer that points to an array of dsa_pointers that track the tranche names. We also do have to copy the array of dsa_pointers into a new pointer when enlarging. That adds some complexity that dshash would hide away, but I don't think it's overly complex. > I know we were trying to avoid using > dshash earlier, if for no other reason than it's perhaps not the best data > structure for the job, yeah, we really just need an append only structure, so dshash did not seems unnecessary. But also, we moved away from dshash before it was decided to have a local cache. Perhaps, it's ok now to go back to dshash considering that it will only be used in rare scenarios: creations of new tranches and syncs to local memory. Most of the time the tranches will be satisfied directly from local cache. > (or some other shmem-based > data structure we have yet to introduce, like a dslist/dsarray). This will be an interesting API to invest time in, if there could be more use-cases. I think it's a separate discussion at this point. -- Sami
Hi, On Mon, Aug 18, 2025 at 05:53:44PM -0500, Sami Imseih wrote: > > I've been staring at the latest patch for a bit, and I'm a bit concerned > > at how much complexity it adds. > > The complexity is the fact we have to track a dsa_pointer that points > to an array of dsa_pointers that track the tranche names. We also > do have to copy the array of dsa_pointers into a new pointer when > enlarging. That adds some complexity that dshash would hide away, > but I don't think it's overly complex. I also think that's not overly complex but OTOH why add extra complexity if we have already a machinery at our disposal that takes care of what we need i.e resizing and so on. > > I know we were trying to avoid using > > dshash earlier, if for no other reason than it's perhaps not the best data > > structure for the job, I can see that Nathan's concern about dshash was that it could be a bit slower ([1]). That's probably true but that does not worry me that much given that adding tranches should be rare enough and the lookups should be rare (to sync the local cache) and not be that sensible to performance. > yeah, we really just need an append only structure, so dshash did not > seems unnecessary. But also, we moved away from dshash before it was > decided to have a local cache. Perhaps, it's ok now to go back to dshash > considering that it will only be used in rare scenarios: creations of > new tranches and > syncs to local memory. Most of the time the tranches will be satisfied > directly from local cache. Yes. > > (or some other shmem-based > > data structure we have yet to introduce, like a dslist/dsarray). > > This will be an interesting API to invest time in, if there could be more > use-cases. I did a quick check and I did not find current use cases: possible candidates could be in ExecParallelHashTableAlloc(), PTIterationArray, PTEntryArray for examples but I think they all know the final size upfront so there is no real need for dsarray for those). > I think it's a separate discussion at this point. OTOH, that would be a valid use case to introduce this new API but I'm not sure it's worth it given that the dshash looks good enough for our case (even if not ideal though). [1]: https://www.postgresql.org/message-id/aHV5BsKoSeOkIsNL%40nathan Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Aug 19, 2025 at 08:09:53AM +0000, Bertrand Drouvot wrote: > On Mon, Aug 18, 2025 at 05:53:44PM -0500, Sami Imseih wrote: >> > (or some other shmem-based >> > data structure we have yet to introduce, like a dslist/dsarray). >> >> This will be an interesting API to invest time in, if there could be more >> use-cases. > > I did a quick check and I did not find current use cases: possible candidates > could be in ExecParallelHashTableAlloc(), PTIterationArray, PTEntryArray for > examples but I think they all know the final size upfront so there is no real > need for dsarray for those). > >> I think it's a separate discussion at this point. > > OTOH, that would be a valid use case to introduce this new API but I'm not sure > it's worth it given that the dshash looks good enough for our case (even if not > ideal though). IMHO it'd be okay to proceed with dshash for now. It would be pretty easy to switch to something like a "dslist" in the future. -- nathan
Hi, On 2025-08-19 12:29:14 -0500, Nathan Bossart wrote: > On Tue, Aug 19, 2025 at 08:09:53AM +0000, Bertrand Drouvot wrote: > > On Mon, Aug 18, 2025 at 05:53:44PM -0500, Sami Imseih wrote: > >> > (or some other shmem-based > >> > data structure we have yet to introduce, like a dslist/dsarray). > >> > >> This will be an interesting API to invest time in, if there could be more > >> use-cases. > > > > I did a quick check and I did not find current use cases: possible candidates > > could be in ExecParallelHashTableAlloc(), PTIterationArray, PTEntryArray for > > examples but I think they all know the final size upfront so there is no real > > need for dsarray for those). > > > >> I think it's a separate discussion at this point. > > > > OTOH, that would be a valid use case to introduce this new API but I'm not sure > > it's worth it given that the dshash looks good enough for our case (even if not > > ideal though). > > IMHO it'd be okay to proceed with dshash for now. It would be pretty easy > to switch to something like a "dslist" in the future. Possibly stupid question - is it really worth having a dynamic structure here? The number of tranches is strictly bound, it seems like it'd be simpler to have an array of tranch nmes in shared memory. Greetings, Andres Freund
On Tue, Aug 19, 2025 at 02:06:50PM -0400, Andres Freund wrote: > Possibly stupid question - is it really worth having a dynamic structure here? > The number of tranches is strictly bound, it seems like it'd be simpler to > have an array of tranch nmes in shared memory. Tranches can be allocated post-startup with LWLockNewTrancheId() (e.g., autoprewarm). -- nathan
Hi, On 2025-08-19 13:31:35 -0500, Nathan Bossart wrote: > On Tue, Aug 19, 2025 at 02:06:50PM -0400, Andres Freund wrote: > > Possibly stupid question - is it really worth having a dynamic structure here? > > The number of tranches is strictly bound, it seems like it'd be simpler to > > have an array of tranch nmes in shared memory. > > Tranches can be allocated post-startup with LWLockNewTrancheId() (e.g., > autoprewarm). Sure, but we don't need to support a large number of tranches. Just make it, idk, 128 entries long. Adding a dynamically allocated dsm to every server seems like a waste - ever shared mapping makes fork / exit slower... Greetings, Andres Freund
On Tue, Aug 19, 2025 at 02:37:19PM -0400, Andres Freund wrote: > On 2025-08-19 13:31:35 -0500, Nathan Bossart wrote: >> On Tue, Aug 19, 2025 at 02:06:50PM -0400, Andres Freund wrote: >> > Possibly stupid question - is it really worth having a dynamic structure here? >> > The number of tranches is strictly bound, it seems like it'd be simpler to >> > have an array of tranch nmes in shared memory. >> >> Tranches can be allocated post-startup with LWLockNewTrancheId() (e.g., >> autoprewarm). > > Sure, but we don't need to support a large number of tranches. Just make it, > idk, 128 entries long. Adding a dynamically allocated dsm to every server > seems like a waste - ever shared mapping makes fork / exit slower... The other issue is that there's presently no limit on the length of a tranche name registered via LWLockRegisterTranche(). Life would become much simpler if we're willing to put a limit on both that and the number of tranches, but thus far we've been trying to avoid it. -- nathan
Nathan Bossart <nathandbossart@gmail.com> writes: > On Tue, Aug 19, 2025 at 02:37:19PM -0400, Andres Freund wrote: >> Sure, but we don't need to support a large number of tranches. Just make it, >> idk, 128 entries long. Adding a dynamically allocated dsm to every server >> seems like a waste - ever shared mapping makes fork / exit slower... > The other issue is that there's presently no limit on the length of a > tranche name registered via LWLockRegisterTranche(). Life would become > much simpler if we're willing to put a limit on both that and the number of > tranches, but thus far we've been trying to avoid it. I can hardly imagine a reason why it wouldn't be okay to limit the lengths of tranche names. But especially so if an unlimited length causes practical problems. regards, tom lane
> Nathan Bossart <nathandbossart@gmail.com> writes: > > On Tue, Aug 19, 2025 at 02:37:19PM -0400, Andres Freund wrote: > >> Sure, but we don't need to support a large number of tranches. Just make it, > >> idk, 128 entries long. Adding a dynamically allocated dsm to every server > >> seems like a waste - ever shared mapping makes fork / exit slower... > > > The other issue is that there's presently no limit on the length of a > > tranche name registered via LWLockRegisterTranche(). Life would become > > much simpler if we're willing to put a limit on both that and the number of > > tranches, but thus far we've been trying to avoid it. > > I can hardly imagine a reason why it wouldn't be okay to limit the > lengths of tranche names. But especially so if an unlimited length > causes practical problems. If we limit the tranche name to NAMEDATALEN and also limit the number of tranches an extension can register, we can put this all in static shared memory (We would still need to have a backend local cache to allow lookups to avoid going to shared memory). However, I am not sure what the limit would be for the number of tranches, and if we do impose something, will it break anything that is out there? -- Sami
On Tue, Aug 19, 2025 at 03:52:33PM -0500, Sami Imseih wrote: > If we limit the tranche name to NAMEDATALEN and also limit the > number of tranches an extension can register, we can put this > all in static shared memory (We would still need to have a backend local > cache to allow lookups to avoid going to shared memory). I bet we could avoid the local cache by keeping a backend-local copy of LWLockCounter that gets updated as needed. > However, I am not sure what the limit would be for the number of tranches, > and if we do impose something, will it break anything that is out there? I can think of contrived scenarios where these limits would present problems, but I'm not aware of anything that folks are actually doing that would be affected. In general, the examples I've seen involve allocating a small number of tranches for an extension, so my assumption is that the most likely cause of breakage would be someone installing many, many extensions. -- nathan
> On Tue, Aug 19, 2025 at 03:52:33PM -0500, Sami Imseih wrote: > > If we limit the tranche name to NAMEDATALEN and also limit the > > number of tranches an extension can register, we can put this > > all in static shared memory (We would still need to have a backend local > > cache to allow lookups to avoid going to shared memory). > > I bet we could avoid the local cache by keeping a backend-local copy of > LWLockCounter that gets updated as needed. maybe. If we agree to impose limits ( both name length and # of tranches ), that will allow us to do things a bit different. If there is agreement on setting limits, may I propose 1024 tranches and NAMEDATALEN. Both seem reasonably sufficient. -- Sami
On Tue, Aug 19, 2025 at 04:16:26PM -0500, Sami Imseih wrote: > If there is agreement on setting limits, may I propose > 1024 tranches and NAMEDATALEN. Both seem reasonably sufficient. Let's proceed with that approach for now. We can worry about the exact limits once this is closer to commit. -- nathan
>> If there is agreement on setting limits, may I propose >> 1024 tranches and NAMEDATALEN. Both seem reasonably sufficient. > Let's proceed with that approach for now. We can worry about the exact > limits once this is closer to commit. v11 implements the fixed size shared memory as discussed. So, as previous versions, we totally got rid of LWLockRegisterTranche and users can register tranches with either RequestNamedLWLockTranche ( during startup ) or LWLockNewTrancheId ( after startup ). The NamedLWLockTrancheArray, allocated at the end of MainLWLockArray with a fixed size limited to the number of tranches requested via RequestNamedLWLockTranche, has been changed in v11. It can now grow, up to a maximum number of total tranches, currently 1024, indexed by LWLockCounter. We don't need to keep track of the number of tranches in a separate variable, and can rely on LWLockCounter only. I kept the local array to serve consecutive reads and to avoid having to take a shared lock on shared memory every time GetLWTrancheName is called. A new LWLock to protect this array is required. LWLockInitialize also errors if the tranche used has not been registered, as was the case in the previous patch versions. Several error message to enforce the fixed size limits have also been added. The next patch set has 0001 which is the core change, and 0002 are the tests ( not sure if we need them, but they help in the review ). The tests in 0002 fail on EXEC_BACKEND due to a newly discovered bug [0], that repros on HEAD and likely other branches. -- Sami [0] https://www.postgresql.org/message-id/CAA5RZ0v1_15QPg5Sqd2Qz5rh_qcsyCeHHmRDY89xVHcy2yt5BQ%40mail.gmail.com
Вложения
On Fri, Aug 22, 2025 at 3:01 PM Sami Imseih <samimseih@gmail.com> wrote: > > >> If there is agreement on setting limits, may I propose > >> 1024 tranches and NAMEDATALEN. Both seem reasonably sufficient. > > > Let's proceed with that approach for now. We can worry about the exact > > limits once this is closer to commit. > > v11 implements the fixed size shared memory as discussed. One point I did not make earlier is that the tranche name lengths will need to be as long as we allow in dsm_registry.c. ``` #define DSMR_NAME_LEN 128 #define DSMR_DSA_TRANCHE_SUFFIX " DSA" #define DSMR_DSA_TRANCHE_SUFFIX_LEN (sizeof(DSMR_DSA_TRANCHE_SUFFIX) - 1) #define DSMR_DSA_TRANCHE_NAME_LEN (DSMR_NAME_LEN + DSMR_DSA_TRANCHE_SUFFIX_LEN) ``` We should make the limits external (in lwlock.h) and dsa_registery.c can enforce based on those values. -- Sami
Sami Imseih <samimseih@gmail.com> writes: > One point I did not make earlier is that the tranche name lengths will > need to be as long as we allow in dsm_registry.c. > #define DSMR_NAME_LEN 128 Huh. Why is that different from NAMEDATALEN in the first place? regards, tom lane
On Fri, Aug 22, 2025 at 05:33:55PM -0400, Tom Lane wrote: > Sami Imseih <samimseih@gmail.com> writes: >> One point I did not make earlier is that the tranche name lengths will >> need to be as long as we allow in dsm_registry.c. > >> #define DSMR_NAME_LEN 128 > > Huh. Why is that different from NAMEDATALEN in the first place? One of the reviewers of commit fe07100 requested it [0]. At the time, there was no reason to use NAMEDATALEN. I'm fine with lowering it if needed for the shared tranche name work. Presently, we also append a " DSA" suffix to the DSA tranche name in GetNamedDSHash(). I was originally trying to use separate tranches for the DSA and the dshash table for observability purposes, but since it seems to be more trouble than it's worth, perhaps we should only allocate one tranche. [0] https://postgr.es/m/58DB9EB7-4646-4508-84BA-F0F067A7E8BA%40gmail.com -- nathan
On Fri, Aug 22, 2025 at 03:01:53PM -0500, Sami Imseih wrote: > I kept the local array to serve consecutive reads and to avoid having to > take a shared lock on shared memory every time GetLWTrancheName is > called. A new LWLock to protect this array is required. I'm not seeing why we need this cache anymore. This is an append-only list, so we could instead keep a backend-local copy of LWLockCounter that gets updated as needed. As long as the ID is less than our backend-local counter, we can go straight to the shared array. If it is greater, we'll have to first update our counter, which should be rare and inexpensive. -- nathan
> On Fri, Aug 22, 2025 at 03:01:53PM -0500, Sami Imseih wrote: > > I kept the local array to serve consecutive reads and to avoid having to > > take a shared lock on shared memory every time GetLWTrancheName is > > called. A new LWLock to protect this array is required. > > I'm not seeing why we need this cache anymore. This is an append-only > list, so we could instead keep a backend-local copy of LWLockCounter that > gets updated as needed. As long as the ID is less than our backend-local > counter, we can go straight to the shared array. If it is greater, we'll > have to first update our counter, which should be rare and inexpensive. When we lookup from shared array only, we need to take a shared lock every lookup. Acquiring that lock is what I am trying to avoid. You are saying it's not worth optimizing that part, correct? -- Sami
On Mon, Aug 25, 2025 at 04:37:21PM -0500, Sami Imseih wrote: > When we lookup from shared array only, we need to take a shared lock > every lookup. Acquiring that lock is what I am trying to avoid. You > are saying it's not worth optimizing that part, correct? Why do we need a shared lock here? IIUC there's no chance that existing entries will change. We'll only ever add new ones to the end. -- nathan
> On Mon, Aug 25, 2025 at 04:37:21PM -0500, Sami Imseih wrote: > > When we lookup from shared array only, we need to take a shared lock > > every lookup. Acquiring that lock is what I am trying to avoid. You > > are saying it's not worth optimizing that part, correct? > > Why do we need a shared lock here? IIUC there's no chance that existing > entries will change. We'll only ever add new ones to the end. hmm, can we really avoid a shared lock when reading from shared memory? considering access for both reads and writes can be concurrent to shared memory. We are also taking an exclusive lock when writing a new tranche. -- Sami
On Mon, Aug 25, 2025 at 04:59:41PM -0500, Sami Imseih wrote: > hmm, can we really avoid a shared lock when reading from shared memory? > considering access for both reads and writes can be concurrent to shared > memory. We are also taking an exclusive lock when writing a new tranche. We probably want to hold a lock while we 1) increment LWLockCounter and copy a new tranche name to memory and 2) while we copy the current value of LWLockCounter to our backend-local variable. Otherwise, AFAICT we don't need one. We could probably use ShmemLock for this. -- nathan
> On Mon, Aug 25, 2025 at 04:59:41PM -0500, Sami Imseih wrote: > > hmm, can we really avoid a shared lock when reading from shared memory? > > considering access for both reads and writes can be concurrent to shared > > memory. We are also taking an exclusive lock when writing a new tranche. > > We probably want to hold a lock while we 1) increment LWLockCounter and > copy a new tranche name to memory and In the last rev, I removed the spinlock acquired on ShmemLock in-lieu of a LWLock. This is because I wanted a single LWLock acquisition while both incrementing LWLockCounter and writing to shared memory, and doing this much work, particularly writing to shared memory, with a spinlock seemed inappropriate. With that said, this is not high concurrency of performance sensitive activity at all, so perhaps I was being overly paranoid. -- Sami
> > On Mon, Aug 25, 2025 at 04:59:41PM -0500, Sami Imseih wrote: > > > hmm, can we really avoid a shared lock when reading from shared memory? > > > considering access for both reads and writes can be concurrent to shared > > > memory. We are also taking an exclusive lock when writing a new tranche. > > > > We probably want to hold a lock while we 1) increment LWLockCounter and > > copy a new tranche name to memory and > > In the last rev, I removed the spinlock acquired on ShmemLock in-lieu of > a LWLock. This is because I wanted a single LWLock acquisition while > both incrementing LWLockCounter and writing to shared memory, and > doing this much work, particularly writing to shared memory, > with a spinlock seemed inappropriate. With that said, this is not high > concurrency of performance sensitive activity at all, so perhaps I was > being overly paranoid. Here is v12 that replaces the LWLock to access the shared memory with a ShmemLock and implements a local counter. -- Sami
Вложения
On Tue, Aug 26, 2025 at 02:56:22PM -0500, Sami Imseih wrote: > Here is v12 that replaces the LWLock to access the shared memory with a > ShmemLock and implements a local counter. This looks much closer to what I was imagining. /* Initialize the LWLock tranche for the DSA. */ - dsa_state->tranche = LWLockNewTrancheId(); + dsa_state->tranche = LWLockNewTrancheId(dsa_state->tranche_name); sprintf(dsa_state->tranche_name, "%s%s", name, DSMR_DSA_TRANCHE_SUFFIX); - LWLockRegisterTranche(dsa_state->tranche, dsa_state->tranche_name); /* Initialize the LWLock tranche for the dshash table. */ - dsh_state->tranche = LWLockNewTrancheId(); + dsh_state->tranche = LWLockNewTrancheId(dsh_state->tranche_name); strcpy(dsh_state->tranche_name, name); - LWLockRegisterTranche(dsh_state->tranche, dsh_state->tranche_name); We probably need to do the sprintf/strcpy before LWLockNewTrancheId(). Also, I'm thinking we should just use the same tranche for both the DSA and the dshash table [0] to evade the DSMR_DSA_TRANCHE_SUFFIX problem, i.e., those tranche names potentially require more space. + /* Space for name of each tranche. */ + size = add_size(size, mul_size(MAX_NAMED_TRANCHES, MAX_NAMED_TRANCHES)); Presumably one of the mul_size() arguments should be MAX_NAMED_TRANCHES_NAME_LEN. +NamedLWLockTrancheArray "Waiting to access the named LWLock tranches array." +PG_LWLOCK(54, NamedLWLockTrancheArray) These can be removed now, right? [0] https://postgr.es/m/aKzIg1JryN1qhNuy%40nathan -- nathan
fixed the issues mentioned above in v13. > We probably need to do the sprintf/strcpy before LWLockNewTrancheId(). > Also, I'm thinking we should just use the same tranche for both the DSA and > the dshash table [0] to evade the DSMR_DSA_TRANCHE_SUFFIX problem, i.e., > those tranche names potentially require more space. v13 also includes a separate patch for this. It removes the DSMR_DSA_TRANCHE_SUFFIX, and dsh and dsa now use the same user defined tranche. The tranche name is also limited to the max length of a named tranche, MAX_NAMED_TRANCHES_NAME_LEN, coming from lwlock.h -- Sami
Вложения
Hi, On Tue, Aug 26, 2025 at 05:50:34PM -0500, Sami Imseih wrote: > fixed the issues mentioned above in v13. > > > We probably need to do the sprintf/strcpy before LWLockNewTrancheId(). > > Also, I'm thinking we should just use the same tranche for both the DSA and > > the dshash table [0] to evade the DSMR_DSA_TRANCHE_SUFFIX problem, i.e., > > those tranche names potentially require more space. > > v13 also includes a separate patch for this. It removes the > DSMR_DSA_TRANCHE_SUFFIX, and dsh and dsa now use the same > user defined tranche. The tranche name is also limited to > the max length of a named tranche, MAX_NAMED_TRANCHES_NAME_LEN, > coming from lwlock.h Thanks for the new version. === 1 +LWLockNewTrancheId(const char *tranche_name) { - int result; - int *LWLockCounter; + int tranche_id, + index, + *LWLockCounter; + Size tranche_name_length = strlen(tranche_name) + 1; We need to check if tranche_name is NULL and report an error if that's the case. If not, strlen() would segfault. === 2 + if (tranche_name_length > MAX_NAMED_TRANCHES_NAME_LEN) + elog(ERROR, "tranche name too long"); I think that we should mention in the doc that the tranche name is limited to 63 bytes. === 3 - /* Convert to array index. */ - tranche_id -= LWTRANCHE_FIRST_USER_DEFINED; + tranche_id = (*LWLockCounter)++; + index = tranche_id - LWTRANCHE_FIRST_USER_DEFINED; - /* If necessary, create or enlarge array. */ - if (tranche_id >= LWLockTrancheNamesAllocated) + if (index >= MAX_NAMED_TRANCHES) { - int newalloc; + SpinLockRelease(ShmemLock); + elog(ERROR, "too many LWLock tranches registered"); + } - newalloc = pg_nextpower2_32(Max(8, tranche_id + 1)); + /* Directly copy into the NamedLWLockTrancheArray */ + strcpy((char *) NamedLWLockTrancheArray + (index * MAX_NAMED_TRANCHES_NAME_LEN), + tranche_name); I was skeptical about using strcpy() while we hold a spinlock. I do see some examples with strlcpy() though (walreceiver.c for example), so that looks OK-ish. Using strcpy() might be OK too, as we already have validated the length, but maybe it would be safer to switch to strlcpy(), instead? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Thanks for reviewing! > === 1 > > We need to check if tranche_name is NULL and report an error if that's the case. > If not, strlen() would segfault. Added an error. Good call. The error message follows previously used convention. ``` + if (!tranche_name) + elog(ERROR, "tranche name cannot be null"); ``` > === 2 > > + if (tranche_name_length > MAX_NAMED_TRANCHES_NAME_LEN) > + elog(ERROR, "tranche name too long"); > > I think that we should mention in the doc that the tranche name is limited to > 63 bytes. Done. I just mentioned NAMEDATALEN -1 in the docs. > === 3 > > I was skeptical about using strcpy() while we hold a spinlock. I do see some > examples with strlcpy() though (walreceiver.c for example), so that looks OK-ish. > > Using strcpy() might be OK too, as we already have validated the length, but maybe > it would be safer to switch to strlcpy(), instead? OK, since that is the pattern used, I changed to strlcpy. But since we are doing checks in advance, I think it will be safe either way. -- Sami
Вложения
Hi, On Wed, Aug 27, 2025 at 02:13:39PM -0500, Sami Imseih wrote: > Thanks for reviewing! > > > === 1 > > > > We need to check if tranche_name is NULL and report an error if that's the case. > > If not, strlen() would segfault. > > Added an error. Good call. The error message follows previously used > convention. > > ``` > + if (!tranche_name) > + elog(ERROR, "tranche name cannot be null"); > ``` +LWLockNewTrancheId(const char *tranche_name) { - int result; - int *LWLockCounter; + int tranche_id, + index, + *LWLockCounter; + Size tranche_name_length = strlen(tranche_name) + 1; - LWLockCounter = (int *) ((char *) MainLWLockArray - sizeof(int)); - /* We use the ShmemLock spinlock to protect LWLockCounter */ - SpinLockAcquire(ShmemLock); - result = (*LWLockCounter)++; - SpinLockRelease(ShmemLock); + if (!tranche_name) + elog(ERROR, "tranche name cannot be null"); The check has to be done before the strlen() call, if not it segfault: Breakpoint 1, LWLockNewTrancheId (tranche_name=0x0) at lwlock.c:569 569 Size tranche_name_length = strlen(tranche_name) + 1; (gdb) n Program received signal SIGSEGV, Segmentation fault. > > === 2 > > > > + if (tranche_name_length > MAX_NAMED_TRANCHES_NAME_LEN) > > + elog(ERROR, "tranche name too long"); > > > > I think that we should mention in the doc that the tranche name is limited to > > 63 bytes. > > Done. I just mentioned NAMEDATALEN -1 in the docs. Most of the places where NAMEDATALEN is mentioned in sgml files also mention it's 64 bytes. Should we do the same here? > > === 3 > > > > I was skeptical about using strcpy() while we hold a spinlock. I do see some > > examples with strlcpy() though (walreceiver.c for example), so that looks OK-ish. > > > > Using strcpy() might be OK too, as we already have validated the length, but maybe > > it would be safer to switch to strlcpy(), instead? > > OK, since that is the pattern used, I changed to strlcpy. But since we are doing > checks in advance, I think it will be safe either way. Agree. - * stores the names of all dynamically-created tranches known to the current - * process. Any unused entries in the array will contain NULL. + * stores the names of all dynamically created tranches in shared memory. + * Any unused entries in the array will contain NULL. This variable is + * non-static so that postmaster.c can copy it to child processes in + * EXEC_BACKEND builds. "Any unused entries in the array will contain NULL": this is not true anymore. It now contains empty strings: (gdb) p *(char(*)[64])NamedLWLockTrancheArray@4 $5 = {"tutu", '\000' <repeats 59 times>, '\000' <repeats 63 times>, '\000' <repeats 63 times>, '\000' <repeats 63 times>} Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> The check has to be done before the strlen() call, if not it segfault: I don't know what I was thinking there :( silly mistake. It was also missed in RequestNamedLWLockTranche. > Most of the places where NAMEDATALEN is mentioned in sgml files also mention > it's 64 bytes. Should we do the same here? I don't have an issue with that, although I did not like how the various docs are not consistent. Some claim 64 characters, others claim 63 bytes. They are not the same since multibyte characters are allowed. For example [0] is definitely wrong, and it should say "bytes". I think it would be nice if all references to NAMEDATALEN point to a single doc (a follow-up topic, perhaps) > "Any unused entries in the array will contain NULL": this is not true anymore. > It now contains empty strings: Yes. Fixed the comment. Also, it has not yet been discussed that the max number of named tranches should be. Thus far I have been using a likely extremely high value of 1024. In v15, I set it to 128 as that was a possibly more realistic number and one suggested earlier [1]. There maybe some odd cases out there in which this value may not be enough; many extensions that init lwlocks or maybe some extension out there that uses partitioned locks and assign a unique tranche for every lock. I have not seen anything like this. Maybe it's not be something to worry about and we can say 128 is reasonable sufficiently large. I would hate to reserve too much shared memory for this unnecessarily. Thoughts? [0] https://www.postgresql.org/docs/18/runtime-config-logging.html#GUC-APPLICATION-NAME [1] https://www.postgresql.org/message-id/5to6tftuml6nkas4jaaljfzecasvslxq3mumeslh74wsol4mzw%40rgxpxxlqqwtf -- Sami
Вложения
I've spent some time getting this prepared for commit, so apologies if I am steamrolling over some of the latest discussion points. The majority of what I've changed amounts to what I'd characterize as relatively minor editorialization, but I'd imagine reasonable people could disagree. The bigger things I've changed include: * I've moved the DSM registry adjustments to be a prerequisite patch. I noticed that we can also stop tracking the tranche names there since we always use the name of the DSM registry entry. * I modified the array of tranche names to be a char ** to simplify lookups. I also changed how it is first initialized in CreateLWLocks() a bit. * I've left out the tests for now. Those are great for the development phase, but I'm not completely sold on committing them. In any case, I'd probably treat that as a follow-up effort once this stuff is committed. I think this patch set will require reworking the "GetNamedLWLockTranche crashes on Windows in normal backend" patch [0], but AFAICT we can easily adjust it to scan through NamedLWLockTrancheNames instead. Overall, I'm pretty happy about these patches. They simplify the API and the code while also fixing the problem with tranche name visibility. [0] https://postgr.es/m/CAA5RZ0v1_15QPg5Sqd2Qz5rh_qcsyCeHHmRDY89xVHcy2yt5BQ%40mail.gmail.com -- nathan
Вложения
> * I modified the array of tranche names to be a char ** to simplify > lookups. I also changed how it is first initialized in CreateLWLocks() a > bit. That works also. > * I've left out the tests for now. Those are great for the development > phase, but I'm not completely sold on committing them. In any case, I'd > probably treat that as a follow-up effort once this stuff is committed. I was mainly using them to verify the functionality since we lack tests in this area. I think long-term, it will be good to have test coverage. We don't need to decide now. > I think this patch set will require reworking the "GetNamedLWLockTranche > crashes on Windows in normal backend" patch [0], but AFAICT we can easily > adjust it to scan through NamedLWLockTrancheNames instead. Except, we will need to turn the char**NamedLWLockTranche into a struct that will hold the num_lwlocks as well. Something like. But that is doable. ``` typedef struct NamedLWLockTranche { char trancheName[NAMEDATALEN]; int num_lwlocks; } NamedLWLockTranche; ``` If there is no interest to backpatch [0], maybe we should just make this change as part of this patch set. What do you think? I can make this change in v18. > Overall, I'm pretty happy about these patches. They simplify the API and > the code while also fixing the problem with tranche name visibility. Just a few things that were discussed earlier, that I incorporated now. 1/ We should be checking that tranche_name is NOT NULL when LWLockNewTrancheId or RequestNamedLWLockTranche is called. Also check for the length of the tranche name inside RequestNamedLWLockTranche, instead of relying on an Assert to check the length. I think that is much safer. 2/ Bertrand suggested to use strlcpy instead of strcpy when copying while holding a spinlock [1] 3/ There was a doc update to mention the 63 byte tranche length. We can skip that for now, since the ERROR message in v16 is clear about the limit. [0] https://postgr.es/m/CAA5RZ0v1_15QPg5Sqd2Qz5rh_qcsyCeHHmRDY89xVHcy2yt5BQ%40mail.gmail.com [1] https://www.postgresql.org/message-id/CAA5RZ0ve-fHuNYW-ruMwg1y1v7-aCqMm_MiNq1KOdg2Y2-pKDw%40mail.gmail.com -- Sami
Вложения
Hi, On Thu, Aug 28, 2025 at 05:53:23PM -0500, Sami Imseih wrote: > Just a few things that were discussed earlier, that I incorporated now. > > 1/ We should be checking that tranche_name is NOT NULL when > LWLockNewTrancheId or RequestNamedLWLockTranche is called. Right, if not strlen() does segfault. In addition to checking for NULL, should we also check for empty string? Currently, the patch does accept strlen(tranche_name) == 0. > Also check for the length of the tranche name inside > RequestNamedLWLockTranche, instead of relying on an Assert to check > the length. I think that is much safer. Same remark as above but in RequestNamedLWLockTranche(). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
> On Thu, Aug 28, 2025 at 05:53:23PM -0500, Sami Imseih wrote: > > Just a few things that were discussed earlier, that I incorporated now. > > > > 1/ We should be checking that tranche_name is NOT NULL when > > LWLockNewTrancheId or RequestNamedLWLockTranche is called. > > Right, if not strlen() does segfault. > > In addition to checking for NULL, should we also check for empty string? Currently, > the patch does accept strlen(tranche_name) == 0. I am not inclined to prevent an empty string. It's currently allowed and rather not change that. > ``` > typedef struct NamedLWLockTranche > { > char trancheName[NAMEDATALEN]; > int num_lwlocks; > } NamedLWLockTranche; > ``` > if there is no interest to backpatch [0], maybe we should just make this > change as part of this patch set. What do you think? I can make this change > in v18. Here is v18. It includes a third patch to fix the issue identified in [0], which can be applied to HEAD as part of this thread. If we want to backpatch the stable branches, the version in [0] is suitable. Note that I created a LWLockNewTrancheIdInternal which takes a tranch name and number of lwlocks. The Internal version is used during startup when requested lwlocks are appended to shared memory, and the existing LWLockNewTrancheId calls the internal version with 0 lwlocks. This keeps all the logic to appending a new tranche ( while holding the spinlock ) in the same routine. -- Sami
Вложения
On Thu, Aug 28, 2025 at 05:53:23PM -0500, Sami Imseih wrote: >> I think this patch set will require reworking the "GetNamedLWLockTranche >> crashes on Windows in normal backend" patch [0], but AFAICT we can easily >> adjust it to scan through NamedLWLockTrancheNames instead. > > Except, we will need to turn the char**NamedLWLockTranche into a struct > that will hold the num_lwlocks as well. Something like. But that is doable. Ah, right. > If there is no interest to backpatch [0], maybe we should just make this > change as part of this patch set. What do you think? I can make this change > in v18. I think we should keep that separate. I don't see any benefit to combining it, and it'd be good to keep the discussion in a single thread in the archives for future reference. -- nathan
I've committed the DSM registry changes. I apologize for the lackluster commit message. My attempts at including all the details ended up being super wordy and hard to read, so I decided to keep it terse and leave out some of the context. I've also attached a rebased patch that addresses all the latest feedback. A reworked verison of the test patch is also included, but that's mostly intended for CI purposes and is still not intended for commit (yet). -- nathan
Вложения
On Fri, Aug 29, 2025 at 09:51:38PM -0500, Nathan Bossart wrote: > I've also attached a rebased patch that addresses all the latest feedback. > A reworked verison of the test patch is also included, but that's mostly > intended for CI purposes and is still not intended for commit (yet). And here's an attempt at fixing the alignment problems revealed by cfbot. -- nathan
Вложения
>> I've also attached a rebased patch that addresses all the latest feedback. >> A reworked verison of the test patch is also included, but that's mostly >> intended for CI purposes and is still not intended for commit (yet). > And here's an attempt at fixing the alignment problems revealed by cfbot. This LGTM. Once we get this committed, will follow-up on [0] [0] https://www.postgresql.org/message-id/CAA5RZ0v1_15QPg5Sqd2Qz5rh_qcsyCeHHmRDY89xVHcy2yt5BQ%40mail.gmail.com -- Sami
Hi, On Sat, Aug 30, 2025 at 09:14:46AM -0500, Nathan Bossart wrote: > On Fri, Aug 29, 2025 at 09:51:38PM -0500, Nathan Bossart wrote: > > I've also attached a rebased patch that addresses all the latest feedback. > > A reworked verison of the test patch is also included, but that's mostly > > intended for CI purposes and is still not intended for commit (yet). > > And here's an attempt at fixing the alignment problems revealed by cfbot. Indeed: ==24409==Using libbacktrace symbolizer. ../src/backend/storage/lmgr/lwlock.c:441:31: runtime error: store to misaligned address 0x7f7644188904 for type 'char *',which requires 8 byte alignment Changes look good. Not directly related, but I think that we can get rid of: size = add_size(size, LWLOCK_PADDED_SIZE); in LWLockShmemSize() and of: ptr += LWLOCK_PADDED_SIZE - ((uintptr_t) ptr) % LWLOCK_PADDED_SIZE; in CreateLWLocks(), and just make use of CACHELINEALIGN(). Attached, a patch doing so. It applies on top of your v20. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hi,
I've also attached a rebased patch that addresses all the latest feedback.
A reworked verison of the test patch is also included, but that's mostly
intended for CI purposes and is still not intended for commit (yet).
Please see below for some comments regarding v20 patch.
1. Following unused declaration is present in the lwlock.h file.
extern void LWLockRegisterTranche(int tranche_id, const char *tranche_name);
2. I am struggling to understand the use of LocalLWLockCounter in the patch.
If the intention is to delay acquiring the spin lock when trancheId is less than LocalLWLockCounter,
it seems possible that we might read NamedLWLockTrancheNames in the GetLWTrancheName function
after LocalLWLockCounter has been updated, but before NamedLWLockTrancheNames has been updated
in the LWLockNewTrancheId function.
To prevent reading an outdated tranche name, we should acquire the spin lock unconditionally in the
GetLWTrancheName function
3.
+ ereport(ERROR,
+ (errmsg("maximum number of tranches already registered"),
+ errdetail("No more than %d tranches may be registered.",
+ MAX_NAMED_TRANCHES)));
+ }
Since this patch sets a maximum limit on the number of LW lock tranches that can be registered,
would it make sense to offer a configurable option rather than using a hard-coded MAX_NUM_TRANCHES?
This will allow users who have reached the maximum limit to register their LW Locks.
Thank you,
would it make sense to offer a configurable option rather than using a hard-coded MAX_NUM_TRANCHES?
This will allow users who have reached the maximum limit to register their LW Locks.
Thank you,
Rahila Syed
On Mon, Sep 01, 2025 at 10:18:46AM +0000, Bertrand Drouvot wrote: > Changes look good. Thanks for looking. > Not directly related, but I think that we can get rid of: > > size = add_size(size, LWLOCK_PADDED_SIZE); > > in LWLockShmemSize() and of: > > ptr += LWLOCK_PADDED_SIZE - ((uintptr_t) ptr) % LWLOCK_PADDED_SIZE; > > in CreateLWLocks(), and just make use of CACHELINEALIGN(). Let's take care of the tranche name stuff first. -- nathan
On Mon, Sep 01, 2025 at 06:35:55PM +0530, Rahila Syed wrote: > Please see below for some comments regarding v20 patch. Thanks for looking. > 1. Following unused declaration is present in the lwlock.h file. > > extern void LWLockRegisterTranche(int tranche_id, const char *tranche_name); Whoops, good catch. > If the intention is to delay acquiring the spin lock when trancheId is > less than LocalLWLockCounter, it seems possible that we might read > NamedLWLockTrancheNames in the GetLWTrancheName function after > LocalLWLockCounter has been updated, but before NamedLWLockTrancheNames > has been updated in the LWLockNewTrancheId function. To prevent reading > an outdated tranche name, we should acquire the spin lock unconditionally > in the GetLWTrancheName function We only ever add new tranche names to the end of the list, so we should be able to avoid the lock for the vast majority of tranche name lookups. We hold the lock while we increment LWLockCounter and add a name, and also whenever we update LocalLWLockCounter. This should prevent any unsafe accesses to NamedLWLockTrancheNames. I've added some additional commentary about this in v21. > + ereport(ERROR, > + (errmsg("maximum number of tranches already registered"), > + errdetail("No more than %d tranches may be registered.", > + MAX_NAMED_TRANCHES))); > + } > > Since this patch sets a maximum limit on the number of LW lock tranches > that can be registered, would it make sense to offer a configurable > option rather than using a hard-coded MAX_NUM_TRANCHES? This will allow > users who have reached the maximum limit to register their LW Locks. Hm. I'm not sure. I think it would be good to offer a way to accommodate more tranche names that didn't involve recompiling, but I don't know whether that situation is likely enough in practice to add yet another GUC. Thus far, we've been operating under the assumption that we'll be able to choose a number far beyond any realistic usage. If others have opinions about this, please share... -- nathan
Вложения
Committed. -- nathan
Hi Nathan,
We only ever add new tranche names to the end of the list, so we should be
able to avoid the lock for the vast majority of tranche name lookups. We
hold the lock while we increment LWLockCounter and add a name, and also
whenever we update LocalLWLockCounter. This should prevent any unsafe
accesses to NamedLWLockTrancheNames.
I've added some additional commentary about this in v21.
Thanks for adding the comments.
I think there may be a brief period in the current code where unsafe access to
LWLockTrancheNames is possible.
Since updates to LocalLWLockCounter and LWLockTrancheNames are performed while
holding the lock, but reading LocalLWLockCounter and then LWLockTrancheNames in
GetLWTrancheName can occur without acquiring the same lock in case trancheID < LocalLWLockCounter,
There is a small window between updating LocalLWLockCounter and adding the name
to LWLockTrancheNames. During this window, if GetLWTrancheNames is called, it might attempt
to access a name in LWLockTrancheNames that hasn't been added yet.
I’ll see if I can demonstrate the situation with a test.
> + ereport(ERROR,
> + (errmsg("maximum number of tranches already registered"),
> + errdetail("No more than %d tranches may be registered.",
> + MAX_NAMED_TRANCHES)));
> + }
>
> Since this patch sets a maximum limit on the number of LW lock tranches
> that can be registered, would it make sense to offer a configurable
> option rather than using a hard-coded MAX_NUM_TRANCHES? This will allow
> users who have reached the maximum limit to register their LW Locks.
Hm. I'm not sure. I think it would be good to offer a way to accommodate
more tranche names that didn't involve recompiling, but I don't know
whether that situation is likely enough in practice to add yet another GUC.
Thus far, we've been operating under the assumption that we'll be able to
choose a number far beyond any realistic usage. If others have opinions
about this, please share...
Thank you for clarifying the rationale behind the decision.
Thank you,
Rahila Syed
On Thu, Sep 04, 2025 at 03:29:52PM +0530, Rahila Syed wrote: > Since updates to LocalLWLockCounter and LWLockTrancheNames are performed > while holding the lock, but reading LocalLWLockCounter and then > LWLockTrancheNames in GetLWTrancheName can occur without acquiring the > same lock in case trancheID < LocalLWLockCounter, There is a small window > between updating LocalLWLockCounter and adding the name to > LWLockTrancheNames. During this window, if GetLWTrancheNames is called, > it might attempt to access a name in LWLockTrancheNames that hasn't been > added yet. We hold the lock when reading/writing the shared-memory LWLockCounter. There's no need to take a lock when reading the backend-local LocalLWLockCounter, as it can't be modified by any other process. Additionally, when adding a new tranche name, we hold the lock the entire time while we bump the shared-memory counter and copy the name, so there's no chance another backend will see the counter update but not the name or attempt to use the same LWLockTrancheNames slot. -- nathan
On Wed, Sep 03, 2025 at 02:01:14PM -0500, Nathan Bossart wrote: > Committed. I'm having some regrets about the changes to RequestNamedLWLockTranche(). Specifically, when it is first called, it immediately allocates an array big enough to hold 256 requests (~17 KB), whereas it used to only allocate space for 16 requests (~1 KB) and resize as needed. Furthermore, the MAX_NAMED_TRANCHES check isn't actually needed because InitializeLWLocks() will do the same check via its calls to LWLockNewTrancheId() for all the named tranche requests. -- nathan
Вложения
> I'm having some regrets about the changes to RequestNamedLWLockTranche(). > Specifically, when it is first called, it immediately allocates an array > big enough to hold 256 requests (~17 KB), whereas it used to only allocate > space for 16 requests (~1 KB) and resize as needed. I liked removing the repalloc calls inside this routine and did not think it was worth optimizing. I am OK with reverting it back. Although v1 is incorrect since it's still initializing NamedLWLockTrancheRequestArray to MAX_NAMED_TRANCHES ``` if (NamedLWLockTrancheRequestArray == NULL) { + NamedLWLockTrancheRequestsAllocated = 16; NamedLWLockTrancheRequestArray = (NamedLWLockTrancheRequest *) MemoryContextAlloc(TopMemoryContext, MAX_NAMED_TRANCHES * sizeof(NamedLWLockTrancheRequest)); } ``` instead of MAX_NAMED_TRANCHES, it should be NamedLWLockTrancheRequestsAllocated . Also, Previously NamedLWLockTrancheRequestsAllocated was global, but I don't think it should ever be used outside of this function, so it's OK to declare it as you have. > Furthermore, the > MAX_NAMED_TRANCHES check isn't actually needed because InitializeLWLocks() > will do the same check via its calls to LWLockNewTrancheId() for all the > named tranche requests. I thought about that one and decided to add the error message there, since requesting a tranche happens way before LWLockNewTrancheId is called during CreateLWLocks, so it was more about erroring out slightly earlier. But it may be ok to also just remove it. -- Sami Imseih Amazon Web Services (AWS)
On Thu, Sep 04, 2025 at 12:30:27PM -0500, Sami Imseih wrote: > I liked removing the repalloc calls inside this routine and did not think > it was worth optimizing. I am OK with reverting it back. Although v1 > is incorrect since it's still initializing > NamedLWLockTrancheRequestArray to MAX_NAMED_TRANCHES Committed with that fix. >> Furthermore, the >> MAX_NAMED_TRANCHES check isn't actually needed because InitializeLWLocks() >> will do the same check via its calls to LWLockNewTrancheId() for all the >> named tranche requests. > > I thought about that one and decided to add the error message there, since > requesting a tranche happens way before LWLockNewTrancheId is called > during CreateLWLocks, so it was more about erroring out slightly earlier. > But it may be ok to also just remove it. We needed it before because the array could only ever hold MAX_NAMED_TRANCHES requests. -- nathan