Обсуждение: Configurable FP_LOCK_SLOTS_PER_BACKEND

Поиск
Список
Период
Сортировка

Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Nikolay Samokhvalov
Дата:
We're observing a few cases with lockmanager spikes in a few quite loaded systems.

These cases are different; queries are different, Postgres versions are 12, 13, and 14.

But in all cases, servers are quite beefy (96-128 vCPUs, ~600-800 GiB) receiving a lot of TPS (a few dozens of thousands). Most queries that struggle from wait_event=lockmanager involve a substantial number of tables/indexes, often with partitioning.

FP_LOCK_SLOTS_PER_BACKEND is now hard-coded 16 in storage/proc.h – and it is now very easy to hit this threshold in a loaded system, especially, for example, if a table with a dozen of indexes was partitioned. It seems any system with good growth hits it sooner or later.

I wonder, would it make sense to:
1) either consider increasing this hard-coded value, taking into account that 16 seems to be very low for modern workloads, schemas, and hardware – say, it could be 64,
2) or even make it configurable – a new GUC.

Thanks,
Nikolay Samokhvalov
Founder, Postgres.ai

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Tomas Vondra
Дата:

On 7/13/23 07:02, Nikolay Samokhvalov wrote:
> We're observing a few cases with lockmanager spikes in a few quite
> loaded systems.
> 
> These cases are different; queries are different, Postgres versions are
> 12, 13, and 14.
> 
> But in all cases, servers are quite beefy (96-128 vCPUs, ~600-800 GiB)
> receiving a lot of TPS (a few dozens of thousands). Most queries that
> struggle from wait_event=lockmanager involve a substantial number of
> tables/indexes, often with partitioning.
> 
> FP_LOCK_SLOTS_PER_BACKEND is now hard-coded 16 in storage/proc.h – and
> it is now very easy to hit this threshold in a loaded system,
> especially, for example, if a table with a dozen of indexes was
> partitioned. It seems any system with good growth hits it sooner or later.
> 
> I wonder, would it make sense to:
> 1) either consider increasing this hard-coded value, taking into account
> that 16 seems to be very low for modern workloads, schemas, and hardware
> – say, it could be 64,

Well, that has a cost too, as it makes PGPROC larger, right? At the
moment that struct is already ~880B / 14 cachelines, adding 48 XIDs
would make it +192B / +3 cachelines. I doubt that won't impact other
common workloads ...

However, the lmgr/README says this is meant to alleviate contention on
the lmgr partition locks. Wouldn't it be better to increase the number
of those locks, without touching the PGPROC stuff? Could this be tuned
using some heuristics based on number of connections?

> 2) or even make it configurable – a new GUC.
> 

I'm rather strongly against adding yet another GUC for this low-level
stuff. We already have enough of such knobs it's almost impossible for
regular users to actually tune the system without doing something wrong.
I'd even say it's actively harmful, especially if it's aimed at very
common setups/workloads (like here).

regards
-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Matt Smiley
Дата:
I thought it might be helpful to share some more details from one of the case studies behind Nik's suggestion.

Bursty contention on lock_manager lwlocks recently became a recurring cause of query throughput drops for GitLab.com, and we got to study the behavior via USDT and uprobe instrumentation along with more conventional observations (see https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301).  This turned up some interesting finds, and I thought sharing some of that research might be helpful.

Results so far suggest that increasing FP_LOCK_SLOTS_PER_BACKEND would have a much larger positive impact than any other mitigation strategy we have evaluated.  Rather than reducing hold duration or collision rate, adding fastpath slots reduces the frequency of even having to acquire those lock_manager lwlocks.  I suspect this would be helpful for many other workloads, particularly those having high frequency queries whose tables collectively have more than about 16 or indexes.

Lowering the lock_manager lwlock acquisition rate means lowering its contention rate (and probably also its contention duration, since exclusive mode forces concurrent lockers to queue).

I'm confident this would help our workload, and I strongly suspect it would be generally helpful by letting queries use fastpath locking more often.

> However, the lmgr/README says this is meant to alleviate contention on
> the lmgr partition locks. Wouldn't it be better to increase the number
> of those locks, without touching the PGPROC stuff?

That was my first thought too, but growing the lock_manager lwlock tranche isn't nearly as helpful.

On the slowpath, each relation's lock tag deterministically hashes onto a specific lock_manager lwlock, so growing the number of lock_manager lwlocks just makes it less likely for two or more frequently locked relations to hash onto the same lock_manager.

In contrast, growing the number of fastpath slots completely avoids calls to the slowpath (i.e. no need to acquire a lock_manager lwlock).

The saturation condition we'd like to solve is heavy contention on one or more of the lock_manager lwlocks.  Since that is driven by the slowpath acquisition rate of heavyweight locks, avoiding the slowpath is better than just moderately reducing the contention on the slowpath.

To be fair, increasing the number of lock_manager locks definitely can help to a certain extent, but it doesn't cover an important general case.  As a thought experiment, suppose we increase the lock_manager tranche to some arbitrarily large size, larger than the number of relations in the db.  This unrealistically large size means we have the best case for avoiding collisions -- each relation maps uniquely onto its own lock_manager lwlock.  That helps a lot in the case where the workload is spread among many non-overlapping sets of relations.  But it doesn't help a workload where any one table is accessed frequently via slowpath locking.

Continuing the thought experiment, if that frequently queried table has 16 or more indexes, or if it is joined to other tables that collectively add up to over 16 relations, then each of those queries is guaranteed to have to use the slowpath and acquire the deterministically associated lock_manager lwlocks.

So growing the tranche of lock_manager lwlocks would help for some workloads, while other workloads would not be helped much at all.  (As a concrete example, a workload at GitLab has several frequently queried tables with over 16 indexes that consequently always use at least some slowpath locks.)

For additional context:

https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#what-influences-lock_manager-lwlock-acquisition-rate
Summarizes the pathology and its current mitigations.

https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678
Documents the supporting research methodology.

https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365370510
What code paths require an exclusive mode lwlock for lock_manager?

https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142
Comparison of fastpath vs. slowpath locking, including quantifying the rate difference.

https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365630726
Confirms the acquisition rate of lock_manager locks is not uniform.  The sampled workload has a 3x difference in the most vs. least frequently acquired lock_manager lock, corresponding to the workload's most frequently accessed relations.

> Well, that has a cost too, as it makes PGPROC larger, right? At the
> moment that struct is already ~880B / 14 cachelines, adding 48 XIDs
> would make it +192B / +3 cachelines. I doubt that won't impact other
> common workloads ...

That's true; growing the data structure may affect L2/L3 cache hit rates when touching PGPROC.  Is that cost worth the benefit of using fastpath for a higher percentage of table locks?  The answer may be workload- and platform-specific.  Exposing this as a GUC gives the admin a way to make a different choice if our default (currently 16) is bad for them.

I share your reluctance to add another low-level tunable, but like many other GUCs, having a generally reasonable default that can be adjusted is better than forcing folks to fork postgres to adjust a compile-time constant.  And unfortunately I don't see a better way to solve this problem.  Growing the lock_manager lwlock tranche isn't as effective, because it doesn't help workloads where one or more relations are locked frequently enough to hit this saturation point.

Handling a larger percentage of heavyweight lock acquisitions via fastpath instead of slowpath seems likely to help many high-throughput workloads, since it avoids having to exclusively acquire an lwlock.  It seemed like the least intrusive general-purpose solution we've come up with so far.  That's why we wanted to solicit feedback or new ideas from the community.  Currently, the only options folks have to solve this class of saturation are through some combination of schema changes, application changes, vertical scaling, and spreading the query rate among more postgres instances.  Those are not feasible and efficient options.  Lacking a better solution, exposing a GUC that rarely needs tuning seems reasonable to me.

Anyway, hopefully the extra context is helpful!  Please do share your thoughts.

--
Matt Smiley | Staff Site Reliability Engineer at GitLab

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Tomas Vondra
Дата:
On 8/3/23 01:51, Matt Smiley wrote:
> I thought it might be helpful to share some more details from one of the
> case studies behind Nik's suggestion.
> 
> Bursty contention on lock_manager lwlocks recently became a recurring
> cause of query throughput drops for GitLab.com, and we got to study the
> behavior via USDT and uprobe instrumentation along with more
> conventional observations (see
> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301
> <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301>). 
> This turned up some interesting finds, and I thought sharing some of
> that research might be helpful.
> 

The analysis in the linked gitlab issue is pretty amazing. I wasn't
planning to argue against the findings anyway, but plenty of data
supporting the conclusions is good.

I'm not an expert on locking, so some of the stuff I say may be
trivially obvious - it's just me thinking about ...

I wonder what's the rough configuration of those systems, though. Both
the hardware and PostgreSQL side. How many cores / connections, etc.?


> Results so far suggest that increasing FP_LOCK_SLOTS_PER_BACKEND would
> have a much larger positive impact than any other mitigation strategy we
> have evaluated.  Rather than reducing hold duration or collision rate,
> adding fastpath slots reduces the frequency of even having to acquire
> those lock_manager lwlocks.  I suspect this would be helpful for many
> other workloads, particularly those having high frequency queries whose
> tables collectively have more than about 16 or indexes.
> 

Yes, I agree with that. Partitioning makes this issue works, I guess.
Schemas with indexes on every column are disturbingly common these days
too, which hits the issue too ...

> Lowering the lock_manager lwlock acquisition rate means lowering its
> contention rate (and probably also its contention duration, since
> exclusive mode forces concurrent lockers to queue).
> 
> I'm confident this would help our workload, and I strongly suspect it
> would be generally helpful by letting queries use fastpath locking more
> often.
> 

OK

>> However, the lmgr/README says this is meant to alleviate contention on
>> the lmgr partition locks. Wouldn't it be better to increase the number
>> of those locks, without touching the PGPROC stuff?
> 
> That was my first thought too, but growing the lock_manager lwlock
> tranche isn't nearly as helpful.
> 
> On the slowpath, each relation's lock tag deterministically hashes onto
> a specific lock_manager lwlock, so growing the number of lock_manager
> lwlocks just makes it less likely for two or more frequently locked
> relations to hash onto the same lock_manager.
> 

Hmmm, so if we have a query that joins 16 tables, or a couple tables
with indexes, all backends running this will acquire exactly the same
partition locks. And we're likely acquiring them in exactly the same
order (to lock objects in the same order because of deadlocks), making
the issue worse.

> In contrast, growing the number of fastpath slots completely avoids
> calls to the slowpath (i.e. no need to acquire a lock_manager lwlock).
> 
> The saturation condition we'd like to solve is heavy contention on one
> or more of the lock_manager lwlocks.  Since that is driven by the
> slowpath acquisition rate of heavyweight locks, avoiding the slowpath is
> better than just moderately reducing the contention on the slowpath.
> 
> To be fair, increasing the number of lock_manager locks definitely can
> help to a certain extent, but it doesn't cover an important general
> case.  As a thought experiment, suppose we increase the lock_manager
> tranche to some arbitrarily large size, larger than the number of
> relations in the db.  This unrealistically large size means we have the
> best case for avoiding collisions -- each relation maps uniquely onto
> its own lock_manager lwlock.  That helps a lot in the case where the
> workload is spread among many non-overlapping sets of relations.  But it
> doesn't help a workload where any one table is accessed frequently via
> slowpath locking.
> 

Understood.

> Continuing the thought experiment, if that frequently queried table has
> 16 or more indexes, or if it is joined to other tables that collectively
> add up to over 16 relations, then each of those queries is guaranteed to
> have to use the slowpath and acquire the deterministically associated
> lock_manager lwlocks.
> 
> So growing the tranche of lock_manager lwlocks would help for some
> workloads, while other workloads would not be helped much at all.  (As a
> concrete example, a workload at GitLab has several frequently queried
> tables with over 16 indexes that consequently always use at least some
> slowpath locks.)
> 
> For additional context:
> 
> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#what-influences-lock_manager-lwlock-acquisition-rate
<https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#what-influences-lock_manager-lwlock-acquisition-rate>
> Summarizes the pathology and its current mitigations.
> 
> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678
<https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678>
> Documents the supporting research methodology.
> 
> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365370510
<https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365370510>
> What code paths require an exclusive mode lwlock for lock_manager?
> 
> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142
<https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142>
> Comparison of fastpath vs. slowpath locking, including quantifying the
> rate difference.
> 
> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365630726
<https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365630726>
> Confirms the acquisition rate of lock_manager locks is not uniform.  The
> sampled workload has a 3x difference in the most vs. least frequently
> acquired lock_manager lock, corresponding to the workload's most
> frequently accessed relations.
> 

Those are pretty great pieces of information. I wonder if some of the
measurements may be affecting the observation (by consuming too much
CPU, making the contention worse), but overall it seems convincing.

Would it be difficult to sample just a small fraction of the calls? Say,
1%, to get good histograms/estimated with acceptable CPU usage.

In any case, it's a great source of information to reproduce the issue
and evaluate possible fixes.

>> Well, that has a cost too, as it makes PGPROC larger, right? At the
>> moment that struct is already ~880B / 14 cachelines, adding 48 XIDs
>> would make it +192B / +3 cachelines. I doubt that won't impact other
>> common workloads ...
> 
> That's true; growing the data structure may affect L2/L3 cache hit rates
> when touching PGPROC.  Is that cost worth the benefit of using fastpath
> for a higher percentage of table locks?  The answer may be workload- and
> platform-specific.  Exposing this as a GUC gives the admin a way to make
> a different choice if our default (currently 16) is bad for them.
> 

After looking at the code etc. I think the main trade-off here is going
to be the cost of searching the fpRelId array. At the moment it's
searched linearly, which is cheap for 16 locks. But at some point it'll
become as expensive as updating the slowpath, and the question is when.

I wonder if we could switch to a more elaborate strategy if the number
of locks is high enough. Say, a hash table, or some hybrid approach.

> I share your reluctance to add another low-level tunable, but like many
> other GUCs, having a generally reasonable default that can be adjusted
> is better than forcing folks to fork postgres to adjust a compile-time
> constant.  And unfortunately I don't see a better way to solve this
> problem.  Growing the lock_manager lwlock tranche isn't as effective,
> because it doesn't help workloads where one or more relations are locked
> frequently enough to hit this saturation point.
> 

I understand. I have two concerns:

1) How would the users know they need to tune this / determine what's
the right value, and  what's the right value for their system.

2) Having to deal with misconfigured systems as people tend to blindly
tune everything to 100x the default, because more is better :-(


> Handling a larger percentage of heavyweight lock acquisitions via
> fastpath instead of slowpath seems likely to help many high-throughput
> workloads, since it avoids having to exclusively acquire an lwlock.  It
> seemed like the least intrusive general-purpose solution we've come up
> with so far.  That's why we wanted to solicit feedback or new ideas from
> the community.  Currently, the only options folks have to solve this
> class of saturation are through some combination of schema changes,
> application changes, vertical scaling, and spreading the query rate
> among more postgres instances.  Those are not feasible and efficient
> options.  Lacking a better solution, exposing a GUC that rarely needs
> tuning seems reasonable to me.
> 
> Anyway, hopefully the extra context is helpful!  Please do share your
> thoughts.
> 

Absolutely! I think the next step for me is to go through the analysis
again, and try to construct a couple of different workloads hitting this
in some way.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Tomas Vondra
Дата:

On 8/3/23 22:39, Tomas Vondra wrote:
> On 8/3/23 01:51, Matt Smiley wrote:
>> I thought it might be helpful to share some more details from one of the
>> case studies behind Nik's suggestion.
>>
>> Bursty contention on lock_manager lwlocks recently became a recurring
>> cause of query throughput drops for GitLab.com, and we got to study the
>> behavior via USDT and uprobe instrumentation along with more
>> conventional observations (see
>> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301
>> <https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301>). 
>> This turned up some interesting finds, and I thought sharing some of
>> that research might be helpful.
>>
> 
> The analysis in the linked gitlab issue is pretty amazing. I wasn't
> planning to argue against the findings anyway, but plenty of data
> supporting the conclusions is good.
> 
> I'm not an expert on locking, so some of the stuff I say may be
> trivially obvious - it's just me thinking about ...
> 
> I wonder what's the rough configuration of those systems, though. Both
> the hardware and PostgreSQL side. How many cores / connections, etc.?
> 
> 
>> Results so far suggest that increasing FP_LOCK_SLOTS_PER_BACKEND would
>> have a much larger positive impact than any other mitigation strategy we
>> have evaluated.  Rather than reducing hold duration or collision rate,
>> adding fastpath slots reduces the frequency of even having to acquire
>> those lock_manager lwlocks.  I suspect this would be helpful for many
>> other workloads, particularly those having high frequency queries whose
>> tables collectively have more than about 16 or indexes.
>>
> 
> Yes, I agree with that. Partitioning makes this issue works, I guess.
> Schemas with indexes on every column are disturbingly common these days
> too, which hits the issue too ...
> 
>> Lowering the lock_manager lwlock acquisition rate means lowering its
>> contention rate (and probably also its contention duration, since
>> exclusive mode forces concurrent lockers to queue).
>>
>> I'm confident this would help our workload, and I strongly suspect it
>> would be generally helpful by letting queries use fastpath locking more
>> often.
>>
> 
> OK
> 
>>> However, the lmgr/README says this is meant to alleviate contention on
>>> the lmgr partition locks. Wouldn't it be better to increase the number
>>> of those locks, without touching the PGPROC stuff?
>>
>> That was my first thought too, but growing the lock_manager lwlock
>> tranche isn't nearly as helpful.
>>
>> On the slowpath, each relation's lock tag deterministically hashes onto
>> a specific lock_manager lwlock, so growing the number of lock_manager
>> lwlocks just makes it less likely for two or more frequently locked
>> relations to hash onto the same lock_manager.
>>
> 
> Hmmm, so if we have a query that joins 16 tables, or a couple tables
> with indexes, all backends running this will acquire exactly the same
> partition locks. And we're likely acquiring them in exactly the same
> order (to lock objects in the same order because of deadlocks), making
> the issue worse.
> 
>> In contrast, growing the number of fastpath slots completely avoids
>> calls to the slowpath (i.e. no need to acquire a lock_manager lwlock).
>>
>> The saturation condition we'd like to solve is heavy contention on one
>> or more of the lock_manager lwlocks.  Since that is driven by the
>> slowpath acquisition rate of heavyweight locks, avoiding the slowpath is
>> better than just moderately reducing the contention on the slowpath.
>>
>> To be fair, increasing the number of lock_manager locks definitely can
>> help to a certain extent, but it doesn't cover an important general
>> case.  As a thought experiment, suppose we increase the lock_manager
>> tranche to some arbitrarily large size, larger than the number of
>> relations in the db.  This unrealistically large size means we have the
>> best case for avoiding collisions -- each relation maps uniquely onto
>> its own lock_manager lwlock.  That helps a lot in the case where the
>> workload is spread among many non-overlapping sets of relations.  But it
>> doesn't help a workload where any one table is accessed frequently via
>> slowpath locking.
>>
> 
> Understood.
> 
>> Continuing the thought experiment, if that frequently queried table has
>> 16 or more indexes, or if it is joined to other tables that collectively
>> add up to over 16 relations, then each of those queries is guaranteed to
>> have to use the slowpath and acquire the deterministically associated
>> lock_manager lwlocks.
>>
>> So growing the tranche of lock_manager lwlocks would help for some
>> workloads, while other workloads would not be helped much at all.  (As a
>> concrete example, a workload at GitLab has several frequently queried
>> tables with over 16 indexes that consequently always use at least some
>> slowpath locks.)
>>
>> For additional context:
>>
>>
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#what-influences-lock_manager-lwlock-acquisition-rate
<https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#what-influences-lock_manager-lwlock-acquisition-rate>
>> Summarizes the pathology and its current mitigations.
>>
>> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678
<https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678>
>> Documents the supporting research methodology.
>>
>> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365370510
<https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365370510>
>> What code paths require an exclusive mode lwlock for lock_manager?
>>
>> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142
<https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142>
>> Comparison of fastpath vs. slowpath locking, including quantifying the
>> rate difference.
>>
>> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365630726
<https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365630726>
>> Confirms the acquisition rate of lock_manager locks is not uniform.  The
>> sampled workload has a 3x difference in the most vs. least frequently
>> acquired lock_manager lock, corresponding to the workload's most
>> frequently accessed relations.
>>
> 
> Those are pretty great pieces of information. I wonder if some of the
> measurements may be affecting the observation (by consuming too much
> CPU, making the contention worse), but overall it seems convincing.
> 
> Would it be difficult to sample just a small fraction of the calls? Say,
> 1%, to get good histograms/estimated with acceptable CPU usage.
> 
> In any case, it's a great source of information to reproduce the issue
> and evaluate possible fixes.
> 
>>> Well, that has a cost too, as it makes PGPROC larger, right? At the
>>> moment that struct is already ~880B / 14 cachelines, adding 48 XIDs
>>> would make it +192B / +3 cachelines. I doubt that won't impact other
>>> common workloads ...
>>
>> That's true; growing the data structure may affect L2/L3 cache hit rates
>> when touching PGPROC.  Is that cost worth the benefit of using fastpath
>> for a higher percentage of table locks?  The answer may be workload- and
>> platform-specific.  Exposing this as a GUC gives the admin a way to make
>> a different choice if our default (currently 16) is bad for them.
>>
> 
> After looking at the code etc. I think the main trade-off here is going
> to be the cost of searching the fpRelId array. At the moment it's
> searched linearly, which is cheap for 16 locks. But at some point it'll
> become as expensive as updating the slowpath, and the question is when.
> 
> I wonder if we could switch to a more elaborate strategy if the number
> of locks is high enough. Say, a hash table, or some hybrid approach.
> 
>> I share your reluctance to add another low-level tunable, but like many
>> other GUCs, having a generally reasonable default that can be adjusted
>> is better than forcing folks to fork postgres to adjust a compile-time
>> constant.  And unfortunately I don't see a better way to solve this
>> problem.  Growing the lock_manager lwlock tranche isn't as effective,
>> because it doesn't help workloads where one or more relations are locked
>> frequently enough to hit this saturation point.
>>
> 
> I understand. I have two concerns:
> 
> 1) How would the users know they need to tune this / determine what's
> the right value, and  what's the right value for their system.
> 
> 2) Having to deal with misconfigured systems as people tend to blindly
> tune everything to 100x the default, because more is better :-(
> 
> 
>> Handling a larger percentage of heavyweight lock acquisitions via
>> fastpath instead of slowpath seems likely to help many high-throughput
>> workloads, since it avoids having to exclusively acquire an lwlock.  It
>> seemed like the least intrusive general-purpose solution we've come up
>> with so far.  That's why we wanted to solicit feedback or new ideas from
>> the community.  Currently, the only options folks have to solve this
>> class of saturation are through some combination of schema changes,
>> application changes, vertical scaling, and spreading the query rate
>> among more postgres instances.  Those are not feasible and efficient
>> options.  Lacking a better solution, exposing a GUC that rarely needs
>> tuning seems reasonable to me.
>>
>> Anyway, hopefully the extra context is helpful!  Please do share your
>> thoughts.
>>
> 
> Absolutely! I think the next step for me is to go through the analysis
> again, and try to construct a couple of different workloads hitting this
> in some way.
> 

FWIW I did some progress on this - I think I managed to reproduce the
issue on a synthetic workload (with a partitioned table, using variable
number of partitions / indexes). It's hard to say for sure how serious
the reproduced cases are, but I do see spikes of lock_manager wait
events, and so on.

That however brings me to the second step - I was planning to increase
the FP_LOCK_SLOTS_PER_BACKEND value and see how much it helps, and also
measure some of the negative impact, to get a better idea what the trade
offs are.

But I quickly realized it's far more complicated than just increasing
the define. The thing is, it's not enough to make fpRelId larger,
there's also fpLockBits, tracking additional info about the locks (lock
mode etc.). But FP_LOCK_SLOTS_PER_BACKEND does not affect that, it's
just that int64 is enough to store the bits for 16 fastpath locks.

Note: This means one of the "mitigations" in the analysis (just rebuild
postgres with custom FP_LOCK_SLOTS_PER_BACKEND value) won't work.

I tried to fix that in a naive way (replacing it with an int64 array,
with one value for 16 locks), but I must be missing something as there
are locking failures.

I'm not sure I'll have time to hack on this soon, but if someone else
wants to take a stab at it and produce a minimal patch, I might be able
to run more tests on it.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Andres Freund
Дата:
Hi,

On 2023-08-02 16:51:29 -0700, Matt Smiley wrote:
> I thought it might be helpful to share some more details from one of the
> case studies behind Nik's suggestion.
> 
> Bursty contention on lock_manager lwlocks recently became a recurring cause
> of query throughput drops for GitLab.com, and we got to study the behavior
> via USDT and uprobe instrumentation along with more conventional
> observations (see
> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301).  This
> turned up some interesting finds, and I thought sharing some of that
> research might be helpful.

Hm, I'm curious whether you have a way to trigger the issue outside of your
prod environment. Mainly because I'm wondering if you're potentially hitting
the issue fixed in a4adc31f690 - we ended up not backpatching that fix, so
you'd not see the benefit unless you reproduced the load in 16+.

I'm also wondering if it's possible that the reason for the throughput drops
are possibly correlated with heavyweight contention or higher frequency access
to the pg_locks view. Deadlock checking and the locks view acquire locks on
all lock manager partitions... So if there's a bout of real lock contention
(for longer than deadlock_timeout)...


Given that most of your lock manager traffic comes from query planning - have
you evaluated using prepared statements more heavily?

Greetings,

Andres Freund



Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Tomas Vondra
Дата:
On 8/6/23 16:44, Tomas Vondra wrote:
> ...
>
> I'm not sure I'll have time to hack on this soon, but if someone else
> wants to take a stab at it and produce a minimal patch, I might be able
> to run more tests on it.
> 

Nah, I gave it another try, handling the bitmap in a different way, and
that happened to work sufficiently well. So here are some results and
also the bench script.

Note: I haven't reproduced the terrible regressions described in this
thread. Either I don't have a system large enough, the workload may not
be exactly right, or maybe it's due to the commit missing in older
branches (mentioned by Andres). Still, the findings seem interesting.

The workload is very simple - create a table "t" with certain number of
partitiones and indexes, add certain number of rows (100, 10k or 1M) and
do "select count(*) from t". And measure throughput. There's also a
script collecting wait-event/lock info, but I haven't looked at that.

I did this for current master (17dev), with two patch versions.

master - current master, with 16 fast-path slots

v1 increases the number of slots to 64, but switches to a single array
combining the bitmap and OIDs. I'm sure the original approach (separate
bitmap) can be made to work, and it's possible this is responsible for a
small regression in some runs (more about it in a minute).

v2 was an attempt to address the small regressions in v1, which may be
due to having to search larger arrays. The core always walks the whole
array, even if we know there have never been that many entries. So this
tries to track the last used slot, and stop the loops earlier.

The attached PDF visualizes the results, and differences between master
and the two patches. It plots throughput against number of rows / tables
and indexes, and also concurrent clients.

The last two columns show throughput vs. master, with a simple color
scale: green - speedup (good), red - regression (bad).

Let's talk about the smallest data set (100 rows). The 10k test has the
same behavior, with smaller differences (as the locking accounts for a
smaller part of the total duration). On the 1M data set the patches make
almost no difference.

There's pretty clear flip once we reach 16 partitions - on master the
throughput drops from 310k tps to 210k tps (for 32 clients, the machine
has 32 cores). With both patches, the drop is to only about 240k tps, so
~20% improvement compared to master.

The other interesting thing is behavior with many clients:

              1     16     32     64     96    128
  master  17603 169132 204870 199390 199662 196774
  v1      17062 180475 234644 266372 267105 265885
  v2      18972 187294 242838 275064 275931 274332

So the master "stagnates" or maybe even drops off, while with both
patches the throughput continues to grow beyond 32 clients. This is even
more obvious for 32 or 64 partitions - for 32, the results are

              1     16     32     64     96    128
  master  11292  93783 111223 102580  95197  87800
  v1      12025 123503 168382 179952 180191 179846
  v2      12501 126438 174255 185435 185332 184938

That's a pretty massive improvement, IMO. Would help OLTP scalability.

For 10k rows the patterns is the same, although the differences are less
significant. For 1M rows there's no speedup.

The bad news is this seems to have negative impact on cases with few
partitions, that'd fit into 16 slots. Which is not surprising, as the
code has to walk longer arrays, it probably affects caching etc. So this
would hurt the systems that don't use that many relations - not much,
but still.

The regression appears to be consistently ~3%, and v2 aimed to improve
that - at least for the case with just 100 rows. It even gains ~5% in a
couple cases. It's however a bit strange v2 doesn't really help the two
larger cases.

Overall, I think this seems interesting - it's hard to not like doubling
the throughput in some cases. Yes, it's 100 rows only, and the real
improvements are bound to be smaller, it would help short OLTP queries
that only process a couple rows.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Nathan Bossart
Дата:
On Mon, Aug 07, 2023 at 12:51:24PM +0200, Tomas Vondra wrote:
> The bad news is this seems to have negative impact on cases with few
> partitions, that'd fit into 16 slots. Which is not surprising, as the
> code has to walk longer arrays, it probably affects caching etc. So this
> would hurt the systems that don't use that many relations - not much,
> but still.
> 
> The regression appears to be consistently ~3%, and v2 aimed to improve
> that - at least for the case with just 100 rows. It even gains ~5% in a
> couple cases. It's however a bit strange v2 doesn't really help the two
> larger cases.
> 
> Overall, I think this seems interesting - it's hard to not like doubling
> the throughput in some cases. Yes, it's 100 rows only, and the real
> improvements are bound to be smaller, it would help short OLTP queries
> that only process a couple rows.

Indeed.  I wonder whether we could mitigate the regressions by using SIMD
intrinsics in the loops.  Or auto-vectorization, if that is possible.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Robert Haas
Дата:
On Mon, Aug 7, 2023 at 6:51 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> The regression appears to be consistently ~3%, and v2 aimed to improve
> that - at least for the case with just 100 rows. It even gains ~5% in a
> couple cases. It's however a bit strange v2 doesn't really help the two
> larger cases.

To me, that's an absolutely monumental regression for a change like
this. Sure, lots of people have partitioned tables. But also, lots of
people don't. Penalizing very simple queries by 2-3% seems completely
over the top to me. I can't help wondering whether there's actually
something wrong with the test, or the coding, because that seems huge
to me.

I would also argue that the results are actually not that great,
because once you get past 64 partitions you're right back where you
started, or maybe worse off. To me, there's nothing magical about
cases between 16 and 64 relations that makes them deserve special
treatment - plenty of people are going to want to use hundreds of
partitions, and even if you only use a few dozen, this isn't going to
help as soon as you join two or three partitioned tables, and I
suspect it hurts whenever it doesn't help.

I think we need a design that scales better. I don't really know what
that would look like, exactly, but your idea of a hybrid approach
seems like it might be worth further consideration. We don't have to
store an infinite number of fast-path locks in an array that we search
linearly, and it might be better that if we moved to some other
approach we could avoid some of the regression. You mentioned a hash
table; a partially associative cache might also be worth considering,
like having an array of 1k slots but dividing it logically into 128
bins of 16 slots each and only allowing an OID to be recorded in the
bin whose low 7 bits match the low 7 bits of the OID.

But maybe first we need to understand where all the CPU cycles are
going, because maybe that's optimizing completely the wrong thing and,
again, it seems like an awfully large hit.

Of course, another thing we could do is try to improve the main lock
manager somehow. I confess that I don't have a great idea for that at
the moment, but the current locking scheme there is from a very, very
long time ago and clearly wasn't designed with modern hardware in
mind.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Tomas Vondra
Дата:
On 8/7/23 19:05, Robert Haas wrote:
> On Mon, Aug 7, 2023 at 6:51 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> The regression appears to be consistently ~3%, and v2 aimed to improve
>> that - at least for the case with just 100 rows. It even gains ~5% in a
>> couple cases. It's however a bit strange v2 doesn't really help the two
>> larger cases.
> 
> To me, that's an absolutely monumental regression for a change like
> this. Sure, lots of people have partitioned tables. But also, lots of
> people don't. Penalizing very simple queries by 2-3% seems completely
> over the top to me. I can't help wondering whether there's actually
> something wrong with the test, or the coding, because that seems huge
> to me.
> 

I'm the first to admit the coding (in my patches) is far from perfect,
and this may easily be a consequence of that. My motivation was to get
some quick measurements for the "bad case".

> I would also argue that the results are actually not that great,
> because once you get past 64 partitions you're right back where you
> started, or maybe worse off. To me, there's nothing magical about
> cases between 16 and 64 relations that makes them deserve special
> treatment - plenty of people are going to want to use hundreds of
> partitions, and even if you only use a few dozen, this isn't going to
> help as soon as you join two or three partitioned tables, and I
> suspect it hurts whenever it doesn't help.
> 

That's true, but doesn't that apply to any cache that can overflow? You
could make the same argument about the default value of 16 slots - why
not to have just 8?

FWIW I wasn't really suggesting we should increase the value to 64, I
was just trying to get a better idea of the costs at play (fast-path
cache maintenance and regular locking).

> I think we need a design that scales better. I don't really know what
> that would look like, exactly, but your idea of a hybrid approach
> seems like it might be worth further consideration. We don't have to
> store an infinite number of fast-path locks in an array that we search
> linearly, and it might be better that if we moved to some other
> approach we could avoid some of the regression. You mentioned a hash
> table; a partially associative cache might also be worth considering,
> like having an array of 1k slots but dividing it logically into 128
> bins of 16 slots each and only allowing an OID to be recorded in the
> bin whose low 7 bits match the low 7 bits of the OID.

Yes, I agree. I don't know if this particular design would be the right
one (1000 elements seems a bit too much for something included right in
PGPROC). But yeah, something that flips from linear search to something
else would be reasonable.

> 
> But maybe first we need to understand where all the CPU cycles are
> going, because maybe that's optimizing completely the wrong thing and,
> again, it seems like an awfully large hit.
> 

Right. We're mostly just guessing what the issue is.

> Of course, another thing we could do is try to improve the main lock
> manager somehow. I confess that I don't have a great idea for that at
> the moment, but the current locking scheme there is from a very, very
> long time ago and clearly wasn't designed with modern hardware in
> mind.
> 

No idea, but I'd bet some of the trade offs may need re-evaluation.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Tomas Vondra
Дата:
On 8/7/23 18:56, Nathan Bossart wrote:
> On Mon, Aug 07, 2023 at 12:51:24PM +0200, Tomas Vondra wrote:
>> The bad news is this seems to have negative impact on cases with few
>> partitions, that'd fit into 16 slots. Which is not surprising, as the
>> code has to walk longer arrays, it probably affects caching etc. So this
>> would hurt the systems that don't use that many relations - not much,
>> but still.
>>
>> The regression appears to be consistently ~3%, and v2 aimed to improve
>> that - at least for the case with just 100 rows. It even gains ~5% in a
>> couple cases. It's however a bit strange v2 doesn't really help the two
>> larger cases.
>>
>> Overall, I think this seems interesting - it's hard to not like doubling
>> the throughput in some cases. Yes, it's 100 rows only, and the real
>> improvements are bound to be smaller, it would help short OLTP queries
>> that only process a couple rows.
> 
> Indeed.  I wonder whether we could mitigate the regressions by using SIMD
> intrinsics in the loops.  Or auto-vectorization, if that is possible.
> 

Maybe, but from what I know about SIMD it would require a lot of changes
to the design, so that the loops don't mix accesses to different PGPROC
fields (fpLockBits, fpRelId) and so on. But I think it'd be better to
just stop walking the whole array regularly.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Robert Haas
Дата:
On Mon, Aug 7, 2023 at 3:02 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> > I would also argue that the results are actually not that great,
> > because once you get past 64 partitions you're right back where you
> > started, or maybe worse off. To me, there's nothing magical about
> > cases between 16 and 64 relations that makes them deserve special
> > treatment - plenty of people are going to want to use hundreds of
> > partitions, and even if you only use a few dozen, this isn't going to
> > help as soon as you join two or three partitioned tables, and I
> > suspect it hurts whenever it doesn't help.
>
> That's true, but doesn't that apply to any cache that can overflow? You
> could make the same argument about the default value of 16 slots - why
> not to have just 8?

Yes and no. I mean, there are situations where when the cache
overflows, you still get a lot of benefit out of the entries that you
are able to cache, as when the frequency of access follows some kind
of non-uniform distribution, Zipfian or decreasing geometrically or
whatever. There are also situations where you can just make the cache
big enough that as a practical matter it's never going to overflow. I
can't think of a PostgreSQL-specific example right now, but if you
find that a 10-entry cache of other people living in your house isn't
good enough, a 200-entry cache should solve the problem for nearly
everyone alive. If that doesn't cause a resource crunch, crank up the
cache size and forget about it. But here we have neither of those
situations. The access frequency is basically uniform, and the cache
size needed to avoid overflows seems to be unrealistically large, at
least given the current design. So I think that in this case upping
the cache size figures to be much less effective than in some other
cases.

It's also a bit questionable whether "cache" is even the right word
here. I'd say it isn't, because it's not like the information in the
fast-path locking structures is a subset of the full information
stored elsewhere. Whatever information is stored there is canonical
for those entries.

> Yes, I agree. I don't know if this particular design would be the right
> one (1000 elements seems a bit too much for something included right in
> PGPROC). But yeah, something that flips from linear search to something
> else would be reasonable.

Yeah ... or there could be a few slots in the PGPROC and then a bit
indicating whether to jump to a larger shared memory structure located
in a separate array. Not sure exactly.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Matt Smiley
Дата:
Thank you Tomas!  I really appreciate your willingness to dig in here and help us out!  The rest of my replies are inline below.

On Thu, Aug 3, 2023 at 1:39 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
The analysis in the linked gitlab issue is pretty amazing. I wasn't
planning to argue against the findings anyway, but plenty of data
supporting the conclusions is good.

Thank you!  I totally agree, having supporting data is so helpful.

I'm not an expert on locking, so some of the stuff I say may be
trivially obvious - it's just me thinking about ...

Absolutely makes sense to check assumptions, etc.  Thanks for being open!  For what it's worth, I've also been working with Postgres for many years, and I love that it keeps teaching me new things, this topic being just the latest.

I wonder what's the rough configuration of those systems, though. Both
the hardware and PostgreSQL side. How many cores / connections, etc.?

Each of the postgres hosts had 96 vCPUs and at peak handled roughly 80 concurrently active connections.

For purposes of reproducing the pathology, I think we can do so with a single postgres instance.  We will need a high enough query rate to push the bottleneck to lock_manager lwlock contention.  The simplest way to do so is probably to give it a small dataset that fits easily in cache and run several concurrent client connections doing cheap single-row queries, each in its own transaction, against a target table that has either many indexes or partitions or both.

For context, here's a brief summary of the production environment where we first observed this pathology:
The writable primary postgres instance has several streaming replicas, used for read-only portions of the workload.  All of them run on equivalent hardware.  Most of the research focuses on the streaming replica postgres instances, although the same pathology has been observed in the writable primary node as well. The general topology is thousands of client connections fanning down into several pgbouncer instances per Postgres instance.  From each Postgres instance's perspective, its workload generally has a daily peak of roughly 80 concurrently active backends supporting a throughput of 75K transactions second, where most transactions run a single query.

Yes, I agree with that. Partitioning makes this issue works, I guess.
Schemas with indexes on every column are disturbingly common these days
too, which hits the issue too ...

Agreed.
 
Those are pretty great pieces of information. I wonder if some of the
measurements may be affecting the observation (by consuming too much
CPU, making the contention worse), but overall it seems convincing.

Yes, definitely overhead is a concern, glad you asked!

Here are my notes on the overhead for each bpftrace script: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678


Here are more generic benchmark results for uprobe overhead: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1383

Briefly, we generally  expect the instrumentation overhead to be roughly 1-2 microseconds per call to the instrumented instruction.  It partly depends on what we're doing in the instrumentation, but most of that overhead is just the interrupt-handling to transfer control flow to/from the BPF code.

Would it be difficult to sample just a small fraction of the calls? Say,
1%, to get good histograms/estimated with acceptable CPU usage.

That would be great, but since the overhead comes mostly from the control transfer, it wouldn't help to put sampling logic in the tracer itself.  The main way to mitigate that overhead is to choose instrumentation points where the call rate is tolerably low.  That's why the only instrumentation I used for more than a few seconds at a time were the "low overhead" scripts that instrument only the stalled call to LWLockAcquire.
 
In any case, it's a great source of information to reproduce the issue
and evaluate possible fixes.

Thanks, that's my hope!
 
After looking at the code etc. I think the main trade-off here is going
to be the cost of searching the fpRelId array. At the moment it's
searched linearly, which is cheap for 16 locks. But at some point it'll
become as expensive as updating the slowpath, and the question is when.

I wonder if we could switch to a more elaborate strategy if the number
of locks is high enough. Say, a hash table, or some hybrid approach.

Interesting idea!  I was hoping a linear search would stay cheap enough but you're right, it's going to become too inefficient at some point.  It might make sense to start with just blackbox timing or throughput measurements, because directly measuring that search duration may not be cheap.  To observe durations via BPF, we have to instrument 2 points (e.g. function entry and return, or more generally the instructions before and after the critical section we're observing).  For code called as frequently as LWLockAcquire, that overhead would be prohibitively expensive, so we might need to measure it natively with counters for each histogram bucket we care about.  Just thinking ahead; we don't need to deal with this yet, I guess.
 
I understand. I have two concerns:

1) How would the users know they need to tune this / determine what's
the right value, and  what's the right value for their system.

2) Having to deal with misconfigured systems as people tend to blindly
tune everything to 100x the default, because more is better :-(

These are very valid concerns.  Thanks for articulating them!

For point 1:
The advice might be to only increase the number of slots for fastpath locks per xact if sampling pg_stat_activity frequently shows "lock_manager" wait_events affecting a significant percentage of your non-idle backends.  And increase it as little as possible, due to the extra overhead incurred when checking locks.  For calibration purposes, I polled pg_locks periodically, counting the number of slowpath locks where the lock mode was weak enough to qualify for fastpath if there had been enough fastpath slots available.  See the graph and SQL at the bottom of this note: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365595142.

For point 2:
Maybe we should put a ceiling on the allowed value.  Maybe 64 or 128 or 256?  Probably should depend on the cost of searching the fpRelid array, as you described earlier.

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Tomas Vondra
Дата:

On 8/7/23 21:21, Robert Haas wrote:
> On Mon, Aug 7, 2023 at 3:02 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>> I would also argue that the results are actually not that great,
>>> because once you get past 64 partitions you're right back where you
>>> started, or maybe worse off. To me, there's nothing magical about
>>> cases between 16 and 64 relations that makes them deserve special
>>> treatment - plenty of people are going to want to use hundreds of
>>> partitions, and even if you only use a few dozen, this isn't going to
>>> help as soon as you join two or three partitioned tables, and I
>>> suspect it hurts whenever it doesn't help.
>>
>> That's true, but doesn't that apply to any cache that can overflow? You
>> could make the same argument about the default value of 16 slots - why
>> not to have just 8?
> 
> Yes and no. I mean, there are situations where when the cache
> overflows, you still get a lot of benefit out of the entries that you
> are able to cache, as when the frequency of access follows some kind
> of non-uniform distribution, Zipfian or decreasing geometrically or
> whatever. There are also situations where you can just make the cache
> big enough that as a practical matter it's never going to overflow. I
> can't think of a PostgreSQL-specific example right now, but if you
> find that a 10-entry cache of other people living in your house isn't
> good enough, a 200-entry cache should solve the problem for nearly
> everyone alive. If that doesn't cause a resource crunch, crank up the
> cache size and forget about it. But here we have neither of those
> situations. The access frequency is basically uniform, and the cache
> size needed to avoid overflows seems to be unrealistically large, at
> least given the current design. So I think that in this case upping
> the cache size figures to be much less effective than in some other
> cases.
> 

Why would the access frequency be uniform? In particular, there's a huge
variability in how long the locks need to exist - IIRC we may be keeping
locks for tables for a long time, but not for indexes. From this POV it
might be better to do fast-path locking for indexes, no?

> It's also a bit questionable whether "cache" is even the right word
> here. I'd say it isn't, because it's not like the information in the
> fast-path locking structures is a subset of the full information
> stored elsewhere. Whatever information is stored there is canonical
> for those entries.
> 

Right. Calling this a cache might be a bit misleading.

>> Yes, I agree. I don't know if this particular design would be the right
>> one (1000 elements seems a bit too much for something included right in
>> PGPROC). But yeah, something that flips from linear search to something
>> else would be reasonable.
> 
> Yeah ... or there could be a few slots in the PGPROC and then a bit
> indicating whether to jump to a larger shared memory structure located
> in a separate array. Not sure exactly.
> 

Maybe, but isn't that mostly what the regular non-fast-path locking
does? Wouldn't that defeat the whole purpose of fast-path locking?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Robert Haas
Дата:
On Mon, Aug 7, 2023 at 3:48 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> Why would the access frequency be uniform? In particular, there's a huge
> variability in how long the locks need to exist - IIRC we may be keeping
> locks for tables for a long time, but not for indexes. From this POV it
> might be better to do fast-path locking for indexes, no?

If you're not using explicit transactions, you take a bunch of locks
at the start of a statement and then release all of them at the end.
None of the locks stick around so fast-path locking structure goes
through cycles where it starts out empty, fills up to N items, and
then goes back to empty. If you visualize it as a cache, we're
flushing the entire cache at the end of every operation.

If you run multiple statements in a transaction, the locks will be
kept until the end of the transaction, once acquired. So then you
could start with a small number and gradually accumulate more. But
then you're going to release them all at once at the end.

The main thing that matters here seems to be whether or not all of the
locks can go through the fast-path mechanism, or how many have to go
through the regular mechanism. It shouldn't matter, AFAICS, *which
ones* go through the fast-path mechanism. If you think it does, I'd
like to hear why - it's possible I'm missing something here.

> Maybe, but isn't that mostly what the regular non-fast-path locking
> does? Wouldn't that defeat the whole purpose of fast-path locking?

I don't think so. The main lock manager has two flaws that hinder
performance in comparison with the fast-path mechanism. The first, but
less important, one is that the data structures are just a lot
simpler. For access to a small number of fixed-size elements, a C
array is hard to beat, and the main lock manager data structures are a
lot more complex. The second one, which I think is more important, is
that we've essentially flipped the ordering of the primary key. In the
main lock manager, you start by hashing the locked object and that
gives you a partition number and you then take that partition lock.
Then, you iterate through a list of backends that have that object
locked. This means that if a lot of people are taking locks on the
same object, even if there's no actual conflict between the lock
modes, you still get a lot of contention. But in the fast-path
mechanism, it's reversed: first, you go to the shared memory *for your
backend* and then you search through it for the particular locked
object at issue. So basically the main lock manager treats the primary
key as (what, who) while the fast-path mechanism treats it as (who,
what). And that gets rid of a ton of contention because then different
backends locking the same object (in sufficiently weak lock modes)
never touch the same cache lines, so there's actually zero contention.
That is, I believe, the most important thing about the fast-path
locking system.

What I've just said is slightly complicated by the existence of
FastPathStrongRelationLockData, which is concurrently accessed by all
backends when using fast-path locking, but it's only read-only as
nobody actually takes a strong lock (like an AccessExclusiveLock on a
table). So you probably do get some cache line effects there, but
because it's read-only, they don't cause too much of a headache.

We do have to be careful that the overhead of checking multiple
locking data structures doesn't add up to a problem, for sure. But
there can still, I believe, be a lot of benefit in dividing up access
first by "who" and then by "what" for weak relation locks even if the
per-backend data structures become more complex. Or at least I hope
so.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Matt Smiley
Дата:
Hi Andres, thanks for helping!  Great questions, replies are inline below.

On Sun, Aug 6, 2023 at 1:00 PM Andres Freund <andres@anarazel.de> wrote:
Hm, I'm curious whether you have a way to trigger the issue outside of your
prod environment. Mainly because I'm wondering if you're potentially hitting
the issue fixed in a4adc31f690 - we ended up not backpatching that fix, so
you'd not see the benefit unless you reproduced the load in 16+.

Thanks for sharing this!

I have not yet written a reproducer since we see this daily in production.  I have a sketch of a few ways that I think will reproduce the behavior we're observing, but haven't had time to implement it.

I'm not sure if we're seeing this behavior in production, but it's definitely an interesting find.  Currently we are running postgres 12.11, with an upcoming upgrade to 15 planned.  Good to know there's a potential improvement waiting in 16.  I noticed that in LWLockAcquire the call to LWLockDequeueSelf occurs (https://github.com/postgres/postgres/blob/REL_12_11/src/backend/storage/lmgr/lwlock.c#L1218) directly between the unsuccessful attempt to immediately acquire the lock and reporting the backend's wait event.  The distinctive indicators we have been using for this pathology are that "lock_manager" wait_event and its associated USDT probe (https://github.com/postgres/postgres/blob/REL_12_11/src/backend/storage/lmgr/lwlock.c#L1236-L1237), both of which occur after whatever overhead is incurred by LWLockDequeueSelf.  As you mentioned in your commit message, that overhead is hard to detect.  My first impression is that whatever overhead it incurs is in addition to what we are investigating.
 
I'm also wondering if it's possible that the reason for the throughput drops
are possibly correlated with heavyweight contention or higher frequency access
to the pg_locks view. Deadlock checking and the locks view acquire locks on
all lock manager partitions... So if there's a bout of real lock contention
(for longer than deadlock_timeout)...

Great questions, but we ruled that out.  The deadlock_timeout is 5 seconds, so frequently hitting that would massively violate SLO and would alert the on-call engineers.  The pg_locks view is scraped a couple times per minute for metrics collection, but the lock_manager lwlock contention can be observed thousands of times every second, typically with very short durations.  The following example (captured just now) shows the number of times per second over a 10-second window that any 1 of the 16 "lock_manager" lwlocks was contended:

msmiley@patroni-main-2004-103-db-gprd.c.gitlab-production.internal:~$ sudo ./bpftrace -e 'usdt:/usr/lib/postgresql/12/bin/postgres:lwlock__wait__start /str(arg0) == "lock_manager"/ { @[arg1] = count(); } interval:s:1 { print(@); clear(@); } interval:s:10 { exit(); }'
Attaching 5 probes...
@[0]: 12122
@[0]: 12888
@[0]: 13011
@[0]: 13348
@[0]: 11461
@[0]: 10637
@[0]: 10892
@[0]: 12334
@[0]: 11565
@[0]: 11596

Typically that contention only lasts a couple microseconds.  But the long tail can sometimes be much slower.  Details here: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365159507.

Given that most of your lock manager traffic comes from query planning - have
you evaluated using prepared statements more heavily?

Yes, there are unrelated obstacles to doing so -- that's a separate can of worms, unfortunately.  But in this pathology, even if we used prepared statements, the backend would still need to reacquire the same locks during each executing transaction.  So in terms of lock acquisition rate, whether it's via the planner or executor doing it, the same relations have to be locked.

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Andres Freund
Дата:
Hi,

On 2023-08-07 13:59:26 -0700, Matt Smiley wrote:
> I have not yet written a reproducer since we see this daily in production.
> I have a sketch of a few ways that I think will reproduce the behavior
> we're observing, but haven't had time to implement it.
> 
> I'm not sure if we're seeing this behavior in production

It might be worth for you to backpatch

commit 92daeca45df
Author: Andres Freund <andres@anarazel.de>
Date:   2022-11-21 20:34:17 -0800

    Add wait event for pg_usleep() in perform_spin_delay()

into 12. That should be low risk and have only trivially resolvable
conflicts. Alternatively, you could use bpftrace et al to set a userspace
probe on perform_spin_delay().


> , but it's definitely an interesting find.  Currently we are running
> postgres 12.11, with an upcoming upgrade to 15 planned.  Good to know
> there's a potential improvement waiting in 16.  I noticed that in
> LWLockAcquire the call to LWLockDequeueSelf occurs (
> https://github.com/postgres/postgres/blob/REL_12_11/src/backend/storage/lmgr/lwlock.c#L1218)
> directly between the unsuccessful attempt to immediately acquire the lock
> and reporting the backend's wait event.

That's normal.



> > I'm also wondering if it's possible that the reason for the throughput
> > drops
> > are possibly correlated with heavyweight contention or higher frequency
> > access
> > to the pg_locks view. Deadlock checking and the locks view acquire locks on
> > all lock manager partitions... So if there's a bout of real lock contention
> > (for longer than deadlock_timeout)...
> >
> 
> Great questions, but we ruled that out.  The deadlock_timeout is 5 seconds,
> so frequently hitting that would massively violate SLO and would alert the
> on-call engineers.  The pg_locks view is scraped a couple times per minute
> for metrics collection, but the lock_manager lwlock contention can be
> observed thousands of times every second, typically with very short
> durations.  The following example (captured just now) shows the number of
> times per second over a 10-second window that any 1 of the 16
> "lock_manager" lwlocks was contended:

Some short-lived contention is fine and expected - the question is how long
the waits are...

Unfortunately my experience is that the overhead of bpftrace means that
analyzing things like this with bpftrace is very hard... :(.


> > Given that most of your lock manager traffic comes from query planning -
> > have you evaluated using prepared statements more heavily?
> >
> 
> Yes, there are unrelated obstacles to doing so -- that's a separate can of
> worms, unfortunately.  But in this pathology, even if we used prepared
> statements, the backend would still need to reacquire the same locks during
> each executing transaction.  So in terms of lock acquisition rate, whether
> it's via the planner or executor doing it, the same relations have to be
> locked.

Planning will often lock more database objects than query execution. Which can
keep you using fastpath locks for longer.

Greetings,

Andres Freund



Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Matt Smiley
Дата:
Why would the access frequency be uniform? In particular, there's a huge
variability in how long the locks need to exist

As a supporting data point, our example production workload shows a 3x difference between the most versus least frequently contended lock_manager lock:

Since we deterministically distribute relations among those 16 lock_manager lwlocks by hashing their lock tag, we can probably assume a roughly uniform number of relations are being managed by each lock_manager lock, but the demand (and contention) for them is non-uniform. This 3x spread corroborates the intuition that some relations are locked more frequently than others (that being both a schema- and workload-specific property).

Since we're contemplating a new hashing scheme, I wonder how we could accommodate that kind of asymmetry, where some relations are locked more frequently than others.

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Andres Freund
Дата:
Hi,

On 2023-08-07 13:05:32 -0400, Robert Haas wrote:
> I would also argue that the results are actually not that great,
> because once you get past 64 partitions you're right back where you
> started, or maybe worse off. To me, there's nothing magical about
> cases between 16 and 64 relations that makes them deserve special
> treatment - plenty of people are going to want to use hundreds of
> partitions, and even if you only use a few dozen, this isn't going to
> help as soon as you join two or three partitioned tables, and I
> suspect it hurts whenever it doesn't help.
> 
> I think we need a design that scales better. I don't really know what
> that would look like, exactly, but your idea of a hybrid approach
> seems like it might be worth further consideration. We don't have to
> store an infinite number of fast-path locks in an array that we search
> linearly, and it might be better that if we moved to some other
> approach we could avoid some of the regression.

My gut feeling is that the state for fast path locks doesn't live in quite the
right place.

What if fast path locks entered PROCLOCK into the shared hashtable, just like
with normal locks, the first time a lock is acquired by a backend. Except that
we'd set a flag indicating the lock is a fastpath lock. When the lock is
released, neither the LOCALLOCK nor the PROCLOCK entry would be
removed. Instead, the LOCK/PROCLOCK would be modified to indicate that the
lock is not held anymore.

That itself wouldn't buy us much - we'd still need to do a lookup in the
shared hashtable. But, by the time we decide whether to use fast path locks,
we've already done a hash lookup in the LOCALLOCK hashtable. Because the
PROCLOCK entry would continue to exist, we can use LOCALLOCK->proclock to get
the PROCLOCK entry without a shared hash table lookup.

Acquiring a strong lock on a fastpath lock would basically entail modifying
all the relevant PROCLOCKs atomically to indicate that fast path locks aren't
possible anymore.  Acquiring a fast path lock would just require atomically
modifying the PROCLOCK to indicate that the lock is held.

On a first blush, this sounds like it could end up being fairly clean and
generic?

Greetings,

Andres Freund



Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Andres Freund
Дата:
Hi,

On 2023-08-07 14:36:48 -0700, Andres Freund wrote:
> What if fast path locks entered PROCLOCK into the shared hashtable, just like
> with normal locks, the first time a lock is acquired by a backend. Except that
> we'd set a flag indicating the lock is a fastpath lock. When the lock is
> released, neither the LOCALLOCK nor the PROCLOCK entry would be
> removed. Instead, the LOCK/PROCLOCK would be modified to indicate that the
> lock is not held anymore.
> 
> That itself wouldn't buy us much - we'd still need to do a lookup in the
> shared hashtable. But, by the time we decide whether to use fast path locks,
> we've already done a hash lookup in the LOCALLOCK hashtable. Because the
> PROCLOCK entry would continue to exist, we can use LOCALLOCK->proclock to get
> the PROCLOCK entry without a shared hash table lookup.
> 
> Acquiring a strong lock on a fastpath lock would basically entail modifying
> all the relevant PROCLOCKs atomically to indicate that fast path locks aren't
> possible anymore.  Acquiring a fast path lock would just require atomically
> modifying the PROCLOCK to indicate that the lock is held.
> 
> On a first blush, this sounds like it could end up being fairly clean and
> generic?

On 2023-08-07 13:05:32 -0400, Robert Haas wrote:
> Of course, another thing we could do is try to improve the main lock
> manager somehow. I confess that I don't have a great idea for that at
> the moment, but the current locking scheme there is from a very, very
> long time ago and clearly wasn't designed with modern hardware in
> mind.

I think the biggest flaw of the locking scheme is that the LockHash locks
protect two, somewhat independent, things:
1) the set of currently lockable objects, i.e. the entries in the hash table [partition]
2) the state of all the locks [in a partition]

It'd not be that hard to avoid the shared hashtable lookup in a number of
cases, e.g. by keeping LOCALLOCK entries around for longer, as I suggest
above.  But we can't, in general, avoid the lock on the partition anyway, as
the each lock's state is also protected by the partition lock.

The amount of work to do a lookup in the shared hashtable and/or create a new
entry therein, is quite bound.  But the work for acquiring a lock is much less
so. We'll e.g. often have to iterate over the set of lock holders etc.

I think we ought to investigate whether pushing down the locking for the "lock
state" into the individual locks is worth it. That way the partitioned lock
would just protect the hashtable.

The biggest issue I see is deadlock checking. Right now acquiring all lock
partitions gives you a consistent view of all the non-fastpath locks - and
fastpath locks can't participate in deadlocks. Any scheme that makes "lock
state" locking in general more granular, will make it next to impossible to
have a similarly consistent view of all locks.  I'm not sure the current
degree of consistency is required however - the lockers participating in a
lock cycle, pretty much by definition, are blocked.


A secondary issue is that making the locks more granular could affect the
happy path measurably - we'd need two atomics for each heavyweight lock
acquisition, not one.  But if we cached the lookup in the shared hashtable,
we'd commonly be able to skip the hashtable lookup...

Greetings,

Andres Freund



Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Robert Haas
Дата:
On Mon, Aug 7, 2023 at 6:05 PM Andres Freund <andres@anarazel.de> wrote:
> I think the biggest flaw of the locking scheme is that the LockHash locks
> protect two, somewhat independent, things:
> 1) the set of currently lockable objects, i.e. the entries in the hash table [partition]
> 2) the state of all the locks [in a partition]
>
> It'd not be that hard to avoid the shared hashtable lookup in a number of
> cases, e.g. by keeping LOCALLOCK entries around for longer, as I suggest
> above.  But we can't, in general, avoid the lock on the partition anyway, as
> the each lock's state is also protected by the partition lock.

Yes, and that's a huge problem. The main selling point of the whole
fast-path mechanism is to ease the pressure on the lock manager
partition locks, and if we did something like what you described in
the previous email without changing the locking regimen, we'd bring
all of that contention back. I'm pretty sure that would suck.

> The amount of work to do a lookup in the shared hashtable and/or create a new
> entry therein, is quite bound.  But the work for acquiring a lock is much less
> so. We'll e.g. often have to iterate over the set of lock holders etc.
>
> I think we ought to investigate whether pushing down the locking for the "lock
> state" into the individual locks is worth it. That way the partitioned lock
> would just protect the hashtable.

I think this would still suck. Suppose you put an LWLock or slock_t in
each LOCK. If you now run a lot of select queries against the same
table (e.g. pgbench -S -c 64 -j 64), everyone is going to fight over
the lock counts for that table. Here again, the value of the fast-path
system is that it spreads out the contention in ways that approaches
like this can't do.

Or, hmm, maybe what you're really suggesting is pushing the state down
into each PROCLOCK rather than each LOCK. That would be more promising
if we could do it, because that is per-lock *and also per-backend*.
But you can't decide from looking at a single PROCLOCK whether a new
lock at some given lock mode is grantable or not, at least not with
the current PROCLOCK representation.

I think any workable solution here has to allow a backend to take a
weak relation lock without contending with other backends trying to
take the same weak relation lock (provided there are no strong
lockers). Maybe backends should be able to allocate PROCLOCKs and
record weak relation locks there without actually linking them up to
LOCK objects, or something like that. Anyone who wants a strong lock
must first go and find all of those objects for the LOCK they want and
connect them up to that LOCK.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Andres Freund
Дата:
Hi,

On 2023-08-08 16:44:37 -0400, Robert Haas wrote:
> On Mon, Aug 7, 2023 at 6:05 PM Andres Freund <andres@anarazel.de> wrote:
> > I think the biggest flaw of the locking scheme is that the LockHash locks
> > protect two, somewhat independent, things:
> > 1) the set of currently lockable objects, i.e. the entries in the hash table [partition]
> > 2) the state of all the locks [in a partition]
> >
> > It'd not be that hard to avoid the shared hashtable lookup in a number of
> > cases, e.g. by keeping LOCALLOCK entries around for longer, as I suggest
> > above.  But we can't, in general, avoid the lock on the partition anyway, as
> > the each lock's state is also protected by the partition lock.
> 
> Yes, and that's a huge problem. The main selling point of the whole
> fast-path mechanism is to ease the pressure on the lock manager
> partition locks, and if we did something like what you described in
> the previous email without changing the locking regimen, we'd bring
> all of that contention back. I'm pretty sure that would suck.

Yea - I tried to outline how I think we could implement the fastpath locking
scheme in a less limited way in the earlier email, that I had quoted above
this bit.  Here I was pontificating on what we possibly should do in addition
to that. I think even if we had "unlimited" fastpath locking, there's still
enough pressure on the lock manager locks that it's worth improving the
overall locking scheme.

Greetings,

Andres Freund



Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

От
Jeremy Schneider
Дата:
On 8/8/23 3:04 PM, Andres Freund wrote:
> On 2023-08-08 16:44:37 -0400, Robert Haas wrote:
>> On Mon, Aug 7, 2023 at 6:05 PM Andres Freund <andres@anarazel.de> wrote:
>>> I think the biggest flaw of the locking scheme is that the LockHash locks
>>> protect two, somewhat independent, things:
>>> 1) the set of currently lockable objects, i.e. the entries in the hash table [partition]
>>> 2) the state of all the locks [in a partition]
>>>
>>> It'd not be that hard to avoid the shared hashtable lookup in a number of
>>> cases, e.g. by keeping LOCALLOCK entries around for longer, as I suggest
>>> above.  But we can't, in general, avoid the lock on the partition anyway, as
>>> the each lock's state is also protected by the partition lock.
>>
>> Yes, and that's a huge problem. The main selling point of the whole
>> fast-path mechanism is to ease the pressure on the lock manager
>> partition locks, and if we did something like what you described in
>> the previous email without changing the locking regimen, we'd bring
>> all of that contention back. I'm pretty sure that would suck.
> 
> Yea - I tried to outline how I think we could implement the fastpath locking
> scheme in a less limited way in the earlier email, that I had quoted above
> this bit.  Here I was pontificating on what we possibly should do in addition
> to that. I think even if we had "unlimited" fastpath locking, there's still
> enough pressure on the lock manager locks that it's worth improving the
> overall locking scheme.


Has anyone considered whether increasing NUM_LOCK_PARTITIONS to
something bigger than 16 might offer cheap/easy/small short-term
improvements while folks continue to think about the bigger long-term ideas?

cf.

https://www.postgresql.org/message-id/flat/VI1PR05MB620666631A41186ACC3FC91ACFC70%40VI1PR05MB6206.eurprd05.prod.outlook.com

I haven't looked deeply into it myself yet. Didn't see a mention in this
thread or in Matt's gitlab research ticket. Maybe it doesn't actually
help. Anyway Alexander Pyhalov's email about LWLock optimization and
NUM_LOCK_PARTITIONS is out there, and I wondered about this.

-Jeremy


-- 
Jeremy Schneider
Performance Engineer
Amazon Web Services