Re: Eager page freeze criteria clarification
От | Melanie Plageman |
---|---|
Тема | Re: Eager page freeze criteria clarification |
Дата | |
Msg-id | CAAKRu_b3tpbdRPUPh1Q5h35gXhY=spH2ssNsEsJ9sDfw6=PEAg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Eager page freeze criteria clarification (Melanie Plageman <melanieplageman@gmail.com>) |
Ответы |
Re: Eager page freeze criteria clarification
(Joe Conway <mail@joeconway.com>)
Re: Eager page freeze criteria clarification (Robert Haas <robertmhaas@gmail.com>) |
Список | pgsql-hackers |
On Wed, Nov 8, 2023 at 9:23 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > The next step is to devise different heuristics and measure their > efficacy. IMO, the goal of the algorithm it is to freeze pages in a > relation such that we drive early unfreezes/freezes -> 0 and pages > frozen/number of pages of a certain age -> 1. Attached is a patchset with an adaptive freezing algorithm that works well. It keeps track of the pages' unmodified duration after having been set all-visible and uses that to predict how likely a page is to be modified given its age. Each time an all-visible page is modified by an update, delete, or tuple lock, if the page was all-visible for less than target_freeze_duration (a new guc that specifies the minimum amount of time you would like data to stay frozen), it is counted as an "early unset" and the duration that it was unmodified is added into an accumulator. Before each relation is vacuumed, we calculate the mean and standard deviation of these durations using the accumulator. This is used to calculate a page LSN threshold demarcating pages with a < 5% likelihood of being modified before target_freeze_duration. We don't freeze pages younger than this threshold. I've done away with the ring buffer of vacuums and the idea of attributing an "unset" to the vacuum that set it all visible. Instead, using an "accumulator", I keep a running sum of the page ages, along with the cardinality of the accumulated set and the sum squared of page ages. This data structure allows us to extract the mean and standard deviation, at any time, from an arbitrary number of values in constant space; and is used to model the pattern of these unsets as a normal distribution that we can use to try and predict whether or not a page will be modified. Values can be "removed" from the accumulator by simply decrementing its cardinality and decreasing the sum and sum squared by a value that will not change the mean and standard deviation of the overall distribution. To adapt to a table's changing access patterns, we'll need to remove values from this accumulator over time, but this patch doesn't yet decide when to do this. A simple solution may be to cap the cardinality of the accumulator to the greater of 1% of the table size, or some fixed number of values (perhaps 200?). Even without such removal of values, the distribution recorded in the accumulator will eventually skew toward more recent data, albeit at a slower rate. This algorithm is able to predict when pages will be modified before target_freeze_threshold as long as their modification pattern fits a normal distribution -- this accommodates access patterns where, after some period of time, pages become less likely to be modified the older they are. As an example, however, access patterns where modification times are bimodal aren't handled well by this model (imagine that half your modifications are to very new data and the other half are to data that is much older but still younger than target_freeze_duration). If this is a common access pattern, the normal distribution could easily be swapped out for another distribution. The current accumulator is capable of tracking a distribution's first two moments of central tendency (the mean and standard deviation). We could track more if we wanted to use a fancier distribution. We also must consider what to do when we have few unsets, e.g. with an insert-only workload. When there are very few unsets (I chose 30 which the internet says is the approximate minimum number required for the central limit theorem to hold), we can choose to always freeze freezable pages; above this limit, the calculated page threshold is used. However, we may not want to make predictions based on 31 values either. To avoid this "cliff", we could modify the algorithm a bit to skew the mean and standard deviation of the distribution using a confidence interval based on the number of values we've collected. The goal is to keep pages frozen for at least target_freeze_duration. target_freeze_duration is in seconds and pages only have a last modification LSN, so target_freeze_duration must be converted to LSNs. To accomplish this, I've added an LSNTimeline data structure, containing XLogRecPtr, TimestampTz pairs stored with decreasing precision as they age. When we need to translate the guc value to LSNs, we linearly interpolate it on this timeline. For the time being, the global LSNTimeline is in PgStat_WalStats and is only updated by vacuum. There is no reason it can't be updated with some other cadence and/or by some other process (nothing about it is inherently tied to vacuum). The cached translated value of target_freeze_duration is stored in each table's stats. This is arbitrary as it is not a table-level stat. However, it needs to be located somewhere that is accessible on update/delete. We may want to recalculate it more often than once per table vacuum, especially in case of long-running vacuums. To benchmark this new heuristic (I'm calling it algo 6 since it is the 6th I've proposed on this thread), I've used a modified subset of my original workloads: Workloads C. Shifting hot set 32 clients inserting multiple rows and then updating an indexed column of a row on a different page containing data they formerly inserted. Only recent data is updated. H. Append only table 32 clients, each inserting a single row at a time I. Work queue 32 clients, each inserting a row, then updating an indexed column in 2 different rows in nearby pages, then deleting 1 of the updated rows Workload C: Algo | Table | AVs | Page Freezes | Pages Frozen | % Frozen -----|----------|-----|--------------|--------------|--------- 6 | hot_tail | 14 | 2,111,824 | 1,643,217 | 53.4% M | hot_tail | 16 | 241,605 | 3,921 | 0.1% Algo | WAL GB | Cptr Bgwr Writes | Other r/w | AV IO time | TPS -----|--------|------------------|------------|------------|-------- 6 | 193 | 5,473,949 | 12,793,574 | 14,870 | 28,397 M | 207 | 5,213,763 | 20,362,315 | 46,190 | 28,461 Algorithm 6 freezes all of the cold data and doesn't freeze the current working set. The notable thing is how much this reduces overall system I/O. On master, autovacuum is doing more than 3x the I/O and the rest of the system is doing more than 1.5x the I/O. I suspect freezing data when it is initially vacuumed is saving future vacuums from having to evict pages of the working set and read in cold data. Workload H: Algo | Table | AVs | Page Freezes | Pages Frozen | % frozen -----|-----------|-----|--------------|--------------|--------- 6 | hthistory | 22 | 668,448 | 668,003 | 87% M | hthistory | 22 | 0 | 0 | 0% Algo | WAL GB | Cptr Bgwr Writes | Other r/w | AV IO time | TPS -----|--------|------------------|-----------|------------|-------- 6 | 14 | 725,103 | 725,575 | 1 | 43,535 M | 13 | 693,945 | 694,417 | 1 | 43,522 The insert-only table is mostly frozen at the end. There is more I/O done but not at the expense of TPS. This is exactly what we want. Workload I: Algo | Table | AVs | Page Freezes | Pages Frozen | % Frozen -----|-----------|-----|--------------|--------------|--------- 6 | workqueue | 234 | 0 | 4,416 | 78% M | workqueue | 234 | 0 | 4,799 | 87% Algo | WAL GB | Cptr Bgwr Writes | Other r/w | AV IO Time | TPS -----|--------|------------------|-----------|------------|-------- 6 | 74 | 64,345 | 64,813 | 1 | 36,366 M | 73 | 63,468 | 63,936 | 1 | 36,145 What we want is for the work queue table to freeze as little as possible, because we will be constantly modifying the data. Both on master and with algorithm 6 we do not freeze tuples on any pages. You will notice, however, that much of the table is set frozen in the VM at the end. This is because we set pages all frozen in the VM if they are technically all frozen even if we do not freeze tuples on the page. This is inexpensive and not under the control of the freeze heuristic. Overall, the behavior of this new adaptive freezing method seems to be exactly what we want. The next step is to decide how many values to remove from the accumulator and benchmark cases where old data is deleted. I'd be delighted to receive any feedback, ideas, questions, or review. - Melanie
Вложения
- v2-0001-Record-LSN-at-postmaster-startup.patch
- v2-0005-Add-guc-target_freeze_duration.patch
- v2-0003-visibilitymap_set-clear-return-previous-vm-bits.patch
- v2-0002-WIP-Add-LSNTimeline-to-estimate-LSN-consumption-r.patch
- v2-0004-Add-accumulator-to-calculate-normal-dist-online.patch
- v2-0006-Count-table-modification-VM-clears.patch
- v2-0007-Opportunistically-freeze-pages-unlikely-to-be-mod.patch
- v2-0008-Track-VM-sets-by-vacuum.patch
- v2-0010-Add-pg_visibility_map_summary_extended.patch
- v2-0009-Add-VM-set-and-unset-stats-to-pg_stat_all_tables.patch
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Alexander LakhinДата:
Сообщение: How abnormal server shutdown could be detected by tests?