Обсуждение: subtransaction performance regression [kind of] due to snapshot caching
Hi,
In a recent thread ([1]) I found a performance regression of the
following statement
DO $do$
BEGIN FOR i IN 1 .. 10000 LOOP
BEGIN
EXECUTE $cf$CREATE OR REPLACE FUNCTION foo() RETURNS VOID LANGUAGE plpgsql AS $f$BEGIN frakbar;
END;$f$;$cf$;
EXCEPTION WHEN others THEN
END;
END LOOP;
END;$do$;
13: 1617.798
14-dev: 34088.505
The time in 14 is spent mostly below:
- 94.58% 0.01% postgres postgres [.] CreateFunction
- 94.57% CreateFunction
- 94.49% ProcedureCreate
- 90.95% record_object_address_dependencies
- 90.93% recordMultipleDependencies
- 89.65% isObjectPinned
- 89.12% systable_getnext
- 89.06% index_getnext_slot
- 56.13% index_fetch_heap
- 54.82% table_index_fetch_tuple
+ 53.79% heapam_index_fetch_tuple
0.07% heap_hot_search_buffer
0.01% ReleaseAndReadBuffer
0.01% LockBuffer
0.08% heapam_index_fetch_tuple
After a bit of debugging I figured out that the direct failure lies with
623a9ba79b. The problem is that subtransaction abort does not increment
ShmemVariableCache->xactCompletionCount. That's trivial to remedy (we
already lock ProcArrayLock during XidCacheRemoveRunningXids).
What happens is that heap_hot_search_buffer()-> correctly recognizes the
aborted subtransaction's rows as dead, but they are not recognized as
"surely dead". Which then leads to O(iterations^2) index->heap lookups,
because the rows from previous iterations are never recognized as dead.
I initially was a bit worried that this could be a correctness issue as
well. The more I thought about it the more confused I got. A
transaction's subtransaction abort should not actually change the
contents of a snapshot, right?
Snapshot
GetSnapshotData(Snapshot snapshot)
...
/*
* We don't include our own XIDs (if any) in the snapshot. It
* needs to be includeded in the xmin computation, but we did so
* outside the loop.
*/
if (pgxactoff == mypgxactoff)
continue;
The sole reason for the behavioural difference is that the cached
snapshot's xmax is *lower* than a new snapshot's would be, because
GetSnapshotData() initializes xmax as
ShmemVariableCache->latestCompletedXid - which
XidCacheRemoveRunningXids() incremented, without incrementing
ShmemVariableCache->xactCompletionCount.
Which then causes HeapTupleSatisfiesMVCC to go down
if (!HeapTupleHeaderXminCommitted(tuple))
...
else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
return false;
else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
HeapTupleHeaderGetRawXmin(tuple));
else
{
/* it must have aborted or crashed */
SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
InvalidTransactionId);
return false;
}
the "return false" for XidInMVCCSnapshot() rather than the
SetHintBits(HEAP_XMIN_INVALID) path. Which then in turn causes
HeapTupleIsSurelyDead() to not recognize the rows as surely dead.
bool
XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
{
uint32 i;
/*
* Make a quick range check to eliminate most XIDs without looking at the
* xip arrays. Note that this is OK even if we convert a subxact XID to
* its parent below, because a subxact with XID < xmin has surely also got
* a parent with XID < xmin, while one with XID >= xmax must belong to a
* parent that was not yet committed at the time of this snapshot.
*/
/* Any xid < xmin is not in-progress */
if (TransactionIdPrecedes(xid, snapshot->xmin))
return false;
/* Any xid >= xmax is in-progress */
if (TransactionIdFollowsOrEquals(xid, snapshot->xmax))
return true;
I *think* this issue doesn't lead to actually wrong query results. For
HeapTupleSatisfiesMVCC purposes there's not much of a difference between
an aborted transaction and one that's "in progress" according to the
snapshot (that's required - we don't check for aborts for xids in the
snapshot).
It is a bit disappointing that there - as far as I could find - are no
tests for kill_prior_tuple actually working. I guess that lack, and that
there's no difference in query results, explains why nobody noticed the
issue in the last ~9 months.
See the attached fix. I did include a test that verifies that the
kill_prior_tuples optimization actually prevents the index from growing,
when subtransactions are involved. I think it should be stable, even
with concurrent activity. But I'd welcome a look.
I don't think that's why the issue exists, but I very much hate the
XidCache* name. Makes it sound much less important than it is...
Greetings,
Andres Freund
[1] https://postgr.es/m/20210317055718.v6qs3ltzrformqoa%40alap3.anarazel.de
Вложения
Andres Freund <andres@anarazel.de> writes:
> The time in 14 is spent mostly below:
> - 94.58% 0.01% postgres postgres [.] CreateFunction
> - 94.57% CreateFunction
> - 94.49% ProcedureCreate
> - 90.95% record_object_address_dependencies
> - 90.93% recordMultipleDependencies
> - 89.65% isObjectPinned
> - 89.12% systable_getnext
> - 89.06% index_getnext_slot
> - 56.13% index_fetch_heap
> - 54.82% table_index_fetch_tuple
> + 53.79% heapam_index_fetch_tuple
> 0.07% heap_hot_search_buffer
> 0.01% ReleaseAndReadBuffer
> 0.01% LockBuffer
> 0.08% heapam_index_fetch_tuple
Not wanting to distract from your point about xactCompletionCount,
but ... I wonder if we could get away with defining "isObjectPinned"
as "is the OID <= 9999" (and, in consequence, dropping explicit pin
entries from pg_depend). I had not previously seen a case where the
cost of looking into pg_depend for this info was this much of the
total query runtime.
regards, tom lane
Hi, On 2021-04-06 00:47:13 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > The time in 14 is spent mostly below: > > - 94.58% 0.01% postgres postgres [.] CreateFunction > > - 94.57% CreateFunction > > - 94.49% ProcedureCreate > > - 90.95% record_object_address_dependencies > > - 90.93% recordMultipleDependencies > > - 89.65% isObjectPinned > > - 89.12% systable_getnext > > - 89.06% index_getnext_slot > > - 56.13% index_fetch_heap > > - 54.82% table_index_fetch_tuple > > + 53.79% heapam_index_fetch_tuple > > 0.07% heap_hot_search_buffer > > 0.01% ReleaseAndReadBuffer > > 0.01% LockBuffer > > 0.08% heapam_index_fetch_tuple > > Not wanting to distract from your point about xactCompletionCount, > but ... I wonder if we could get away with defining "isObjectPinned" > as "is the OID <= 9999" (and, in consequence, dropping explicit pin > entries from pg_depend). I had not previously seen a case where the > cost of looking into pg_depend for this info was this much of the > total query runtime. I had the same thought, and yes, I do think we should do that. I've seen it show up in non-buggy workloads too (not to that degree though). The <= 9999 pg_depend entries area also just a substantial proportion of the size of an empty database, making it worth to remove <= 9999 entries: freshly initdb'd empty cluster: VACUUM FULL pg_depend; dropme[926131][1]=# SELECT oid::regclass, pg_relation_size(oid) FROM pg_class WHERE relfilenode <> 0 ORDER BY 2 DESC LIMIT10; ┌─────────────────────────────────┬──────────────────┐ │ oid │ pg_relation_size │ ├─────────────────────────────────┼──────────────────┤ │ pg_depend │ 532480 │ │ pg_toast.pg_toast_2618 │ 516096 │ │ pg_collation │ 360448 │ │ pg_description │ 352256 │ │ pg_depend_depender_index │ 294912 │ │ pg_depend_reference_index │ 294912 │ │ pg_description_o_c_o_index │ 221184 │ │ pg_statistic │ 155648 │ │ pg_operator │ 114688 │ │ pg_collation_name_enc_nsp_index │ 106496 │ └─────────────────────────────────┴──────────────────┘ (10 rows) DELETE FROM pg_depend WHERE deptype = 'p' AND refobjid <> 0 AND refobjid < 10000; VACUUM FULL pg_depend; dropme[926131][1]=# SELECT oid::regclass, pg_relation_size(oid) FROM pg_class WHERE relfilenode <> 0 ORDER BY 2 DESC LIMIT10; ┌─────────────────────────────────┬──────────────────┐ │ oid │ pg_relation_size │ ├─────────────────────────────────┼──────────────────┤ │ pg_toast.pg_toast_2618 │ 516096 │ │ pg_collation │ 360448 │ │ pg_description │ 352256 │ │ pg_depend │ 253952 │ │ pg_description_o_c_o_index │ 221184 │ │ pg_statistic │ 155648 │ │ pg_depend_depender_index │ 147456 │ │ pg_depend_reference_index │ 147456 │ │ pg_operator │ 114688 │ │ pg_collation_name_enc_nsp_index │ 106496 │ └─────────────────────────────────┴──────────────────┘ (10 rows) A reduction from 8407kB to 7863kB of the size of the "dropme" database just by deleting the "implicitly pinned" entries seems like a good deal. To save people the time to look it up: pg_toast.pg_toast_2618 is pg_description... Couldn't we also treat FirstGenbkiObjectId to FirstBootstrapObjectId as pinned? That'd be another 400kB of database size... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes:
> On 2021-04-06 00:47:13 -0400, Tom Lane wrote:
>> Not wanting to distract from your point about xactCompletionCount,
>> but ... I wonder if we could get away with defining "isObjectPinned"
>> as "is the OID <= 9999" (and, in consequence, dropping explicit pin
>> entries from pg_depend). I had not previously seen a case where the
>> cost of looking into pg_depend for this info was this much of the
>> total query runtime.
> Couldn't we also treat FirstGenbkiObjectId to FirstBootstrapObjectId as
> pinned? That'd be another 400kB of database size...
Yeah, it'd require some close study of exactly what we want to pin
or not pin. Certainly everything with hand-assigned OIDs should
be pinned, but I think there's a lot of critical stuff like index
opclasses that don't get hand-assigned OIDs. On the other hand,
it's intentional that nothing in information_schema is pinned.
We might have to rejigger initdb so that there's a clearer
distinction between the OID ranges we want to pin or not.
Maybe we'd even get initdb to record the cutoff OID in
pg_control or someplace.
Anyway, just idle late-night speculation for now ...
regards, tom lane
On 2021-04-06 01:34:02 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2021-04-06 00:47:13 -0400, Tom Lane wrote: > >> Not wanting to distract from your point about xactCompletionCount, > >> but ... I wonder if we could get away with defining "isObjectPinned" > >> as "is the OID <= 9999" (and, in consequence, dropping explicit pin > >> entries from pg_depend). I had not previously seen a case where the > >> cost of looking into pg_depend for this info was this much of the > >> total query runtime. > > > Couldn't we also treat FirstGenbkiObjectId to FirstBootstrapObjectId as > > pinned? That'd be another 400kB of database size... > > Yeah, it'd require some close study of exactly what we want to pin > or not pin. One interesting bit is: postgres[947554][1]=# SELECT classid::regclass, objid, refclassid::regclass, refobjid, deptype, refobjversion FROM pg_dependWHERE refobjid < 13000 AND deptype <> 'p'; ┌───────────────┬───────┬──────────────┬──────────┬─────────┬───────────────┐ │ classid │ objid │ refclassid │ refobjid │ deptype │ refobjversion │ ├───────────────┼───────┼──────────────┼──────────┼─────────┼───────────────┤ │ pg_constraint │ 15062 │ pg_collation │ 100 │ n │ 2.31 │ └───────────────┴───────┴──────────────┴──────────┴─────────┴───────────────┘ (1 row) > Certainly everything with hand-assigned OIDs should > be pinned, but I think there's a lot of critical stuff like index > opclasses that don't get hand-assigned OIDs. On the other hand, > it's intentional that nothing in information_schema is pinned. Isn't that pretty much the difference between FirstGenbkiObjectId and FirstBootstrapObjectId? Genbki will have assigned things like opclasses, but not things like information_schema? > We might have to rejigger initdb so that there's a clearer > distinction between the OID ranges we want to pin or not. > Maybe we'd even get initdb to record the cutoff OID in > pg_control or someplace. The only non-pinned pg_depend entry below FirstBootstrapObjectId is the collation versioning one above. The only pinned entries above FirstBootstrapObjectId are the ones created via system_constraints.sql. So it seems we "just" would need to resolve the constraint versioning stuff? And that could probably just be handled as a hardcoded special case for now... Greetings, Andres Freund
Hi, On 2021-04-05 21:35:21 -0700, Andres Freund wrote: > See the attached fix. I did include a test that verifies that the > kill_prior_tuples optimization actually prevents the index from growing, > when subtransactions are involved. I think it should be stable, even > with concurrent activity. But I'd welcome a look. Pushed that now, after trying and failing to make the test spuriously fail due to concurrent activity. Greetings, Andres Freund