Обсуждение: Computing the conflict xid for index page-level-vacuum on primary

Поиск
Список
Период
Сортировка

Computing the conflict xid for index page-level-vacuum on primary

От
Andres Freund
Дата:
Hi,

Currently nbtree and hash indexes (and soon gist, where it's missing due
to oversight) compute an xid that is used to resolve recovery conflicts.
They do so by visiting all the heap pages
xl_btree_delete/xl_hash_vacuum_one_page point to item-by-item.

That's problematic for two projects I'm working on:

1) For pluggable storage it's problematic, because obviously we can't
   have the nbtree/hash/gist code just assume that the index points to a
   heap.  There's no way we can do catalog lookups to determine the AM,
   and additionally AMs are database specific, so the startup process
   couldn't handle that, even if we somehow managed to get the oid or
   such of the AM handler.

   The zheap patchset currently solved that by having a separate flag
   indicating that the index is over zheap, but that's obviously doesn't
   work with table storage inteded to be pluggable.


2) For logical decoding on the standby [3], at least in my approach, we
   need to be able to recognize snapshot conflicts not just when the
   database is consistent, but also when it's not yet.  But
   unfortunately, the current approach requires that a consistent state
   has been reached:

    /*
     * In what follows, we have to examine the previous state of the index
     * page, as well as the heap page(s) it points to.  This is only valid if
     * WAL replay has reached a consistent database state; which means that
     * the preceding check is not just an optimization, but is *necessary*. We
     * won't have let in any user sessions before we reach consistency.
     */
    if (!reachedConsistency)
        elog(PANIC, "btree_xlog_delete_get_latestRemovedXid: cannot operate with inconsistent data");


I think there's also the tertiary issue that doing substantial work
during recovery that's not performed during the primary is a bad idea,
because it makes it much more likely for the standby to start lagging
behind. Recovery is single-threaded, whereas records like
xl_btree_delete can be emitted concurrently by every single backend. In
the current way there's no backpressure whatsoever.  This also means
that we do that work in multiple replicas.


This leads me to think that we possibly should move computation of the
last removed xid from recovery to the primary, during the generation of
the xl_btree_delete WAL record.


That obviously has the downside that we're doing the additional work
whenever wal_level >= replica, which previously only was done during
recovery.  I don't really see a good way around that for pluggable
storage, unfortunately - I'd be delighted to hear a good one.  Both
during recovery and on the primary that works happens while the index
page is locked.

In the prototype I've attached, I've attempted to address that by making
the routine less inefficient than before:
- Previously it fetched a buffer for every single tuple, in the order of
  items on the index page. It's pretty common for there to be multiple
  tuples on the same page, and it's more efficient to access pages in
  the on-disk order.
- We didn't do any prefetching, which seems a waste given the ease that
  can be done in this case. I've for now just added unconditional
  prefetching, but I guess that ought to instead be something respective
  effective_io_concurrency :/
- It seems to me that it's more likely on the primary that the relevant
  heap pages are already cached: The kill_prior_tuple logic needs to
  have kicked in to start the page level deletion after all, and that
  doesn't WAL log any changes to the heap page.

While on my laptop I wasn't able to observe a performance regression due
to these changes, but was able to see recovery performance increase, I
suspect it's possible to see performance regressions on disks that are
less fast than my SSD, in particular on rotational disks.


Talking with Peter Geoghegan we were pondering a few ideas to address
that:
- we could stop doing page level deletion after the tuples on the first
  heap page, if that frees sufficient space on the index page, to stave
  of a page split for longer.  That has the downside that it's possible
  that we'd end up WAL logging more, because we'd repeatedly emit
  xl_btree_delete records.
- We could use the visits to the heap pages to be *more* aggressive
  about deleting the heap page. If there's some tuples on a heap page
  that are deletable, and we know about that due to the kill_prior_tuple
  logic, it's fairly likely that other pointers to the same page
  actually are also dead: We could re-check that for all index entries
  pointing to pages that we'd visit anyway.
- We could invent a 'exclusive read' lwlock level, and downgrade the
  index buffer lock while we check the heap pages, and then upgrade
  afterwards.  A cursory look makes that look safe, but it'd obviously
  be a nontrivial change.


Comments? Better ideas? Pitchforks?


Minor aside:

The code currently has these two comments:
     * If there's nothing running on the standby we don't need to derive a
     * full latestRemovedXid value, so use a fast path out of here.  This
     * returns InvalidTransactionId, and so will conflict with all HS
     * transactions; but since we just worked out that that's zero people,
     * it's OK.
     *
     * XXX There is a race condition here, which is that a new backend might
     * start just after we look.  If so, it cannot need to conflict, but this
     * coding will result in throwing a conflict anyway.
    if (CountDBBackends(InvalidOid) == 0)
        return latestRemovedXid;
and
    /*
     * If all heap tuples were LP_DEAD then we will be returning
     * InvalidTransactionId here, which avoids conflicts. This matches
     * existing logic which assumes that LP_DEAD tuples must already be older
     * than the latestRemovedXid on the cleanup record that set them as
     * LP_DEAD, hence must already have generated a conflict.
     */
    return latestRemovedXid;

which contradict each other.

ResolveRecoveryConflictWithSnapshot indeed ignores such conflicts:

    /*
     * If we get passed InvalidTransactionId then we are a little surprised,
     * but it is theoretically possible in normal running. It also happens
     * when replaying already applied WAL records after a standby crash or
     * restart, or when replaying an XLOG_HEAP2_VISIBLE record that marks as
     * frozen a page which was already all-visible.  If latestRemovedXid is
     * invalid then there is no conflict. That rule applies across all record
     * types that suffer from this conflict.
     */
    if (!TransactionIdIsValid(latestRemovedXid))
        return;

which means we don't handle the race condition above correctly, as far
as I can tell?  This seems to have been introduced in [2].  It's
probably not too easy to hit this, because newer transactions on a
standby are likely to have a new enough snapshot, but I think it's
a question of likelihood, not certainty.

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20181212224524.scafnlyjindmrbe6%40alap3.anarazel.de
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=52010027efc8757fdd830a4d0113763a501259bc
[3] https://www.postgresql.org/message-id/20181212204154.nsxf3gzqv3gesl32%40alap3.anarazel.de

Вложения

Re: Computing the conflict xid for index page-level-vacuum on primary

От
Alexander Korotkov
Дата:
On Fri, Dec 14, 2018 at 4:43 AM Andres Freund <andres@anarazel.de> wrote:
> This leads me to think that we possibly should move computation of the
> last removed xid from recovery to the primary, during the generation of
> the xl_btree_delete WAL record.

Do I understand correctly that we need this xid computation, because
we may delete some index tuples using kill_prior_tuple before we prune
corresponding heap tuples (which would be also replicated and cause
required conflict)?  If so, then can we just give up with that?  That
is before setting kill_prior_tuple = true, prune corresponding heap
tuples.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Computing the conflict xid for index page-level-vacuum on primary

От
Andres Freund
Дата:
Hi,

On 2018-12-14 21:39:48 +0300, Alexander Korotkov wrote:
> On Fri, Dec 14, 2018 at 4:43 AM Andres Freund <andres@anarazel.de> wrote:
> > This leads me to think that we possibly should move computation of the
> > last removed xid from recovery to the primary, during the generation of
> > the xl_btree_delete WAL record.
> 
> Do I understand correctly that we need this xid computation, because
> we may delete some index tuples using kill_prior_tuple before we prune
> corresponding heap tuples (which would be also replicated and cause
> required conflict)?

Correct.


> If so, then can we just give up with that?  That is before setting
> kill_prior_tuple = true, prune corresponding heap tuples.

That'd require WAL logging such changes, which'd be pretty bad, because
we'd need to do it one-by-one.  What we could do, and what I suggested
earlier, is that we do a bit more pruning when emitting such WAL
records.


Greetings,

Andres Freund


Re: Computing the conflict xid for index page-level-vacuum on primary

От
Peter Geoghegan
Дата:
On Thu, Dec 13, 2018 at 5:42 PM Andres Freund <andres@anarazel.de> wrote:
> This leads me to think that we possibly should move computation of the
> last removed xid from recovery to the primary, during the generation of
> the xl_btree_delete WAL record.

For the record, I share your sense that this isn't a great design for
microvacuuming. It would be better if it happened on the primary,
where we generally have a lot more control/context. Of course, the
devil is in the details.

> Talking with Peter Geoghegan we were pondering a few ideas to address
> that:
> - we could stop doing page level deletion after the tuples on the first
>   heap page, if that frees sufficient space on the index page, to stave
>   of a page split for longer.  That has the downside that it's possible
>   that we'd end up WAL logging more, because we'd repeatedly emit
>   xl_btree_delete records.
> - We could use the visits to the heap pages to be *more* aggressive
>   about deleting the heap page. If there's some tuples on a heap page
>   that are deletable, and we know about that due to the kill_prior_tuple
>   logic, it's fairly likely that other pointers to the same page
>   actually are also dead: We could re-check that for all index entries
>   pointing to pages that we'd visit anyway.

Right. As we discussed, B-Tree is concerned solely with delaying and
probably even preventing a page split in its _bt_findsplitloc() path.
It's far from obvious that an eventual page split is inevitable, since
of course the range of values that belong on the leaf page are always
constrained. We can probably afford to be lazy about the things we
actually kill, since we're operating on the primary at a critical
moment: the point that we need some exact amount of free space to
delay a page split. We can free only as much space as needed,
balancing the need to access heap pages against the immediate need for
free space on the leaf page.

Now, I admit that I haven't thought through the implications for WAL
volume at all. However, it's possible that we might actually manage to
*decrease* the WAL volume in many cases by never freeing space that
isn't truly needed. It's not impossible that we'd regularly come out
ahead there, since a split is not inevitable just because we're
running low on free space this instant. And, I suspect that WAL volume
is the only major additional overhead that comes from being more
discerning about how many tuples are actually killed. Eagerly freeing
space to prevent a page split is a high priority for many B-Tree
implementations; the additional fragmentation of the leaf page caused
by a more incremental approach to deletion isn't likely to be a
problem.

Once you establish the principle that you can be lazy here, you also
establish the principle that you can be more eager, finding additional
things to kill that didn't already have their LP_DEAD bits set based
on heuristics (e.g. "I'm visiting the same heap page anyway, might as
well look for extra stuff"). It isn't necessary to actually have any
of that fancy stuff in your initial version, though.

-- 
Peter Geoghegan


Re: Computing the conflict xid for index page-level-vacuum on primary

От
Alexander Korotkov
Дата:
On Fri, Dec 14, 2018 at 9:47 PM Andres Freund <andres@anarazel.de> wrote:
> On 2018-12-14 21:39:48 +0300, Alexander Korotkov wrote:
> > If so, then can we just give up with that?  That is before setting
> > kill_prior_tuple = true, prune corresponding heap tuples.
>
> That'd require WAL logging such changes, which'd be pretty bad, because
> we'd need to do it one-by-one.  What we could do, and what I suggested
> earlier, is that we do a bit more pruning when emitting such WAL
> records.

It's not necessary one-by-one, we could decide to prune the whole heap
page and eventually delete more tuples.  But yes, in worst case it
would be one-by-one.  I'm not sure what would be practical average
value.

Also, for resolving conflicts we only need to inform standby about
latest removed xid.  So, before setting kill_prior_tuple = true, we
might emit WAL record with just latest removed xid, but only if it's
greater than latest removed xid written to the WAL previously.  So, in
worst case it would be one extra WAL record per transaction, but
hopefully it would be much less.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company