Re: The plan for FDW-based sharding

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: The plan for FDW-based sharding
Дата
Msg-id CAMsr+YFCPh4TWAPZts7Jdysgt6VOuRH+hyBC6g8bMLD_q0CQvQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: The plan for FDW-based sharding  (Kevin Grittner <kgrittn@gmail.com>)
Ответы Re: The plan for FDW-based sharding  (Kevin Grittner <kgrittn@gmail.com>)
Список pgsql-hackers
On 5 March 2016 at 23:41, Kevin Grittner <kgrittn@gmail.com> wrote:

> I'd be really interested in some ideas on how that information might be
> usefully accessed. If we could write info on when to apply commits to the
> xlog in serializable mode that'd be very handy, especially when looking to
> the future with logical decoding of in-progress transactions, parallel
> apply, etc.

Are you suggesting the possibility of holding off on writing the
commit record for a SERIALIZABLE transaction to WAL until it is
known that no other SERIALIZABLE transaction comes ahead of it in
the apparent order of execution?  If so, that's an interesting idea
that I hadn't given much thought to yet -- I had been assuming
current WAL writes, with adjustments to the timing of application
of the records.

I wasn't, I simply wrote less than clearly. I intended to say "from the xlog" where I wrote "to the xlog". Nonetheless, that'd be a completely unrelated but interesting thing to explore...
 
> For parallel apply I anticipated that we'd probably have workers applying
> xacts in parallel and committing them in upstream commit order. They'd
> sometimes deadlock with each other; when this happened all workers whose
> xacts committed after the first aborted xact would have to abort and start
> again. Not ideal, but safe.
>
> Being able to avoid that by using SSI information was in the back of my
> mind, but with no idea how to even begin to tackle it. What you've mentioned
> here is helpful and I'd be interested if you could share a bit more of your
> experience in the area.

My thinking so far has been that reordering the application of
transaction commits on a replica would best be done as the minimal
rearrangement possible from commit order which allows the work of
transactions to become visible in an order consistent with some
one-at-a-time run of those transactions.  Partly that is because
the commit order is something that is fairly obvious to see and is
what most people intuitively look at, even when it is wrong.
Deviating from this intuitive order seems likely to introduce
confusion, even when the results are 100% correct.

The only place you *need* to vary from commit order for correctness
is when there are overlapping SERIALIZABLE transactions, one
modifies data and commits, and another reads the old version of the
data but commits later.

Ah, right. So here, even though X1 commits before X2 running concurrently under SSI, the logical order in which the xacts could've occurred serially is that where xact 2 runs and commits before X1, since xact 2 doesn't depend on xact 1. X2 read the old row version before xact 1 modified it, and logically occurs before xact1 in the serial rearrangement.

I don't fully grasp how that can lead to a situation where xacts can commit in an order that's valid upstream but not valid as a downstream apply order. I presume we're looking at read-only logical replicas here (rather than multimaster), and it's only a concern for SERIALIZABLE xacts since a READ COMMITTED xact on the master and replica would both be able to see the state where X1 is commited but X2 isn't yet. But I don't see how a read-only xact in SERIALIZABLE on the replica can get different results to what it'd get with SSI on the master. It's entirely possible for a read xact on the master to get a snapshot after X1 commits and after X2 commits, same as READ COMMITTED. SSI shouldn't AFAIK come into play with no writes to create a pivot. Is that wrong?

If we applied this sequence to the downstream in commit order we'd still get correct results on the heap after applying both. We'd have an intermediate state where X1 is commited but X2 isn't, but we can have the same on the master. SSI doesn't AFAIK mask X1 from becoming visible in a snapshot until X2 commits or anything, right?
 
  Due to the action of SSI on the source
machine, you know that there could not be any SERIALIZABLE
transaction which saw the inconsistent state between the two
commits, but on replicas we don't yet manage that.

OK, maybe that's what I'm missing. How exactly does SSI ensure that? (A RTFM link / hint is fine, but I didn't find it in the SSI section of TFM at least in a way I recognised).

The key is that
there is a read-write dependency (a/k/a rw-conflict) between the
two transactions which tells you that the second to commit has to
come before the first in any graph of apparent order of execution.

Yeah, I get that part. How does that stop a 3rd SERIALIZABLE xact from getting a snapshot between the two commits and reading from there?
 
The tricky part is that when there are two overlapping SERIALIZABLE
transactions and one of them has modified data and committed, and
there is an overlapping SERIALIZABLE transaction which is not READ
ONLY which has not yet reached completion (COMMIT or ROLLBACK) the
correct ordering remains in doubt -- there is no way to know which
might need to commit first, or whether it even matters.  I am
skeptical about whether in logical replication (including MMR), it
is going to be possible to manage this by finding "safe snapshots".
The only alternative I can see, though, is to suspend replication
while correct transaction ordering remains in doubt.  A big READ
ONLY transaction would not cause a replication stall, but a big
READ WRITE transaction could cause an indefinite stall.  Simon
seemed to be saying that this is unacceptable, but I tend to think
it is a viable approach for some workloads, especially if the READ
ONLY transaction property is used when possible.

We already have huge replication stalls when big write xacts occur. We don't start sending any data for the xact to a peer until it commits, and once we start we don't send any other xact data until that xact is received (and probably applied) by the peer.

I'd like to address that by introducing xact streaming / interleaved xacts, where we stream big xacts on the wire as they occur and buffer them on the peer, possibly speculatively applying them too. This requires that individual row changes be tagged with subxact IDs and that subxact-to-top-level-xact mapping info be sent, so the peer can accumulate the right xacts into the right buffers. Basically offloading reorder buffering to the peer.

That same mechanism would let replication continue while logical serializable commit-order is in-doubt, blocking only the actual commit from proceeding, and only on those xacts. I think.

That said I'm still clearly more fuzzy about the details of what SSI does, what it guarantees and how it works than I thought I was, so I may just be handwaving pointlessly at this point. I'd better read some code...

There might be some wiggle room in terms of letting
non-SERIALIZABLE transactions commit while the ordering of
SERIALIZABLE transactions remain in doubt, but that would involve
allowing bigger deviations from commit order in transaction
application, which may confuse people.  The argument on the other
side is that if they use transaction isolation less strict than
SERIALIZABLE that they are vulnerable to seeing anomalies anyway,
so they must be OK with that.

Yeah. I'd be inclined to do just that, and with that argument.

 
Hopefully this is in some way helpful....
 
Very, thankyou.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: ExecGather() + nworkers
Следующее
От: Craig Ringer
Дата:
Сообщение: Re: How can we expand PostgreSQL ecosystem?