Re: Potential data loss due to race condition during logical replication slot creation
От | Callahan, Drew |
---|---|
Тема | Re: Potential data loss due to race condition during logical replication slot creation |
Дата | |
Msg-id | 2444AA15-D21B-4CCE-8052-52C7C2DAFE5C@amazon.com обсуждение исходный текст |
Ответ на | Potential data loss due to race condition during logical replication slot creation ("Callahan, Drew" <callaan@amazon.com>) |
Ответы |
Re: Potential data loss due to race condition during logical replication slot creation
Re: Potential data loss due to race condition during logical replication slot creation |
Список | pgsql-bugs |
> I think we can prepare an isolation test of the above steps and > include it in contrib/test_decoding/specs. Thanks for the suggestion. Included an isolation test for the exact repro and a couple other tests for fully before and fully after. These two tests don't add a whole lot of value since this is already heavily tested elsewhere, so I'm good with removing. However, thought it made sense to include from a perspective of totality given the test name. > Another simple way to fix this is that we always set > need_full_snapshot to true when slot creation. But it would then make > the snapshot builder include non-catalog-change transactions too while > finding the initial startpoint. Which is not necessary. Since we're not accumulating changes anyway and we never distribute snapshots while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer. This seems pretty contained and wouldn't require the use of a global variable and memory context callbacks which is a bit nicer from a readability standpoint. It's a little inefficient with memory and CPU, especially in the presence of long running transactions, but likely would not be noticeable to most users. > As for the proposed patch, it uses a global variable, InCreate, and > memory context callback to reset it. While I believe this is for not > breaking any ABI as this bug exists on back branches too, I think it's > better to consider how to fix it in HEAD and then how to backpatch it. > We don't necessarily need to have the same fixes for all branches all > the time. Attached a patch that is ABI breaking, but removes the need for a global variable. This relies on pushing the context of whether we are in a create mode to the logical decoding context, which is cleaner, but not as clean as just pushing all the way down to the snapbuilder. I would push down to the snap builder, but the entire object is persisted to disk. This is problematic to me since the create flag would be persisted causing exisiting slots to restore a snapbuild that could have the create flag set to true. This is not problematic as is, since the variable would only be leveraged in SnapBuildRestore() which is only called while inconsistent, but seems like a future bug waiting to happen. We could always clear the flag when persisting, but that seems a bit hacky. We could also fork the snap builder into on-disk and in-memory structs, which would be convenient since we would not need to invalidate exisiting snapshots. However, not sure how people feel about the fork. > IIUC we need to skip snapshot restore only when we're finding the > initial start point. So probably we don't need to set InCreate when > calling create_logical_replication_slot() with find_startpoint = > false. Yep, that's correct. We can safely skip setting InCreate. We could move the setter function slotfuncs.c as one way of doing this. I lean towards setting it no matter what since it is harmless if we never try to find a decoding point and removes the need for future callers of CreateInitDecodingContext() from having to determine if it's safe to be false. Though I agree that InCreate may a bigger hammer than needed which may need to be modified in the future. Thanks for the quick response, Drew Callahan
Вложения
В списке pgsql-bugs по дате отправления: