Re: Why is subscription/t/031_column_list.pl failing so much?

Поиск
Список
Период
Сортировка
От vignesh C
Тема Re: Why is subscription/t/031_column_list.pl failing so much?
Дата
Msg-id CALDaNm3JCsHaNTWdwSo2yAMyoi2AT9kmdwgBTzd20K4x8sdeOA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Why is subscription/t/031_column_list.pl failing so much?  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
On Wed, 7 Feb 2024 at 15:26, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Feb 7, 2024 at 2:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > I wrote:
> > > More to the point, aren't these proposals just band-aids that
> > > would stabilize the test without fixing the actual problem?
> > > The same thing is likely to happen to people in the field,
> > > unless we do something drastic like removing ALTER SUBSCRIPTION.
> >
> > I've been able to make the 031_column_list.pl failure pretty
> > reproducible by adding a delay in walsender, as attached.
> >
> > While I'm not too familiar with this code, it definitely does appear
> > that the new walsender is told to start up at an LSN before the
> > creation of the publication, and then if it needs to decide whether
> > to stream a particular data change before it's reached that creation,
> > kaboom!
> >
> > I read and understood the upthread worries about it not being
> > a great idea to ignore publication lookup failures, but I really
> > don't see that we have much choice.  As an example, if a subscriber
> > is humming along reading publication pub1, and then someone
> > drops and then recreates pub1 on the publisher, I don't think that
> > the subscriber will be able to advance through that gap if there
> > are any operations within it that require deciding if they should
> > be streamed.
> >
>
> Right. One idea to address those worries was to have a new
> subscription option like ignore_nonexistant_pubs (or some better name
> for such an option). The 'true' value of this new option means that we
> will ignore the publication lookup failures and continue replication,
> the 'false' means give an error as we are doing now. If we agree that
> such an option is useful or at least saves us in some cases as
> discussed in another thread [1], we can keep the default value as true
> so that users don't face such errors by default and also have a way to
> go back to current behavior.
>
> >
>   (That is, contrary to Amit's expectation that
> > DROP/CREATE would mask the problem, I suspect it will instead turn
> > it into a hard failure.  I've not experimented though.)
> >
>
> This is not contrary because I was suggesting to DROP/CREATE
> Subscription whereas you are talking of drop and recreate of
> Publication.
>
> > BTW, this same change breaks two other subscription tests:
> > 015_stream.pl and 022_twophase_cascade.pl.
> > The symptoms are different (no "publication does not exist" errors),
> > so maybe these are just test problems not fundamental weaknesses.
> >
>
> As per the initial analysis, this is because those cases have somewhat
> larger transactions (more than 64kB) under test so it just times out
> waiting for all the data to be replicated. We will do further analysis
> and share the findings.

Yes, these tests are failing while waiting to catchup the larger
transactions to be replicated within180 seconds, as the transactions
needs more time to replicate because of the sleep added. To verify
this I had tried a couple of things a) I had increased the timeout to
a higher value and verified that both the test runs successfully with
1800 seconds timeout. b) I reduced the sleep to 1000 microseconds and
verified that both the test runs successfully.

So I feel these tests 015_stream.pl and 022_twophase_cascade.pl
failing after the sleep added can be ignored.

Regards,
Vignesh



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Daniel Gustafsson
Дата:
Сообщение: Re: Commitfest 2024-01 first week update
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: Commitfest 2024-01 first week update