RE: Synchronizing slots from primary to standby

Поиск
Список
Период
Сортировка
От Zhijie Hou (Fujitsu)
Тема RE: Synchronizing slots from primary to standby
Дата
Msg-id OS0PR01MB57166281EA512BAE9A2F324F94732@OS0PR01MB5716.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Re: Synchronizing slots from primary to standby  (Peter Smith <smithpb2250@gmail.com>)
Список pgsql-hackers
On Tuesday, January 16, 2024 9:27 AM Peter Smith <smithpb2250@gmail.com> wrote:
> 
> Here are some review comments for patch v61-0002

Thanks for the comments.

> 
> ======
> doc/src/sgml/logical-replication.sgml
> 
> 1.
> +  <sect2 id="logical-replication-failover-examples">
> +   <title>Examples: logical replication failover</title>
> 
> The current documentation structure (after the patch is applied) looks
> like this:
> 
> 30.1. Publication
> 30.2. Subscription
>     30.2.1. Replication Slot Management
>     30.2.2. Examples: Set Up Logical Replication
>     30.2.3. Examples: Deferred Replication Slot Creation
>     30.2.4. Examples: logical replication failover
> 
> I don't think it is ideal.
> 
> Firstly, I think this new section is not just "Examples:"; it is more
> like instructions for steps to check if a successful failover is
> possible. IMO call it something like "Logical Replication Failover" or
> "Replication Slot Failover".
> 
> Secondly, I don't think this new section strictly belongs underneath
> the "Subscription" section anymore because IMO it is just as much
> about the promotion of the publications. Now that you are adding this
> new (2nd) section about slots, I think the whole structure of this
> document should be changed like below:
> 
> SUGGESTION #1 (make a new section 30.3 just for slot-related topics)
> 
> 30.1. Publication
> 30.2. Subscription
>     30.2.1. Examples: Set Up Logical Replication
> 30.3. Logical Replication Slots
>     30.3.1. Replication Slot Management
>     30.3.2. Examples: Deferred Replication Slot Creation
>     30.3.3. Logical Replication Failover
> 
> ~
> 
> SUGGESTION #2 (keep the existing structure, but give the failover its
> own new section 30.3)
> 
> 30.1. Publication
> 30.2. Subscription
>     30.2.1. Replication Slot Management
>     30.2.2. Examples: Set Up Logical Replication
>     30.2.3. Examples: Deferred Replication Slot Creation
> 30.3 Logical Replication Failover

I used this version for now as I am sure about changing other section.

> 
> ~
> 
> SUGGESTION #2a (and maybe later you can extract some of the failover
> examples further)
> 
> 30.1. Publication
> 30.2. Subscription
>     30.2.1. Replication Slot Management
>     30.2.2. Examples: Set Up Logical Replication
>     30.2.3. Examples: Deferred Replication Slot Creation
> 30.3 Logical Replication Failover
>     30.3.1. Examples: Checking if failover ready
> 
> ~~~
> 
> 2.
> +   <para>
> +    In a logical replication setup, if the publisher server is also the primary
> +    server of the streaming replication, the logical slots on the
> primary server
> +    can be synchronized to the standby server by specifying
> <literal>failover = true</literal>
> +    when creating the subscription. Enabling failover ensures a seamless
> +    transition of the subscription to the promoted standby, allowing it to
> +    subscribe to the new primary server without any data loss.
> +   </para>
> 
> I was initially confused by the wording. How about like below:
> 
> SUGGESTION
> When the publisher server is the primary server of a streaming
> replication, the logical slots on that primary server can be
> synchronized to the standby server by specifying <literal>failover =
> true</literal> when creating subscriptions for those publications.
> Enabling failover ensures a seamless transition of those subscriptions
> after the standby is promoted. They can continue subscribing to
> publications now on the new primary server without any data loss.

Changed as suggested.

> 
> ~~~
> 
> 3.
> +   <para>
> +    However, the replication slots are copied asynchronously, which
> means it's necessary
> +    to confirm that replication slots have been synced to the standby server
> +    before the failover happens. Additionally, to ensure a successful failover,
> +    the standby server must not lag behind the subscriber. To confirm
> +    that the standby server is ready for failover, follow these steps:
> +   </para>
> 
> Minor rewording
> 
> SUGGESTION
> Because the slot synchronization logic copies asynchronously, it is
> necessary to confirm that replication slots have been synced to the
> standby server before the failover happens. Furthermore, to ensure a
> successful failover, the standby server must not be lagging behind the
> subscriber. To confirm that the standby server is indeed ready for
> failover, follow these 2 steps:

Changed as suggested.

> 
> ~~~
> 
> 4.
> The instructions said "follow these steps", so the next parts should
> be rendered as 2 "steps" (using <procedure> markup?)
> 
> SUGGESTION (show as steps 1,2  and also some minor rewording of the
> step heading)
> 
> 1. Confirm that all the necessary logical replication slots have been
> synced to the standby server.
> 2. Confirm that the standby server is not lagging behind the subscribers.
> 

Changed as suggested.

> ~~~
> 
> 5.
> +   <para>
> +    Check if all the necessary logical replication slots have been synced to
> +    the standby server.
> +   </para>
> 
> SUGGESTION
> Confirm that all the necessary logical replication slots have been
> synced to the standby server.
> 

Changed as suggested.

> ~~~
> 
> 6.
> +     <listitem>
> +      <para>
> +       On logical subscriber, fetch the slot names that should be synced to
> the
> +       standby that we plan to promote.
> 
> SUGGESTION
> Firstly, on the subscriber node, use the following SQL to identify the
> slot names that should be...
> 

Changed as suggested.

> ~~~
> 
> 7.
> +<programlisting>
> +test_sub=# SELECT
> +               array_agg(slotname) AS slots
> +           FROM
> +           ((
> +               SELECT r.srsubid AS subid, CONCAT('pg_' || srsubid ||
> '_sync_' || srrelid || '_' || ctl.system_identifier) AS slotname
> +               FROM pg_control_system() ctl, pg_subscription_rel r,
> pg_subscription s
> +               WHERE r.srsubstate = 'f' AND s.oid = r.srsubid AND
> s.subfailover
> +           ) UNION (
> +               SELECT oid AS subid, subslotname as slotname
> +               FROM pg_subscription
> +               WHERE subfailover
> +           ));
> 
> 7a
> Maybe this ought to include "pg_catalog" schemas?

After searching other query examples, I think most of them don’t add this for
either function or system table. So, I didn’t add this.

> 
> ~
> 
> 7b.
> For consistency, maybe it is better to use a table alias "FROM
> pg_subscription s" in the UNION also

Added.

> 
> ~~~
> 
> 8.
> +     <listitem>
> +      <para>
> +       Check that the logical replication slots exist on the standby server.
> 
> SUGGESTION
> Next, check that the logical replication slots identified above exist
> on the standby server.

Changed as suggested.

> 
> ~~~
> 
> 9.
> +<programlisting>
> +test_standby=# SELECT bool_and(synced AND NOT temporary AND
> conflict_reason IS NULL) AS failover_ready
> +               FROM pg_replication_slots
> +               WHERE slot_name in ('slots');
> + failover_ready
> +----------------
> + t
> 
> 9a.
> Maybe this ought to include "pg_catalog" schemas?

Same as above.

> 
> ~
> 
> 9b.
> IIUC that 'slots' reference is supposed to be those names that were
> found in the prior step. If so, then that point needs to be made
> clear, and anyway in this case 'slots' is not compatible with the
> 'sub' name returned by your first SQL.

Changed as suggested.

> 
> ~~~
> 
> 10.
> +     <listitem>
> +      <para>
> +       Query the last replayed WAL on the logical subscriber.
> 
> SUGGESTION
> Firstly, on the subscriber node check the last replayed WAL.
> 
Changed as suggested.

> ~~~
> 
> 11.
> +<programlisting>
> +test_sub=# SELECT
> +               MAX(remote_lsn) AS remote_lsn_on_subscriber
> +           FROM
> +           ((
> +               SELECT (CASE WHEN r.srsubstate = 'f' THEN
> pg_replication_origin_progress(CONCAT('pg_' || r.srsubid || '_' ||
> r.srrelid), false)
> +                           WHEN r.srsubstate = 's' THEN r.srsublsn
> END) as remote_lsn
> +               FROM pg_subscription_rel r, pg_subscription s
> +               WHERE r.srsubstate IN ('f', 's') AND s.oid = r.srsubid
> AND s.subfailover
> +           ) UNION (
> +               SELECT pg_replication_origin_progress(CONCAT('pg_' ||
> s.oid), false) AS remote_lsn
> +               FROM pg_subscription s
> +               WHERE subfailover
> +           ));
> 
> 11a.
> Maybe this ought to include "pg_catalog" schemas?

Same as above.

> 
> ~
> 
> 11b.
> /WHERE subfailover/WHERE s.subfailover/
> 
> ~~~
> 
> 12.
> +     <listitem>
> +      <para>
> +       On the standby server, check that the last-received WAL location
> +       is ahead of the replayed WAL location on the subscriber.
> 
> SUGGESTION
> Next, on the standby server check that the last-received WAL location
> is ahead of the replayed WAL location on the subscriber identified
> above.
> 

Changed as suggested.


> ~~~
> 
> 13.
> +</programlisting></para>
> +     </listitem>
> +     <listitem>
> +      <para>
> +       On the standby server, check that the last-received WAL location
> +       is ahead of the replayed WAL location on the subscriber.
> +<programlisting>
> +test_standby=# SELECT pg_last_wal_receive_lsn() >=
> 'remote_lsn_on_subscriber'::pg_lsn AS failover_ready;
> + failover_ready
> +----------------
> + t
> 
> IIUC the 'remote_lsn_on_subscriber' is supposed to represent the
> substitution of the value found in the subscriber server. In this
> example maybe it would be:
> SELECT pg_last_wal_receive_lsn() >= '0/3000388'::pg_lsn AS failover_ready;
> 
> maybe that point can be made more clearly.

I have changed it to use the actual LSN got in last step.

> 
> ~~~
> 
> 14.
> +   <para>
> +    If the result (failover_ready) of both above steps is true, it means it is
> +    okay to subscribe to the standby server.
> +   </para>
> 
> 14a.
> failover_ready should be rendered as literal.

Added.

> 
> ~
> 
> 14b.
> Does this say what you intended, or did you mean something more like
> "the standby can be promoted and existing subscriptions will be able
> to continue without data loss"

I used the later part of your suggestion as I think promotion
depends not only on logical replication part.

Best Regards,
Hou zj

В списке pgsql-hackers по дате отправления:

Предыдущее
От: shveta malik
Дата:
Сообщение: Re: Synchronizing slots from primary to standby
Следующее
От: shveta malik
Дата:
Сообщение: Re: Synchronizing slots from primary to standby