Обсуждение: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

Поиск

Список

Период

Сортировка

BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

PG Bug reporting form

Дата:

17 апреля, 02:14:19

The following bug has been logged on the website:

Bug reference:      18897
Logged by:          Zane Duffield
Email address:      duffieldzane@gmail.com
PostgreSQL version: 17.4
Operating system:   RHEL9 Linux
Description:

Hi all,

I'm in the process of converting our databases from pglogical logical
replication to the native logical replication implementation on PostgreSQL
17. One of the bugs we encountered and had to work around with pglogical was
the plugin dropping records while converting to a streaming replica to
logical via pglogical_create_subscriber (reported
https://github.com/2ndQuadrant/pglogical/issues/349). I was trying to
confirm that the native logical replication implementation did not have this
problem, and I've found that it might have a different problem.

In rare cases, I'm finding that the logical replica can start with the
logical decoder stuck on a conflicting primary key even though the replica
received no connections, and only decoded changes from the one primary.

I should say that I've been operating under the assumption that
pg_createsubscriber is designed for use on a replica for a *live* primary
database, if this isn't correct then someone please let me know.

I have a script that I've been using to reproduce the issue (pasted at end
of email, because this bug reporting page doesn't seem to support
attachments). It basically performs a loop that sets up a primary and a
physical replica, generates some load, converts the replica to logical,
waits, and makes sure the row counts are the same.
A few variables at the top of the script may need adjusting, depending on
the system's PostgreSQL installation and the desired locations.
The environment variables NUM_THREADS and INSERT_WIDTH can be used to
control the characteristics of the load generation for the primary.

The script isn't terribly good at reproducing the issue; it takes anywhere
from minutes to hours in my testing.
The 2 machines I've reproduced the issue on are both running PostgreSQL 17.4
on RHEL9 in a VM, one VM was moderately powerful (12 cores, 256GB RAM), and
the other was quite weak (2 cores, 2GB RAM).
I found that many combinations of NUM_THREADS and INSERT_WIDTH were able to
reproduce the issue, but NUM_THREADS=40 and INSERT_WIDTH=1000 is a good
starting point for a powerful machine.

Thanks,
Zane

SCRIPT
```
#!/bin/bash

set -ue
set -o pipefail

cd "${BASH_SOURCE[0]%/*}" || exit

POSTGRES_VERSION=17

# Adjust these for the system under test
PATH="/usr/pgsql-$POSTGRES_VERSION/bin:$PATH"
PRIMARY_DIR="/var/lib/pgsql/$POSTGRES_VERSION/primary"
REPLICA_DIR="/var/lib/pgsql/$POSTGRES_VERSION/replica"

PRIMARY_PORT=5340
REPLICA_PORT=5341

DBNAME="test_db"
TABLE="test_table"
TRIGGER_FILE="./stop_inserting"

CYAN="\e[1;36m"
RESET="\e[0m"

TEST_NUM=0
reset_databases() (
    test_start_time=$(date +"%Y-%m-%d_%H-%M-%S")

    if pg_ctl status -D "$REPLICA_DIR" > /dev/null 2>&1 && ! pg_ctl stop -m
immediate -D "$REPLICA_DIR"; then
        echo "ERROR: Could not stop replica."
        return 1
    fi

    if ! rm -rf "$REPLICA_DIR"; then
        echo "ERROR: Could not delete replica directory \"$REPLICA_DIR\"."
        return 1
    fi

    if pg_ctl status -D "$PRIMARY_DIR" > /dev/null 2>&1 && ! pg_ctl stop -m
immediate -D "$PRIMARY_DIR"; then
        echo "ERROR: Could not stop primary."
        return 1
    fi

    if ! rm -rf "$PRIMARY_DIR"; then
        echo "ERROR: Could not delete primary directory \"$PRIMARY_DIR\"."
        return 1
    fi

    if ! initdb -D "$PRIMARY_DIR" --locale=C; then
        echo "ERROR: Could not create primary."
        return 1
    fi

    cat >> "$PRIMARY_DIR"/postgresql.auto.conf << EOF
port = $PRIMARY_PORT
wal_level = logical
max_worker_processes = 10
max_replication_slots = 10
max_wal_senders = 10
max_connections = 1000
log_directory = '$PWD/log'
log_filename = 'postgresql-$test_start_time-test-$TEST_NUM.log'
log_checkpoints = on
log_connections = on
log_disconnections = on
log_replication_commands = on
log_duration = off
log_min_duration_statement = 0
log_statement = all
log_replication_commands = on
log_line_prefix = '%m [%p] port=$PRIMARY_PORT %q%u@%d/%a '
EOF

    cat >> "$PRIMARY_DIR"/pg_hba.conf << EOF
host   replication      postgres     127.0.0.1/32            trust
host       test_db      postgres     127.0.0.1/32            trust
host       test_db      postgres     127.0.0.1/32            trust
EOF

    if ! pg_ctl start -D "$PRIMARY_DIR"; then
        echo "ERROR: Could not start primary."
        return 1
    fi

    if ! pg_basebackup -h localhost -U postgres -p "$PRIMARY_PORT" -D
"$REPLICA_DIR" -P -Xs -R; then
        echo "ERROR: Could not create replica."
        return 1
    fi

    cat >> "$REPLICA_DIR"/postgresql.auto.conf << EOF
port = $REPLICA_PORT
archive_mode = off
hot_standby = on
log_directory = '$PWD/log'
log_filename = 'postgresql-$test_start_time-test-$TEST_NUM.log'
log_checkpoints = on
log_connections = on
log_disconnections = on
log_replication_commands = on
log_duration = off
log_min_duration_statement = 0
log_statement = all
log_replication_commands = on
log_line_prefix = '%m [%p] port=$REPLICA_PORT %q%u@%d/%a '
EOF

    if ! pg_ctl start -D "$REPLICA_DIR"; then
        echo "ERROR: Could not start replica."
        return 1
    fi
)

create_test_database() (
    psql -p "$PRIMARY_PORT" -c "CREATE DATABASE $DBNAME"
    psql -p "$PRIMARY_PORT" -d "$DBNAME" -c "CREATE TABLE $TABLE (f1 int
primary key, insert_time timestamp without time zone)"
)

# Adjust these variables to tweak the load characteristics
NUM_THREADS=${NUM_THREADS:-20}
INSERT_WIDTH=${INSERT_WIDTH:-100}

add_records_to_test_table() (
    id=$1
    start_time=$(date +%s)
    start=$((id * INSERT_WIDTH)) end=0

    while true; do
        end=$((start + INSERT_WIDTH - 1))
        echo "INSERT INTO $TABLE SELECT val, CURRENT_TIMESTAMP FROM
generate_series($start, $end) S(val);" |
            psql -p "$PRIMARY_PORT" -d "$DBNAME" > /dev/null
        start=$((start + NUM_THREADS * INSERT_WIDTH))

        if [[ -f "$TRIGGER_FILE" ]] || (( $(date "+%s") - start_time > 15
)); then
            break
        fi
    done

    return 0
)

INSERT_PIDS=()
start_insert_threads() {
    echo "*** STARTING $NUM_THREADS LOOPS INSERTING $INSERT_WIDTH RECORDS
PER ITERATION ..."
    INSERT_PIDS=()
    for id in $(seq 0 $((NUM_THREADS - 1))); do
        add_records_to_test_table "$id" &
        INSERT_PIDS+=($!)
    done
}

create_subscriber() (
    echo "*** Stopping replica, then running pg_createsubscriber..."

    pg_ctl stop -m fast -D "$REPLICA_DIR"

    pg_createsubscriber -D "$REPLICA_DIR" \
        --database="$DBNAME" \
        --subscription="sub" \
        --publication="pub" \
        --publisher-server="host=localhost port=$PRIMARY_PORT
dbname=$DBNAME"

    pg_ctl start -D "$REPLICA_DIR"

    echo "*** Successfully started logical replica on port $REPLICA_PORT."
)

run_test() (
    create_test_database
    rm -f "$TRIGGER_FILE"
    start_insert_threads
    sleep 2
    create_subscriber
    sleep 0.1

    touch "$TRIGGER_FILE"
    errors=0
    for pid in "${INSERT_PIDS[@]}"; do
        if ! wait "$pid"; then
            echo "ERROR: insert thread with PID $pid failed"
            errors=1
        fi
    done
    (( errors )) && return 1

    echo "*** ALL INSERT LOOPS FINISHED"

    last_src_count=0
    last_dest_count=0

    # wait until the counts stop increasing
    while true; do
        src_count=$(psql -qt -p "$PRIMARY_PORT" -d "$DBNAME" -c "SELECT
COUNT(f1) FROM $TABLE")
        dest_count=$(psql -qt -p "$REPLICA_PORT" -d "$DBNAME" -c "SELECT
COUNT(f1) FROM $TABLE")

        if [[ $dest_count -ne $last_dest_count ]] || [[ $src_count -ne
$last_src_count ]]; then
            last_dest_count=$dest_count
            last_src_count=$src_count
            sleep 0.5s
            continue;
        fi

        echo "SOURCE COUNT = $src_count"
        echo "DEST COUNT   = $dest_count"

        if (( src_count != dest_count )); then
            echo "ERROR: record count mismatch"
            return 1
        fi

        break
    done

    echo -e "*** TEST PASSED\n"
)

for TEST_NUM in {1..10000}; do
    echo -e "${CYAN}*** STARTING TEST #$TEST_NUM${RESET}"
    reset_databases && run_test
done
```

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

Zane Duffield

Дата:

17 апреля, 02:54:23

The script may have been broken by formatting in the email. I've attached it to this reply.

From: PG Bug reporting form <noreply@postgresql.org>
Sent: Thursday, April 17, 2025 9:14:19 AM
To: pgsql-bugs@lists.postgresql.org <pgsql-bugs@lists.postgresql.org>
Cc: duffieldzane@gmail.com <duffieldzane@gmail.com>
Subject: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

The following bug has been logged on the website:

Bug reference:      18897
Logged by:          Zane Duffield
Email address:      duffieldzane@gmail.com
PostgreSQL version: 17.4
Operating system:   RHEL9 Linux
Description:

Hi all,

I'm in the process of converting our databases from pglogical logical
replication to the native logical replication implementation on PostgreSQL
17. One of the bugs we encountered and had to work around with pglogical was
the plugin dropping records while converting to a streaming replica to
logical via pglogical_create_subscriber (reported
https://github.com/2ndQuadrant/pglogical/issues/349). I was trying to
confirm that the native logical replication implementation did not have this
problem, and I've found that it might have a different problem.

In rare cases, I'm finding that the logical replica can start with the
logical decoder stuck on a conflicting primary key even though the replica
received no connections, and only decoded changes from the one primary.

I should say that I've been operating under the assumption that
pg_createsubscriber is designed for use on a replica for a *live* primary
database, if this isn't correct then someone please let me know.

I have a script that I've been using to reproduce the issue (pasted at end
of email, because this bug reporting page doesn't seem to support
attachments). It basically performs a loop that sets up a primary and a
physical replica, generates some load, converts the replica to logical,
waits, and makes sure the row counts are the same.
A few variables at the top of the script may need adjusting, depending on
the system's PostgreSQL installation and the desired locations.
The environment variables NUM_THREADS and INSERT_WIDTH can be used to
control the characteristics of the load generation for the primary.

The script isn't terribly good at reproducing the issue; it takes anywhere
from minutes to hours in my testing.
The 2 machines I've reproduced the issue on are both running PostgreSQL 17.4
on RHEL9 in a VM, one VM was moderately powerful (12 cores, 256GB RAM), and
the other was quite weak (2 cores, 2GB RAM).
I found that many combinations of NUM_THREADS and INSERT_WIDTH were able to
reproduce the issue, but NUM_THREADS=40 and INSERT_WIDTH=1000 is a good
starting point for a powerful machine.

Thanks,
Zane

SCRIPT
```
#!/bin/bash

set -ue
set -o pipefail

cd "${BASH_SOURCE[0]%/*}" || exit

POSTGRES_VERSION=17

# Adjust these for the system under test
PATH="/usr/pgsql-$POSTGRES_VERSION/bin:$PATH"
PRIMARY_DIR="/var/lib/pgsql/$POSTGRES_VERSION/primary"
REPLICA_DIR="/var/lib/pgsql/$POSTGRES_VERSION/replica"

PRIMARY_PORT=5340
REPLICA_PORT=5341

DBNAME="test_db"
TABLE="test_table"
TRIGGER_FILE="./stop_inserting"

CYAN="\e[1;36m"
RESET="\e[0m"

TEST_NUM=0
reset_databases() (
    test_start_time=$(date +"%Y-%m-%d_%H-%M-%S")

    if pg_ctl status -D "$REPLICA_DIR" > /dev/null 2>&1 && ! pg_ctl stop -m
immediate -D "$REPLICA_DIR"; then
        echo "ERROR: Could not stop replica."
        return 1
    fi

    if ! rm -rf "$REPLICA_DIR"; then
        echo "ERROR: Could not delete replica directory \"$REPLICA_DIR\"."
        return 1
    fi

    if pg_ctl status -D "$PRIMARY_DIR" > /dev/null 2>&1 && ! pg_ctl stop -m
immediate -D "$PRIMARY_DIR"; then
        echo "ERROR: Could not stop primary."
        return 1
    fi

    if ! rm -rf "$PRIMARY_DIR"; then
        echo "ERROR: Could not delete primary directory \"$PRIMARY_DIR\"."
        return 1
    fi

    if ! initdb -D "$PRIMARY_DIR" --locale=C; then
        echo "ERROR: Could not create primary."
        return 1
    fi

    cat >> "$PRIMARY_DIR"/postgresql.auto.conf << EOF
port = $PRIMARY_PORT
wal_level = logical
max_worker_processes = 10
max_replication_slots = 10
max_wal_senders = 10
max_connections = 1000
log_directory = '$PWD/log'
log_filename = 'postgresql-$test_start_time-test-$TEST_NUM.log'
log_checkpoints = on
log_connections = on
log_disconnections = on
log_replication_commands = on
log_duration = off
log_min_duration_statement = 0
log_statement = all
log_replication_commands = on
log_line_prefix = '%m [%p] port=$PRIMARY_PORT %q%u@%d/%a '
EOF

    cat >> "$PRIMARY_DIR"/pg_hba.conf << EOF
host   replication      postgres     127.0.0.1/32            trust
host       test_db      postgres     127.0.0.1/32            trust
host       test_db      postgres     127.0.0.1/32            trust
EOF

    if ! pg_ctl start -D "$PRIMARY_DIR"; then
        echo "ERROR: Could not start primary."
        return 1
    fi

    if ! pg_basebackup -h localhost -U postgres -p "$PRIMARY_PORT" -D
"$REPLICA_DIR" -P -Xs -R; then
        echo "ERROR: Could not create replica."
        return 1
    fi

    cat >> "$REPLICA_DIR"/postgresql.auto.conf << EOF
port = $REPLICA_PORT
archive_mode = off
hot_standby = on
log_directory = '$PWD/log'
log_filename = 'postgresql-$test_start_time-test-$TEST_NUM.log'
log_checkpoints = on
log_connections = on
log_disconnections = on
log_replication_commands = on
log_duration = off
log_min_duration_statement = 0
log_statement = all
log_replication_commands = on
log_line_prefix = '%m [%p] port=$REPLICA_PORT %q%u@%d/%a '
EOF

    if ! pg_ctl start -D "$REPLICA_DIR"; then
        echo "ERROR: Could not start replica."
        return 1
    fi
)

create_test_database() (
    psql -p "$PRIMARY_PORT" -c "CREATE DATABASE $DBNAME"
    psql -p "$PRIMARY_PORT" -d "$DBNAME" -c "CREATE TABLE $TABLE (f1 int
primary key, insert_time timestamp without time zone)"
)

# Adjust these variables to tweak the load characteristics
NUM_THREADS=${NUM_THREADS:-20}
INSERT_WIDTH=${INSERT_WIDTH:-100}

add_records_to_test_table() (
    id=$1
    start_time=$(date +%s)
    start=$((id * INSERT_WIDTH)) end=0

    while true; do
        end=$((start + INSERT_WIDTH - 1))
        echo "INSERT INTO $TABLE SELECT val, CURRENT_TIMESTAMP FROM
generate_series($start, $end) S(val);" |
            psql -p "$PRIMARY_PORT" -d "$DBNAME" > /dev/null
        start=$((start + NUM_THREADS * INSERT_WIDTH))

        if [[ -f "$TRIGGER_FILE" ]] || (( $(date "+%s") - start_time > 15
)); then
            break
        fi
    done

    return 0
)

INSERT_PIDS=()
start_insert_threads() {
    echo "*** STARTING $NUM_THREADS LOOPS INSERTING $INSERT_WIDTH RECORDS
PER ITERATION ..."
    INSERT_PIDS=()
    for id in $(seq 0 $((NUM_THREADS - 1))); do
        add_records_to_test_table "$id" &
        INSERT_PIDS+=($!)
    done
}

create_subscriber() (
    echo "*** Stopping replica, then running pg_createsubscriber..."

    pg_ctl stop -m fast -D "$REPLICA_DIR"

    pg_createsubscriber -D "$REPLICA_DIR" \
        --database="$DBNAME" \
        --subscription="sub" \
        --publication="pub" \
        --publisher-server="host=localhost port=$PRIMARY_PORT
dbname=$DBNAME"

    pg_ctl start -D "$REPLICA_DIR"

    echo "*** Successfully started logical replica on port $REPLICA_PORT."
)

run_test() (
    create_test_database
    rm -f "$TRIGGER_FILE"
    start_insert_threads
    sleep 2
    create_subscriber
    sleep 0.1

    touch "$TRIGGER_FILE"
    errors=0
    for pid in "${INSERT_PIDS[@]}"; do
        if ! wait "$pid"; then
            echo "ERROR: insert thread with PID $pid failed"
            errors=1
        fi
    done
    (( errors )) && return 1

    echo "*** ALL INSERT LOOPS FINISHED"

    last_src_count=0
    last_dest_count=0

    # wait until the counts stop increasing
    while true; do
        src_count=$(psql -qt -p "$PRIMARY_PORT" -d "$DBNAME" -c "SELECT
COUNT(f1) FROM $TABLE")
        dest_count=$(psql -qt -p "$REPLICA_PORT" -d "$DBNAME" -c "SELECT
COUNT(f1) FROM $TABLE")

        if [[ $dest_count -ne $last_dest_count ]] || [[ $src_count -ne
$last_src_count ]]; then
            last_dest_count=$dest_count
            last_src_count=$src_count
            sleep 0.5s
            continue;
        fi

        echo "SOURCE COUNT = $src_count"
        echo "DEST COUNT   = $dest_count"

        if (( src_count != dest_count )); then
            echo "ERROR: record count mismatch"
            return 1
        fi

        break
    done

    echo -e "*** TEST PASSED\n"
)

for TEST_NUM in {1..10000}; do
    echo -e "${CYAN}*** STARTING TEST #$TEST_NUM${RESET}"
    reset_databases && run_test
done
```

Вложения

logical_stress_test.sh

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

Amit Kapila

Дата:

21 апреля, 09:09:53

On Thu, Apr 17, 2025 at 4:58 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      18897
> Logged by:          Zane Duffield
> Email address:      duffieldzane@gmail.com
> PostgreSQL version: 17.4
> Operating system:   RHEL9 Linux
> Description:
>
> Hi all,
>
> I'm in the process of converting our databases from pglogical logical
> replication to the native logical replication implementation on PostgreSQL
> 17. One of the bugs we encountered and had to work around with pglogical was
> the plugin dropping records while converting to a streaming replica to
> logical via pglogical_create_subscriber (reported
> https://github.com/2ndQuadrant/pglogical/issues/349). I was trying to
> confirm that the native logical replication implementation did not have this
> problem, and I've found that it might have a different problem.
>
> In rare cases, I'm finding that the logical replica can start with the
> logical decoder stuck on a conflicting primary key even though the replica
> received no connections, and only decoded changes from the one primary.
>
> I should say that I've been operating under the assumption that
> pg_createsubscriber is designed for use on a replica for a *live* primary
> database, if this isn't correct then someone please let me know.
>

As per my understanding, your assumption is correct, otherwise, we
should have given an ERROR at the start that we don't allow creating a
subscriber for the live primary.

--
With Regards,
Amit Kapila.

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

Shlok Kyal

Дата:

21 апреля, 09:40:51

On Thu, 17 Apr 2025 at 04:58, PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      18897
> Logged by:          Zane Duffield
> Email address:      duffieldzane@gmail.com
> PostgreSQL version: 17.4
> Operating system:   RHEL9 Linux
> Description:
>
> Hi all,
>
> I'm in the process of converting our databases from pglogical logical
> replication to the native logical replication implementation on PostgreSQL
> 17. One of the bugs we encountered and had to work around with pglogical was
> the plugin dropping records while converting to a streaming replica to
> logical via pglogical_create_subscriber (reported
> https://github.com/2ndQuadrant/pglogical/issues/349). I was trying to
> confirm that the native logical replication implementation did not have this
> problem, and I've found that it might have a different problem.
>
> In rare cases, I'm finding that the logical replica can start with the
> logical decoder stuck on a conflicting primary key even though the replica
> received no connections, and only decoded changes from the one primary.
>
> I should say that I've been operating under the assumption that
> pg_createsubscriber is designed for use on a replica for a *live* primary
> database, if this isn't correct then someone please let me know.
>
> I have a script that I've been using to reproduce the issue (pasted at end
> of email, because this bug reporting page doesn't seem to support
> attachments). It basically performs a loop that sets up a primary and a
> physical replica, generates some load, converts the replica to logical,
> waits, and makes sure the row counts are the same.
> A few variables at the top of the script may need adjusting, depending on
> the system's PostgreSQL installation and the desired locations.
> The environment variables NUM_THREADS and INSERT_WIDTH can be used to
> control the characteristics of the load generation for the primary.
>
> The script isn't terribly good at reproducing the issue; it takes anywhere
> from minutes to hours in my testing.
> The 2 machines I've reproduced the issue on are both running PostgreSQL 17.4
> on RHEL9 in a VM, one VM was moderately powerful (12 cores, 256GB RAM), and
> the other was quite weak (2 cores, 2GB RAM).
> I found that many combinations of NUM_THREADS and INSERT_WIDTH were able to
> reproduce the issue, but NUM_THREADS=40 and INSERT_WIDTH=1000 is a good
> starting point for a powerful machine.
>
> Thanks,
> Zane
>
> SCRIPT
> ```
> #!/bin/bash
>
> set -ue
> set -o pipefail
>
> cd "${BASH_SOURCE[0]%/*}" || exit
>
> POSTGRES_VERSION=17
>
> # Adjust these for the system under test
> PATH="/usr/pgsql-$POSTGRES_VERSION/bin:$PATH"
> PRIMARY_DIR="/var/lib/pgsql/$POSTGRES_VERSION/primary"
> REPLICA_DIR="/var/lib/pgsql/$POSTGRES_VERSION/replica"
>
> PRIMARY_PORT=5340
> REPLICA_PORT=5341
>
> DBNAME="test_db"
> TABLE="test_table"
> TRIGGER_FILE="./stop_inserting"
>
> CYAN="\e[1;36m"
> RESET="\e[0m"
>
> TEST_NUM=0
> reset_databases() (
>     test_start_time=$(date +"%Y-%m-%d_%H-%M-%S")
>
>     if pg_ctl status -D "$REPLICA_DIR" > /dev/null 2>&1 && ! pg_ctl stop -m
> immediate -D "$REPLICA_DIR"; then
>         echo "ERROR: Could not stop replica."
>         return 1
>     fi
>
>     if ! rm -rf "$REPLICA_DIR"; then
>         echo "ERROR: Could not delete replica directory \"$REPLICA_DIR\"."
>         return 1
>     fi
>
>     if pg_ctl status -D "$PRIMARY_DIR" > /dev/null 2>&1 && ! pg_ctl stop -m
> immediate -D "$PRIMARY_DIR"; then
>         echo "ERROR: Could not stop primary."
>         return 1
>     fi
>
>     if ! rm -rf "$PRIMARY_DIR"; then
>         echo "ERROR: Could not delete primary directory \"$PRIMARY_DIR\"."
>         return 1
>     fi
>
>     if ! initdb -D "$PRIMARY_DIR" --locale=C; then
>         echo "ERROR: Could not create primary."
>         return 1
>     fi
>
>     cat >> "$PRIMARY_DIR"/postgresql.auto.conf << EOF
> port = $PRIMARY_PORT
> wal_level = logical
> max_worker_processes = 10
> max_replication_slots = 10
> max_wal_senders = 10
> max_connections = 1000
> log_directory = '$PWD/log'
> log_filename = 'postgresql-$test_start_time-test-$TEST_NUM.log'
> log_checkpoints = on
> log_connections = on
> log_disconnections = on
> log_replication_commands = on
> log_duration = off
> log_min_duration_statement = 0
> log_statement = all
> log_replication_commands = on
> log_line_prefix = '%m [%p] port=$PRIMARY_PORT %q%u@%d/%a '
> EOF
>
>     cat >> "$PRIMARY_DIR"/pg_hba.conf << EOF
> host   replication      postgres     127.0.0.1/32            trust
> host       test_db      postgres     127.0.0.1/32            trust
> host       test_db      postgres     127.0.0.1/32            trust
> EOF
>
>     if ! pg_ctl start -D "$PRIMARY_DIR"; then
>         echo "ERROR: Could not start primary."
>         return 1
>     fi
>
>     if ! pg_basebackup -h localhost -U postgres -p "$PRIMARY_PORT" -D
> "$REPLICA_DIR" -P -Xs -R; then
>         echo "ERROR: Could not create replica."
>         return 1
>     fi
>
>     cat >> "$REPLICA_DIR"/postgresql.auto.conf << EOF
> port = $REPLICA_PORT
> archive_mode = off
> hot_standby = on
> log_directory = '$PWD/log'
> log_filename = 'postgresql-$test_start_time-test-$TEST_NUM.log'
> log_checkpoints = on
> log_connections = on
> log_disconnections = on
> log_replication_commands = on
> log_duration = off
> log_min_duration_statement = 0
> log_statement = all
> log_replication_commands = on
> log_line_prefix = '%m [%p] port=$REPLICA_PORT %q%u@%d/%a '
> EOF
>
>     if ! pg_ctl start -D "$REPLICA_DIR"; then
>         echo "ERROR: Could not start replica."
>         return 1
>     fi
> )
>
> create_test_database() (
>     psql -p "$PRIMARY_PORT" -c "CREATE DATABASE $DBNAME"
>     psql -p "$PRIMARY_PORT" -d "$DBNAME" -c "CREATE TABLE $TABLE (f1 int
> primary key, insert_time timestamp without time zone)"
> )
>
> # Adjust these variables to tweak the load characteristics
> NUM_THREADS=${NUM_THREADS:-20}
> INSERT_WIDTH=${INSERT_WIDTH:-100}
>
> add_records_to_test_table() (
>     id=$1
>     start_time=$(date +%s)
>     start=$((id * INSERT_WIDTH)) end=0
>
>     while true; do
>         end=$((start + INSERT_WIDTH - 1))
>         echo "INSERT INTO $TABLE SELECT val, CURRENT_TIMESTAMP FROM
> generate_series($start, $end) S(val);" |
>             psql -p "$PRIMARY_PORT" -d "$DBNAME" > /dev/null
>         start=$((start + NUM_THREADS * INSERT_WIDTH))
>
>         if [[ -f "$TRIGGER_FILE" ]] || (( $(date "+%s") - start_time > 15
> )); then
>             break
>         fi
>     done
>
>     return 0
> )
>
> INSERT_PIDS=()
> start_insert_threads() {
>     echo "*** STARTING $NUM_THREADS LOOPS INSERTING $INSERT_WIDTH RECORDS
> PER ITERATION ..."
>     INSERT_PIDS=()
>     for id in $(seq 0 $((NUM_THREADS - 1))); do
>         add_records_to_test_table "$id" &
>         INSERT_PIDS+=($!)
>     done
> }
>
> create_subscriber() (
>     echo "*** Stopping replica, then running pg_createsubscriber..."
>
>     pg_ctl stop -m fast -D "$REPLICA_DIR"
>
>     pg_createsubscriber -D "$REPLICA_DIR" \
>         --database="$DBNAME" \
>         --subscription="sub" \
>         --publication="pub" \
>         --publisher-server="host=localhost port=$PRIMARY_PORT
> dbname=$DBNAME"
>
>     pg_ctl start -D "$REPLICA_DIR"
>
>     echo "*** Successfully started logical replica on port $REPLICA_PORT."
> )
>
> run_test() (
>     create_test_database
>     rm -f "$TRIGGER_FILE"
>     start_insert_threads
>     sleep 2
>     create_subscriber
>     sleep 0.1
>
>     touch "$TRIGGER_FILE"
>     errors=0
>     for pid in "${INSERT_PIDS[@]}"; do
>         if ! wait "$pid"; then
>             echo "ERROR: insert thread with PID $pid failed"
>             errors=1
>         fi
>     done
>     (( errors )) && return 1
>
>     echo "*** ALL INSERT LOOPS FINISHED"
>
>     last_src_count=0
>     last_dest_count=0
>
>     # wait until the counts stop increasing
>     while true; do
>         src_count=$(psql -qt -p "$PRIMARY_PORT" -d "$DBNAME" -c "SELECT
> COUNT(f1) FROM $TABLE")
>         dest_count=$(psql -qt -p "$REPLICA_PORT" -d "$DBNAME" -c "SELECT
> COUNT(f1) FROM $TABLE")
>
>         if [[ $dest_count -ne $last_dest_count ]] || [[ $src_count -ne
> $last_src_count ]]; then
>             last_dest_count=$dest_count
>             last_src_count=$src_count
>             sleep 0.5s
>             continue;
>         fi
>
>         echo "SOURCE COUNT = $src_count"
>         echo "DEST COUNT   = $dest_count"
>
>         if (( src_count != dest_count )); then
>             echo "ERROR: record count mismatch"
>             return 1
>         fi
>
>         break
>     done
>
>     echo -e "*** TEST PASSED\n"
> )
>
> for TEST_NUM in {1..10000}; do
>     echo -e "${CYAN}*** STARTING TEST #$TEST_NUM${RESET}"
>     reset_databases && run_test
> done
> ```
>

I am able to reproduce the issue with the script provided. I am
currently analysing the issue.

Thanks and Regards,
Shlok Kyal

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

"Euler Taveira"

Дата:

23 апреля, 04:57:53

On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:

I'm in the process of converting our databases from pglogical logical
replication to the native logical replication implementation on PostgreSQL
17. One of the bugs we encountered and had to work around with pglogical was
the plugin dropping records while converting to a streaming replica to
logical via pglogical_create_subscriber (reported
https://github.com/2ndQuadrant/pglogical/issues/349). I was trying to
confirm that the native logical replication implementation did not have this
problem, and I've found that it might have a different problem.

pg_createsubscriber uses a different approach than pglogical. While pglogical

uses a restore point, pg_createsubscriber uses the LSN from the latest

replication slot as a replication start point. The restore point approach is

usually suitable to physical replication but might not cover all scenarios for

logical replication (such as when there are in progress transactions). Since

creating a logical replication slot does find a consistent decoding start

point, it is a natural choice to start the logical replication (that also needs

to find a decoding start point).

I should say that I've been operating under the assumption that
pg_createsubscriber is designed for use on a replica for a *live* primary
database, if this isn't correct then someone please let me know.

pg_createsubscriber expects a physical replica that is preferably stopped

before running it.

I have a script that I've been using to reproduce the issue (pasted at end
of email, because this bug reporting page doesn't seem to support
attachments). It basically performs a loop that sets up a primary and a
physical replica, generates some load, converts the replica to logical,
waits, and makes sure the row counts are the same.

If I run your tests, it reports

$ NUM_THREADS=40 INSERT_WIDTH=1000 /tmp/logical_stress_test.sh

*** Successfully started logical replica on port 5341.

*** ALL INSERT LOOPS FINISHED

SOURCE COUNT = 916000

DEST COUNT = 768000

ERROR: record count mismatch

but after some time

$ psql -X -p 5340 -c "SELECT count(f1) FROM test_table" -d test_db

count

--------

916000

(1 row)

$ psql -X -p 5341 -c "SELECT count(f1) FROM test_table" -d test_db

count

--------

916000

(1 row)

I also checked the data

$ pg_dump -t test_table -p 5340 -d test_db -f - | sort > /tmp/p.out

$ pg_dump -t test_table -p 5341 -d test_db -f - | sort > /tmp/r.out

$ diff -q /tmp/p.out /tmp/r.out

$ echo $?

Your script is not waiting enough time until it applies the backlog. Unless,

you are seeing a different symptom, there is no bug.

You should have used something similar to wait_for_subscription_sync routine

(Cluster.pm) before counting the rows. That's what is used in the

pg_createsubscriber tests. It guarantees the subscriber has caught up.

Euler Taveira

EDB https://www.enterprisedb.com/

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

Zane Duffield

Дата:

23 апреля, 06:13:42

Hi Euler, thanks for your reply.

On Wed, Apr 23, 2025 at 11:58 AM Euler Taveira <euler@eulerto.com> wrote:

On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:
I'm in the process of converting our databases from pglogical logical
replication to the native logical replication implementation on PostgreSQL
17. One of the bugs we encountered and had to work around with pglogical was
the plugin dropping records while converting to a streaming replica to
logical via pglogical_create_subscriber (reported
https://github.com/2ndQuadrant/pglogical/issues/349). I was trying to
confirm that the native logical replication implementation did not have this
problem, and I've found that it might have a different problem.

pg_createsubscriber uses a different approach than pglogical. While pglogical
uses a restore point, pg_createsubscriber uses the LSN from the latest
replication slot as a replication start point. The restore point approach is
usually suitable to physical replication but might not cover all scenarios for
logical replication (such as when there are in progress transactions). Since
creating a logical replication slot does find a consistent decoding start
point, it is a natural choice to start the logical replication (that also needs
to find a decoding start point).

I should say that I've been operating under the assumption that
pg_createsubscriber is designed for use on a replica for a *live* primary
database, if this isn't correct then someone please let me know.

pg_createsubscriber expects a physical replica that is preferably stopped
before running it.

I think pg_createsubscriber actually gives you an error if the replica is not stopped. I was talking about the primary.

Your script is not waiting enough time until it applies the backlog. Unless,
you are seeing a different symptom, there is no bug.

You should have used something similar to wait_for_subscription_sync routine
(Cluster.pm) before counting the rows. That's what is used in the
pg_createsubscriber tests. It guarantees the subscriber has caught up.

It may be true that the script doesn't wait long enough for all systems, but when I reproduced the issue on my machine(s) I confirmed that the logical decoder process was properly stuck on a conflicting primary key, rather than just catching up.

From the log file

2025-04-16 09:17:16.090 AEST [3845786] port=5341 ERROR: duplicate key value violates unique constraint "test_table_pkey"
2025-04-16 09:17:16.090 AEST [3845786] port=5341 DETAIL: Key (f1)=(20700) already exists.
2025-04-16 09:17:16.090 AEST [3845786] port=5341 CONTEXT: processing remote data for replication origin "pg_24576" during message type "INSERT" for replication target relation "public.test_table" in transaction 1581, finished at 0/3720058
2025-04-16 09:17:16.091 AEST [3816845] port=5341 LOG: background worker "logical replication apply worker" (PID 3845786) exited with exit code 1

wait_for_subscription_sync sounds like a better solution than what I have, but you might still be able to reproduce the problem if you increase the sleep interval on line 198.

I wonder if Shlok could confirm whether they found the conflicting primary key in their reproduction?

Thanks,

Zane

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

Zane Duffield

Дата:

23 апреля, 06:30:47

I meant to say the logical apply worker was stuck, not the decoder process.

On Wed, Apr 23, 2025 at 1:13 PM Zane Duffield <duffieldzane@gmail.com> wrote:

Hi Euler, thanks for your reply.

On Wed, Apr 23, 2025 at 11:58 AM Euler Taveira <euler@eulerto.com> wrote:
On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:
I'm in the process of converting our databases from pglogical logical
replication to the native logical replication implementation on PostgreSQL
17. One of the bugs we encountered and had to work around with pglogical was
the plugin dropping records while converting to a streaming replica to
logical via pglogical_create_subscriber (reported
https://github.com/2ndQuadrant/pglogical/issues/349). I was trying to
confirm that the native logical replication implementation did not have this
problem, and I've found that it might have a different problem.

pg_createsubscriber uses a different approach than pglogical. While pglogical
uses a restore point, pg_createsubscriber uses the LSN from the latest
replication slot as a replication start point. The restore point approach is
usually suitable to physical replication but might not cover all scenarios for
logical replication (such as when there are in progress transactions). Since
creating a logical replication slot does find a consistent decoding start
point, it is a natural choice to start the logical replication (that also needs
to find a decoding start point).

I should say that I've been operating under the assumption that
pg_createsubscriber is designed for use on a replica for a *live* primary
database, if this isn't correct then someone please let me know.

pg_createsubscriber expects a physical replica that is preferably stopped
before running it.

I think pg_createsubscriber actually gives you an error if the replica is not stopped. I was talking about the primary.

Your script is not waiting enough time until it applies the backlog. Unless,
you are seeing a different symptom, there is no bug.

You should have used something similar to wait_for_subscription_sync routine
(Cluster.pm) before counting the rows. That's what is used in the
pg_createsubscriber tests. It guarantees the subscriber has caught up.

It may be true that the script doesn't wait long enough for all systems, but when I reproduced the issue on my machine(s) I confirmed that the logical decoder process was properly stuck on a conflicting primary key, rather than just catching up.

From the log file
2025-04-16 09:17:16.090 AEST [3845786] port=5341 ERROR: duplicate key value violates unique constraint "test_table_pkey"
2025-04-16 09:17:16.090 AEST [3845786] port=5341 DETAIL: Key (f1)=(20700) already exists.
2025-04-16 09:17:16.090 AEST [3845786] port=5341 CONTEXT: processing remote data for replication origin "pg_24576" during message type "INSERT" for replication target relation "public.test_table" in transaction 1581, finished at 0/3720058
2025-04-16 09:17:16.091 AEST [3816845] port=5341 LOG: background worker "logical replication apply worker" (PID 3845786) exited with exit code 1

wait_for_subscription_sync sounds like a better solution than what I have, but you might still be able to reproduce the problem if you increase the sleep interval on line 198.

I wonder if Shlok could confirm whether they found the conflicting primary key in their reproduction?

Thanks,
Zane

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

vignesh C

Дата:

24 апреля, 06:38:41

On Wed, 23 Apr 2025 at 07:29, Euler Taveira <euler@eulerto.com> wrote:
>
> On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:
>
> I'm in the process of converting our databases from pglogical logical
> replication to the native logical replication implementation on PostgreSQL
> 17. One of the bugs we encountered and had to work around with pglogical was
> the plugin dropping records while converting to a streaming replica to
> logical via pglogical_create_subscriber (reported
> https://github.com/2ndQuadrant/pglogical/issues/349). I was trying to
> confirm that the native logical replication implementation did not have this
> problem, and I've found that it might have a different problem.
>
>
> pg_createsubscriber uses a different approach than pglogical. While pglogical
> uses a restore point, pg_createsubscriber uses the LSN from the latest
> replication slot as a replication start point. The restore point approach is
> usually suitable to physical replication but might not cover all scenarios for
> logical replication (such as when there are in progress transactions). Since
> creating a logical replication slot does find a consistent decoding start
> point, it is a natural choice to start the logical replication (that also needs
> to find a decoding start point).

I observed a difference between logical replication and
pg_createsubscriber in how the replication origin is set, which can
lead to different behaviors during replication restarts.

Consider the following WAL records:
rmgr: Transaction len (rec/tot):     46/    46, tx:        767, lsn:
0/01756868, prev 0/01756780, desc: COMMIT 2025-04-23 15:49:24.349942
IST
rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn:
0/01756898, prev 0/01756868, desc: RUNNING_XACTS nextXid 768
latestCompletedXid 767 oldestRunningXid 755; 1 xacts: 755

Behavior in Logical Replication
In logical replication, after applying the COMMIT record at LSN
0/01756868, the replication origin is set to the end LSN of the commit
record, i.e., 0/01756898. On restart, replication resumes from the
next LSN after COMMIT records, which is 0/01756898 in this case.
As a result, the same transaction is not sent again, and duplicate
data is avoided.

Behavior in pg_createsubscriber
However, in the case of pg_createsubscriber, the consistent point used
for creating the replication slot on the publisher may be set exactly
at the commit LSN (0/01756868, xid 767). When promoting the standby,
this same LSN is used as recovery_target_lsn, so the standby recovers
up to and including the commit of transaction 767.

After promotion, if the replication origin is also set to this same
commit LSN, the subscriber will request changes starting from that
point. Since the origin doesn't reflect the commit as applied,
transaction 767 gets replicated again, leading to duplicate data and
possible replication failure.

If the issue is not reproducible using Zane's suggested steps, we can
try the following:
1) Stop the standby server.
2) Perform an insert transaction and note the commit LSN using pg_waldump.
3) Set up the publisher, replication slot, etc., simulating the
pg_createsubscriber behavior.
4) Restart the standby with promotion configuration from
setup_recovery(), setting recovery_target_lsn to the commit LSN from
step 2.
5) Create the subscription and set the replication origin to the same
commit LSN.
6) Enable the subscription.

This setup should reproduce the issue where the transaction gets
applied twice on the subscriber due to the replication origin being
aligned with the commit LSN rather than its end.

Thoughts?

Regards,
Vignesh

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

Shlok Kyal

Дата:

24 апреля, 09:25:13

On Wed, 23 Apr 2025 at 08:43, Zane Duffield <duffieldzane@gmail.com> wrote:
>
> Hi Euler, thanks for your reply.
>
> On Wed, Apr 23, 2025 at 11:58 AM Euler Taveira <euler@eulerto.com> wrote:
>>
>> On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:
>>
>> I'm in the process of converting our databases from pglogical logical
>> replication to the native logical replication implementation on PostgreSQL
>> 17. One of the bugs we encountered and had to work around with pglogical was
>> the plugin dropping records while converting to a streaming replica to
>> logical via pglogical_create_subscriber (reported
>> https://github.com/2ndQuadrant/pglogical/issues/349). I was trying to
>> confirm that the native logical replication implementation did not have this
>> problem, and I've found that it might have a different problem.
>>
>>
>> pg_createsubscriber uses a different approach than pglogical. While pglogical
>> uses a restore point, pg_createsubscriber uses the LSN from the latest
>> replication slot as a replication start point. The restore point approach is
>> usually suitable to physical replication but might not cover all scenarios for
>> logical replication (such as when there are in progress transactions). Since
>> creating a logical replication slot does find a consistent decoding start
>> point, it is a natural choice to start the logical replication (that also needs
>> to find a decoding start point).
>>
>> I should say that I've been operating under the assumption that
>> pg_createsubscriber is designed for use on a replica for a *live* primary
>> database, if this isn't correct then someone please let me know.
>>
>>
>> pg_createsubscriber expects a physical replica that is preferably stopped
>> before running it.
>
>
> I think pg_createsubscriber actually gives you an error if the replica is not stopped. I was talking about the
primary.
>
>>
>> Your script is not waiting enough time until it applies the backlog. Unless,
>> you are seeing a different symptom, there is no bug.
>>
>> You should have used something similar to wait_for_subscription_sync routine
>> (Cluster.pm) before counting the rows. That's what is used in the
>> pg_createsubscriber tests. It guarantees the subscriber has caught up.
>>
>
> It may be true that the script doesn't wait long enough for all systems, but when I reproduced the issue on my
machine(s)I confirmed that the logical decoder process was properly stuck on a conflicting primary key, rather than
justcatching up. 
>
> From the log file
>>
>> 2025-04-16 09:17:16.090 AEST [3845786] port=5341 ERROR:  duplicate key value violates unique constraint
"test_table_pkey"
>> 2025-04-16 09:17:16.090 AEST [3845786] port=5341 DETAIL:  Key (f1)=(20700) already exists.
>> 2025-04-16 09:17:16.090 AEST [3845786] port=5341 CONTEXT:  processing remote data for replication origin "pg_24576"
duringmessage type "INSERT" for replication target relation "public.test_table" in transaction 1581, finished at
0/3720058
>> 2025-04-16 09:17:16.091 AEST [3816845] port=5341 LOG:  background worker "logical replication apply worker" (PID
3845786)exited with exit code 1 
>
>
>   wait_for_subscription_sync sounds like a better solution than what I have, but you might still be able to reproduce
theproblem if you increase the sleep interval on line 198. 
>
> I wonder if Shlok could confirm whether they found the conflicting primary key in their reproduction?
>

I have analysed the issue and found that this issue appears when
'consistent_lsn' is set to a COMMIT lsn during pg_createsubscriber is
run.

pg_waldump on primary node:
rmgr: Standby     len (rec/tot):    102/   102, tx:          0, lsn:
0/06AEB2B0, prev 0/06AEB270, desc: RUNNING_XACTS nextXid 1128
latestCompletedXid 1114 oldestRunningXid 1115; 13 xacts: 1121 1118
1115 1116 1117 1119 1120 1124 1125 1122 1126 1123 1127
rmgr: Transaction len (rec/tot):     46/    46, tx:       1116, lsn:
0/06AEB318, prev 0/06AEB2B0, desc: COMMIT 2025-04-17 19:22:22.789463
IST

logs for replica node:
2025-04-17 19:22:34.759 IST [2410569] port=5341 LOG:  logical
replication apply worker for subscription "sub" has started
2025-04-17 19:22:34.831 IST [2410569] port=5341 ERROR:  duplicate key
value violates unique constraint "test_table_pkey"
2025-04-17 19:22:34.831 IST [2410569] port=5341 DETAIL:  Key
(f1)=(402000) already exists.
2025-04-17 19:22:34.831 IST [2410569] port=5341 CONTEXT:  processing
remote data for replication origin "pg_24576" during message type
"INSERT" for replication target relation "public.test_table" in
transaction 1116, finished at 0/6AEB318
2025-04-17 19:22:34.835 IST [2410562] port=5341 LOG:  background
worker "logical replication apply worker" (PID 2410569) exited with
exit code 1

logs for primary node:
2025-04-17 19:22:34.788 IST [2410570] port=5340 ubuntu@test_db/sub
DETAIL:  Waiting for transactions (approximately 8) older than 1115 to
end.
2025-04-17 19:22:34.788 IST [2410570] port=5340 ubuntu@test_db/sub
STATEMENT:  START_REPLICATION SLOT "sub" LOGICAL 0/6AEB318
(proto_version '4', origin 'any', publication_names '"pub"')
2025-04-17 19:22:34.825 IST [2410570] port=5340 ubuntu@test_db/sub
LOG:  logical decoding found consistent point at 0/6AEB2B0
2025-04-17 19:22:34.825 IST [2410570] port=5340 ubuntu@test_db/sub
DETAIL:  There are no old transactions anymore.
2025-04-17 19:22:34.825 IST [2410570] port=5340 ubuntu@test_db/sub
STATEMENT:  START_REPLICATION SLOT "sub" LOGICAL 0/6AEB318
(proto_version '4', origin 'any', publication_names '"pub"')
2025-04-17 19:22:34.832 IST [2410570] port=5340 ubuntu@test_db/sub
LOG:  could not send data to client: Connection reset by peer
2025-04-17 19:22:34.832 IST [2410570] port=5340 ubuntu@test_db/sub
CONTEXT:  slot "sub", output plugin "pgoutput", in the change
callback, associated LSN 0/695BAA8
2025-04-17 19:22:34.832 IST [2410570] port=5340 ubuntu@test_db/sub
STATEMENT:  START_REPLICATION SLOT "sub" LOGICAL 0/6AEB318
(proto_version '4', origin 'any', publication_names '"pub"')
2025-04-17 19:22:34.832 IST [2410570] port=5340 ubuntu@test_db/sub
LOG:  released logical replication slot "sub"

logs for the pg_createsubscriber run:
2025-04-17 19:22:34.361 IST [2410552] port=5341
ubuntu@test_db/pg_createsubscriber LOG:  statement: SELECT
pg_catalog.pg_replication_origin_advance('pg_24576', '0/6AEB318')
2025-04-17 19:22:34.361 IST [2410552] port=5341
ubuntu@test_db/pg_createsubscriber LOG:  duration: 0.640 ms
2025-04-17 19:22:34.362 IST [2410552] port=5341
ubuntu@test_db/pg_createsubscriber LOG:  statement: ALTER SUBSCRIPTION
"sub" ENABLE

Here '0/06AEB318' is a COMMIT lsn. During the run of
pg_createsubscriber, the 'recovery_target_lsn' of standby is set to
'0/06AEB318'. So, after restart of standby it will have the records
for the transaction committed at '0/06AEB318'.
Now after subscription is created, replication origin is advanced to
lsn '0/06AEB318'. Now when this subscription is enabled, the
applyworker tries to apply this record again and thus gives a primary
key conflict.

In case of normal logical replication, the replication origin is
advanced to the next record to that of COMMIT lsn (in code it is
referred as end_lsn).

The occurrence of this failure is rare because it happens only when
the 'consistent_lsn' is set to a lsn which refers to a COMMIT record.


Thanks and Regards,
Shlok Kyal

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

Shlok Kyal

Дата:

24 апреля, 09:27:20

On Thu, 24 Apr 2025 at 09:08, vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 23 Apr 2025 at 07:29, Euler Taveira <euler@eulerto.com> wrote:
> >
> > On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:
> >
> > I'm in the process of converting our databases from pglogical logical
> > replication to the native logical replication implementation on PostgreSQL
> > 17. One of the bugs we encountered and had to work around with pglogical was
> > the plugin dropping records while converting to a streaming replica to
> > logical via pglogical_create_subscriber (reported
> > https://github.com/2ndQuadrant/pglogical/issues/349). I was trying to
> > confirm that the native logical replication implementation did not have this
> > problem, and I've found that it might have a different problem.
> >
> >
> > pg_createsubscriber uses a different approach than pglogical. While pglogical
> > uses a restore point, pg_createsubscriber uses the LSN from the latest
> > replication slot as a replication start point. The restore point approach is
> > usually suitable to physical replication but might not cover all scenarios for
> > logical replication (such as when there are in progress transactions). Since
> > creating a logical replication slot does find a consistent decoding start
> > point, it is a natural choice to start the logical replication (that also needs
> > to find a decoding start point).
>
> I observed a difference between logical replication and
> pg_createsubscriber in how the replication origin is set, which can
> lead to different behaviors during replication restarts.
>
> Consider the following WAL records:
> rmgr: Transaction len (rec/tot):     46/    46, tx:        767, lsn:
> 0/01756868, prev 0/01756780, desc: COMMIT 2025-04-23 15:49:24.349942
> IST
> rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn:
> 0/01756898, prev 0/01756868, desc: RUNNING_XACTS nextXid 768
> latestCompletedXid 767 oldestRunningXid 755; 1 xacts: 755
>
> Behavior in Logical Replication
> In logical replication, after applying the COMMIT record at LSN
> 0/01756868, the replication origin is set to the end LSN of the commit
> record, i.e., 0/01756898. On restart, replication resumes from the
> next LSN after COMMIT records, which is 0/01756898 in this case.
> As a result, the same transaction is not sent again, and duplicate
> data is avoided.
>
> Behavior in pg_createsubscriber
> However, in the case of pg_createsubscriber, the consistent point used
> for creating the replication slot on the publisher may be set exactly
> at the commit LSN (0/01756868, xid 767). When promoting the standby,
> this same LSN is used as recovery_target_lsn, so the standby recovers
> up to and including the commit of transaction 767.
>
> After promotion, if the replication origin is also set to this same
> commit LSN, the subscriber will request changes starting from that
> point. Since the origin doesn't reflect the commit as applied,
> transaction 767 gets replicated again, leading to duplicate data and
> possible replication failure.
>
> If the issue is not reproducible using Zane's suggested steps, we can
> try the following:
> 1) Stop the standby server.
> 2) Perform an insert transaction and note the commit LSN using pg_waldump.
> 3) Set up the publisher, replication slot, etc., simulating the
> pg_createsubscriber behavior.
> 4) Restart the standby with promotion configuration from
> setup_recovery(), setting recovery_target_lsn to the commit LSN from
> step 2.
> 5) Create the subscription and set the replication origin to the same
> commit LSN.
> 6) Enable the subscription.
>
> This setup should reproduce the issue where the transaction gets
> applied twice on the subscriber due to the replication origin being
> aligned with the commit LSN rather than its end.
>
> Thoughts?

Hi Vignesh,

I have verified the behaviour shared by you for Logical Replication
and pg_createsubscriber and agree with your analysis.
I have also tried the steps shared by you and am able to reproduce the issue.

I am thinking of resolving it by introducing a new API, which can give
us the next lsn to the lsn provided.
During pg_createsusbcriber run where we are advancing the replication
origin to 'consistent_lsn'. I think we should advance it to the next
lsn of 'consistent_lsn' instead.
I think this will resolve the issue. Thoughts?


Thanks and Regards,
Shlok Kyal

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

Shlok Kyal

Дата:

25 апреля, 11:44:23

On Thu, 24 Apr 2025 at 11:57, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> On Thu, 24 Apr 2025 at 09:08, vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Wed, 23 Apr 2025 at 07:29, Euler Taveira <euler@eulerto.com> wrote:
> > >
> > > On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:
> > >
> > > I'm in the process of converting our databases from pglogical logical
> > > replication to the native logical replication implementation on PostgreSQL
> > > 17. One of the bugs we encountered and had to work around with pglogical was
> > > the plugin dropping records while converting to a streaming replica to
> > > logical via pglogical_create_subscriber (reported
> > > https://github.com/2ndQuadrant/pglogical/issues/349). I was trying to
> > > confirm that the native logical replication implementation did not have this
> > > problem, and I've found that it might have a different problem.
> > >
> > >
> > > pg_createsubscriber uses a different approach than pglogical. While pglogical
> > > uses a restore point, pg_createsubscriber uses the LSN from the latest
> > > replication slot as a replication start point. The restore point approach is
> > > usually suitable to physical replication but might not cover all scenarios for
> > > logical replication (such as when there are in progress transactions). Since
> > > creating a logical replication slot does find a consistent decoding start
> > > point, it is a natural choice to start the logical replication (that also needs
> > > to find a decoding start point).
> >
> > I observed a difference between logical replication and
> > pg_createsubscriber in how the replication origin is set, which can
> > lead to different behaviors during replication restarts.
> >
> > Consider the following WAL records:
> > rmgr: Transaction len (rec/tot):     46/    46, tx:        767, lsn:
> > 0/01756868, prev 0/01756780, desc: COMMIT 2025-04-23 15:49:24.349942
> > IST
> > rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn:
> > 0/01756898, prev 0/01756868, desc: RUNNING_XACTS nextXid 768
> > latestCompletedXid 767 oldestRunningXid 755; 1 xacts: 755
> >
> > Behavior in Logical Replication
> > In logical replication, after applying the COMMIT record at LSN
> > 0/01756868, the replication origin is set to the end LSN of the commit
> > record, i.e., 0/01756898. On restart, replication resumes from the
> > next LSN after COMMIT records, which is 0/01756898 in this case.
> > As a result, the same transaction is not sent again, and duplicate
> > data is avoided.
> >
> > Behavior in pg_createsubscriber
> > However, in the case of pg_createsubscriber, the consistent point used
> > for creating the replication slot on the publisher may be set exactly
> > at the commit LSN (0/01756868, xid 767). When promoting the standby,
> > this same LSN is used as recovery_target_lsn, so the standby recovers
> > up to and including the commit of transaction 767.
> >
> > After promotion, if the replication origin is also set to this same
> > commit LSN, the subscriber will request changes starting from that
> > point. Since the origin doesn't reflect the commit as applied,
> > transaction 767 gets replicated again, leading to duplicate data and
> > possible replication failure.
> >
> > If the issue is not reproducible using Zane's suggested steps, we can
> > try the following:
> > 1) Stop the standby server.
> > 2) Perform an insert transaction and note the commit LSN using pg_waldump.
> > 3) Set up the publisher, replication slot, etc., simulating the
> > pg_createsubscriber behavior.
> > 4) Restart the standby with promotion configuration from
> > setup_recovery(), setting recovery_target_lsn to the commit LSN from
> > step 2.
> > 5) Create the subscription and set the replication origin to the same
> > commit LSN.
> > 6) Enable the subscription.
> >
> > This setup should reproduce the issue where the transaction gets
> > applied twice on the subscriber due to the replication origin being
> > aligned with the commit LSN rather than its end.
> >
> > Thoughts?
>
> Hi Vignesh,
>
> I have verified the behaviour shared by you for Logical Replication
> and pg_createsubscriber and agree with your analysis.
> I have also tried the steps shared by you and am able to reproduce the issue.
>
> I am thinking of resolving it by introducing a new API, which can give
> us the next lsn to the lsn provided.
> During pg_createsusbcriber run where we are advancing the replication
> origin to 'consistent_lsn'. I think we should advance it to the next
> lsn of 'consistent_lsn' instead.
> I think this will resolve the issue. Thoughts?
>

I have created a patch to fix the issue reported by Zane.
I ran the test script with the patch and was not able to reproduce the
issue. The script ran for around ~6hrs.

The patch applies on HEAD.


Thanks and Regards,
Shlok Kyal

Вложения

v1-0001-Fix-duplicate-insert-during-pg_createsubscriber.patch

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

vignesh C

Дата:

28 апреля, 07:57:47

On Thu, 24 Apr 2025 at 11:57, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> On Thu, 24 Apr 2025 at 09:08, vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Wed, 23 Apr 2025 at 07:29, Euler Taveira <euler@eulerto.com> wrote:
> > >
> > > On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:
> > >
> > > I'm in the process of converting our databases from pglogical logical
> > > replication to the native logical replication implementation on PostgreSQL
> > > 17. One of the bugs we encountered and had to work around with pglogical was
> > > the plugin dropping records while converting to a streaming replica to
> > > logical via pglogical_create_subscriber (reported
> > > https://github.com/2ndQuadrant/pglogical/issues/349). I was trying to
> > > confirm that the native logical replication implementation did not have this
> > > problem, and I've found that it might have a different problem.
> > >
> > >
> > > pg_createsubscriber uses a different approach than pglogical. While pglogical
> > > uses a restore point, pg_createsubscriber uses the LSN from the latest
> > > replication slot as a replication start point. The restore point approach is
> > > usually suitable to physical replication but might not cover all scenarios for
> > > logical replication (such as when there are in progress transactions). Since
> > > creating a logical replication slot does find a consistent decoding start
> > > point, it is a natural choice to start the logical replication (that also needs
> > > to find a decoding start point).
> >
> > I observed a difference between logical replication and
> > pg_createsubscriber in how the replication origin is set, which can
> > lead to different behaviors during replication restarts.
> >
> > Consider the following WAL records:
> > rmgr: Transaction len (rec/tot):     46/    46, tx:        767, lsn:
> > 0/01756868, prev 0/01756780, desc: COMMIT 2025-04-23 15:49:24.349942
> > IST
> > rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn:
> > 0/01756898, prev 0/01756868, desc: RUNNING_XACTS nextXid 768
> > latestCompletedXid 767 oldestRunningXid 755; 1 xacts: 755
> >
> > Behavior in Logical Replication
> > In logical replication, after applying the COMMIT record at LSN
> > 0/01756868, the replication origin is set to the end LSN of the commit
> > record, i.e., 0/01756898. On restart, replication resumes from the
> > next LSN after COMMIT records, which is 0/01756898 in this case.
> > As a result, the same transaction is not sent again, and duplicate
> > data is avoided.
> >
> > Behavior in pg_createsubscriber
> > However, in the case of pg_createsubscriber, the consistent point used
> > for creating the replication slot on the publisher may be set exactly
> > at the commit LSN (0/01756868, xid 767). When promoting the standby,
> > this same LSN is used as recovery_target_lsn, so the standby recovers
> > up to and including the commit of transaction 767.
> >
> > After promotion, if the replication origin is also set to this same
> > commit LSN, the subscriber will request changes starting from that
> > point. Since the origin doesn't reflect the commit as applied,
> > transaction 767 gets replicated again, leading to duplicate data and
> > possible replication failure.
> >
> > If the issue is not reproducible using Zane's suggested steps, we can
> > try the following:
> > 1) Stop the standby server.
> > 2) Perform an insert transaction and note the commit LSN using pg_waldump.
> > 3) Set up the publisher, replication slot, etc., simulating the
> > pg_createsubscriber behavior.
> > 4) Restart the standby with promotion configuration from
> > setup_recovery(), setting recovery_target_lsn to the commit LSN from
> > step 2.
> > 5) Create the subscription and set the replication origin to the same
> > commit LSN.
> > 6) Enable the subscription.
> >
> > This setup should reproduce the issue where the transaction gets
> > applied twice on the subscriber due to the replication origin being
> > aligned with the commit LSN rather than its end.
> >
> > Thoughts?
>
> Hi Vignesh,
>
> I have verified the behaviour shared by you for Logical Replication
> and pg_createsubscriber and agree with your analysis.
> I have also tried the steps shared by you and am able to reproduce the issue.
>
> I am thinking of resolving it by introducing a new API, which can give
> us the next lsn to the lsn provided.
> During pg_createsusbcriber run where we are advancing the replication
> origin to 'consistent_lsn'. I think we should advance it to the next
> lsn of 'consistent_lsn' instead.
> I think this will resolve the issue. Thoughts?

With this approach, there is a risk of starting from the next WAL
record after the consistent point. For example, if the slot returns a
consistent point at 0/1715E10, after the fix we would begin replaying
from the next WAL record, such as 0/1715E40, which could potentially
lead to data loss.
As an alternative, we could set recovery_target_inclusive to false in
the setup_recovery function. This way, recovery would stop just before
the recovery target, allowing the publisher to start replicating
exactly from the consistent point.
Thoughts?

Regards,
Vignesh

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

Zane Duffield

Дата:

29 апреля, 03:09:40

On Mon, Apr 28, 2025 at 2:58 PM vignesh C <vignesh21@gmail.com> wrote:

As an alternative, we could set recovery_target_inclusive to false in
the setup_recovery function. This way, recovery would stop just before
the recovery target, allowing the publisher to start replicating
exactly from the consistent point.
Thoughts?

This sounds to me like a simpler and more correct solution.

Thanks,

Zane

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

Shlok Kyal

Дата:

29 апреля, 10:47:30

On Mon, 28 Apr 2025 at 10:28, vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, 24 Apr 2025 at 11:57, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
> >
> > On Thu, 24 Apr 2025 at 09:08, vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Wed, 23 Apr 2025 at 07:29, Euler Taveira <euler@eulerto.com> wrote:
> > > >
> > > > On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:
> > > >
> > > > I'm in the process of converting our databases from pglogical logical
> > > > replication to the native logical replication implementation on PostgreSQL
> > > > 17. One of the bugs we encountered and had to work around with pglogical was
> > > > the plugin dropping records while converting to a streaming replica to
> > > > logical via pglogical_create_subscriber (reported
> > > > https://github.com/2ndQuadrant/pglogical/issues/349). I was trying to
> > > > confirm that the native logical replication implementation did not have this
> > > > problem, and I've found that it might have a different problem.
> > > >
> > > >
> > > > pg_createsubscriber uses a different approach than pglogical. While pglogical
> > > > uses a restore point, pg_createsubscriber uses the LSN from the latest
> > > > replication slot as a replication start point. The restore point approach is
> > > > usually suitable to physical replication but might not cover all scenarios for
> > > > logical replication (such as when there are in progress transactions). Since
> > > > creating a logical replication slot does find a consistent decoding start
> > > > point, it is a natural choice to start the logical replication (that also needs
> > > > to find a decoding start point).
> > >
> > > I observed a difference between logical replication and
> > > pg_createsubscriber in how the replication origin is set, which can
> > > lead to different behaviors during replication restarts.
> > >
> > > Consider the following WAL records:
> > > rmgr: Transaction len (rec/tot):     46/    46, tx:        767, lsn:
> > > 0/01756868, prev 0/01756780, desc: COMMIT 2025-04-23 15:49:24.349942
> > > IST
> > > rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn:
> > > 0/01756898, prev 0/01756868, desc: RUNNING_XACTS nextXid 768
> > > latestCompletedXid 767 oldestRunningXid 755; 1 xacts: 755
> > >
> > > Behavior in Logical Replication
> > > In logical replication, after applying the COMMIT record at LSN
> > > 0/01756868, the replication origin is set to the end LSN of the commit
> > > record, i.e., 0/01756898. On restart, replication resumes from the
> > > next LSN after COMMIT records, which is 0/01756898 in this case.
> > > As a result, the same transaction is not sent again, and duplicate
> > > data is avoided.
> > >
> > > Behavior in pg_createsubscriber
> > > However, in the case of pg_createsubscriber, the consistent point used
> > > for creating the replication slot on the publisher may be set exactly
> > > at the commit LSN (0/01756868, xid 767). When promoting the standby,
> > > this same LSN is used as recovery_target_lsn, so the standby recovers
> > > up to and including the commit of transaction 767.
> > >
> > > After promotion, if the replication origin is also set to this same
> > > commit LSN, the subscriber will request changes starting from that
> > > point. Since the origin doesn't reflect the commit as applied,
> > > transaction 767 gets replicated again, leading to duplicate data and
> > > possible replication failure.
> > >
> > > If the issue is not reproducible using Zane's suggested steps, we can
> > > try the following:
> > > 1) Stop the standby server.
> > > 2) Perform an insert transaction and note the commit LSN using pg_waldump.
> > > 3) Set up the publisher, replication slot, etc., simulating the
> > > pg_createsubscriber behavior.
> > > 4) Restart the standby with promotion configuration from
> > > setup_recovery(), setting recovery_target_lsn to the commit LSN from
> > > step 2.
> > > 5) Create the subscription and set the replication origin to the same
> > > commit LSN.
> > > 6) Enable the subscription.
> > >
> > > This setup should reproduce the issue where the transaction gets
> > > applied twice on the subscriber due to the replication origin being
> > > aligned with the commit LSN rather than its end.
> > >
> > > Thoughts?
> >
> > Hi Vignesh,
> >
> > I have verified the behaviour shared by you for Logical Replication
> > and pg_createsubscriber and agree with your analysis.
> > I have also tried the steps shared by you and am able to reproduce the issue.
> >
> > I am thinking of resolving it by introducing a new API, which can give
> > us the next lsn to the lsn provided.
> > During pg_createsusbcriber run where we are advancing the replication
> > origin to 'consistent_lsn'. I think we should advance it to the next
> > lsn of 'consistent_lsn' instead.
> > I think this will resolve the issue. Thoughts?
>
> With this approach, there is a risk of starting from the next WAL
> record after the consistent point. For example, if the slot returns a
> consistent point at 0/1715E10, after the fix we would begin replaying
> from the next WAL record, such as 0/1715E40, which could potentially
> lead to data loss.
> As an alternative, we could set recovery_target_inclusive to false in
> the setup_recovery function. This way, recovery would stop just before
> the recovery target, allowing the publisher to start replicating
> exactly from the consistent point.
> Thoughts?

This approach looks better to me.
I have prepared the patch for the same.

With the patch I have run the test script in [1]. It ran for ~7hrs and
it did not reproduce the issue.

[1]:
https://www.postgresql.org/message-id/MEYP282MB17994D100FBBDA3A9FFC9E3AFFBD2%40MEYP282MB1799.AUSP282.PROD.OUTLOOK.COM



Thanks and Regards,
Shlok Kyal

Вложения

v2-0001-Fix-duplicate-insert-during-pg_createsubscriber.patch

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

vignesh C

Дата:

05 мая, 17:47:33

On Tue, 29 Apr 2025 at 13:17, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> On Mon, 28 Apr 2025 at 10:28, vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Thu, 24 Apr 2025 at 11:57, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
> > >
> > > On Thu, 24 Apr 2025 at 09:08, vignesh C <vignesh21@gmail.com> wrote:
> > > >
> > > > On Wed, 23 Apr 2025 at 07:29, Euler Taveira <euler@eulerto.com> wrote:
> > > > >
> > > > > On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:
> > > > >
> > > > > I'm in the process of converting our databases from pglogical logical
> > > > > replication to the native logical replication implementation on PostgreSQL
> > > > > 17. One of the bugs we encountered and had to work around with pglogical was
> > > > > the plugin dropping records while converting to a streaming replica to
> > > > > logical via pglogical_create_subscriber (reported
> > > > > https://github.com/2ndQuadrant/pglogical/issues/349). I was trying to
> > > > > confirm that the native logical replication implementation did not have this
> > > > > problem, and I've found that it might have a different problem.
> > > > >
> > > > >
> > > > > pg_createsubscriber uses a different approach than pglogical. While pglogical
> > > > > uses a restore point, pg_createsubscriber uses the LSN from the latest
> > > > > replication slot as a replication start point. The restore point approach is
> > > > > usually suitable to physical replication but might not cover all scenarios for
> > > > > logical replication (such as when there are in progress transactions). Since
> > > > > creating a logical replication slot does find a consistent decoding start
> > > > > point, it is a natural choice to start the logical replication (that also needs
> > > > > to find a decoding start point).
> > > >
> > > > I observed a difference between logical replication and
> > > > pg_createsubscriber in how the replication origin is set, which can
> > > > lead to different behaviors during replication restarts.
> > > >
> > > > Consider the following WAL records:
> > > > rmgr: Transaction len (rec/tot):     46/    46, tx:        767, lsn:
> > > > 0/01756868, prev 0/01756780, desc: COMMIT 2025-04-23 15:49:24.349942
> > > > IST
> > > > rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn:
> > > > 0/01756898, prev 0/01756868, desc: RUNNING_XACTS nextXid 768
> > > > latestCompletedXid 767 oldestRunningXid 755; 1 xacts: 755
> > > >
> > > > Behavior in Logical Replication
> > > > In logical replication, after applying the COMMIT record at LSN
> > > > 0/01756868, the replication origin is set to the end LSN of the commit
> > > > record, i.e., 0/01756898. On restart, replication resumes from the
> > > > next LSN after COMMIT records, which is 0/01756898 in this case.
> > > > As a result, the same transaction is not sent again, and duplicate
> > > > data is avoided.
> > > >
> > > > Behavior in pg_createsubscriber
> > > > However, in the case of pg_createsubscriber, the consistent point used
> > > > for creating the replication slot on the publisher may be set exactly
> > > > at the commit LSN (0/01756868, xid 767). When promoting the standby,
> > > > this same LSN is used as recovery_target_lsn, so the standby recovers
> > > > up to and including the commit of transaction 767.
> > > >
> > > > After promotion, if the replication origin is also set to this same
> > > > commit LSN, the subscriber will request changes starting from that
> > > > point. Since the origin doesn't reflect the commit as applied,
> > > > transaction 767 gets replicated again, leading to duplicate data and
> > > > possible replication failure.
> > > >
> > > > If the issue is not reproducible using Zane's suggested steps, we can
> > > > try the following:
> > > > 1) Stop the standby server.
> > > > 2) Perform an insert transaction and note the commit LSN using pg_waldump.
> > > > 3) Set up the publisher, replication slot, etc., simulating the
> > > > pg_createsubscriber behavior.
> > > > 4) Restart the standby with promotion configuration from
> > > > setup_recovery(), setting recovery_target_lsn to the commit LSN from
> > > > step 2.
> > > > 5) Create the subscription and set the replication origin to the same
> > > > commit LSN.
> > > > 6) Enable the subscription.
> > > >
> > > > This setup should reproduce the issue where the transaction gets
> > > > applied twice on the subscriber due to the replication origin being
> > > > aligned with the commit LSN rather than its end.
> > > >
> > > > Thoughts?
> > >
> > > Hi Vignesh,
> > >
> > > I have verified the behaviour shared by you for Logical Replication
> > > and pg_createsubscriber and agree with your analysis.
> > > I have also tried the steps shared by you and am able to reproduce the issue.
> > >
> > > I am thinking of resolving it by introducing a new API, which can give
> > > us the next lsn to the lsn provided.
> > > During pg_createsusbcriber run where we are advancing the replication
> > > origin to 'consistent_lsn'. I think we should advance it to the next
> > > lsn of 'consistent_lsn' instead.
> > > I think this will resolve the issue. Thoughts?
> >
> > With this approach, there is a risk of starting from the next WAL
> > record after the consistent point. For example, if the slot returns a
> > consistent point at 0/1715E10, after the fix we would begin replaying
> > from the next WAL record, such as 0/1715E40, which could potentially
> > lead to data loss.
> > As an alternative, we could set recovery_target_inclusive to false in
> > the setup_recovery function. This way, recovery would stop just before
> > the recovery target, allowing the publisher to start replicating
> > exactly from the consistent point.
> > Thoughts?
>
> This approach looks better to me.
> I have prepared the patch for the same.
>
> With the patch I have run the test script in [1]. It ran for ~7hrs and
> it did not reproduce the issue.

Can you check and see that the original scenario gets verified, that
is the consistent lsn record type is COMMIT. You can use the patch
attached which will print the consistent lsn record type. Make sure to
create the pg_walinspect extension before running pg_createsubscriber
as the pg_walinspect's pg_get_wal_record_info function is used.

Regards,
Vignesh

Вложения

log_consistent_lsn_type.patch

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

Shlok Kyal

Дата:

10 мая, 22:19:38

On Mon, 5 May 2025 at 20:17, vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, 29 Apr 2025 at 13:17, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
> >
> > On Mon, 28 Apr 2025 at 10:28, vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Thu, 24 Apr 2025 at 11:57, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
> > > >
> > > > On Thu, 24 Apr 2025 at 09:08, vignesh C <vignesh21@gmail.com> wrote:
> > > > >
> > > > > On Wed, 23 Apr 2025 at 07:29, Euler Taveira <euler@eulerto.com> wrote:
> > > > > >
> > > > > > On Wed, Apr 16, 2025, at 8:14 PM, PG Bug reporting form wrote:
> > > > > >
> > > > > > I'm in the process of converting our databases from pglogical logical
> > > > > > replication to the native logical replication implementation on PostgreSQL
> > > > > > 17. One of the bugs we encountered and had to work around with pglogical was
> > > > > > the plugin dropping records while converting to a streaming replica to
> > > > > > logical via pglogical_create_subscriber (reported
> > > > > > https://github.com/2ndQuadrant/pglogical/issues/349). I was trying to
> > > > > > confirm that the native logical replication implementation did not have this
> > > > > > problem, and I've found that it might have a different problem.
> > > > > >
> > > > > >
> > > > > > pg_createsubscriber uses a different approach than pglogical. While pglogical
> > > > > > uses a restore point, pg_createsubscriber uses the LSN from the latest
> > > > > > replication slot as a replication start point. The restore point approach is
> > > > > > usually suitable to physical replication but might not cover all scenarios for
> > > > > > logical replication (such as when there are in progress transactions). Since
> > > > > > creating a logical replication slot does find a consistent decoding start
> > > > > > point, it is a natural choice to start the logical replication (that also needs
> > > > > > to find a decoding start point).
> > > > >
> > > > > I observed a difference between logical replication and
> > > > > pg_createsubscriber in how the replication origin is set, which can
> > > > > lead to different behaviors during replication restarts.
> > > > >
> > > > > Consider the following WAL records:
> > > > > rmgr: Transaction len (rec/tot):     46/    46, tx:        767, lsn:
> > > > > 0/01756868, prev 0/01756780, desc: COMMIT 2025-04-23 15:49:24.349942
> > > > > IST
> > > > > rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn:
> > > > > 0/01756898, prev 0/01756868, desc: RUNNING_XACTS nextXid 768
> > > > > latestCompletedXid 767 oldestRunningXid 755; 1 xacts: 755
> > > > >
> > > > > Behavior in Logical Replication
> > > > > In logical replication, after applying the COMMIT record at LSN
> > > > > 0/01756868, the replication origin is set to the end LSN of the commit
> > > > > record, i.e., 0/01756898. On restart, replication resumes from the
> > > > > next LSN after COMMIT records, which is 0/01756898 in this case.
> > > > > As a result, the same transaction is not sent again, and duplicate
> > > > > data is avoided.
> > > > >
> > > > > Behavior in pg_createsubscriber
> > > > > However, in the case of pg_createsubscriber, the consistent point used
> > > > > for creating the replication slot on the publisher may be set exactly
> > > > > at the commit LSN (0/01756868, xid 767). When promoting the standby,
> > > > > this same LSN is used as recovery_target_lsn, so the standby recovers
> > > > > up to and including the commit of transaction 767.
> > > > >
> > > > > After promotion, if the replication origin is also set to this same
> > > > > commit LSN, the subscriber will request changes starting from that
> > > > > point. Since the origin doesn't reflect the commit as applied,
> > > > > transaction 767 gets replicated again, leading to duplicate data and
> > > > > possible replication failure.
> > > > >
> > > > > If the issue is not reproducible using Zane's suggested steps, we can
> > > > > try the following:
> > > > > 1) Stop the standby server.
> > > > > 2) Perform an insert transaction and note the commit LSN using pg_waldump.
> > > > > 3) Set up the publisher, replication slot, etc., simulating the
> > > > > pg_createsubscriber behavior.
> > > > > 4) Restart the standby with promotion configuration from
> > > > > setup_recovery(), setting recovery_target_lsn to the commit LSN from
> > > > > step 2.
> > > > > 5) Create the subscription and set the replication origin to the same
> > > > > commit LSN.
> > > > > 6) Enable the subscription.
> > > > >
> > > > > This setup should reproduce the issue where the transaction gets
> > > > > applied twice on the subscriber due to the replication origin being
> > > > > aligned with the commit LSN rather than its end.
> > > > >
> > > > > Thoughts?
> > > >
> > > > Hi Vignesh,
> > > >
> > > > I have verified the behaviour shared by you for Logical Replication
> > > > and pg_createsubscriber and agree with your analysis.
> > > > I have also tried the steps shared by you and am able to reproduce the issue.
> > > >
> > > > I am thinking of resolving it by introducing a new API, which can give
> > > > us the next lsn to the lsn provided.
> > > > During pg_createsusbcriber run where we are advancing the replication
> > > > origin to 'consistent_lsn'. I think we should advance it to the next
> > > > lsn of 'consistent_lsn' instead.
> > > > I think this will resolve the issue. Thoughts?
> > >
> > > With this approach, there is a risk of starting from the next WAL
> > > record after the consistent point. For example, if the slot returns a
> > > consistent point at 0/1715E10, after the fix we would begin replaying
> > > from the next WAL record, such as 0/1715E40, which could potentially
> > > lead to data loss.
> > > As an alternative, we could set recovery_target_inclusive to false in
> > > the setup_recovery function. This way, recovery would stop just before
> > > the recovery target, allowing the publisher to start replicating
> > > exactly from the consistent point.
> > > Thoughts?
> >
> > This approach looks better to me.
> > I have prepared the patch for the same.
> >
> > With the patch I have run the test script in [1]. It ran for ~7hrs and
> > it did not reproduce the issue.
>
> Can you check and see that the original scenario gets verified, that
> is the consistent lsn record type is COMMIT. You can use the patch
> attached which will print the consistent lsn record type. Make sure to
> create the pg_walinspect extension before running pg_createsubscriber
> as the pg_walinspect's pg_get_wal_record_info function is used.
>

Hi Vignesh,

Thanks for providing the patch. I have verified the original case with
the patch on both HEAD and PG_17.
With the changes in [1], I noticed that even if the consistent lsn is
set to 'COMMIT' record, we do not get data mismatches or the primary
key conflict.
I have also attached the logs for test runs with your changes.

I have also attached the patches which apply on HEAD and PG_17.

[1] : https://www.postgresql.org/message-id/CANhcyEVZ-Gf8WJBwXfhrz%2BKg89wnzmTMqEqvu8zESEokn2Q7Gg%40mail.gmail.com

Thanks and Regards,
Shlok Kyal

Вложения

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

vignesh C

Дата:

14 мая, 08:30:39

On Sun, 11 May 2025 at 00:49, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> Thanks for providing the patch. I have verified the original case with
> the patch on both HEAD and PG_17.
> With the changes in [1], I noticed that even if the consistent lsn is
> set to 'COMMIT' record, we do not get data mismatches or the primary
> key conflict.
> I have also attached the logs for test runs with your changes.
>
> I have also attached the patches which apply on HEAD and PG_17.

The changes look good to me, I don't have any more comments.

Regards,
Vignesh

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

Amit Kapila

Дата:

10 июля, 10:03:37

On Tue, Apr 29, 2025 at 1:17 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> On Mon, 28 Apr 2025 at 10:28, vignesh C <vignesh21@gmail.com> wrote:
> >
> > With this approach, there is a risk of starting from the next WAL
> > record after the consistent point. For example, if the slot returns a
> > consistent point at 0/1715E10, after the fix we would begin replaying
> > from the next WAL record, such as 0/1715E40, which could potentially
> > lead to data loss.
> > As an alternative, we could set recovery_target_inclusive to false in
> > the setup_recovery function. This way, recovery would stop just before
> > the recovery target, allowing the publisher to start replicating
> > exactly from the consistent point.
> > Thoughts?
>
> This approach looks better to me.
> I have prepared the patch for the same.
>

We should find out in which case and why the consisten_lsn is a start
point LSN of a commit record. We use slot's confirm_flush LSN location
as a consistent_lsn, which normally should be the end point of
running_xacts record or commit_end LSN record (in case client sends
ack).

If we decide to fix in the way proposed here, then we also need to
investigate whether we need an additional WAL record added by commit
03b08c8f5f3e30c97e5908f3d3d76872dab8a9dc. The reason why that
additional WAL record was added is discussed in email [1].

[1] - https://www.postgresql.org/message-id/flat/2377319.1719766794%40sss.pgh.pa.us#bba9f5ee0efc73151cc521a6bd5182ed

--
With Regards,
Amit Kapila.

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

Shlok Kyal

Дата:

22 июля, 14:24:29

On Thu, 10 Jul 2025 at 12:33, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Apr 29, 2025 at 1:17 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
> >
> > On Mon, 28 Apr 2025 at 10:28, vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > With this approach, there is a risk of starting from the next WAL
> > > record after the consistent point. For example, if the slot returns a
> > > consistent point at 0/1715E10, after the fix we would begin replaying
> > > from the next WAL record, such as 0/1715E40, which could potentially
> > > lead to data loss.
> > > As an alternative, we could set recovery_target_inclusive to false in
> > > the setup_recovery function. This way, recovery would stop just before
> > > the recovery target, allowing the publisher to start replicating
> > > exactly from the consistent point.
> > > Thoughts?
> >
> > This approach looks better to me.
> > I have prepared the patch for the same.
> >
>
> We should find out in which case and why the consisten_lsn is a start
> point LSN of a commit record. We use slot's confirm_flush LSN location
> as a consistent_lsn, which normally should be the end point of
> running_xacts record or commit_end LSN record (in case client sends
> ack).
>
I checked it and here is my analysis:

When we create a slot, it returns the confirmed_flush LSN as a
consistent_lsn. I noticed that in general when we create a slot, the
confirmed_flush is set to the end of a RUNNING_XACT log or we can say
start of the next record. And this next record can be anything. Ii can
be a COMMIT record for a transaction in another session.
I have attached server logs and waldump logs for one of such case
reproduced using test script shared in [1].
The snapbuild machinery has four steps: START, BUILDING_SNAPSHOT,
FULL_SNAPSHOT and SNAPBUILD_CONSISTENT. Between each step a
RUNNING_XACT is logged.

When the command to create a slot is executed, during the first
RUNNING_XACT, we get following logs:
2025-07-20 16:50:18.002 IST [1780326] port=5340
ubuntu@test_db/[unknown] LOG:  logical decoding found initial starting
point at 0/3C813D8
2025-07-20 16:50:18.002 IST [1780326] port=5340
ubuntu@test_db/[unknown] DETAIL:  Waiting for transactions
(approximately 2) older than 1353 to end.
2025-07-20 16:50:18.002 IST [1780326] port=5340
ubuntu@test_db/[unknown] STATEMENT:  SELECT lsn FROM
pg_catalog.pg_create_logical_replication_slot('sub', 'pgoutput',
false, false, false)

waldump log corresponding to this:
rmgr: Standby     len (rec/tot):     58/    58, tx:          0, lsn:
0/03C813D8, prev 0/03C81390, desc: RUNNING_XACTS nextXid 1353
latestCompletedXid 1350 oldestRunningXid 1351; 2 xacts: 1352 1351

For second  RUNNING_XACT, we get following logs:
2025-07-20 16:50:18.007 IST [1780326] port=5340
ubuntu@test_db/[unknown] LOG:  logical decoding found initial
consistent point at 0/3C97948
2025-07-20 16:50:18.007 IST [1780326] port=5340
ubuntu@test_db/[unknown] DETAIL:  Waiting for transactions
(approximately 8) older than 1362 to end.
2025-07-20 16:50:18.007 IST [1780326] port=5340
ubuntu@test_db/[unknown] STATEMENT:  SELECT lsn FROM
pg_catalog.pg_create_logical_replication_slot('sub', 'pgoutput',
false, false, false)

waldump record:
len (rec/tot):     82/    82, tx:          0, lsn: 0/03C97948, prev
0/03C97900, desc: RUNNING_XACTS nextXid 1362 latestCompletedXid 1353
oldestRunningXid 1354; 8 xacts: 1357 1358 1355 1354 1359 1361 1356
1360

For third RUNNING_XACT,  we get following logs::
2025-07-20 16:50:18.038 IST [1780326] port=5340
ubuntu@test_db/[unknown] LOG:  logical decoding found consistent point
at 0/3CBCC58
2025-07-20 16:50:18.038 IST [1780326] port=5340
ubuntu@test_db/[unknown] DETAIL:  There are no old transactions
anymore.
2025-07-20 16:50:18.038 IST [1780326] port=5340
ubuntu@test_db/[unknown] STATEMENT:  SELECT lsn FROM
pg_catalog.pg_create_logical_replication_slot('sub', 'pgoutput',
false, false, false)

waldump record:
rmgr: Standby     len (rec/tot):     70/    70, tx:          0, lsn:
0/03CBCC58, prev 0/03CBCC18, desc: RUNNING_XACTS nextXid 1370
latestCompletedXid 1364 oldestRunningXid 1365; 5 xacts: 1366 1365 1369
1368 1367
The consistent point is found at  "0/3CBCC58".

When slot is created the confirmed_flush is set inside function
"DecodingContextFindStartpoint" using:
slot->data.confirmed_flush = ctx->reader->EndRecPtr;
In our case the value of consistent_lsn is "0/03CBCCA0" (I added some
logs  and got the value). Logs:
2025-07-20 16:50:18.039 IST [1780326] port=5340
ubuntu@test_db/[unknown] LOG:  #### confirmed_flush =  0/03CBCCA0
inside DecodingContextFindStartpoint
2025-07-20 16:50:18.039 IST [1780326] port=5340
ubuntu@test_db/[unknown] STATEMENT:  SELECT lsn FROM
pg_catalog.pg_create_logical_replication_slot('sub', 'pgoutput',
false, false, false)

This consistent_lsn "0/03CBCCA0" is nothing but End of RUNNING_XACT (
whose start is "0/3CBCC58").

While the slot is being created a transaction in a concurrent session
commits (just after the third RUNNING_XACT) and add a COMMIT log:
rmgr: Transaction len (rec/tot):     46/    46, tx:       1369, lsn:
0/03CBCCA0, prev 0/03CBCC58, desc: COMMIT 2025-07-20 16:50:18.031146
IST

So, in such cases the consistent LSN can be set to a COMMIT record.

> If we decide to fix in the way proposed here, then we also need to
> investigate whether we need an additional WAL record added by commit'
> 03b08c8f5f3e30c97e5908f3d3d76872dab8a9dc. The reason why that
> additional WAL record was added is discussed in email [1].
>
> [1] - https://www.postgresql.org/message-id/flat/2377319.1719766794%40sss.pgh.pa.us#bba9f5ee0efc73151cc521a6bd5182ed

I reverted the change added by commit
03b08c8f5f3e30c97e5908f3d3d76872dab8a9dc and applied my patch and
checked the behaviour. And I am able to reproduce the issue the commit
was resolving. I think this change is still required.
This change is still required because, while recovery is performed in
the function 'PerformWalRecovery', when recovery_target_inclusive is
set to false, function 'recoveryStopsBefore' is responsible to set
whether recovery is finished or not. This function will set
'reachedRecoveryTarget' to true when it satisfy the condition
/* Check if target LSN has been reached */
    if (recoveryTarget == RECOVERY_TARGET_LSN &&
        !recoveryTargetInclusive &&
        record->ReadRecPtr >= recoveryTargetLSN)
Here we are checking if "start of the record" >= recoveryTargetLSN.

When a replication slot is created, consistent_lsn is obtained. Since
this consistent_lsn points to End of the record (or we can say start
of the next record), there can be a case that there is no WAL record
corresponding to the consistent lsn. So, during the recovery, it will
wait till it reads the record corresponding to consistent lsn (during
my testing this wait was around ~20 sec). And this wait can create the
timeout issue.
I have manually debugged and checked the above case and I think the
change in commit 03b08c8f5f3e30c97e5908f3d3d76872dab8a9dc is still
needed.

I have attached the rebased patches for HEAD, REL_18_STABLE and REL_17_STABLE

[1]:
https://www.postgresql.org/message-id/MEYP282MB17994D100FBBDA3A9FFC9E3AFFBD2@MEYP282MB1799.AUSP282.PROD.OUTLOOK.COM

Thanks,
Shlok Kyal

Вложения

RE: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

"Hayato Kuroda (Fujitsu)"

Дата:

22 июля, 15:21:33

Dear Shlok,

> I checked it and here is my analysis:
> 
> When we create a slot, it returns the confirmed_flush LSN as a
> consistent_lsn. I noticed that in general when we create a slot, the
> confirmed_flush is set to the end of a RUNNING_XACT log or we can say
> start of the next record. And this next record can be anything. Ii can
> be a COMMIT record for a transaction in another session.
> I have attached server logs and waldump logs for one of such case
> reproduced using test script shared in [1].
> The snapbuild machinery has four steps: START, BUILDING_SNAPSHOT,
> FULL_SNAPSHOT and SNAPBUILD_CONSISTENT. Between each step a
> RUNNING_XACT is logged.
...

Thanks for the analysis! It is quite helpful. Based on your point I understood
like below. Are they correct?

Facts:
=====
1.
RUNNING_XACT records can be generated when the snapshot status is advanced while
creating the slot.
2.
pg_create_logical_replication_slot() returns the end point of RUNNING_XACT.
It was generated when the snapshot becomes SNAPBUILD_CONSISTENT.
3.
Some transactions could be started while the snapshot is FULL_SNAPSHOT state, and
they can be committed after we reached SNAPBUILD_CONSISTENT. Such transactions
should be output by the upcoming logical decoding.

What happened here:
=================
a.
confirmed_flush_lsn was 0/03CBCCA0, which is end of RUNNING_XACT (lsn: 0/03CBCC58).
Also, a COMMIT record for txn 1369 located *just after* the RUNNING_XACT [1].
b.
pg_createsubscriber set the recovery_target_lsn to "0/03CBCCA0", and
recovery_target_inclusive was true. This meant record stared from "0/03CBCCA0"
must be applied.
c.
startup process applied till that point. Transaction 1369 was applied and then the
standby could be promoted.
e.
logical walsender decoded transaction 1369 and replicated it to the standby.
However, it has already been applied by startup thus conflict could happen.

[1]:
according to the log:
```
...
rmgr: Standby     len (rec/tot):     70/    70, tx:          0, lsn: 0/03CBCC58, prev 0/03CBCC18, desc: RUNNING_XACTS
nextXid1370 latestCompletedXid 1364 oldestRunningXid 1365; 5 xacts: 1366 1365 1369 1368 1367
 
rmgr: Transaction len (rec/tot):     46/    46, tx:       1369, lsn: 0/03CBCCA0, prev 0/03CBCC58, desc: COMMIT
2025-07-2016:50:18.031146 IST
 
...
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

Shlok Kyal

Дата:

23 июля, 07:35:01

On Tue, 22 Jul 2025 at 17:51, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Shlok,
>
> > I checked it and here is my analysis:
> >
> > When we create a slot, it returns the confirmed_flush LSN as a
> > consistent_lsn. I noticed that in general when we create a slot, the
> > confirmed_flush is set to the end of a RUNNING_XACT log or we can say
> > start of the next record. And this next record can be anything. Ii can
> > be a COMMIT record for a transaction in another session.
> > I have attached server logs and waldump logs for one of such case
> > reproduced using test script shared in [1].
> > The snapbuild machinery has four steps: START, BUILDING_SNAPSHOT,
> > FULL_SNAPSHOT and SNAPBUILD_CONSISTENT. Between each step a
> > RUNNING_XACT is logged.
> ...
>
> Thanks for the analysis! It is quite helpful. Based on your point I understood
> like below. Are they correct?
>
> Facts:
> =====
> 1.
> RUNNING_XACT records can be generated when the snapshot status is advanced while
> creating the slot.
> 2.
> pg_create_logical_replication_slot() returns the end point of RUNNING_XACT.
> It was generated when the snapshot becomes SNAPBUILD_CONSISTENT.
> 3.
> Some transactions could be started while the snapshot is FULL_SNAPSHOT state, and
> they can be committed after we reached SNAPBUILD_CONSISTENT. Such transactions
> should be output by the upcoming logical decoding.
>
> What happened here:
> =================
> a.
> confirmed_flush_lsn was 0/03CBCCA0, which is end of RUNNING_XACT (lsn: 0/03CBCC58).
> Also, a COMMIT record for txn 1369 located *just after* the RUNNING_XACT [1].
> b.
> pg_createsubscriber set the recovery_target_lsn to "0/03CBCCA0", and
> recovery_target_inclusive was true. This meant record stared from "0/03CBCCA0"
> must be applied.
> c.
> startup process applied till that point. Transaction 1369 was applied and then the
> standby could be promoted.
> e.
> logical walsender decoded transaction 1369 and replicated it to the standby.
> However, it has already been applied by startup thus conflict could happen.
>
> [1]:
> according to the log:
> ```
> ...
> rmgr: Standby     len (rec/tot):     70/    70, tx:          0, lsn: 0/03CBCC58, prev 0/03CBCC18, desc: RUNNING_XACTS
nextXid1370 latestCompletedXid 1364 oldestRunningXid 1365; 5 xacts: 1366 1365 1369 1368 1367
 
> rmgr: Transaction len (rec/tot):     46/    46, tx:       1369, lsn: 0/03CBCCA0, prev 0/03CBCC58, desc: COMMIT
2025-07-2016:50:18.031146 IST
 
> ...
> ```
>
> Best regards,
> Hayato Kuroda
> FUJITSU LIMITED
>
Hi Kuroda-san,

Thanks for reviewing the thread. Your understanding is correct.

Thanks,
Shlok Kyal

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

Amit Kapila

Дата:

23 июля, 12:19:42

On Tue, Jul 22, 2025 at 4:54 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> On Thu, 10 Jul 2025 at 12:33, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > We should find out in which case and why the consisten_lsn is a start
> > point LSN of a commit record. We use slot's confirm_flush LSN location
> > as a consistent_lsn, which normally should be the end point of
> > running_xacts record or commit_end LSN record (in case client sends
> > ack).
> >
> I checked it and here is my analysis:
>
> When we create a slot, it returns the confirmed_flush LSN as a
> consistent_lsn. I noticed that in general when we create a slot, the
> confirmed_flush is set to the end of a RUNNING_XACT log or we can say
> start of the next record. And this next record can be anything. Ii can
> be a COMMIT record for a transaction in another session.
>
...
> waldump record:
> rmgr: Standby     len (rec/tot):     70/    70, tx:          0, lsn:
> 0/03CBCC58, prev 0/03CBCC18, desc: RUNNING_XACTS nextXid 1370
> latestCompletedXid 1364 oldestRunningXid 1365; 5 xacts: 1366 1365 1369
> 1368 1367
> The consistent point is found at  "0/3CBCC58".
>
> When slot is created the confirmed_flush is set inside function
> "DecodingContextFindStartpoint" using:
> slot->data.confirmed_flush = ctx->reader->EndRecPtr;
> In our case the value of consistent_lsn is "0/03CBCCA0" (I added some
> logs  and got the value). Logs:
> 2025-07-20 16:50:18.039 IST [1780326] port=5340
> ubuntu@test_db/[unknown] LOG:  #### confirmed_flush =  0/03CBCCA0
> inside DecodingContextFindStartpoint
> 2025-07-20 16:50:18.039 IST [1780326] port=5340
> ubuntu@test_db/[unknown] STATEMENT:  SELECT lsn FROM
> pg_catalog.pg_create_logical_replication_slot('sub', 'pgoutput',
> false, false, false)
>
> This consistent_lsn "0/03CBCCA0" is nothing but End of RUNNING_XACT (
> whose start is "0/3CBCC58").
>
> While the slot is being created a transaction in a concurrent session
> commits (just after the third RUNNING_XACT) and add a COMMIT log:
> rmgr: Transaction len (rec/tot):     46/    46, tx:       1369, lsn:
> 0/03CBCCA0, prev 0/03CBCC58, desc: COMMIT 2025-07-20 16:50:18.031146
> IST
>
> So, in such cases the consistent LSN can be set to a COMMIT record.
>

Your analysis and proposed patch looks good to me. I'll push this
patch tomorrow unless Euler or someone thinks otherwise.

> > If we decide to fix in the way proposed here, then we also need to
> > investigate whether we need an additional WAL record added by commit'
> > 03b08c8f5f3e30c97e5908f3d3d76872dab8a9dc. The reason why that
> > additional WAL record was added is discussed in email [1].
> >
> > [1] -
https://www.postgresql.org/message-id/flat/2377319.1719766794%40sss.pgh.pa.us#bba9f5ee0efc73151cc521a6bd5182ed
>
> I reverted the change added by commit
> 03b08c8f5f3e30c97e5908f3d3d76872dab8a9dc and applied my patch and
> checked the behaviour. And I am able to reproduce the issue the commit
> was resolving. I think this change is still required.
> This change is still required because, while recovery is performed in
> the function 'PerformWalRecovery', when recovery_target_inclusive is
> set to false, function 'recoveryStopsBefore' is responsible to set
> whether recovery is finished or not. This function will set
> 'reachedRecoveryTarget' to true when it satisfy the condition
> /* Check if target LSN has been reached */
>     if (recoveryTarget == RECOVERY_TARGET_LSN &&
>         !recoveryTargetInclusive &&
>         record->ReadRecPtr >= recoveryTargetLSN)
> Here we are checking if "start of the record" >= recoveryTargetLSN.
>
> When a replication slot is created, consistent_lsn is obtained. Since
> this consistent_lsn points to End of the record (or we can say start
> of the next record), there can be a case that there is no WAL record
> corresponding to the consistent lsn. So, during the recovery, it will
> wait till it reads the record corresponding to consistent lsn (during
> my testing this wait was around ~20 sec). And this wait can create the
> timeout issue.
> I have manually debugged and checked the above case and I think the
> change in commit 03b08c8f5f3e30c97e5908f3d3d76872dab8a9dc is still
> needed.
>

Agreed.

Thanks for the detailed analysis.

--
With Regards,
Amit Kapila.

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

vignesh C

Дата:

23 июля, 12:24:49

On Tue, 22 Jul 2025 at 16:54, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> On Thu, 10 Jul 2025 at 12:33, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Apr 29, 2025 at 1:17 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
> > >
> > > On Mon, 28 Apr 2025 at 10:28, vignesh C <vignesh21@gmail.com> wrote:
> > > >
> > > > With this approach, there is a risk of starting from the next WAL
> > > > record after the consistent point. For example, if the slot returns a
> > > > consistent point at 0/1715E10, after the fix we would begin replaying
> > > > from the next WAL record, such as 0/1715E40, which could potentially
> > > > lead to data loss.
> > > > As an alternative, we could set recovery_target_inclusive to false in
> > > > the setup_recovery function. This way, recovery would stop just before
> > > > the recovery target, allowing the publisher to start replicating
> > > > exactly from the consistent point.
> > > > Thoughts?
> > >
> > > This approach looks better to me.
> > > I have prepared the patch for the same.
> > >
> >
> > We should find out in which case and why the consisten_lsn is a start
> > point LSN of a commit record. We use slot's confirm_flush LSN location
> > as a consistent_lsn, which normally should be the end point of
> > running_xacts record or commit_end LSN record (in case client sends
> > ack).
> >
> I checked it and here is my analysis:
>
> When we create a slot, it returns the confirmed_flush LSN as a
> consistent_lsn. I noticed that in general when we create a slot, the
> confirmed_flush is set to the end of a RUNNING_XACT log or we can say
> start of the next record. And this next record can be anything. Ii can
> be a COMMIT record for a transaction in another session.
> I have attached server logs and waldump logs for one of such case
> reproduced using test script shared in [1].
> The snapbuild machinery has four steps: START, BUILDING_SNAPSHOT,
> FULL_SNAPSHOT and SNAPBUILD_CONSISTENT. Between each step a
> RUNNING_XACT is logged.
>
> When the command to create a slot is executed, during the first
> RUNNING_XACT, we get following logs:
> 2025-07-20 16:50:18.002 IST [1780326] port=5340
> ubuntu@test_db/[unknown] LOG:  logical decoding found initial starting
> point at 0/3C813D8
> 2025-07-20 16:50:18.002 IST [1780326] port=5340
> ubuntu@test_db/[unknown] DETAIL:  Waiting for transactions
> (approximately 2) older than 1353 to end.
> 2025-07-20 16:50:18.002 IST [1780326] port=5340
> ubuntu@test_db/[unknown] STATEMENT:  SELECT lsn FROM
> pg_catalog.pg_create_logical_replication_slot('sub', 'pgoutput',
> false, false, false)
>
> waldump log corresponding to this:
> rmgr: Standby     len (rec/tot):     58/    58, tx:          0, lsn:
> 0/03C813D8, prev 0/03C81390, desc: RUNNING_XACTS nextXid 1353
> latestCompletedXid 1350 oldestRunningXid 1351; 2 xacts: 1352 1351
>
> For second  RUNNING_XACT, we get following logs:
> 2025-07-20 16:50:18.007 IST [1780326] port=5340
> ubuntu@test_db/[unknown] LOG:  logical decoding found initial
> consistent point at 0/3C97948
> 2025-07-20 16:50:18.007 IST [1780326] port=5340
> ubuntu@test_db/[unknown] DETAIL:  Waiting for transactions
> (approximately 8) older than 1362 to end.
> 2025-07-20 16:50:18.007 IST [1780326] port=5340
> ubuntu@test_db/[unknown] STATEMENT:  SELECT lsn FROM
> pg_catalog.pg_create_logical_replication_slot('sub', 'pgoutput',
> false, false, false)
>
> waldump record:
> len (rec/tot):     82/    82, tx:          0, lsn: 0/03C97948, prev
> 0/03C97900, desc: RUNNING_XACTS nextXid 1362 latestCompletedXid 1353
> oldestRunningXid 1354; 8 xacts: 1357 1358 1355 1354 1359 1361 1356
> 1360
>
> For third RUNNING_XACT,  we get following logs::
> 2025-07-20 16:50:18.038 IST [1780326] port=5340
> ubuntu@test_db/[unknown] LOG:  logical decoding found consistent point
> at 0/3CBCC58
> 2025-07-20 16:50:18.038 IST [1780326] port=5340
> ubuntu@test_db/[unknown] DETAIL:  There are no old transactions
> anymore.
> 2025-07-20 16:50:18.038 IST [1780326] port=5340
> ubuntu@test_db/[unknown] STATEMENT:  SELECT lsn FROM
> pg_catalog.pg_create_logical_replication_slot('sub', 'pgoutput',
> false, false, false)
>
> waldump record:
> rmgr: Standby     len (rec/tot):     70/    70, tx:          0, lsn:
> 0/03CBCC58, prev 0/03CBCC18, desc: RUNNING_XACTS nextXid 1370
> latestCompletedXid 1364 oldestRunningXid 1365; 5 xacts: 1366 1365 1369
> 1368 1367
> The consistent point is found at  "0/3CBCC58".
>
> When slot is created the confirmed_flush is set inside function
> "DecodingContextFindStartpoint" using:
> slot->data.confirmed_flush = ctx->reader->EndRecPtr;
> In our case the value of consistent_lsn is "0/03CBCCA0" (I added some
> logs  and got the value). Logs:
> 2025-07-20 16:50:18.039 IST [1780326] port=5340
> ubuntu@test_db/[unknown] LOG:  #### confirmed_flush =  0/03CBCCA0
> inside DecodingContextFindStartpoint
> 2025-07-20 16:50:18.039 IST [1780326] port=5340
> ubuntu@test_db/[unknown] STATEMENT:  SELECT lsn FROM
> pg_catalog.pg_create_logical_replication_slot('sub', 'pgoutput',
> false, false, false)
>
> This consistent_lsn "0/03CBCCA0" is nothing but End of RUNNING_XACT (
> whose start is "0/3CBCC58").
>
> While the slot is being created a transaction in a concurrent session
> commits (just after the third RUNNING_XACT) and add a COMMIT log:
> rmgr: Transaction len (rec/tot):     46/    46, tx:       1369, lsn:
> 0/03CBCCA0, prev 0/03CBCC58, desc: COMMIT 2025-07-20 16:50:18.031146
> IST
>
> So, in such cases the consistent LSN can be set to a COMMIT record.
>
> > If we decide to fix in the way proposed here, then we also need to
> > investigate whether we need an additional WAL record added by commit'
> > 03b08c8f5f3e30c97e5908f3d3d76872dab8a9dc. The reason why that
> > additional WAL record was added is discussed in email [1].
> >
> > [1] -
https://www.postgresql.org/message-id/flat/2377319.1719766794%40sss.pgh.pa.us#bba9f5ee0efc73151cc521a6bd5182ed
>
> I reverted the change added by commit
> 03b08c8f5f3e30c97e5908f3d3d76872dab8a9dc and applied my patch and
> checked the behaviour. And I am able to reproduce the issue the commit
> was resolving. I think this change is still required.
> This change is still required because, while recovery is performed in
> the function 'PerformWalRecovery', when recovery_target_inclusive is
> set to false, function 'recoveryStopsBefore' is responsible to set
> whether recovery is finished or not. This function will set
> 'reachedRecoveryTarget' to true when it satisfy the condition
> /* Check if target LSN has been reached */
>     if (recoveryTarget == RECOVERY_TARGET_LSN &&
>         !recoveryTargetInclusive &&
>         record->ReadRecPtr >= recoveryTargetLSN)
> Here we are checking if "start of the record" >= recoveryTargetLSN.
>
> When a replication slot is created, consistent_lsn is obtained. Since
> this consistent_lsn points to End of the record (or we can say start
> of the next record), there can be a case that there is no WAL record
> corresponding to the consistent lsn. So, during the recovery, it will
> wait till it reads the record corresponding to consistent lsn (during
> my testing this wait was around ~20 sec). And this wait can create the
> timeout issue.
> I have manually debugged and checked the above case and I think the
> change in commit 03b08c8f5f3e30c97e5908f3d3d76872dab8a9dc is still
> needed.

I agree with your analysis.

How about adding a comment in the following lines here:
/* Set recovery_target_inclusive = false to avoid reapplying the
transaction committed at consistent_lsn after subscription is enabled.
*/

        appendPQExpBufferStr(recoveryconfcontents,
-
"recovery_target_inclusive = true\n");
+
"recovery_target_inclusive = false\n");
        appendPQExpBufferStr(recoveryconfcontents,

Also the master branch patch can be applied for PG18 branch, just
master and branch PG17 branch patch is enough.

Regards,
Vignesh

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

Shlok Kyal

Дата:

23 июля, 16:25:49

On Wed, 23 Jul 2025 at 14:55, vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, 22 Jul 2025 at 16:54, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
> >
> > On Thu, 10 Jul 2025 at 12:33, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Apr 29, 2025 at 1:17 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
> > > >
> > > > On Mon, 28 Apr 2025 at 10:28, vignesh C <vignesh21@gmail.com> wrote:
> > > > >
> > > > > With this approach, there is a risk of starting from the next WAL
> > > > > record after the consistent point. For example, if the slot returns a
> > > > > consistent point at 0/1715E10, after the fix we would begin replaying
> > > > > from the next WAL record, such as 0/1715E40, which could potentially
> > > > > lead to data loss.
> > > > > As an alternative, we could set recovery_target_inclusive to false in
> > > > > the setup_recovery function. This way, recovery would stop just before
> > > > > the recovery target, allowing the publisher to start replicating
> > > > > exactly from the consistent point.
> > > > > Thoughts?
> > > >
> > > > This approach looks better to me.
> > > > I have prepared the patch for the same.
> > > >
> > >
> > > We should find out in which case and why the consisten_lsn is a start
> > > point LSN of a commit record. We use slot's confirm_flush LSN location
> > > as a consistent_lsn, which normally should be the end point of
> > > running_xacts record or commit_end LSN record (in case client sends
> > > ack).
> > >
> > I checked it and here is my analysis:
> >
> > When we create a slot, it returns the confirmed_flush LSN as a
> > consistent_lsn. I noticed that in general when we create a slot, the
> > confirmed_flush is set to the end of a RUNNING_XACT log or we can say
> > start of the next record. And this next record can be anything. Ii can
> > be a COMMIT record for a transaction in another session.
> > I have attached server logs and waldump logs for one of such case
> > reproduced using test script shared in [1].
> > The snapbuild machinery has four steps: START, BUILDING_SNAPSHOT,
> > FULL_SNAPSHOT and SNAPBUILD_CONSISTENT. Between each step a
> > RUNNING_XACT is logged.
> >
> > When the command to create a slot is executed, during the first
> > RUNNING_XACT, we get following logs:
> > 2025-07-20 16:50:18.002 IST [1780326] port=5340
> > ubuntu@test_db/[unknown] LOG:  logical decoding found initial starting
> > point at 0/3C813D8
> > 2025-07-20 16:50:18.002 IST [1780326] port=5340
> > ubuntu@test_db/[unknown] DETAIL:  Waiting for transactions
> > (approximately 2) older than 1353 to end.
> > 2025-07-20 16:50:18.002 IST [1780326] port=5340
> > ubuntu@test_db/[unknown] STATEMENT:  SELECT lsn FROM
> > pg_catalog.pg_create_logical_replication_slot('sub', 'pgoutput',
> > false, false, false)
> >
> > waldump log corresponding to this:
> > rmgr: Standby     len (rec/tot):     58/    58, tx:          0, lsn:
> > 0/03C813D8, prev 0/03C81390, desc: RUNNING_XACTS nextXid 1353
> > latestCompletedXid 1350 oldestRunningXid 1351; 2 xacts: 1352 1351
> >
> > For second  RUNNING_XACT, we get following logs:
> > 2025-07-20 16:50:18.007 IST [1780326] port=5340
> > ubuntu@test_db/[unknown] LOG:  logical decoding found initial
> > consistent point at 0/3C97948
> > 2025-07-20 16:50:18.007 IST [1780326] port=5340
> > ubuntu@test_db/[unknown] DETAIL:  Waiting for transactions
> > (approximately 8) older than 1362 to end.
> > 2025-07-20 16:50:18.007 IST [1780326] port=5340
> > ubuntu@test_db/[unknown] STATEMENT:  SELECT lsn FROM
> > pg_catalog.pg_create_logical_replication_slot('sub', 'pgoutput',
> > false, false, false)
> >
> > waldump record:
> > len (rec/tot):     82/    82, tx:          0, lsn: 0/03C97948, prev
> > 0/03C97900, desc: RUNNING_XACTS nextXid 1362 latestCompletedXid 1353
> > oldestRunningXid 1354; 8 xacts: 1357 1358 1355 1354 1359 1361 1356
> > 1360
> >
> > For third RUNNING_XACT,  we get following logs::
> > 2025-07-20 16:50:18.038 IST [1780326] port=5340
> > ubuntu@test_db/[unknown] LOG:  logical decoding found consistent point
> > at 0/3CBCC58
> > 2025-07-20 16:50:18.038 IST [1780326] port=5340
> > ubuntu@test_db/[unknown] DETAIL:  There are no old transactions
> > anymore.
> > 2025-07-20 16:50:18.038 IST [1780326] port=5340
> > ubuntu@test_db/[unknown] STATEMENT:  SELECT lsn FROM
> > pg_catalog.pg_create_logical_replication_slot('sub', 'pgoutput',
> > false, false, false)
> >
> > waldump record:
> > rmgr: Standby     len (rec/tot):     70/    70, tx:          0, lsn:
> > 0/03CBCC58, prev 0/03CBCC18, desc: RUNNING_XACTS nextXid 1370
> > latestCompletedXid 1364 oldestRunningXid 1365; 5 xacts: 1366 1365 1369
> > 1368 1367
> > The consistent point is found at  "0/3CBCC58".
> >
> > When slot is created the confirmed_flush is set inside function
> > "DecodingContextFindStartpoint" using:
> > slot->data.confirmed_flush = ctx->reader->EndRecPtr;
> > In our case the value of consistent_lsn is "0/03CBCCA0" (I added some
> > logs  and got the value). Logs:
> > 2025-07-20 16:50:18.039 IST [1780326] port=5340
> > ubuntu@test_db/[unknown] LOG:  #### confirmed_flush =  0/03CBCCA0
> > inside DecodingContextFindStartpoint
> > 2025-07-20 16:50:18.039 IST [1780326] port=5340
> > ubuntu@test_db/[unknown] STATEMENT:  SELECT lsn FROM
> > pg_catalog.pg_create_logical_replication_slot('sub', 'pgoutput',
> > false, false, false)
> >
> > This consistent_lsn "0/03CBCCA0" is nothing but End of RUNNING_XACT (
> > whose start is "0/3CBCC58").
> >
> > While the slot is being created a transaction in a concurrent session
> > commits (just after the third RUNNING_XACT) and add a COMMIT log:
> > rmgr: Transaction len (rec/tot):     46/    46, tx:       1369, lsn:
> > 0/03CBCCA0, prev 0/03CBCC58, desc: COMMIT 2025-07-20 16:50:18.031146
> > IST
> >
> > So, in such cases the consistent LSN can be set to a COMMIT record.
> >
> > > If we decide to fix in the way proposed here, then we also need to
> > > investigate whether we need an additional WAL record added by commit'
> > > 03b08c8f5f3e30c97e5908f3d3d76872dab8a9dc. The reason why that
> > > additional WAL record was added is discussed in email [1].
> > >
> > > [1] -
https://www.postgresql.org/message-id/flat/2377319.1719766794%40sss.pgh.pa.us#bba9f5ee0efc73151cc521a6bd5182ed
> >
> > I reverted the change added by commit
> > 03b08c8f5f3e30c97e5908f3d3d76872dab8a9dc and applied my patch and
> > checked the behaviour. And I am able to reproduce the issue the commit
> > was resolving. I think this change is still required.
> > This change is still required because, while recovery is performed in
> > the function 'PerformWalRecovery', when recovery_target_inclusive is
> > set to false, function 'recoveryStopsBefore' is responsible to set
> > whether recovery is finished or not. This function will set
> > 'reachedRecoveryTarget' to true when it satisfy the condition
> > /* Check if target LSN has been reached */
> >     if (recoveryTarget == RECOVERY_TARGET_LSN &&
> >         !recoveryTargetInclusive &&
> >         record->ReadRecPtr >= recoveryTargetLSN)
> > Here we are checking if "start of the record" >= recoveryTargetLSN.
> >
> > When a replication slot is created, consistent_lsn is obtained. Since
> > this consistent_lsn points to End of the record (or we can say start
> > of the next record), there can be a case that there is no WAL record
> > corresponding to the consistent lsn. So, during the recovery, it will
> > wait till it reads the record corresponding to consistent lsn (during
> > my testing this wait was around ~20 sec). And this wait can create the
> > timeout issue.
> > I have manually debugged and checked the above case and I think the
> > change in commit 03b08c8f5f3e30c97e5908f3d3d76872dab8a9dc is still
> > needed.
>
> I agree with your analysis.
>
> How about adding a comment in the following lines here:
> /* Set recovery_target_inclusive = false to avoid reapplying the
> transaction committed at consistent_lsn after subscription is enabled.
> */
>
>         appendPQExpBufferStr(recoveryconfcontents,
> -
> "recovery_target_inclusive = true\n");
> +
> "recovery_target_inclusive = false\n");
>         appendPQExpBufferStr(recoveryconfcontents,
>
> Also the master branch patch can be applied for PG18 branch, just
> master and branch PG17 branch patch is enough.
>
Thanks for reviewing the patch. I have added comments and attached the
updated patch.
v6_HEAD applies to HEAD and REL_18_STABLE. v6_REL_17 applies to REL_17_STABLE.

Thanks,
Shlok Kyal

Вложения

Re: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

От

Amit Kapila

Дата:

24 июля, 13:49:02

On Wed, Jul 23, 2025 at 6:56 PM Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> >
> Thanks for reviewing the patch. I have added comments and attached the
> updated patch.
> v6_HEAD applies to HEAD and REL_18_STABLE. v6_REL_17 applies to REL_17_STABLE.
>

Thanks, I have pushed the fix after changing the comments.

--
With Regards,
Amit Kapila.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: BUG #18897: Logical replication conflict after using pg_createsubscriber under heavy load

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения