pg_ctl start may return 0 even if the postmaster has been already started on Windows

Поиск

Список

Период

Сортировка

От	Hayato Kuroda (Fujitsu)
Тема	pg_ctl start may return 0 even if the postmaster has been already started on Windows
Дата	7 сентября 2023 г. 07:07:36
Msg-id	TYAPR01MB586654E2D74B838021BE77CAF5EEA@TYAPR01MB5866.jpnprd01.prod.outlook.com обсуждение исходный текст
Ответы	Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Список	pgsql-hackers

Дерево обсуждения

Dear hackers,

While investigating the cfbot failure [1], I found a strange behavior of pg_ctl
command. How do you think? Is this a bug to be fixed or in the specification?

# Problem

The "pg_ctl start" command returns 0 (succeeded) even if the cluster has
already been started. This occurs on Windows environment, and when the command
is executed just after postmaster starts.


# Analysis

The primal reason is in wait_for_postmaster_start(). In this function the
postmaster.pid file is read and checked whether the start command is
successfully done or not.

Check (1) requires that the postmaster must be started after the our pg_ctl
command, but 2 seconds delay is accepted.

In the linux mode, the check (2) is also executed to ensures that the forked
process modified the file, so this time window is not so problematic.
But in the windows system, (2) is ignored, *so the pg_ctl command may be
succeeded if the postmaster is started within 2 seconds.*

```
        if ((optlines = readfile(pid_file, &numlines)) != NULL &&
            numlines >= LOCK_FILE_LINE_PM_STATUS)
        {
            /* File is complete enough for us, parse it */
            pid_t        pmpid;
            time_t        pmstart;

            /*
             * Make sanity checks.  If it's for the wrong PID, or the recorded
             * start time is before pg_ctl started, then either we are looking
             * at the wrong data directory, or this is a pre-existing pidfile
             * that hasn't (yet?) been overwritten by our child postmaster.
             * Allow 2 seconds slop for possible cross-process clock skew.
             */
            pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
            pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
            if (pmstart >= start_time - 2 && // (1)
#ifndef WIN32
                pmpid == pm_pid // (2)
#else
            /* Windows can only reject standalone-backend PIDs */
                pmpid > 0
#endif

```

# Appendix - how do I found?

I found it while investigating the failure. In the test "pg_upgrade --check"
is executed just after old cluster has been started. I checked the output file [2]
and found that the banner says "Performing Consistency Checks", which meant that
the parameter live_check was set to false (see output_check_banner()). This
parameter is set to true when the postmaster has been started at that time and
the pg_ctl start fails. That's how I find.

[1]: https://cirrus-ci.com/task/4634769732927488
[2]:
https://api.cirrus-ci.com/v1/artifact/task/4634769732927488/testrun/build/testrun/pg_upgrade/003_logical_replication_slots/data/t_003_logical_replication_slots_new_publisher_data/pgdata/pg_upgrade_output.d/20230905T080645.548/log/pg_upgrade_internal.log

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

pg_ctl start may return 0 even if the postmaster has been already started on Windows