pg_ctl start may return 0 even if the postmaster has been already started on Windows

Поиск
Список
Период
Сортировка
От Hayato Kuroda (Fujitsu)
Тема pg_ctl start may return 0 even if the postmaster has been already started on Windows
Дата
Msg-id TYAPR01MB586654E2D74B838021BE77CAF5EEA@TYAPR01MB5866.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответы Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-hackers
Dear hackers,

While investigating the cfbot failure [1], I found a strange behavior of pg_ctl
command. How do you think? Is this a bug to be fixed or in the specification?

# Problem

The "pg_ctl start" command returns 0 (succeeded) even if the cluster has
already been started. This occurs on Windows environment, and when the command
is executed just after postmaster starts.


# Analysis

The primal reason is in wait_for_postmaster_start(). In this function the
postmaster.pid file is read and checked whether the start command is
successfully done or not.

Check (1) requires that the postmaster must be started after the our pg_ctl
command, but 2 seconds delay is accepted.

In the linux mode, the check (2) is also executed to ensures that the forked
process modified the file, so this time window is not so problematic.
But in the windows system, (2) is ignored, *so the pg_ctl command may be
succeeded if the postmaster is started within 2 seconds.*

```
        if ((optlines = readfile(pid_file, &numlines)) != NULL &&
            numlines >= LOCK_FILE_LINE_PM_STATUS)
        {
            /* File is complete enough for us, parse it */
            pid_t        pmpid;
            time_t        pmstart;

            /*
             * Make sanity checks.  If it's for the wrong PID, or the recorded
             * start time is before pg_ctl started, then either we are looking
             * at the wrong data directory, or this is a pre-existing pidfile
             * that hasn't (yet?) been overwritten by our child postmaster.
             * Allow 2 seconds slop for possible cross-process clock skew.
             */
            pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]);
            pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]);
            if (pmstart >= start_time - 2 && // (1)
#ifndef WIN32
                pmpid == pm_pid // (2)
#else
            /* Windows can only reject standalone-backend PIDs */
                pmpid > 0
#endif

```

# Appendix - how do I found?

I found it while investigating the failure. In the test "pg_upgrade --check"
is executed just after old cluster has been started. I checked the output file [2]
and found that the banner says "Performing Consistency Checks", which meant that
the parameter live_check was set to false (see output_check_banner()). This
parameter is set to true when the postmaster has been started at that time and
the pg_ctl start fails. That's how I find.

[1]: https://cirrus-ci.com/task/4634769732927488
[2]:
https://api.cirrus-ci.com/v1/artifact/task/4634769732927488/testrun/build/testrun/pg_upgrade/003_logical_replication_slots/data/t_003_logical_replication_slots_new_publisher_data/pgdata/pg_upgrade_output.d/20230905T080645.548/log/pg_upgrade_internal.log

Best Regards,
Hayato Kuroda
FUJITSU LIMITED




В списке pgsql-hackers по дате отправления:

Предыдущее
От: bt23nguyent
Дата:
Сообщение: Re: Transaction timeout
Следующее
От: Yugo NAGATA
Дата:
Сообщение: Re: psql help message contains excessive indentations