Re: pg_upgrade failing for 200+ million Large Objects

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: pg_upgrade failing for 200+ million Large Objects
Дата
Msg-id 2055911.1702258962@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: pg_upgrade failing for 200+ million Large Objects  ("Kumar, Sachin" <ssetiya@amazon.com>)
Ответы Re: pg_upgrade failing for 200+ million Large Objects  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: pg_upgrade failing for 200+ million Large Objects  ("Kumar, Sachin" <ssetiya@amazon.com>)
Список pgsql-hackers
I spent some time looking at the v7 patch.  I can't help feeling
that this is going off in the wrong direction, primarily for
these reasons:

* It focuses only on cutting the number of transactions needed
to restore a large number of blobs (large objects).  Certainly
that's a pain point, but it's not the only one of this sort.
If you have a lot of tables, restore will consume just as many
transactions as it would for a similar number of blobs --- probably
more, in fact, since we usually need more commands per table than
per blob.

* I'm not too thrilled with the (undocumented) rearrangements in
pg_dump.  I really don't like the idea of emitting a fundamentally
different TOC layout in binary-upgrade mode; that seems unmaintainably
bug-prone.  Plus, the XID-consumption problem is not really confined
to pg_upgrade.

What I think we actually ought to do is one of the alternatives
discussed upthread: teach pg_restore to be able to commit
every so often, without trying to provide the all-or-nothing
guarantees of --single-transaction mode.  This cuts its XID
consumption by whatever multiple "every so often" is, while
allowing us to limit the number of locks taken during any one
transaction.  It also seems a great deal safer than the idea
I floated of not taking locks at all during a binary upgrade;
plus, it has some usefulness with regular pg_restore that's not
under control of pg_upgrade.

So I had a go at coding that, and attached is the result.
It invents a --transaction-size option, and when that's active
it will COMMIT after every N TOC items.  (This seems simpler to
implement and less bug-prone than every-N-SQL-commands.)

I had initially supposed that in a parallel restore we could
have child workers also commit after every N TOC items, but was
soon disabused of that idea.  After a worker processes a TOC
item, any dependent items (such as index builds) might get
dispatched to some other worker, which had better be able to
see the results of the first worker's step.  So at least in
this implementation, we disable the multi-command-per-COMMIT
behavior during the parallel part of the restore.  Maybe that
could be improved in future, but it seems like it'd add a
lot more complexity, and it wouldn't make life any better for
pg_upgrade (which doesn't use parallel pg_restore, and seems
unlikely to want to in future).

I've not spent a lot of effort on pg_upgrade changes here:
I just hard-wired it to select --transaction-size=1000.
Given the default lock table size of 64*100, that gives us
enough headroom for each TOC to take half a dozen locks.
We could go higher than that by making pg_upgrade force the
destination postmaster to create a larger-than-default lock
table, but I'm not sure if it's worth any trouble.  We've
already bought three orders of magnitude improvement as it
stands, which seems like enough ambition for today.  (Also,
having pg_upgrade override the user's settings in the
destination cluster might not be without downsides.)

Another thing I'm wondering about is why this is only a pg_restore
option not also a pg_dump/pg_dumpall option.  I did it like that
because --single-transaction is pg_restore only, but that seems more
like an oversight or laziness than a well-considered decision.
Maybe we should back-fill that omission; but it could be done later.

Thoughts?

            regards, tom lane

diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index 1a23874da6..2e3ba80258 100644
--- a/doc/src/sgml/ref/pg_restore.sgml
+++ b/doc/src/sgml/ref/pg_restore.sgml
@@ -786,6 +786,30 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>

+     <varlistentry>
+      <term><option>--transaction-size=<replaceable class="parameter">N</replaceable></option></term>
+      <listitem>
+       <para>
+        Execute the restore as a series of transactions, each processing
+        up to <replaceable class="parameter">N</replaceable> database
+        objects.  This option implies <option>--exit-on-error</option>.
+       </para>
+       <para>
+        <option>--transaction-size</option> offers an intermediate choice
+        between the default behavior (one transaction per SQL command)
+        and <option>-1</option>/<option>--single-transaction</option>
+        (one transaction for all restored objects).
+        While <option>--single-transaction</option> has the least
+        overhead, it may be impractical for large databases because the
+        transaction will take a lock on each restored object, possibly
+        exhausting the server's lock table space.
+        Using <option>--transaction-size</option> with a size of a few
+        thousand objects offers nearly the same performance benefits while
+        capping the amount of lock table space needed.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--use-set-session-authorization</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 9ef2f2017e..fbf5f1c515 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -149,7 +149,9 @@ typedef struct _restoreOptions
                                                  * compression */
     int            suppressDumpWarnings;    /* Suppress output of WARNING entries
                                          * to stderr */
-    bool        single_txn;
+
+    bool        single_txn;        /* restore all TOCs in one transaction */
+    int            txn_size;        /* restore this many TOCs per txn, if > 0 */

     bool       *idWanted;        /* array showing which dump IDs to emit */
     int            enable_row_security;
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 256d1e35a4..600482c93c 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -502,7 +502,28 @@ RestoreArchive(Archive *AHX)
             /* Otherwise, drop anything that's selected and has a dropStmt */
             if (((te->reqs & (REQ_SCHEMA | REQ_DATA)) != 0) && te->dropStmt)
             {
+                bool        not_allowed_in_txn = false;
+
                 pg_log_info("dropping %s %s", te->desc, te->tag);
+
+                /*
+                 * In --transaction-size mode, we have to temporarily exit our
+                 * transaction block to drop objects that can't be dropped
+                 * within a transaction.
+                 */
+                if (ropt->txn_size > 0)
+                {
+                    if (strcmp(te->desc, "DATABASE") == 0 ||
+                        strcmp(te->desc, "DATABASE PROPERTIES") == 0)
+                    {
+                        not_allowed_in_txn = true;
+                        if (AH->connection)
+                            CommitTransaction(AHX);
+                        else
+                            ahprintf(AH, "COMMIT;\n");
+                    }
+                }
+
                 /* Select owner and schema as necessary */
                 _becomeOwner(AH, te);
                 _selectOutputSchema(AH, te->namespace);
@@ -615,6 +636,33 @@ RestoreArchive(Archive *AHX)
                         }
                     }
                 }
+
+                /*
+                 * In --transaction-size mode, re-establish the transaction
+                 * block if needed; otherwise, commit after every N drops.
+                 */
+                if (ropt->txn_size > 0)
+                {
+                    if (not_allowed_in_txn)
+                    {
+                        if (AH->connection)
+                            StartTransaction(AHX);
+                        else
+                            ahprintf(AH, "BEGIN;\n");
+                        AH->txnCount = 0;
+                    }
+                    else if (++AH->txnCount >= ropt->txn_size)
+                    {
+                        if (AH->connection)
+                        {
+                            CommitTransaction(AHX);
+                            StartTransaction(AHX);
+                        }
+                        else
+                            ahprintf(AH, "COMMIT;\nBEGIN;\n");
+                        AH->txnCount = 0;
+                    }
+                }
             }
         }

@@ -711,7 +759,11 @@ RestoreArchive(Archive *AHX)
         }
     }

-    if (ropt->single_txn)
+    /*
+     * Close out any persistent transaction we may have.  While these two
+     * cases are started in different places, we can end both cases here.
+     */
+    if (ropt->single_txn || ropt->txn_size > 0)
     {
         if (AH->connection)
             CommitTransaction(AHX);
@@ -772,6 +824,25 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel)
      */
     if ((reqs & REQ_SCHEMA) != 0)
     {
+        bool        object_is_db = false;
+
+        /*
+         * In --transaction-size mode, must exit our transaction block to
+         * create a database or set its properties.
+         */
+        if (strcmp(te->desc, "DATABASE") == 0 ||
+            strcmp(te->desc, "DATABASE PROPERTIES") == 0)
+        {
+            object_is_db = true;
+            if (ropt->txn_size > 0)
+            {
+                if (AH->connection)
+                    CommitTransaction(&AH->public);
+                else
+                    ahprintf(AH, "COMMIT;\n\n");
+            }
+        }
+
         /* Show namespace in log message if available */
         if (te->namespace)
             pg_log_info("creating %s \"%s.%s\"",
@@ -822,10 +893,10 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel)
         /*
          * If we created a DB, connect to it.  Also, if we changed DB
          * properties, reconnect to ensure that relevant GUC settings are
-         * applied to our session.
+         * applied to our session.  (That also restarts the transaction block
+         * in --transaction-size mode.)
          */
-        if (strcmp(te->desc, "DATABASE") == 0 ||
-            strcmp(te->desc, "DATABASE PROPERTIES") == 0)
+        if (object_is_db)
         {
             pg_log_info("connecting to new database \"%s\"", te->tag);
             _reconnectToDB(AH, te->tag);
@@ -951,6 +1022,25 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel)
         }
     }

+    /*
+     * If we emitted anything for this TOC entry, that counts as one action
+     * against the transaction-size limit.  Commit if it's time to.
+     */
+    if ((reqs & (REQ_SCHEMA | REQ_DATA)) != 0 && ropt->txn_size > 0)
+    {
+        if (++AH->txnCount >= ropt->txn_size)
+        {
+            if (AH->connection)
+            {
+                CommitTransaction(&AH->public);
+                StartTransaction(&AH->public);
+            }
+            else
+                ahprintf(AH, "COMMIT;\nBEGIN;\n\n");
+            AH->txnCount = 0;
+        }
+    }
+
     if (AH->public.n_errors > 0 && status == WORKER_OK)
         status = WORKER_IGNORED_ERRORS;

@@ -1297,7 +1387,12 @@ StartRestoreLOs(ArchiveHandle *AH)
 {
     RestoreOptions *ropt = AH->public.ropt;

-    if (!ropt->single_txn)
+    /*
+     * LOs must be restored within a transaction block, since we need the LO
+     * handle to stay open while we write it.  Establish a transaction unless
+     * there's one being used globally.
+     */
+    if (!(ropt->single_txn || ropt->txn_size > 0))
     {
         if (AH->connection)
             StartTransaction(&AH->public);
@@ -1316,7 +1411,7 @@ EndRestoreLOs(ArchiveHandle *AH)
 {
     RestoreOptions *ropt = AH->public.ropt;

-    if (!ropt->single_txn)
+    if (!(ropt->single_txn || ropt->txn_size > 0))
     {
         if (AH->connection)
             CommitTransaction(&AH->public);
@@ -3149,6 +3244,19 @@ _doSetFixedOutputState(ArchiveHandle *AH)
     else
         ahprintf(AH, "SET row_security = off;\n");

+    /*
+     * In --transaction-size mode, we should always be in a transaction when
+     * we begin to restore objects.
+     */
+    if (ropt && ropt->txn_size > 0)
+    {
+        if (AH->connection)
+            StartTransaction(&AH->public);
+        else
+            ahprintf(AH, "\nBEGIN;\n");
+        AH->txnCount = 0;
+    }
+
     ahprintf(AH, "\n");
 }

@@ -3991,6 +4099,14 @@ restore_toc_entries_prefork(ArchiveHandle *AH, TocEntry *pending_list)
         }
     }

+    /*
+     * In --transaction-size mode, we must commit the open transaction before
+     * dropping the database connection.  This also ensures that child workers
+     * can see the objects we've created so far.
+     */
+    if (AH->public.ropt->txn_size > 0)
+        CommitTransaction(&AH->public);
+
     /*
      * Now close parent connection in prep for parallel steps.  We do this
      * mainly to ensure that we don't exceed the specified number of parallel
@@ -4730,6 +4846,10 @@ CloneArchive(ArchiveHandle *AH)
     clone = (ArchiveHandle *) pg_malloc(sizeof(ArchiveHandle));
     memcpy(clone, AH, sizeof(ArchiveHandle));

+    /* Likewise flat-copy the RestoreOptions, so we can alter them locally */
+    clone->public.ropt = (RestoreOptions *) pg_malloc(sizeof(RestoreOptions));
+    memcpy(clone->public.ropt, AH->public.ropt, sizeof(RestoreOptions));
+
     /* Handle format-independent fields */
     memset(&(clone->sqlparse), 0, sizeof(clone->sqlparse));

@@ -4748,6 +4868,13 @@ CloneArchive(ArchiveHandle *AH)
     /* clone has its own error count, too */
     clone->public.n_errors = 0;

+    /*
+     * Clone connections disregard --transaction-size; they must commit after
+     * each command so that the results are immediately visible to other
+     * workers.
+     */
+    clone->public.ropt->txn_size = 0;
+
     /*
      * Connect our new clone object to the database, using the same connection
      * parameters used for the original connection.
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index 917283fd34..c21fdfe596 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -322,6 +322,9 @@ struct _archiveHandle
     char       *currTablespace; /* current tablespace, or NULL */
     char       *currTableAm;    /* current table access method, or NULL */

+    /* in --transaction-size mode, this counts objects emitted in cur xact */
+    int            txnCount;
+
     void       *lo_buf;
     size_t        lo_buf_used;
     size_t        lo_buf_size;
diff --git a/src/bin/pg_dump/pg_restore.c b/src/bin/pg_dump/pg_restore.c
index c3beacdec1..5ea78cf7cc 100644
--- a/src/bin/pg_dump/pg_restore.c
+++ b/src/bin/pg_dump/pg_restore.c
@@ -120,6 +120,7 @@ main(int argc, char **argv)
         {"role", required_argument, NULL, 2},
         {"section", required_argument, NULL, 3},
         {"strict-names", no_argument, &strict_names, 1},
+        {"transaction-size", required_argument, NULL, 5},
         {"use-set-session-authorization", no_argument, &use_setsessauth, 1},
         {"no-comments", no_argument, &no_comments, 1},
         {"no-publications", no_argument, &no_publications, 1},
@@ -289,10 +290,18 @@ main(int argc, char **argv)
                 set_dump_section(optarg, &(opts->dumpSections));
                 break;

-            case 4:
+            case 4:                /* filter */
                 read_restore_filters(optarg, opts);
                 break;

+            case 5:                /* transaction-size */
+                if (!option_parse_int(optarg, "--transaction-size",
+                                      1, INT_MAX,
+                                      &opts->txn_size))
+                    exit(1);
+                opts->exit_on_error = true;
+                break;
+
             default:
                 /* getopt_long already emitted a complaint */
                 pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -337,6 +346,9 @@ main(int argc, char **argv)
     if (opts->dataOnly && opts->dropSchema)
         pg_fatal("options -c/--clean and -a/--data-only cannot be used together");

+    if (opts->single_txn && opts->txn_size > 0)
+        pg_fatal("options -1/--single-transaction and --transaction-size cannot be used together");
+
     /*
      * -C is not compatible with -1, because we can't create a database inside
      * a transaction block.
@@ -484,6 +496,7 @@ usage(const char *progname)
     printf(_("  --section=SECTION            restore named section (pre-data, data, or post-data)\n"));
     printf(_("  --strict-names               require table and/or schema include patterns to\n"
              "                               match at least one entity each\n"));
+    printf(_("  --transaction-size=N         commit after every N objects\n"));
     printf(_("  --use-set-session-authorization\n"
              "                               use SET SESSION AUTHORIZATION commands instead of\n"
              "                               ALTER OWNER commands to set ownership\n"));
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 3960af4036..5cfd2282e1 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -548,6 +548,7 @@ create_new_objects(void)
                   true,
                   true,
                   "\"%s/pg_restore\" %s %s --exit-on-error --verbose "
+                  "--transaction-size=1000 "
                   "--dbname postgres \"%s/%s\"",
                   new_cluster.bindir,
                   cluster_conn_opts(&new_cluster),
@@ -586,6 +587,7 @@ create_new_objects(void)
         parallel_exec_prog(log_file_name,
                            NULL,
                            "\"%s/pg_restore\" %s %s --exit-on-error --verbose "
+                           "--transaction-size=1000 "
                            "--dbname template1 \"%s/%s\"",
                            new_cluster.bindir,
                            cluster_conn_opts(&new_cluster),

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Some useless includes in llvmjit_inline.cpp
Следующее
От: Masahiko Sawada
Дата:
Сообщение: Re: Make COPY format extendable: Extract COPY TO format implementations