Обсуждение: CREATE DATABASE with filesystem cloning

Поиск

Список

Период

Сортировка

CREATE DATABASE with filesystem cloning

От

Thomas Munro

Дата:

07 октября 2023 г., 08:51:45

Hello hackers,

Here is an experimental POC of fast/cheap database cloning.  For
clones from little template databases, no one cares much, but it might
be useful to be able to create a snapshot or fork of very large
database for testing/experimentation like this:

  create database foodb_snapshot20231007 template=foodb strategy=file_clone

It should be a lot faster, and use less physical disk, than the two
existing strategies on recent-ish XFS, BTRFS, very recent OpenZFS,
APFS (= macOS), and it could in theory be extended to other systems
that invented different system calls for this with more work (Solaris,
Windows).  Then extra physical disk space will be consumed only as the
two clones diverge.

It's just like the old strategy=file_copy, except it asks the OS to do
its best copying trick.  If you try it on a system that doesn't
support copy-on-write, then copy_file_range() should fall back to
plain old copy, but it might still be better than we could do, as it
can push copy commands to network storage or physical storage.

Therefore, the usual caveats from strategy=file_copy also apply here.
Namely that it has to perform checkpoints which could be very
expensive, and there are some quirks/brokenness about concurrent
backups and PITR.  Which makes me wonder if it's worth pursuing this
idea.  Thoughts?

I tested on bleeding edge FreeBSD/ZFS, where you need to set sysctl
vfs.zfs.bclone_enabled=1 to enable the optimisation, as it's still a
very new feature that is still being rolled out.  The system call
succeeds either way, but that controls whether the new database
initially shares blocks on disk, or get new copies.  I also tested on
a Mac.  In both cases I could clone large databases in a fraction of a
second.

Вложения

0001-WIP-CREATE-DATABASE-.-STRATEGY-FILE_CLONE.patch

Re: CREATE DATABASE with filesystem cloning

От

Thomas Munro

Дата:

08 октября 2023 г., 01:42:20

One thing I forgot to mention about this experiment: when used in a
backend I think this probably needs a chunk size (what size?) and a
CFI().  Which is a bit annoying, because for best results we want
SSIZE_MAX, but in the case that copy_file_range() falls back to raw
copy, that'd do I/O work bounded only by segment size :-/  (That's not
a problem that comes up in the similar patch for pg_upgrade.)

Re: CREATE DATABASE with filesystem cloning

От

Andrew Dunstan

Дата:

08 октября 2023 г., 16:20:32

On 2023-10-07 Sa 01:51, Thomas Munro wrote:
> Hello hackers,
>
> Here is an experimental POC of fast/cheap database cloning.  For
> clones from little template databases, no one cares much, but it might
> be useful to be able to create a snapshot or fork of very large
> database for testing/experimentation like this:
>
>    create database foodb_snapshot20231007 template=foodb strategy=file_clone
>
> It should be a lot faster, and use less physical disk, than the two
> existing strategies on recent-ish XFS, BTRFS, very recent OpenZFS,
> APFS (= macOS), and it could in theory be extended to other systems
> that invented different system calls for this with more work (Solaris,
> Windows).  Then extra physical disk space will be consumed only as the
> two clones diverge.
>
> It's just like the old strategy=file_copy, except it asks the OS to do
> its best copying trick.  If you try it on a system that doesn't
> support copy-on-write, then copy_file_range() should fall back to
> plain old copy, but it might still be better than we could do, as it
> can push copy commands to network storage or physical storage.
>
> Therefore, the usual caveats from strategy=file_copy also apply here.
> Namely that it has to perform checkpoints which could be very
> expensive, and there are some quirks/brokenness about concurrent
> backups and PITR.  Which makes me wonder if it's worth pursuing this
> idea.  Thoughts?
>
> I tested on bleeding edge FreeBSD/ZFS, where you need to set sysctl
> vfs.zfs.bclone_enabled=1 to enable the optimisation, as it's still a
> very new feature that is still being rolled out.  The system call
> succeeds either way, but that controls whether the new database
> initially shares blocks on disk, or get new copies.  I also tested on
> a Mac.  In both cases I could clone large databases in a fraction of a
> second.


I've had to disable COW on my BTRFS-resident buildfarm animals (see 
previous discussion re Direct I/O).


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: CREATE DATABASE with filesystem cloning

От

Thomas Munro

Дата:

09 октября 2023 г., 02:19:30

On Mon, Oct 9, 2023 at 2:20 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> I've had to disable COW on my BTRFS-resident buildfarm animals (see
> previous discussion re Direct I/O).

Right, because it is still buggy[1].  I don't see any sign that a fix
has been committed yet, assuming that is the right thing (and it sure
sounds like it).  It means you still have to disable COW to run the
004_io_direct.pl test for now, but that's an independent thing due
hopefully to be fixed soon, and you can still run PostgreSQL just fine
with COW enabled as it is by default as long as you don't turn on
debug_io_direct (which isn't for users yet anyway).

Since I hadn't actually tried out this cloning stuff out on
Linux/btrfs before and was claiming that it should work, I took it for
a quick unscientific spin (literally, this is on a spinning SATA disk
for extra crunchy slowness...).  I created a scale 500 pgbench
database, saw that du -h showed 7.4G, and got these times:

postgres=# create database foodb_copy template=foodb strategy=file_copy;
CREATE DATABASE
Time: 124019.885 ms (02:04.020)
postgres=# create database foodb_clone template=foodb strategy=file_clone;
CREATE DATABASE
Time: 8618.195 ms (00:08.618)

That's something, but not as good as I was expecting, so let's also
try Linux/XFS for reference on the same spinny rust...  One thing I
learned is that if you have an existing XFS partition, it might have
been created without reflinks enabled (see output of xfs_info) as that
was the default not very long ago and it's not changeable later, so on
the box I'm writing from I had to do a fresh mkfs.xfs to see any
benefit from this.

postgres=# create database foodb_copy template=foodb strategy=file_copy;
CREATE DATABASE
Time: 49157.876 ms (00:49.158)
postgres=# create database foodb_clone template=foodb strategy=file_clone;
CREATE DATABASE
Time: 1026.455 ms (00:01.026)

Not bad.  To understand what that did, we can check which physical
blocks on disk hold the first segment of the pgbench_accounts table in
foodb and foodb_clone:

$ sudo xfs_bmap /mnt/xfs/pgdata/base/16384/16400
/mnt/xfs/pgdata/base/16384/16400:
    0: [0..1637439]: 977586048..979223487
    1: [1637440..2097151]: 1464966136..1465425847
$ sudo xfs_bmap /mnt/xfs/pgdata/base/16419/16400
/mnt/xfs/pgdata/base/16419/16400:
    0: [0..1637439]: 977586048..979223487
    1: [1637440..2097151]: 1464966136..1465425847

The same blocks.  Now let's update a tuple on the second page of
pgbench_accounts in the clone:

foodb=# update pgbench_accounts set bid = bid + 1 where ctid = '(1, 1)';
UPDATE 1
foodb=# checkpoint;
CHECKPOINT

Now some new physical disk blocks have been allocated just for that
page, but the rest are still clones:

$ sudo xfs_bmap /mnt/xfs/pgdata/base/16419/16400
/mnt/xfs/pgdata/base/16419/16400:
    0: [0..15]: 977586048..977586063
    1: [16..31]: 977586064..977586079
    2: [32..1637439]: 977586080..979223487
    3: [1637440..2097151]: 1464966136..1465425847

I tried changing it to work in 1MB chunks and add the CFI() (v2
attached), and it didn't affect the time measurably and also didn't
generate any extra extents as displayed by xfs_bmap, so the end result
is the same.  I haven't looked into the chunked version on the other
file systems yet.

I don't have the numbers to hand (different machines far from me right
now) but FreeBSD/ZFS and macOS/APFS were on the order of a few hundred
milliseconds for the same scale of pgbench on laptop storage (so not
comparable with the above).

I also tried a -s 5000 database, and saw that XFS could clone a 74GB
database just as fast as the 7.4GB database (still ~1s).  At a guess,
this is going to scale not so much by total data size, but more by
things like number of relations, segment size and internal (invisible)
fragmentation due to previous cloning/update history in
filesystem-dependent ways, since those are the things that generate
extents (contiguous ranges of physical blocks to be referenced by the
new file).

[1] https://lore.kernel.org/linux-btrfs/ae81e48b0e954bae1c3451c0da1a24ae7146606c.1676684984.git.boris@bur.io/T/#u

Вложения

v2-0001-WIP-CREATE-DATABASE-.-STRATEGY-FILE_CLONE.patch

Re: CREATE DATABASE with filesystem cloning

От

Andres Freund

Дата:

10 октября 2023 г., 02:48:27

Hi,

On 2023-10-07 18:51:45 +1300, Thomas Munro wrote:
> It should be a lot faster, and use less physical disk, than the two
> existing strategies on recent-ish XFS, BTRFS, very recent OpenZFS,
> APFS (= macOS), and it could in theory be extended to other systems
> that invented different system calls for this with more work (Solaris,
> Windows).  Then extra physical disk space will be consumed only as the
> two clones diverge.

> It's just like the old strategy=file_copy, except it asks the OS to do
> its best copying trick.  If you try it on a system that doesn't
> support copy-on-write, then copy_file_range() should fall back to
> plain old copy, but it might still be better than we could do, as it
> can push copy commands to network storage or physical storage.
> 
> Therefore, the usual caveats from strategy=file_copy also apply here.
> Namely that it has to perform checkpoints which could be very
> expensive, and there are some quirks/brokenness about concurrent
> backups and PITR.  Which makes me wonder if it's worth pursuing this
> idea.  Thoughts?

I think it'd be interesting to have. For the regression tests we do end up
spending a lot of disk throughput on contents duplicated between
template0/template1/postgres.  And I've plenty of time spent time copying huge
template databases, to have a reproducible starting point for some benchmark
that's expensive to initialize.

If we do this, I think we should consider creating template0, template1 with
the new strategy, so that a new initdb cluster ends up with deduplicated data.

FWIW, I experimented with using cp -c on macos for the initdb template, and
that provided some further gain. I suspect that that gain would increase if
template0/template1/postgres were deduplicated.

> diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
> index e04bc3941a..8c963ff548 100644
> --- a/src/backend/storage/file/copydir.c
> +++ b/src/backend/storage/file/copydir.c
> @@ -19,14 +19,21 @@
>  #include "postgres.h"
>  
>  #include <fcntl.h>
> +#include <limits.h>
>  #include <unistd.h>
>  
> +#ifdef HAVE_COPYFILE_H
> +#include <copyfile.h>
> +#endif

We already have code around this in src/bin/pg_upgrade/file.c, seems we ought
to move it somewhere in src/port?

Greetings,

Andres Freund

Re: CREATE DATABASE with filesystem cloning

От

Peter Eisentraut

Дата:

11 октября 2023 г., 09:40:19

On 07.10.23 07:51, Thomas Munro wrote:
> Here is an experimental POC of fast/cheap database cloning.

Here are some previous discussions of this:

https://www.postgresql.org/message-id/flat/20131001223108.GG23410%40saarenmaa.fi

https://www.postgresql.org/message-id/flat/511B5D11.4040507%40socialserve.com

https://www.postgresql.org/message-id/flat/bc9ca382-b98d-0446-f699-8c5de2307ca7%402ndquadrant.com

(I don't see any clear conclusions in any of these threads, but it might 
be good to check them in any case.)

Re: CREATE DATABASE with filesystem cloning

От

Robert Haas

Дата:

11 октября 2023 г., 15:35:46

On Mon, Oct 9, 2023 at 7:49 PM Andres Freund <andres@anarazel.de> wrote:
> If we do this, I think we should consider creating template0, template1 with
> the new strategy, so that a new initdb cluster ends up with deduplicated data.

Seems a little questionable given the reports earlier in the thread
about some filesystems still having bugs in this functionality.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: CREATE DATABASE with filesystem cloning

От

"Dan Langille"

Дата:

17 октября 2023 г., 05:35:02

On Sat, Oct 7, 2023, at 1:51 AM, Thomas Munro wrote:

> I tested on bleeding edge FreeBSD/ZFS, where you need to set sysctl
> vfs.zfs.bclone_enabled=1 to enable the optimisation, as it's still a
> very new feature that is still being rolled out.  The system call
> succeeds either way, but that controls whether the new database
> initially shares blocks on disk, or get new copies.  I also tested on
> a Mac.  In both cases I could clone large databases in a fraction of a
> second.

WOOT! Looking forward to this. It would greatly improve times needed
when deploying test environments.

Thank you.

-- 
  Dan Langille
  dan@langille.org

Re: CREATE DATABASE with filesystem cloning

От

Thomas Munro

Дата:

06 марта, 05:16:38

On Wed, Oct 11, 2023 at 7:40 PM Peter Eisentraut <peter@eisentraut.org> wrote:
> On 07.10.23 07:51, Thomas Munro wrote:
> > Here is an experimental POC of fast/cheap database cloning.
>
> Here are some previous discussions of this:
>
> https://www.postgresql.org/message-id/flat/20131001223108.GG23410%40saarenmaa.fi
>
> https://www.postgresql.org/message-id/flat/511B5D11.4040507%40socialserve.com
>
> https://www.postgresql.org/message-id/flat/bc9ca382-b98d-0446-f699-8c5de2307ca7%402ndquadrant.com
>
> (I don't see any clear conclusions in any of these threads, but it might
> be good to check them in any case.)

Thanks.  Wow, quite a lot of people have written an experimental patch
like this.  I would say the things that changed since those ones are:

* copy_file_range() became the preferred way to do this on Linux AFAIK
(instead of various raw ioctls)
* FreeBSD adopted Linux's copy_file_range()
* Open ZFS 2.2 implemented range-based cloning
* XFS enabled reflink support by default
* Apple invented ApFS with cloning
* Several OSes adopted XFS, BTRFS, ZFS, ApFS by default
* copy_file_range() went in the direction of not revealing how the
copying is done (no flags to force behaviour)

Here's a rebase.

The main thing that is missing is support for redo.  It's mostly
trivial I think, probably just a record type for "try cloning first"
and then teaching that clone function to fall back to the regular copy
path if it fails in recovery, do you agree with that idea?  Another
approach would be to let it fail if it doesn't work on the replica, so
you don't finish up using dramatically different amounts of disk
space, but that seems terrible because now your replica is broken.  So
probably fallback with logged warning (?), though I'm not sure exactly
which errnos to give that treatment to.

One thing to highlight about COW file system semantics: PostgreSQL
behaves differently when space runs out.  When writing relation data,
eg ZFS sometimes fails like bullet point 2 in this ENOSPC article[1],
while XFS usually fails like bullet point 1.  A database on XFS that
has been cloned in this way might presumably start to fail like bullet
point 2, eg when checkpointing dirty pages, instead of its usual
extension-time-only ENOSPC-rolls-back-your-transaction behaviour.

[1] https://wiki.postgresql.org/wiki/ENOSPC

Вложения

v3-0001-CREATE-DATABASE-.-STRATEGY-FILE_CLONE.patch

Re: CREATE DATABASE with filesystem cloning

От

Thomas Munro

Дата:

06 марта, 06:00:03

On Wed, Mar 6, 2024 at 3:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's a rebase.

Now with a wait event and a paragraph of documentation.

Вложения

v4-0001-CREATE-DATABASE-.-STRATEGY-FILE_CLONE.patch

Re: CREATE DATABASE with filesystem cloning

От

Nazir Bilal Yavuz

Дата:

07 мая, 15:00:11

Hi,

On Wed, 6 Mar 2024 at 05:17, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> The main thing that is missing is support for redo.  It's mostly
> trivial I think, probably just a record type for "try cloning first"
> and then teaching that clone function to fall back to the regular copy
> path if it fails in recovery, do you agree with that idea?  Another
> approach would be to let it fail if it doesn't work on the replica, so
> you don't finish up using dramatically different amounts of disk
> space, but that seems terrible because now your replica is broken.  So
> probably fallback with logged warning (?), though I'm not sure exactly
> which errnos to give that treatment to.

We had an off-list talk with Thomas and we thought making this option
GUC instead of SQL command level could solve this problem.

I am posting a new rebased version of the patch with some important changes:

* 'createdb_file_copy_method' GUC is created. Possible values are
'copy' and 'clone'. Copy is the default option. Clone option can be
chosen if the system supports it, otherwise it gives error at the
startup.

* #else part of the clone_file() function calls pg_unreachable()
instead of ereport().

* Documentation updates.

Also, what should happen when the kernel has clone support but the
file system does not?

- I tested this on Linux and copy_file_range() silently uses normal
copy when this happens. I assume the same thing will happen for
FreeBSD because it uses the copy_file_range() function as well.

- I am not sure about MacOS since the force flag
(COPYFILE_CLONE_FORCE) is used. I do not have MacOS so I can not test
it but I assume it will error out when this happens. If that is the
case, is this a problem? I am asking that since this is a GUC now, the
user will have the full responsibility.

Any kind of feedback would be appreciated.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Вложения

v5-0001-Use-CLONE-method-with-GUC-on-CREATE-DATABASE-.-ST.patch

Re: CREATE DATABASE with filesystem cloning

От

Ranier Vilela

Дата:

07 мая, 16:28:19

Hi,

Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com> wrote:

>Any kind of feedback would be appreciated.

The flags:
O_WRONLY | O_TRUNC

Allow the OS to make some optimizations, if you deem it possible.

The flag O_RDWR indicates that the file can be read, which is not true in this case.

The destination file will just be written.

best regards,

Ranier Vilela

Re: CREATE DATABASE with filesystem cloning

От

Nazir Bilal Yavuz

Дата:

08 мая, 10:37:20

Hi Ranier,

Thanks for looking into this!

I am not sure why but your reply does not show up in the thread, so I
copied your reply and answered it in the thread for visibility.

On Tue, 7 May 2024 at 16:28, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> I know it's coming from copy-and-paste, but
> I believe the flags could be:
> - dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
> + dstfd = OpenTransientFile(tofile, O_CREAT | O_WRONLY | O_TRUNC | O_EXCL | PG_BINARY);
>
> The flags:
> O_WRONLY | O_TRUNC
>
> Allow the OS to make some optimizations, if you deem it possible.
>
> The flag O_RDWR indicates that the file can be read, which is not true in this case.
> The destination file will just be written.

You may be right about the O_WRONLY flag but I am not sure about the
O_TRUNC flag.

O_TRUNC from the linux man page [1]: If the file already exists and is
a regular file and the access mode allows writing (i.e., is O_RDWR or
O_WRONLY) it will be truncated to length 0.  If the file is a FIFO  or
terminal device file, the O_TRUNC flag is ignored. Otherwise, the
effect of O_TRUNC is unspecified.

But it should be already checked if the destination directory already
exists in dbcommands.c file in createdb() function [2], so this should
not happen.

[1] https://man7.org/linux/man-pages/man2/open.2.html

[2]
    /*
     * If database OID is configured, check if the OID is already in use or
     * data directory already exists.
     */
    if (OidIsValid(dboid))
    {
        char       *existing_dbname = get_database_name(dboid);

        if (existing_dbname != NULL)
            ereport(ERROR,
                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
                    errmsg("database OID %u is already in use by
database \"%s\"",
                           dboid, existing_dbname));

        if (check_db_file_conflict(dboid))
            ereport(ERROR,
                    (errcode(ERRCODE_INVALID_PARAMETER_VALUE)),
                    errmsg("data directory with the specified OID %u
already exists", dboid));
    }
    else
    {
        /* Select an OID for the new database if is not explicitly
configured. */
        do
        {
            dboid = GetNewOidWithIndex(pg_database_rel, DatabaseOidIndexId,
                                       Anum_pg_database_oid);
        } while (check_db_file_conflict(dboid));
    }

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CREATE DATABASE with filesystem cloning

От

Ranier Vilela

Дата:

08 мая, 14:16:18

Em qua., 8 de mai. de 2024 às 04:37, Nazir Bilal Yavuz <byavuz81@gmail.com> escreveu:

Hi Ranier,

Thanks for looking into this!

I am not sure why but your reply does not show up in the thread, so I
copied your reply and answered it in the thread for visibility.

On Tue, 7 May 2024 at 16:28, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> I know it's coming from copy-and-paste, but
> I believe the flags could be:
> - dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
> + dstfd = OpenTransientFile(tofile, O_CREAT | O_WRONLY | O_TRUNC | O_EXCL | PG_BINARY);
>
> The flags:
> O_WRONLY | O_TRUNC
>
> Allow the OS to make some optimizations, if you deem it possible.
>
> The flag O_RDWR indicates that the file can be read, which is not true in this case.
> The destination file will just be written.

You may be right about the O_WRONLY flag but I am not sure about the
O_TRUNC flag.

O_TRUNC from the linux man page [1]: If the file already exists and is
a regular file and the access mode allows writing (i.e., is O_RDWR or
O_WRONLY) it will be truncated to length 0. If the file is a FIFO or
terminal device file, the O_TRUNC flag is ignored. Otherwise, the
effect of O_TRUNC is unspecified.

O_TRUNC is usually used in conjunction with O_WRONLY.

See the example at:

open.html

O_TRUNC signals the OS to forget the current contents of the file, if it happens to exist.
In other words, there will be no seeks, only and exclusively writes.

But it should be already checked if the destination directory already
exists in dbcommands.c file in createdb() function [2], so this should
not happen.

I'm not sure what you're referring to here.

best regards,

Ranier Vilela

Re: CREATE DATABASE with filesystem cloning

От

Nazir Bilal Yavuz

Дата:

08 мая, 14:42:27

Hi,

On Wed, 8 May 2024 at 14:16, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
>
> Em qua., 8 de mai. de 2024 às 04:37, Nazir Bilal Yavuz <byavuz81@gmail.com> escreveu:
>>
>> Hi Ranier,
>>
>> Thanks for looking into this!
>>
>> I am not sure why but your reply does not show up in the thread, so I
>> copied your reply and answered it in the thread for visibility.
>>
>> On Tue, 7 May 2024 at 16:28, Ranier Vilela <ranier.vf@gmail.com> wrote:
>> >
>> > I know it's coming from copy-and-paste, but
>> > I believe the flags could be:
>> > - dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
>> > + dstfd = OpenTransientFile(tofile, O_CREAT | O_WRONLY | O_TRUNC | O_EXCL | PG_BINARY);
>> >
>> > The flags:
>> > O_WRONLY | O_TRUNC
>> >
>> > Allow the OS to make some optimizations, if you deem it possible.
>> >
>> > The flag O_RDWR indicates that the file can be read, which is not true in this case.
>> > The destination file will just be written.
>>
>> You may be right about the O_WRONLY flag but I am not sure about the
>> O_TRUNC flag.
>>
>> O_TRUNC from the linux man page [1]: If the file already exists and is
>> a regular file and the access mode allows writing (i.e., is O_RDWR or
>> O_WRONLY) it will be truncated to length 0.  If the file is a FIFO  or
>> terminal device file, the O_TRUNC flag is ignored. Otherwise, the
>> effect of O_TRUNC is unspecified.
>
> O_TRUNC is usually used in conjunction with O_WRONLY.
> See the example at:
> open.html
>
> O_TRUNC signals the OS to forget the current contents of the file, if it happens to exist.
> In other words, there will be no seeks, only and exclusively writes.

You are right; the O_TRUNC flag truncates the file, if it happens to
exist. But it should not exist in the first place because the code at
[2] should check this before the OpenTransientFile() call. Even if we
assume somehow the check at [2] does not work, I do not think the
correct operation is to truncate the contents of the existing file.

>>
>> But it should be already checked if the destination directory already
>> exists in dbcommands.c file in createdb() function [2], so this should
>> not happen.
>
> I'm not sure what you're referring to here.

Sorry, I meant that the destination directory / file should not exist
because the code at [2] confirms it does not exist, otherwise it
errors out.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CREATE DATABASE with filesystem cloning

От

Ranier Vilela

Дата:

08 мая, 15:23:00

Em qua., 8 de mai. de 2024 às 08:42, Nazir Bilal Yavuz <byavuz81@gmail.com> escreveu:

Hi,

On Wed, 8 May 2024 at 14:16, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
>
> Em qua., 8 de mai. de 2024 às 04:37, Nazir Bilal Yavuz <byavuz81@gmail.com> escreveu:
>>
>> Hi Ranier,
>>
>> Thanks for looking into this!
>>
>> I am not sure why but your reply does not show up in the thread, so I
>> copied your reply and answered it in the thread for visibility.
>>
>> On Tue, 7 May 2024 at 16:28, Ranier Vilela <ranier.vf@gmail.com> wrote:
>> >
>> > I know it's coming from copy-and-paste, but
>> > I believe the flags could be:
>> > - dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
>> > + dstfd = OpenTransientFile(tofile, O_CREAT | O_WRONLY | O_TRUNC | O_EXCL | PG_BINARY);
>> >
>> > The flags:
>> > O_WRONLY | O_TRUNC
>> >
>> > Allow the OS to make some optimizations, if you deem it possible.
>> >
>> > The flag O_RDWR indicates that the file can be read, which is not true in this case.
>> > The destination file will just be written.
>>
>> You may be right about the O_WRONLY flag but I am not sure about the
>> O_TRUNC flag.
>>
>> O_TRUNC from the linux man page [1]: If the file already exists and is
>> a regular file and the access mode allows writing (i.e., is O_RDWR or
>> O_WRONLY) it will be truncated to length 0. If the file is a FIFO or
>> terminal device file, the O_TRUNC flag is ignored. Otherwise, the
>> effect of O_TRUNC is unspecified.
>
> O_TRUNC is usually used in conjunction with O_WRONLY.
> See the example at:
> open.html
>
> O_TRUNC signals the OS to forget the current contents of the file, if it happens to exist.
> In other words, there will be no seeks, only and exclusively writes.

You are right; the O_TRUNC flag truncates the file, if it happens to
exist. But it should not exist in the first place because the code at
[2] should check this before the OpenTransientFile() call. Even if we
assume somehow the check at [2] does not work, I do not think the
correct operation is to truncate the contents of the existing file.

I don't know if there is a communication problem here.
But what I'm trying to suggest refers to the destination file,

which doesn't matter if it exists or not?

If the destination file does not exist, O_TRUNC is ignored.

If the destination file exists, O_TRUNC truncates the current contents of the file.

I don't see why you think it's a problem to truncate the current content if the destination file exists.
Isn't he going to be replaced anyway?

Unless we want to preserve the current content (destination file), in case the copy/clone fails?

best regards,

Ranier Vilela

Re: CREATE DATABASE with filesystem cloning

От

Nazir Bilal Yavuz

Дата:

08 мая, 16:06:45

Hi,

On Wed, 8 May 2024 at 15:23, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> Em qua., 8 de mai. de 2024 às 08:42, Nazir Bilal Yavuz <byavuz81@gmail.com> escreveu:
>>
>> Hi,
>>
>> On Wed, 8 May 2024 at 14:16, Ranier Vilela <ranier.vf@gmail.com> wrote:
>> >
>> >
>> > Em qua., 8 de mai. de 2024 às 04:37, Nazir Bilal Yavuz <byavuz81@gmail.com> escreveu:
>> >>
>> >> Hi Ranier,
>> >>
>> >> Thanks for looking into this!
>> >>
>> >> I am not sure why but your reply does not show up in the thread, so I
>> >> copied your reply and answered it in the thread for visibility.
>> >>
>> >> On Tue, 7 May 2024 at 16:28, Ranier Vilela <ranier.vf@gmail.com> wrote:
>> >> >
>> >> > I know it's coming from copy-and-paste, but
>> >> > I believe the flags could be:
>> >> > - dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
>> >> > + dstfd = OpenTransientFile(tofile, O_CREAT | O_WRONLY | O_TRUNC | O_EXCL | PG_BINARY);
>> >> >
>> >> > The flags:
>> >> > O_WRONLY | O_TRUNC
>> >> >
>> >> > Allow the OS to make some optimizations, if you deem it possible.
>> >> >
>> >> > The flag O_RDWR indicates that the file can be read, which is not true in this case.
>> >> > The destination file will just be written.
>> >>
>> >> You may be right about the O_WRONLY flag but I am not sure about the
>> >> O_TRUNC flag.
>> >>
>> >> O_TRUNC from the linux man page [1]: If the file already exists and is
>> >> a regular file and the access mode allows writing (i.e., is O_RDWR or
>> >> O_WRONLY) it will be truncated to length 0.  If the file is a FIFO  or
>> >> terminal device file, the O_TRUNC flag is ignored. Otherwise, the
>> >> effect of O_TRUNC is unspecified.
>> >
>> > O_TRUNC is usually used in conjunction with O_WRONLY.
>> > See the example at:
>> > open.html
>> >
>> > O_TRUNC signals the OS to forget the current contents of the file, if it happens to exist.
>> > In other words, there will be no seeks, only and exclusively writes.
>>
>> You are right; the O_TRUNC flag truncates the file, if it happens to
>> exist. But it should not exist in the first place because the code at
>> [2] should check this before the OpenTransientFile() call. Even if we
>> assume somehow the check at [2] does not work, I do not think the
>> correct operation is to truncate the contents of the existing file.
>
> I don't know if there is a communication problem here.
> But what I'm trying to suggest refers to the destination file,
> which doesn't matter if it exists or not?

I do not think there is a communication problem. Actually it matters
because the destination file should not exist, there is a code [2]
which already checks and confirms that it does not exist.

>
> If the destination file does not exist, O_TRUNC is ignored.
> If the destination file exists, O_TRUNC truncates the current contents of the file.
> I don't see why you think it's a problem to truncate the current content if the destination file exists.
> Isn't he going to be replaced anyway?

'If the destination file does not exist' means the code at [2] works
well and we do not need the O_TRUNC flag.

'If the destination file already exists' means the code at [2] is
broken somehow and there is a high chance that we could truncate
something that we do not want to. For example, there is a foo db and
we want to create bar db, Postgres chose the foo db's location as the
destination of the bar db (which should not happen but let's assume
somehow checks at [2] failed), then we could wrongly truncate the foo
db's contents.

Hence, if Postgres works successfully I think the O_TRUNC flag is
unnecessary but if Postgres does not work successfully, the O_TRUNC
flag could cause harm.

>
> Unless we want to preserve the current content (destination file), in case the copy/clone fails?

Like I said above, Postgres should not choose the existing file as the
destination file.

Also, we have O_CREAT | O_EXCL flags together, from the link [3] you
shared before: If O_CREAT and O_EXCL are set, open() shall fail if the
file exists. So, overwriting the already existing file is already
prevented.

[3] https://pubs.opengroup.org/onlinepubs/009696699/functions/open.html

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CREATE DATABASE with filesystem cloning

От

Ranier Vilela

Дата:

08 мая, 16:20:54

Em qua., 8 de mai. de 2024 às 10:06, Nazir Bilal Yavuz <byavuz81@gmail.com> escreveu:

Hi,

On Wed, 8 May 2024 at 15:23, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> Em qua., 8 de mai. de 2024 às 08:42, Nazir Bilal Yavuz <byavuz81@gmail.com> escreveu:
>>
>> Hi,
>>
>> On Wed, 8 May 2024 at 14:16, Ranier Vilela <ranier.vf@gmail.com> wrote:
>> >
>> >
>> > Em qua., 8 de mai. de 2024 às 04:37, Nazir Bilal Yavuz <byavuz81@gmail.com> escreveu:
>> >>
>> >> Hi Ranier,
>> >>
>> >> Thanks for looking into this!
>> >>
>> >> I am not sure why but your reply does not show up in the thread, so I
>> >> copied your reply and answered it in the thread for visibility.
>> >>
>> >> On Tue, 7 May 2024 at 16:28, Ranier Vilela <ranier.vf@gmail.com> wrote:
>> >> >
>> >> > I know it's coming from copy-and-paste, but
>> >> > I believe the flags could be:
>> >> > - dstfd = OpenTransientFile(tofile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
>> >> > + dstfd = OpenTransientFile(tofile, O_CREAT | O_WRONLY | O_TRUNC | O_EXCL | PG_BINARY);
>> >> >
>> >> > The flags:
>> >> > O_WRONLY | O_TRUNC
>> >> >
>> >> > Allow the OS to make some optimizations, if you deem it possible.
>> >> >
>> >> > The flag O_RDWR indicates that the file can be read, which is not true in this case.
>> >> > The destination file will just be written.
>> >>
>> >> You may be right about the O_WRONLY flag but I am not sure about the
>> >> O_TRUNC flag.
>> >>
>> >> O_TRUNC from the linux man page [1]: If the file already exists and is
>> >> a regular file and the access mode allows writing (i.e., is O_RDWR or
>> >> O_WRONLY) it will be truncated to length 0. If the file is a FIFO or
>> >> terminal device file, the O_TRUNC flag is ignored. Otherwise, the
>> >> effect of O_TRUNC is unspecified.
>> >
>> > O_TRUNC is usually used in conjunction with O_WRONLY.
>> > See the example at:
>> > open.html
>> >
>> > O_TRUNC signals the OS to forget the current contents of the file, if it happens to exist.
>> > In other words, there will be no seeks, only and exclusively writes.
>>
>> You are right; the O_TRUNC flag truncates the file, if it happens to
>> exist. But it should not exist in the first place because the code at
>> [2] should check this before the OpenTransientFile() call. Even if we
>> assume somehow the check at [2] does not work, I do not think the
>> correct operation is to truncate the contents of the existing file.
>
> I don't know if there is a communication problem here.
> But what I'm trying to suggest refers to the destination file,
> which doesn't matter if it exists or not?

I do not think there is a communication problem. Actually it matters
because the destination file should not exist, there is a code [2]
which already checks and confirms that it does not exist.

I got it.

>
> If the destination file does not exist, O_TRUNC is ignored.
> If the destination file exists, O_TRUNC truncates the current contents of the file.
> I don't see why you think it's a problem to truncate the current content if the destination file exists.
> Isn't he going to be replaced anyway?

'If the destination file does not exist' means the code at [2] works
well and we do not need the O_TRUNC flag.

True, the O_TRUNC is ignored in this case.

'If the destination file already exists' means the code at [2] is
broken somehow and there is a high chance that we could truncate
something that we do not want to. For example, there is a foo db and
we want to create bar db, Postgres chose the foo db's location as the
destination of the bar db (which should not happen but let's assume
somehow checks at [2] failed), then we could wrongly truncate the foo
db's contents.

Of course, truncating the wrong file would be pretty bad.

Hence, if Postgres works successfully I think the O_TRUNC flag is
unnecessary but if Postgres does not work successfully, the O_TRUNC
flag could cause harm.

The disaster will happen anyway, but of course we can help in some way.
Even without truncating, the wrong file will be destroyed anyway.

>
> Unless we want to preserve the current content (destination file), in case the copy/clone fails?

Like I said above, Postgres should not choose the existing file as the
destination file.

Also, we have O_CREAT | O_EXCL flags together, from the link [3] you
shared before: If O_CREAT and O_EXCL are set, open() shall fail if the
file exists. So, overwriting the already existing file is already
prevented.

That said, I agree that not using O_TRUNC helps in some way.

best regards,

Ranier Vilela

Re: CREATE DATABASE with filesystem cloning

От

Robert Haas

Дата:

08 мая, 16:57:50

On Tue, May 7, 2024 at 8:00 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> We had an off-list talk with Thomas and we thought making this option
> GUC instead of SQL command level could solve this problem.
>
> I am posting a new rebased version of the patch with some important changes:
>
> * 'createdb_file_copy_method' GUC is created. Possible values are
> 'copy' and 'clone'. Copy is the default option. Clone option can be
> chosen if the system supports it, otherwise it gives error at the
> startup.

This seems like a smart idea, because the type of file copying that we
do during redo need not be the same as what was done when the
operation was originally performed.

I'm not so sure about the GUC name. On the one hand, it feels like
createdb should be spelled out as create_database, but on the other
hand, the GUC name is quite long already. Then again, why would we
make this specific to CREATE DATABASE in the first place? Would we
also want alter_tablespace_file_copy_method and
basic_archive.file_copy_method?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: CREATE DATABASE with filesystem cloning

От

Nazir Bilal Yavuz

Дата:

08 мая, 17:34:20

Hi,

On Wed, 8 May 2024 at 16:58, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, May 7, 2024 at 8:00 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > We had an off-list talk with Thomas and we thought making this option
> > GUC instead of SQL command level could solve this problem.
> >
> > I am posting a new rebased version of the patch with some important changes:
> >
> > * 'createdb_file_copy_method' GUC is created. Possible values are
> > 'copy' and 'clone'. Copy is the default option. Clone option can be
> > chosen if the system supports it, otherwise it gives error at the
> > startup.
>
> This seems like a smart idea, because the type of file copying that we
> do during redo need not be the same as what was done when the
> operation was originally performed.
>
> I'm not so sure about the GUC name. On the one hand, it feels like
> createdb should be spelled out as create_database, but on the other
> hand, the GUC name is quite long already. Then again, why would we
> make this specific to CREATE DATABASE in the first place? Would we
> also want alter_tablespace_file_copy_method and
> basic_archive.file_copy_method?

I agree that it is already quite long, because of that I chose the
createdb as a prefix. I did not think that file cloning was planned to
be used in other places. If that is the case, does something like
'preferred_copy_method' work? Then, we would mention which places will
be affected with this GUC in the docs.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CREATE DATABASE with filesystem cloning

От

Robert Haas

Дата:

09 мая, 19:29:26

On Wed, May 8, 2024 at 10:34 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > I'm not so sure about the GUC name. On the one hand, it feels like
> > createdb should be spelled out as create_database, but on the other
> > hand, the GUC name is quite long already. Then again, why would we
> > make this specific to CREATE DATABASE in the first place? Would we
> > also want alter_tablespace_file_copy_method and
> > basic_archive.file_copy_method?
>
> I agree that it is already quite long, because of that I chose the
> createdb as a prefix. I did not think that file cloning was planned to
> be used in other places. If that is the case, does something like
> 'preferred_copy_method' work? Then, we would mention which places will
> be affected with this GUC in the docs.

I would go with file_copy_method rather than preferred_copy_method,
because (1) there's nothing "preferred" about it if we're using it
unconditionally and (2) it's nice to clarify that we're talking about
copying files rather than anything else.

My personal enthusiasm for making platform-specific file copy methods
usable all over PostgreSQL is quite limited. However, it is my
observation that other people seem far more enthusiastic about it than
I am. For example, consider how quickly it got added to
pg_combinebackup. So, I suspect it's smart to plan on anything we add
in this area getting used in a bunch of places. And perhaps it is even
best to think about making it work in all of those places right from
the start. If we build support into copydir and copy_file, then we
just get everything that uses those, and all that remains is to
document was is covered (and add comments so that future patch authors
know they should further update the documentation).

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: CREATE DATABASE with filesystem cloning

От

Nazir Bilal Yavuz

Дата:

16 мая, 15:35:38

Hi,

On Thu, 9 May 2024 at 19:29, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, May 8, 2024 at 10:34 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > > I'm not so sure about the GUC name. On the one hand, it feels like
> > > createdb should be spelled out as create_database, but on the other
> > > hand, the GUC name is quite long already. Then again, why would we
> > > make this specific to CREATE DATABASE in the first place? Would we
> > > also want alter_tablespace_file_copy_method and
> > > basic_archive.file_copy_method?
> >
> > I agree that it is already quite long, because of that I chose the
> > createdb as a prefix. I did not think that file cloning was planned to
> > be used in other places. If that is the case, does something like
> > 'preferred_copy_method' work? Then, we would mention which places will
> > be affected with this GUC in the docs.
>
> I would go with file_copy_method rather than preferred_copy_method,
> because (1) there's nothing "preferred" about it if we're using it
> unconditionally and (2) it's nice to clarify that we're talking about
> copying files rather than anything else.

Done.

>
> My personal enthusiasm for making platform-specific file copy methods
> usable all over PostgreSQL is quite limited. However, it is my
> observation that other people seem far more enthusiastic about it than
> I am. For example, consider how quickly it got added to
> pg_combinebackup. So, I suspect it's smart to plan on anything we add
> in this area getting used in a bunch of places. And perhaps it is even
> best to think about making it work in all of those places right from
> the start. If we build support into copydir and copy_file, then we
> just get everything that uses those, and all that remains is to
> document was is covered (and add comments so that future patch authors
> know they should further update the documentation).

I updated the documentation and put a comment on top of the copydir()
function to inform that further changes and uses of this function may
require documentation updates.

I also changed O_RDWR to O_WRONLY flag in the clone_file() function
based on Raniers' feedback. Also, does this feature need tests? I
thought about possible test cases but since this feature requires
specific file systems to run, I could not find any.

v6 is attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Вложения

v6-0001-Introduce-file_copy_method-GUC.patch

Re: CREATE DATABASE with filesystem cloning

От

Robert Haas

Дата:

16 мая, 15:40:32

On Thu, May 16, 2024 at 8:35 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> I updated the documentation and put a comment on top of the copydir()
> function to inform that further changes and uses of this function may
> require documentation updates.

I was assuming that the documentation for the file_copy_method was
going to list the things that it controlled, and that the comment was
going to say that you should update that list specifically. Just
saying that you may need to update some part of the documentation in
some way is fairly useless, IMHO.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: CREATE DATABASE with filesystem cloning

От

Nazir Bilal Yavuz

Дата:

16 мая, 16:42:52

Hi,

On Thu, 16 May 2024 at 15:40, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, May 16, 2024 at 8:35 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > I updated the documentation and put a comment on top of the copydir()
> > function to inform that further changes and uses of this function may
> > require documentation updates.
>
> I was assuming that the documentation for the file_copy_method was
> going to list the things that it controlled, and that the comment was
> going to say that you should update that list specifically. Just
> saying that you may need to update some part of the documentation in
> some way is fairly useless, IMHO.

Actually, the documentation for the file_copy_method was mentioning
the things it controls; I converted it to an itemized list now. Also,
changed the comment to: 'Further uses of this function requires
updates to the list that GUC controls in its documentation.'. v7 is
attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Вложения

v7-0001-Introduce-file_copy_method-GUC.patch

Re: CREATE DATABASE with filesystem cloning

От

Robert Haas

Дата:

16 мая, 17:46:56

On Thu, May 16, 2024 at 9:43 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> Actually, the documentation for the file_copy_method was mentioning
> the things it controls; I converted it to an itemized list now. Also,
> changed the comment to: 'Further uses of this function requires
> updates to the list that GUC controls in its documentation.'. v7 is
> attached.

I think the comments need some wordsmithing.

I don't see why this parameter should be PGC_POSTMASTER.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: CREATE DATABASE with filesystem cloning

От

Nazir Bilal Yavuz

Дата:

21 мая, 11:18:34

Hi,

On Thu, 16 May 2024 at 17:47, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, May 16, 2024 at 9:43 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > Actually, the documentation for the file_copy_method was mentioning
> > the things it controls; I converted it to an itemized list now. Also,
> > changed the comment to: 'Further uses of this function requires
> > updates to the list that GUC controls in its documentation.'. v7 is
> > attached.
>
> I think the comments need some wordsmithing.

I changed it to 'Uses of this function must be documented in the list
of places affected by this GUC.', I am open to any suggestions.

> I don't see why this parameter should be PGC_POSTMASTER.

I changed it to 'PGC_USERSET' now. My initial point was the database
or tablespace to be copied with the same method. I thought copying
some portion of the database with the copy and rest with the clone
could cause potential problems. After a bit of searching, I could not
find any problems related to that.

v8 is attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Вложения

v8-0001-Introduce-file_copy_method-GUC.patch

Re: CREATE DATABASE with filesystem cloning

От

Ranier Vilela

Дата:

21 мая, 15:08:13

Em ter., 21 de mai. de 2024 às 05:18, Nazir Bilal Yavuz <byavuz81@gmail.com> escreveu:

Hi,

On Thu, 16 May 2024 at 17:47, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, May 16, 2024 at 9:43 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> > Actually, the documentation for the file_copy_method was mentioning
> > the things it controls; I converted it to an itemized list now. Also,
> > changed the comment to: 'Further uses of this function requires
> > updates to the list that GUC controls in its documentation.'. v7 is
> > attached.
>
> I think the comments need some wordsmithing.

I changed it to 'Uses of this function must be documented in the list
of places affected by this GUC.', I am open to any suggestions.

> I don't see why this parameter should be PGC_POSTMASTER.

I changed it to 'PGC_USERSET' now. My initial point was the database
or tablespace to be copied with the same method. I thought copying
some portion of the database with the copy and rest with the clone
could cause potential problems. After a bit of searching, I could not
find any problems related to that.

v8 is attached.

Hi,

I did some research on the subject and despite being an improvement,

I believe that some terminologies should be changed in this patch.
Although the new function is called *clone_file*, I'm not sure if it really is "clone".
Why MacOS added an API called clonefile [1] and Linux exists
another called FICLONE.[2]
So perhaps it should be treated here as a copy and not a clone?
Leaving it open, is the possibility of implementing a true clone api?
Thoughts?

best regards,

Ranier Vilela

[1] clonefile

[2] ioctl_ficlonerange

--
Regards,
Nazir Bilal Yavuz
Microsoft

Re: CREATE DATABASE with filesystem cloning

От

Nazir Bilal Yavuz

Дата:

03 июня, 15:49:29

Hi,

On Tue, 21 May 2024 at 15:08, Ranier Vilela <ranier.vf@gmail.com> wrote:
>
> Em ter., 21 de mai. de 2024 às 05:18, Nazir Bilal Yavuz <byavuz81@gmail.com> escreveu:
>>
>> Hi,
>>
>> On Thu, 16 May 2024 at 17:47, Robert Haas <robertmhaas@gmail.com> wrote:
>> >
>> > On Thu, May 16, 2024 at 9:43 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>> > > Actually, the documentation for the file_copy_method was mentioning
>> > > the things it controls; I converted it to an itemized list now. Also,
>> > > changed the comment to: 'Further uses of this function requires
>> > > updates to the list that GUC controls in its documentation.'. v7 is
>> > > attached.
>> >
>> > I think the comments need some wordsmithing.
>>
>> I changed it to 'Uses of this function must be documented in the list
>> of places affected by this GUC.', I am open to any suggestions.
>>
>> > I don't see why this parameter should be PGC_POSTMASTER.
>>
>> I changed it to 'PGC_USERSET' now. My initial point was the database
>> or tablespace to be copied with the same method. I thought copying
>> some portion of the database with the copy and rest with the clone
>> could cause potential problems. After a bit of searching, I could not
>> find any problems related to that.
>>
>> v8 is attached.
>
> Hi,
> I did some research on the subject and despite being an improvement,
> I believe that some terminologies should be changed in this patch.
> Although the new function is called *clone_file*, I'm not sure if it really is "clone".
> Why MacOS added an API called clonefile [1] and Linux exists
> another called FICLONE.[2]
> So perhaps it should be treated here as a copy and not a clone?
> Leaving it open, is the possibility of implementing a true clone api?
> Thoughts?

Thank you for mentioning this.

1- I do not know where to look for MacOS' function definitions but I
followed the same links you shared. It says copyfile(...,
COPYFILE_CLONE_FORCE) means 'Clone the file instead. This is a force
flag i.e. if cloning fails, an error is returned....' [1]. It does the
cloning but I still do not know whether there is a difference between
'copyfile() function with COPYFILE_CLONE_FORCE flag' and 'clonefile()'
function.

2- I read a couple of discussions about copy_file_range() and FICLONE.
It seems that the copy_file_range() function is a slightly upgraded
version of FICLONE but less available since it is newer. It still does
the clone [2]. Also, FICLONE is already used in pg_upgrade and
pg_combinebackup.

Both of these copyfile(..., COPYFILE_CLONE_FORCE) and
copy_file_range() functions do the cloning, so I think using clone
terminology is correct. But, using FICLONE instead of
copy_file_range() could be better since it is more widely available.

[1] https://www.unix.com/man-page/mojave/3/copyfile/
[2] https://elixir.bootlin.com/linux/v5.15-rc5/source/fs/read_write.c#L1495

--
Regards,
Nazir Bilal Yavuz
Microsoft

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: CREATE DATABASE with filesystem cloning

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения