Обсуждение: Unlogged tables cleanup

Поиск
Список
Период
Сортировка

Unlogged tables cleanup

От
Konstantin Knizhnik
Дата:
Hi, hackers

I wonder if such behavior can be considered as a bug:

knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# create tablespace fs location '/home/knizhnik/dtm-data/fs';
CREATE TABLESPACE
postgres=# set default_tablespace=fs;
SET
postgres=# create unlogged table foo(x integer);
CREATE TABLE
postgres=# insert into foo values(generate_series(1,100000));
INSERT 0 100000



Now simulate server crash using using "pkill -9 postgres".

knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l 
logfile start
pg_ctl: another server might be running; trying to start server anyway
server starting
knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# select * from foo;
ERROR:  could not open file 
"pg_tblspc/16384/PG_10_201611041/12289/16385": No such file or directory

knizhnik@knizhnik:~/dtm-data$ ls fs
PG_10_201611041
knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/


So all relation directory is removed!
It happens only for first table created in tablespace.
If you create table in Postgres data directory everything is ok: first 
segment of relation is truncated but not deleted.
Also if you create one more unlogged table in tablespace it is truncated 
correctly:

postgres=# set default_tablespace=fs;
SET
postgres=# create unlogged table foo1(x integer);
CREATE TABLE
postgres=# insert into foo1 values(generate_series(1,100000));
INSERT 0 100000
postgres=# \q
knizhnik@knizhnik:~/dtm-data$ pkill -9 postgres
knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l 
logfile start
pg_ctl: another server might be running; trying to start server anyway
server starting
knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# select * from foo1; x
---
(0 rows)

knizhnik@knizhnik:~/dtm-data$ ls -l fs/PG_10_201611041/12289/*
-rw------- 1 knizhnik knizhnik 0 Nov  9 19:52 fs/PG_10_201611041/12289/32768
-rw------- 1 knizhnik knizhnik 0 Nov  9 19:52 
fs/PG_10_201611041/12289/32768_init


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Unlogged tables cleanup

От
Robert Haas
Дата:
On Wed, Nov 9, 2016 at 11:56 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> Now simulate server crash using using "pkill -9 postgres".
>
> knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l
> logfile start
> pg_ctl: another server might be running; trying to start server anyway
> server starting
> knizhnik@knizhnik:~/dtm-data$ psql postgres
> psql (10devel)
> Type "help" for help.
>
> postgres=# select * from foo;
> ERROR:  could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385":
> No such file or directory
>
> knizhnik@knizhnik:~/dtm-data$ ls fs
> PG_10_201611041
> knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/
>
> So all relation directory is removed!
> It happens only for first table created in tablespace.
> If you create table in Postgres data directory everything is ok: first
> segment of relation is truncated but not deleted.

Whoa.  There should be an _init fork that doesn't get removed, and
without removing the _init fork you shouldn't be able to remove the
directory that contains it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Unlogged tables cleanup

От
Michael Paquier
Дата:
On Thu, Nov 10, 2016 at 3:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Nov 9, 2016 at 11:56 AM, Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
>> Now simulate server crash using using "pkill -9 postgres".
>>
>> knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l
>> logfile start
>> pg_ctl: another server might be running; trying to start server anyway
>> server starting
>> knizhnik@knizhnik:~/dtm-data$ psql postgres
>> psql (10devel)
>> Type "help" for help.
>>
>> postgres=# select * from foo;
>> ERROR:  could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385":
>> No such file or directory
>>
>> knizhnik@knizhnik:~/dtm-data$ ls fs
>> PG_10_201611041
>> knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/
>>
>> So all relation directory is removed!
>> It happens only for first table created in tablespace.
>> If you create table in Postgres data directory everything is ok: first
>> segment of relation is truncated but not deleted.
>
> Whoa.  There should be an _init fork that doesn't get removed, and
> without removing the _init fork you shouldn't be able to remove the
> directory that contains it.

Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you
have locally a standby pointing as well to this tablespace?
-- 
Michael



Re: Unlogged tables cleanup

От
konstantin knizhnik
Дата:
On Nov 10, 2016, at 10:17 AM, Michael Paquier wrote:

>
> Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you
> have locally a standby pointing as well to this tablespace?

No, it is latest sources from Postgres repository.
Please notice that you should create new database and tablespace to reproduce this issue.
So actually the whole sequence is

mkdir fs
initdb -D pgsql
pg_ctl -D pgsql -l logfile start
psql postgres
# create tablespace fs location '/home/knizhnik/dtm-data/fs';
# set default_tablespace=fs;
# create unlogged table foo(x integer);
# insert into foo values(generate_series(1,100000));
# ^D
pkill -9 postgres
pg_ctl -D pgsql -l logfile start
# select * from foo;


> --
> Michael




Re: Unlogged tables cleanup

От
Michael Paquier
Дата:
On Thu, Nov 10, 2016 at 4:23 PM, konstantin knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> No, it is latest sources from Postgres repository.
> Please notice that you should create new database and tablespace to reproduce this issue.
> So actually the whole sequence is
>
> mkdir fs
> initdb -D pgsql
> pg_ctl -D pgsql -l logfile start
> psql postgres
> # create tablespace fs location '/home/knizhnik/dtm-data/fs';
> # set default_tablespace=fs;
> # create unlogged table foo(x integer);
> # insert into foo values(generate_series(1,100000));
> # ^D
> pkill -9 postgres
> pg_ctl -D pgsql -l logfile start
> # select * from foo;

OK, I understood what I was missing. This can be reproduced with
wal_level = minimal. When using hot_standby things are working
properly.
-- 
Michael



Re: Unlogged tables cleanup

От
Michael Paquier
Дата:
On Thu, Nov 10, 2016 at 4:33 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Nov 10, 2016 at 4:23 PM, konstantin knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
>> No, it is latest sources from Postgres repository.
>> Please notice that you should create new database and tablespace to reproduce this issue.
>> So actually the whole sequence is
>>
>> mkdir fs
>> initdb -D pgsql
>> pg_ctl -D pgsql -l logfile start
>> psql postgres
>> # create tablespace fs location '/home/knizhnik/dtm-data/fs';
>> # set default_tablespace=fs;
>> # create unlogged table foo(x integer);
>> # insert into foo values(generate_series(1,100000));
>> # ^D
>> pkill -9 postgres
>> pg_ctl -D pgsql -l logfile start
>> # select * from foo;
>
> OK, I understood what I was missing. This can be reproduced with
> wal_level = minimal. When using hot_standby things are working
> properly.

Okay, so what happens is that the CREATE TABLESPACE record removes the
tablespace directory and recreates a fresh one, but as no CREATE
records are created for unlogged tables the init fork is not
re-created. It seems to me that we should log a record to recreate the
INIT fork, and that heap_create_with_catalog() is missing something.
Generating a record in RelationCreateStorage() is the more direct way,
and that actually fixes the issue. Now the comments at the top of it
mention that RelationCreateStorage() should only be used to create the
MAIN forknum. So, wouldn't a correct fix be to log this INIT record at
the end of heap_create()?
-- 
Michael



Re: Unlogged tables cleanup

От
Michael Paquier
Дата:
On Thu, Nov 10, 2016 at 5:15 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Okay, so what happens is that the CREATE TABLESPACE record removes the
> tablespace directory and recreates a fresh one, but as no CREATE
> records are created for unlogged tables the init fork is not
> re-created. It seems to me that we should log a record to recreate the
> INIT fork, and that heap_create_with_catalog() is missing something.
> Generating a record in RelationCreateStorage() is the more direct way,
> and that actually fixes the issue. Now the comments at the top of it
> mention that RelationCreateStorage() should only be used to create the
> MAIN forknum. So, wouldn't a correct fix be to log this INIT record at
> the end of heap_create()?

Nah. Looking at the code the fix is quite obvious.
heap_create_init_fork() is checking for XLogIsNeeded() to decide if
the INIT forknum should be logged or not. But this is wrong, it needs
to be done unconditionally to ensure that the relation gets created at
replay.
--
Michael

Вложения

Re: Unlogged tables cleanup

От
Kuntal Ghosh
Дата:
On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Nov 10, 2016 at 5:15 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Okay, so what happens is that the CREATE TABLESPACE record removes the
>> tablespace directory and recreates a fresh one, but as no CREATE
>> records are created for unlogged tables the init fork is not
>> re-created. It seems to me that we should log a record to recreate the
>> INIT fork, and that heap_create_with_catalog() is missing something.
>> Generating a record in RelationCreateStorage() is the more direct way,
>> and that actually fixes the issue. Now the comments at the top of it
>> mention that RelationCreateStorage() should only be used to create the
>> MAIN forknum. So, wouldn't a correct fix be to log this INIT record at
>> the end of heap_create()?
>
> Nah. Looking at the code the fix is quite obvious.
> heap_create_init_fork() is checking for XLogIsNeeded() to decide if
> the INIT forknum should be logged or not. But this is wrong, it needs
> to be done unconditionally to ensure that the relation gets created at
> replay.
I think that we should also update other *buildempty() functions as well.
For example, if the table has a primary key, we'll encounter the error again
for btree index.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: Unlogged tables cleanup

От
Michael Paquier
Дата:
On Thu, Nov 10, 2016 at 10:52 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Nah. Looking at the code the fix is quite obvious.
>> heap_create_init_fork() is checking for XLogIsNeeded() to decide if
>> the INIT forknum should be logged or not. But this is wrong, it needs
>> to be done unconditionally to ensure that the relation gets created at
>> replay.
>
> I think that we should also update other *buildempty() functions as well.
> For example, if the table has a primary key, we'll encounter the error again
> for btree index.

Good point. I just had a look at all the AMs: bloom, btree and SP-gist
are plainly broken. The others are fine. Attached is an updated patch.

Here are the tests I have done with wal_level = minimal to test all the AMs:
\! rm -rf /Users/mpaquier/Desktop/tbsp/PG*
create extension bloom;
create extension btree_gist;
create extension btree_gin;
create tablespace popo location '/Users/mpaquier/Desktop/tbsp';
set default_tablespace = popo;
create unlogged table foo (a int);
insert into foo values (generate_series(1,10000));
create index foo_bt on foo(a);
create index foo_bloom on foo using bloom(a);
create index foo_gin on foo using gin (a);
create index foo_gist on foo using gist (a);
create index foo_brin on foo using brin (a);
create unlogged table foo_geo (a box);
insert into foo_geo values ('(1,2),(2,3)');
create index foo_spgist ON foo_geo using spgist(a);

-- crash PG
-- Insert some data
insert into foo values (1000000);
insert into foo_geo values ('(50,12),(312,3)');

This should create 8 INIT forks, 6 for the indexes and 2 for the
tables. On HEAD, only 3 are getting created and with the patch all of
them are.
--
Michael

Вложения

Re: Unlogged tables cleanup

От
Robert Haas
Дата:
On Thu, Nov 10, 2016 at 9:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Nov 10, 2016 at 10:52 PM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> Nah. Looking at the code the fix is quite obvious.
>>> heap_create_init_fork() is checking for XLogIsNeeded() to decide if
>>> the INIT forknum should be logged or not. But this is wrong, it needs
>>> to be done unconditionally to ensure that the relation gets created at
>>> replay.
>>
>> I think that we should also update other *buildempty() functions as well.
>> For example, if the table has a primary key, we'll encounter the error again
>> for btree index.
>
> Good point. I just had a look at all the AMs: bloom, btree and SP-gist
> are plainly broken. The others are fine. Attached is an updated patch.
>
> Here are the tests I have done with wal_level = minimal to test all the AMs:
> \! rm -rf /Users/mpaquier/Desktop/tbsp/PG*
> create extension bloom;
> create extension btree_gist;
> create extension btree_gin;
> create tablespace popo location '/Users/mpaquier/Desktop/tbsp';
> set default_tablespace = popo;
> create unlogged table foo (a int);
> insert into foo values (generate_series(1,10000));
> create index foo_bt on foo(a);
> create index foo_bloom on foo using bloom(a);
> create index foo_gin on foo using gin (a);
> create index foo_gist on foo using gist (a);
> create index foo_brin on foo using brin (a);
> create unlogged table foo_geo (a box);
> insert into foo_geo values ('(1,2),(2,3)');
> create index foo_spgist ON foo_geo using spgist(a);
>
> -- crash PG
> -- Insert some data
> insert into foo values (1000000);
> insert into foo_geo values ('(50,12),(312,3)');
>
> This should create 8 INIT forks, 6 for the indexes and 2 for the
> tables. On HEAD, only 3 are getting created and with the patch all of
> them are.

The header comment for heap_create_init_fork() says this:

/** Set up an init fork for an unlogged table so that it can be correctly* reinitialized on restart.  Since we're going
todo an immediate sync, we* only need to xlog this if archiving or streaming is enabled.  And the* immediate sync is
required,because otherwise there's no guarantee that* this will hit the disk before the next checkpoint moves the redo
pointer.*/

Your patch causes the code not to match the comment any more.  And the
comment explains why at the time I wrote this code I thought it was OK
to have the XLogIsNeeded() test in there, so it clearly needs
updating.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Unlogged tables cleanup

От
Michael Paquier
Дата:
On Wed, Nov 16, 2016 at 11:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> The header comment for heap_create_init_fork() says this:
>
> /*
>  * Set up an init fork for an unlogged table so that it can be correctly
>  * reinitialized on restart.  Since we're going to do an immediate sync, we
>  * only need to xlog this if archiving or streaming is enabled.  And the
>  * immediate sync is required, because otherwise there's no guarantee that
>  * this will hit the disk before the next checkpoint moves the redo pointer.
>  */
>
> Your patch causes the code not to match the comment any more.  And the
> comment explains why at the time I wrote this code I thought it was OK
> to have the XLogIsNeeded() test in there, so it clearly needs
> updating.

Indeed I missed this comment block. Please let me suggest the following instead:
 /*
  * Set up an init fork for an unlogged table so that it can be correctly
- * reinitialized on restart.  Since we're going to do an immediate sync, we
- * only need to xlog this if archiving or streaming is enabled.  And the
- * immediate sync is required, because otherwise there's no guarantee that
- * this will hit the disk before the next checkpoint moves the redo pointer.
+ * reinitialized on restart.  An immediate sync is required even if the
+ * page has been logged, because the write did not go through
+ * shared_buffers and therefore a concurrent checkpoint may have moved
+ * the redo pointer past our xlog record.
  */
--
Michael

Вложения

Re: Unlogged tables cleanup

От
Robert Haas
Дата:
On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Nov 16, 2016 at 11:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> The header comment for heap_create_init_fork() says this:
>>
>> /*
>>  * Set up an init fork for an unlogged table so that it can be correctly
>>  * reinitialized on restart.  Since we're going to do an immediate sync, we
>>  * only need to xlog this if archiving or streaming is enabled.  And the
>>  * immediate sync is required, because otherwise there's no guarantee that
>>  * this will hit the disk before the next checkpoint moves the redo pointer.
>>  */
>>
>> Your patch causes the code not to match the comment any more.  And the
>> comment explains why at the time I wrote this code I thought it was OK
>> to have the XLogIsNeeded() test in there, so it clearly needs
>> updating.
>
> Indeed I missed this comment block. Please let me suggest the following instead:
>  /*
>   * Set up an init fork for an unlogged table so that it can be correctly
> - * reinitialized on restart.  Since we're going to do an immediate sync, we
> - * only need to xlog this if archiving or streaming is enabled.  And the
> - * immediate sync is required, because otherwise there's no guarantee that
> - * this will hit the disk before the next checkpoint moves the redo pointer.
> + * reinitialized on restart.  An immediate sync is required even if the
> + * page has been logged, because the write did not go through
> + * shared_buffers and therefore a concurrent checkpoint may have moved
> + * the redo pointer past our xlog record.
>   */

Hmm.  Well, that deletes the comment that's no longer true, but it
doesn't replace it with any explanation of why we also need to WAL-log
it unconditionally, and I think that explanation is not entirely
trivial?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Unlogged tables cleanup

От
Michael Paquier
Дата:
On Wed, Nov 16, 2016 at 7:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Indeed I missed this comment block. Please let me suggest the following instead:
>>  /*
>>   * Set up an init fork for an unlogged table so that it can be correctly
>> - * reinitialized on restart.  Since we're going to do an immediate sync, we
>> - * only need to xlog this if archiving or streaming is enabled.  And the
>> - * immediate sync is required, because otherwise there's no guarantee that
>> - * this will hit the disk before the next checkpoint moves the redo pointer.
>> + * reinitialized on restart.  An immediate sync is required even if the
>> + * page has been logged, because the write did not go through
>> + * shared_buffers and therefore a concurrent checkpoint may have moved
>> + * the redo pointer past our xlog record.
>>   */
>
> Hmm.  Well, that deletes the comment that's no longer true, but it
> doesn't replace it with any explanation of why we also need to WAL-log
> it unconditionally, and I think that explanation is not entirely
> trivial?

OK, the original code does not give any special reason either
regarding why doing so is safe for archiving or streaming :)

More seriously, if there could be more details regarding that, I would
think that we could say something like "logging the init fork is
mandatory in any case to ensure its on-disk presence when at recovery
replay, even on non-default tablespaces whose base location are
deleted and re-created from scratch if the WAL record in charge of
creating this tablespace gets replayed". The problem shows up because
of tablespaces being deleted at replay at the end... So perhaps this
makes sense. What do you think?
-- 
Michael



Re: Unlogged tables cleanup

От
Robert Haas
Дата:
On Wed, Nov 16, 2016 at 11:55 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Nov 16, 2016 at 7:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> Indeed I missed this comment block. Please let me suggest the following instead:
>>>  /*
>>>   * Set up an init fork for an unlogged table so that it can be correctly
>>> - * reinitialized on restart.  Since we're going to do an immediate sync, we
>>> - * only need to xlog this if archiving or streaming is enabled.  And the
>>> - * immediate sync is required, because otherwise there's no guarantee that
>>> - * this will hit the disk before the next checkpoint moves the redo pointer.
>>> + * reinitialized on restart.  An immediate sync is required even if the
>>> + * page has been logged, because the write did not go through
>>> + * shared_buffers and therefore a concurrent checkpoint may have moved
>>> + * the redo pointer past our xlog record.
>>>   */
>>
>> Hmm.  Well, that deletes the comment that's no longer true, but it
>> doesn't replace it with any explanation of why we also need to WAL-log
>> it unconditionally, and I think that explanation is not entirely
>> trivial?
>
> OK, the original code does not give any special reason either
> regarding why doing so is safe for archiving or streaming :)

Yeah, but surely it's obvious that if you don't WAL log it, it's not
going to work for archiving or streaming.  It's a lot less obvious why
you have to WAL log it when you're not doing either of those things if
you've already written it and fsync'd it locally.  How is it going to
disappear if it's been fsync'd, one wonders?

> More seriously, if there could be more details regarding that, I would
> think that we could say something like "logging the init fork is
> mandatory in any case to ensure its on-disk presence when at recovery
> replay, even on non-default tablespaces whose base location are
> deleted and re-created from scratch if the WAL record in charge of
> creating this tablespace gets replayed". The problem shows up because
> of tablespaces being deleted at replay at the end... So perhaps this
> makes sense. What do you think?

Yes, that's about what I think we need to explain.  Actually, I'm
wondering if we ought to just switch this over to using the delayChkpt
mechanism instead of going through this complicated song-and-dance.
Incurring an immediate sync to avoid having to WAL-log this is a bit
tenuous, but having to WAL-log it AND do the immediate sync just seems
silly, and I'm actually a bit worried that even with your fix there
might still be a latent bug somewhere.  We couldn't switch mechanisms
cleanly in the 9.2 branch (cf.
f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
back-patch it in the form you have it in already, but it's probably
worth switching over at least in master.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Unlogged tables cleanup

От
Michael Paquier
Дата:
On Thu, Nov 17, 2016 at 7:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Yeah, but surely it's obvious that if you don't WAL log it, it's not
> going to work for archiving or streaming.  It's a lot less obvious why
> you have to WAL log it when you're not doing either of those things if
> you've already written it and fsync'd it locally.  How is it going to
> disappear if it's been fsync'd, one wonders?

That's not obvious that replaying a WAL record for a database creation
or tablespace creation would cause that for sure! I didn't know that
redo was wiping them out with rmtree() at replay before looking at
this bug. One more thing to recall when fixing an issue in this area
in the future.

>> More seriously, if there could be more details regarding that, I would
>> think that we could say something like "logging the init fork is
>> mandatory in any case to ensure its on-disk presence when at recovery
>> replay, even on non-default tablespaces whose base location are
>> deleted and re-created from scratch if the WAL record in charge of
>> creating this tablespace gets replayed". The problem shows up because
>> of tablespaces being deleted at replay at the end... So perhaps this
>> makes sense. What do you think?
>
> Yes, that's about what I think we need to explain.

OK, what do you think about the attached then?

I came up with something like that for those code paths:
-   /* Write the page.  If archiving/streaming, XLOG it. */
+   /*
+    * Write the page and log it unconditionally.  This is important
+    * particularly for indexes created on tablespaces and databases
+    * whose creation happened after the last redo pointer as recovery
+    * removes any of their existing content when the corresponding
+    * create records are replayed.
+    */
I have not been able to use the word "create" less than that. The
patch is really repetitive, but I think that we had better mention the
need of logging the content in each code path and not touch
log_newpage() to keep it a maximum flexible.

> Actually, I'm
> wondering if we ought to just switch this over to using the delayChkpt
> mechanism instead of going through this complicated song-and-dance.
> Incurring an immediate sync to avoid having to WAL-log this is a bit
> tenuous, but having to WAL-log it AND do the immediate sync just seems
> silly, and I'm actually a bit worried that even with your fix there
> might still be a latent bug somewhere.  We couldn't switch mechanisms
> cleanly in the 9.2 branch (cf.
> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
> back-patch it in the form you have it in already, but it's probably
> worth switching over at least in master.

Thinking through it... Could we not just rip off the sync phase
because there is delayChkpt? It seems to me that what counts is that
the commit of the transaction that creates the relation does not get
past the redo point. It would matter for read uncommitted transactions
but that concept does not exist in PG, and never will. On
back-branches it is definitely safer to keep the sync, I am just
thinking about a HEAD-only optimization here as you do.
--
Michael

Вложения

Re: Unlogged tables cleanup

От
Robert Haas
Дата:
Michael, your greetings were passed on to me with a request that I
look at this thread.

On Fri, Nov 18, 2016 at 12:33 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>>> More seriously, if there could be more details regarding that, I would
>>> think that we could say something like "logging the init fork is
>>> mandatory in any case to ensure its on-disk presence when at recovery
>>> replay, even on non-default tablespaces whose base location are
>>> deleted and re-created from scratch if the WAL record in charge of
>>> creating this tablespace gets replayed". The problem shows up because
>>> of tablespaces being deleted at replay at the end... So perhaps this
>>> makes sense. What do you think?
>>
>> Yes, that's about what I think we need to explain.
>
> OK, what do you think about the attached then?
>
> I came up with something like that for those code paths:
> -   /* Write the page.  If archiving/streaming, XLOG it. */
> +   /*
> +    * Write the page and log it unconditionally.  This is important
> +    * particularly for indexes created on tablespaces and databases
> +    * whose creation happened after the last redo pointer as recovery
> +    * removes any of their existing content when the corresponding
> +    * create records are replayed.
> +    */
> I have not been able to use the word "create" less than that. The
> patch is really repetitive, but I think that we had better mention the
> need of logging the content in each code path and not touch
> log_newpage() to keep it a maximum flexible.

In blinsert.c, nbtree.c, etc. how about:

Write the page and log it.  It might seem that an immediate sync would
be sufficient to guarantee that the file exists on disk, but recovery
itself might remove it while replaying, for example, an
XLOG_DBASE_CREATE record.  Therefore, we need this even when
wal_level=minimal.

>> Actually, I'm
>> wondering if we ought to just switch this over to using the delayChkpt
>> mechanism instead of going through this complicated song-and-dance.
>> Incurring an immediate sync to avoid having to WAL-log this is a bit
>> tenuous, but having to WAL-log it AND do the immediate sync just seems
>> silly, and I'm actually a bit worried that even with your fix there
>> might still be a latent bug somewhere.  We couldn't switch mechanisms
>> cleanly in the 9.2 branch (cf.
>> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
>> back-patch it in the form you have it in already, but it's probably
>> worth switching over at least in master.
>
> Thinking through it... Could we not just rip off the sync phase
> because there is delayChkpt? It seems to me that what counts is that
> the commit of the transaction that creates the relation does not get
> past the redo point. It would matter for read uncommitted transactions
> but that concept does not exist in PG, and never will. On
> back-branches it is definitely safer to keep the sync, I am just
> thinking about a HEAD-only optimization here as you do.

Right (I think).  If we set and clear delayChkpt around this work, we
don't need the immediate sync.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Unlogged tables cleanup

От
Michael Paquier
Дата:
On Thu, Dec 8, 2016 at 6:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Michael, your greetings were passed on to me with a request that I
> look at this thread.

Thanks for showing up!

> On Fri, Nov 18, 2016 at 12:33 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>>> More seriously, if there could be more details regarding that, I would
>>>> think that we could say something like "logging the init fork is
>>>> mandatory in any case to ensure its on-disk presence when at recovery
>>>> replay, even on non-default tablespaces whose base location are
>>>> deleted and re-created from scratch if the WAL record in charge of
>>>> creating this tablespace gets replayed". The problem shows up because
>>>> of tablespaces being deleted at replay at the end... So perhaps this
>>>> makes sense. What do you think?
>>>
>>> Yes, that's about what I think we need to explain.
>>
>> OK, what do you think about the attached then?
>>
>> I came up with something like that for those code paths:
>> -   /* Write the page.  If archiving/streaming, XLOG it. */
>> +   /*
>> +    * Write the page and log it unconditionally.  This is important
>> +    * particularly for indexes created on tablespaces and databases
>> +    * whose creation happened after the last redo pointer as recovery
>> +    * removes any of their existing content when the corresponding
>> +    * create records are replayed.
>> +    */
>> I have not been able to use the word "create" less than that. The
>> patch is really repetitive, but I think that we had better mention the
>> need of logging the content in each code path and not touch
>> log_newpage() to keep it a maximum flexible.
>
> In blinsert.c, nbtree.c, etc. how about:
>
> Write the page and log it.  It might seem that an immediate sync would
> be sufficient to guarantee that the file exists on disk, but recovery
> itself might remove it while replaying, for example, an
> XLOG_DBASE_CREATE record.  Therefore, we need this even when
> wal_level=minimal.

OK, I rewrote a bit the patch as attached. What do you think?

>>> Actually, I'm
>>> wondering if we ought to just switch this over to using the delayChkpt
>>> mechanism instead of going through this complicated song-and-dance.
>>> Incurring an immediate sync to avoid having to WAL-log this is a bit
>>> tenuous, but having to WAL-log it AND do the immediate sync just seems
>>> silly, and I'm actually a bit worried that even with your fix there
>>> might still be a latent bug somewhere.  We couldn't switch mechanisms
>>> cleanly in the 9.2 branch (cf.
>>> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
>>> back-patch it in the form you have it in already, but it's probably
>>> worth switching over at least in master.
>>
>> Thinking through it... Could we not just rip off the sync phase
>> because there is delayChkpt? It seems to me that what counts is that
>> the commit of the transaction that creates the relation does not get
>> past the redo point. It would matter for read uncommitted transactions
>> but that concept does not exist in PG, and never will. On
>> back-branches it is definitely safer to keep the sync, I am just
>> thinking about a HEAD-only optimization here as you do.
>
> Right (I think).  If we set and clear delayChkpt around this work, we
> don't need the immediate sync.

My point is a bit different than what you mean I think: the
transaction creating an unlogged relfilenode would not need to even
set delayChkpt in the empty() routines because other transactions
would not refer to it until this transaction has committed. So I am
arguing about just removing the sync phase.
--
Michael

Вложения

Re: Unlogged tables cleanup

От
Robert Haas
Дата:
On Wed, Dec 7, 2016 at 11:20 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> OK, I rewrote a bit the patch as attached. What do you think?

Committed and back-patched all the way back to 9.2.

>> Right (I think).  If we set and clear delayChkpt around this work, we
>> don't need the immediate sync.
>
> My point is a bit different than what you mean I think: the
> transaction creating an unlogged relfilenode would not need to even
> set delayChkpt in the empty() routines because other transactions
> would not refer to it until this transaction has committed. So I am
> arguing about just removing the sync phase.

That doesn't sound right; see the comment for heap_create_init_fork.
Suppose the transaction creating the unlogged table commits, a
checkpoint happens, and then the operating system crashes.  Without
the immediate sync, the operating system crash can cause the un-sync'd
file to crash, and because of the checkpoint the WAL record that
creates it isn't replayed either.  So the file's just gone.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Unlogged tables cleanup

От
Michael Paquier
Дата:
On Fri, Dec 9, 2016 at 4:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Dec 7, 2016 at 11:20 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> OK, I rewrote a bit the patch as attached. What do you think?
>
> Committed and back-patched all the way back to 9.2.

Thanks!

>>> Right (I think).  If we set and clear delayChkpt around this work, we
>>> don't need the immediate sync.
>>
>> My point is a bit different than what you mean I think: the
>> transaction creating an unlogged relfilenode would not need to even
>> set delayChkpt in the empty() routines because other transactions
>> would not refer to it until this transaction has committed. So I am
>> arguing about just removing the sync phase.
>
> That doesn't sound right; see the comment for heap_create_init_fork.
> Suppose the transaction creating the unlogged table commits, a
> checkpoint happens, and then the operating system crashes.  Without
> the immediate sync, the operating system crash can cause the un-sync'd
> file to crash, and because of the checkpoint the WAL record that
> creates it isn't replayed either.  So the file's just gone.

Doh. That would have made sense if the checkpoint was actually
flushing the page if it was in shared buffers. But that's not the
case.
-- 
Michael



Re: [HACKERS] Unlogged tables cleanup

От
Robert Haas
Дата:
On Thu, Dec 8, 2016 at 7:49 PM Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Dec 9, 2016 at 4:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Wed, Dec 7, 2016 at 11:20 PM, Michael Paquier
> > <michael.paquier@gmail.com> wrote:
> >> OK, I rewrote a bit the patch as attached. What do you think?
> >
> > Committed and back-patched all the way back to 9.2.
>
> Thanks!

This thread resulted in commit
fa0f466d5329e10b16f3b38c8eaf5306f7e234e8, and today I had cause to
look at that commit again.  The code is now in
heapam_relation_set_new_filenode. I noticed a few things that seem
like they are not quite right.

1. The comment added in that commit says "Recover may as well remove
it while replaying..." but what is really meant is "Recovery may well
remove it well replaying..."  The phrase "may as well" means that
there isn't really any reason not to do it even if it's not strictly
necessary.  The phrase "may well" means that it's entirely possible.
The latter meaning is the one we want here.

2. The comment as adjusted in that commit refers to needing an
immediate sync "even if the page has been logged, because the write
did not go through shared_buffers," but there is no page and no write,
because an empty heap is just an empty file.   That logic makes sense
for index AMs that bypass shared buffers to write a metapage, e.g.
nbtree, as opposed to others which go through shared_buffers and don't
have the immediate sync, e.g. brin.  However, the heap writes no pages
either through shared buffers or bypassing shared buffers, so either
I'm confused or the comment makes no sense.

3. Before that commit, the comment said that "the immediate sync is
required, because otherwise there's no guarantee that this will hit
the disk before the next checkpoint moves the redo pointer."  However,
that justification seems to apply to *anything* that does smgrcreate +
log_smgrcreate would also need to do smgrimmedsync, and
RelationCreateStorage doesn't.  Unless, for some reason that I'm not
thinking of right now, the init fork has stronger durability
requirements that the main fork, it's hard to understand why
heapam_relation_set_new_filenode can call RelationCreateStorage to do
smgrcreate + log_smgrcreate for the main fork and that's OK, but when
it does the same thing itself for the init fork, it now needs
smgrimmedsync as well.

My guess, just shooting from the hip, is that the smgrimmedsync call
can be removed here.  If that's wrong, then we need a better
explanation for why it's needed, and we possibly need to add it to
every single place that does smgrcreate that doesn't have it already.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Unlogged tables cleanup

От
Alvaro Herrera
Дата:
I agree that the wording "recovery may as well" is incorrect and that
"may well" makes it correct.

On 2019-May-13, Robert Haas wrote:

> My guess, just shooting from the hip, is that the smgrimmedsync call
> can be removed here.  If that's wrong, then we need a better
> explanation for why it's needed, and we possibly need to add it to
> every single place that does smgrcreate that doesn't have it already.

AFAICS ResetUnloggedRelations copies the init fork after replaying WAL,
so it would be sufficient to have the init fork be recovered from WAL
for that to work.  However, we also do ResetUnloggedRelations *before*
replaying WAL in order to remove leftover not-init-fork files, and that
process requires that the init fork is present at that time.  So I think
the immedsync call is necessary (otherwise the cleanup may fail).  I
don't quite understand why the log_smgrcreate is necessary, but I think
it is for reasons that are not adequately explained by the existing
comments.

IMO if you can remove either the immedsync or the log_smgrcreate call
and no test fails, then we're either missing test cases, or (one of) the
calls is unnecessary.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Unlogged tables cleanup

От
Andres Freund
Дата:
Hi,

On 2019-05-13 12:24:05 -0400, Alvaro Herrera wrote:
> > My guess, just shooting from the hip, is that the smgrimmedsync call
> > can be removed here.  If that's wrong, then we need a better
> > explanation for why it's needed, and we possibly need to add it to
> > every single place that does smgrcreate that doesn't have it already.
> 
> AFAICS ResetUnloggedRelations copies the init fork after replaying WAL,
> so it would be sufficient to have the init fork be recovered from WAL
> for that to work.  However, we also do ResetUnloggedRelations *before*
> replaying WAL in order to remove leftover not-init-fork files, and that
> process requires that the init fork is present at that time.

What scenario are you precisely wondering about? That
ResetUnloggedRelations() could overwrite the main fork, while not yet
having valid contents (due to the lack of smgrimmedsync())? Shouldn't
that only be possible while still in an inconsistent state? A checkpoint
would have serialized the correct contents, and we'd not reach HS
consistency before having replayed that WAL records resetting the table
and the init fork consistency?


> So I think
> the immedsync call is necessary (otherwise the cleanup may fail).  I
> don't quite understand why the log_smgrcreate is necessary, but I think
> it is for reasons that are not adequately explained by the existing
> comments.

Well, otherwise the relation won't exist on a standby? And if replay
starts from before a database/tablespace creation we'd remove the init
fork. So if it's not in the WAL, we'd loose it.


> IMO if you can remove either the immedsync or the log_smgrcreate call
> and no test fails, then we're either missing test cases, or (one of) the
> calls is unnecessary.

We definitely don't have a high coverage of more complicated recovery
scenarios.

Greetings,

Andres Freund



Re: [HACKERS] Unlogged tables cleanup

От
Alvaro Herrera
Дата:
On 2019-May-13, Andres Freund wrote:

> On 2019-05-13 12:24:05 -0400, Alvaro Herrera wrote:

> > AFAICS ResetUnloggedRelations copies the init fork after replaying WAL,
> > so it would be sufficient to have the init fork be recovered from WAL
> > for that to work.  However, we also do ResetUnloggedRelations *before*
> > replaying WAL in order to remove leftover not-init-fork files, and that
> > process requires that the init fork is present at that time.
> 
> What scenario are you precisely wondering about? That
> ResetUnloggedRelations() could overwrite the main fork, while not yet
> having valid contents (due to the lack of smgrimmedsync())? Shouldn't
> that only be possible while still in an inconsistent state? A checkpoint
> would have serialized the correct contents, and we'd not reach HS
> consistency before having replayed that WAL records resetting the table
> and the init fork consistency?

The first ResetUnloggedRelations call occurs before any WAL is replayed,
so the data dir certainly still in inconsistent state.  At that point,
we need the init fork files to be present, because the init files are the
indicators of what relations we need to delete the other forks for.

Maybe we can do something lighter than a full immedsync of all the data
for the init file -- it would be sufficient to have the file *exist* --
but I'm not sure this optimization is worth anything.

> > So I think the immedsync call is necessary (otherwise the cleanup
> > may fail).  I don't quite understand why the log_smgrcreate is
> > necessary, but I think it is for reasons that are not adequately
> > explained by the existing comments.
> 
> Well, otherwise the relation won't exist on a standby? And if replay
> starts from before a database/tablespace creation we'd remove the init
> fork. So if it's not in the WAL, we'd loose it.

Ah, of course.  Well, that needs to be in the comments then.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Unlogged tables cleanup

От
Robert Haas
Дата:
On Mon, May 13, 2019 at 12:50 PM Andres Freund <andres@anarazel.de> wrote:
> > AFAICS ResetUnloggedRelations copies the init fork after replaying WAL,
> > so it would be sufficient to have the init fork be recovered from WAL
> > for that to work.  However, we also do ResetUnloggedRelations *before*
> > replaying WAL in order to remove leftover not-init-fork files, and that
> > process requires that the init fork is present at that time.
>
> What scenario are you precisely wondering about? That
> ResetUnloggedRelations() could overwrite the main fork, while not yet
> having valid contents (due to the lack of smgrimmedsync())? Shouldn't
> that only be possible while still in an inconsistent state? A checkpoint
> would have serialized the correct contents, and we'd not reach HS
> consistency before having replayed that WAL records resetting the table
> and the init fork consistency?

I think I see what Alvaro is talking about, or at least I think I see
*a* possible problem based on his remarks.

Suppose we create an unlogged table and then crash. The main fork
makes it to disk, and the init fork does not.  Before WAL replay, we
remove any main forks that have init forks, but because the init fork
was lost, that does not happen.  Recovery recreates the init fork.
After WAL replay, we try to copy_file() each _init fork to the
corresponding main fork. That fails, because copy_file() expects to be
able to create the target file, and here it can't do that because it
already exists.

If that's the scenario, I'm not sure the smgrimmedsync() call is
sufficient.  Suppose we log_smgrcreate() but then crash before
smgrimmedsync()... seems like we'd need to do them in the other order,
or else maybe just pass a flag to copy_file() telling it not to be so
picky.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Unlogged tables cleanup

От
Alvaro Herrera
Дата:
On 2019-May-13, Robert Haas wrote:

> I think I see what Alvaro is talking about, or at least I think I see
> *a* possible problem based on his remarks.
> 
> Suppose we create an unlogged table and then crash. The main fork
> makes it to disk, and the init fork does not.  Before WAL replay, we
> remove any main forks that have init forks, but because the init fork
> was lost, that does not happen.  Recovery recreates the init fork.
> After WAL replay, we try to copy_file() each _init fork to the
> corresponding main fork. That fails, because copy_file() expects to be
> able to create the target file, and here it can't do that because it
> already exists.

... right, that seems to be it.

> If that's the scenario, I'm not sure the smgrimmedsync() call is
> sufficient.  Suppose we log_smgrcreate() but then crash before
> smgrimmedsync()... seems like we'd need to do them in the other order,
> or else maybe just pass a flag to copy_file() telling it not to be so
> picky.

Well, if we do modify copy_file and thus we don't think the deletion
step serves any purpose, why not just get rid of the deletion step
entirely?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Unlogged tables cleanup

От
Andres Freund
Дата:
Hi,

On 2019-05-13 13:07:30 -0400, Alvaro Herrera wrote:
> On 2019-May-13, Andres Freund wrote:
> 
> > On 2019-05-13 12:24:05 -0400, Alvaro Herrera wrote:
> 
> > > AFAICS ResetUnloggedRelations copies the init fork after replaying WAL,
> > > so it would be sufficient to have the init fork be recovered from WAL
> > > for that to work.  However, we also do ResetUnloggedRelations *before*
> > > replaying WAL in order to remove leftover not-init-fork files, and that
> > > process requires that the init fork is present at that time.
> > 
> > What scenario are you precisely wondering about? That
> > ResetUnloggedRelations() could overwrite the main fork, while not yet
> > having valid contents (due to the lack of smgrimmedsync())? Shouldn't
> > that only be possible while still in an inconsistent state? A checkpoint
> > would have serialized the correct contents, and we'd not reach HS
> > consistency before having replayed that WAL records resetting the table
> > and the init fork consistency?
> 
> The first ResetUnloggedRelations call occurs before any WAL is replayed,
> so the data dir certainly still in inconsistent state.  At that point,
> we need the init fork files to be present, because the init files are the
> indicators of what relations we need to delete the other forks for.

Hm. I think this might be a self-made problem. For the main fork, we
don't need this - if the init fork was created before the last
checkpoint/restartpoint, it'll be on-disk. If it was created afterwards,
WAL replay will recreate both main an init fork. So the problem is just
that the VM fork might survive, because it'll not get nuked given the
current arrangement. Is that what you're thinking about?

I'm doubtful that that is a sane arrangement - there very well could be
tables created and dropped, and then recreated with a recycled oid,
between start and end of recovery. I'm not sure this is actively a
problem for the VM, but I think it's pretty broken for the FSM.

Why isn't the correct answer to nuke all forks during the WAL replay of
the main relation's creation?


> Maybe we can do something lighter than a full immedsync of all the data
> for the init file -- it would be sufficient to have the file *exist* --
> but I'm not sure this optimization is worth anything.

I don't think just that is sufficient in isolation for types of
relations with metapages (e.g. btree) - the init fork constains data
there.


> > > So I think the immedsync call is necessary (otherwise the cleanup
> > > may fail).  I don't quite understand why the log_smgrcreate is
> > > necessary, but I think it is for reasons that are not adequately
> > > explained by the existing comments.
> > 
> > Well, otherwise the relation won't exist on a standby? And if replay
> > starts from before a database/tablespace creation we'd remove the init
> > fork. So if it's not in the WAL, we'd loose it.
> 
> Ah, of course.  Well, that needs to be in the comments then.

I think it is?

         * ... Recovery may as well remove it
     * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
     * record. Therefore, logging is necessary even if wal_level=minimal.

Greetings,

Andres Freund



Re: [HACKERS] Unlogged tables cleanup

От
Alvaro Herrera
Дата:
On 2019-May-13, Andres Freund wrote:

> On 2019-05-13 13:07:30 -0400, Alvaro Herrera wrote:
> > On 2019-May-13, Andres Freund wrote:

> > The first ResetUnloggedRelations call occurs before any WAL is replayed,
> > so the data dir certainly still in inconsistent state.  At that point,
> > we need the init fork files to be present, because the init files are the
> > indicators of what relations we need to delete the other forks for.
> 
> Hm. I think this might be a self-made problem. For the main fork, we
> don't need this - if the init fork was created before the last
> checkpoint/restartpoint, it'll be on-disk. If it was created afterwards,
> WAL replay will recreate both main an init fork. So the problem is just
> that the VM fork might survive, because it'll not get nuked given the
> current arrangement. Is that what you're thinking about?

No, this wasn't was I was trying to explain.  Robert described it
better.

> > Maybe we can do something lighter than a full immedsync of all the data
> > for the init file -- it would be sufficient to have the file *exist* --
> > but I'm not sure this optimization is worth anything.
> 
> I don't think just that is sufficient in isolation for types of
> relations with metapages (e.g. btree) - the init fork constains data
> there.

No, I meant that when doing the initial cleanup (before WAL replay) we
only delete files; and for that we only need to know whether the table
is unlogged, and we know that by testing presence of the init file.  We
do need the contents *after* WAL replay, and for indexes we of course
need the actual contents of the init fork.

> > > Well, otherwise the relation won't exist on a standby? And if replay
> > > starts from before a database/tablespace creation we'd remove the init
> > > fork. So if it's not in the WAL, we'd loose it.
> > 
> > Ah, of course.  Well, that needs to be in the comments then.
> 
> I think it is?
> 
>          * ... Recovery may as well remove it
>      * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
>      * record. Therefore, logging is necessary even if wal_level=minimal.

I meant the standby part.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Unlogged tables cleanup

От
Andres Freund
Дата:
On 2019-05-13 13:09:17 -0400, Robert Haas wrote:
> On Mon, May 13, 2019 at 12:50 PM Andres Freund <andres@anarazel.de> wrote:
> > > AFAICS ResetUnloggedRelations copies the init fork after replaying WAL,
> > > so it would be sufficient to have the init fork be recovered from WAL
> > > for that to work.  However, we also do ResetUnloggedRelations *before*
> > > replaying WAL in order to remove leftover not-init-fork files, and that
> > > process requires that the init fork is present at that time.
> >
> > What scenario are you precisely wondering about? That
> > ResetUnloggedRelations() could overwrite the main fork, while not yet
> > having valid contents (due to the lack of smgrimmedsync())? Shouldn't
> > that only be possible while still in an inconsistent state? A checkpoint
> > would have serialized the correct contents, and we'd not reach HS
> > consistency before having replayed that WAL records resetting the table
> > and the init fork consistency?
> 
> I think I see what Alvaro is talking about, or at least I think I see
> *a* possible problem based on his remarks.
> 
> Suppose we create an unlogged table and then crash. The main fork
> makes it to disk, and the init fork does not.  Before WAL replay, we
> remove any main forks that have init forks, but because the init fork
> was lost, that does not happen.  Recovery recreates the init fork.
> After WAL replay, we try to copy_file() each _init fork to the
> corresponding main fork. That fails, because copy_file() expects to be
> able to create the target file, and here it can't do that because it
> already exists.

Ugh, this is all such a mess. But, isn't this broken independently of
the smgrimmedsync() issue? In a basebackup case, the basebackup could
have included the main fork, but not the init fork, and the reverse. WAL
replay *solely* needs to be able to recover from that.  At the very
least we'd have to do the cleanup step after becoming consistent, not
just before recovery even started.


> If that's the scenario, I'm not sure the smgrimmedsync() call is
> sufficient.  Suppose we log_smgrcreate() but then crash before
> smgrimmedsync()... seems like we'd need to do them in the other order,
> or else maybe just pass a flag to copy_file() telling it not to be so
> picky.

I think I mentioned this elsewhere recently, I think our smgr WAL
logging makes this unneccessarily hard. We should just have the intial
smgrcreate record (be it for the main fork, or just the init fork), have
a flag that tells WAL replay to clean out everything pre-existing


Hm, I'm also a bit confused as to how ResetUnloggedRelations() got to
take its argument as a bitmask - given that we only pass in individual
flags at the moment.  Seems pretty clear to me that the call at the end
of recovery would need to be UNLOGGED_RELATION_CLEANUP |
UNLOGGED_RELATION_INIT, unless we make some bigger changes?

Greetings,

Andres Freund



Re: [HACKERS] Unlogged tables cleanup

От
Andres Freund
Дата:
Hi,

On 2019-05-13 13:21:33 -0400, Alvaro Herrera wrote:
> On 2019-May-13, Robert Haas wrote:
> > If that's the scenario, I'm not sure the smgrimmedsync() call is
> > sufficient.  Suppose we log_smgrcreate() but then crash before
> > smgrimmedsync()... seems like we'd need to do them in the other order,
> > or else maybe just pass a flag to copy_file() telling it not to be so
> > picky.
> 
> Well, if we do modify copy_file and thus we don't think the deletion
> step serves any purpose, why not just get rid of the deletion step
> entirely?

Well, it does serve a purpose:
        /*
         * We're in recovery, so unlogged relations may be trashed and must be
         * reset.  This should be done BEFORE allowing Hot Standby
         * connections, so that read-only backends don't try to read whatever
         * garbage is left over from before.
         */

but I'm somewhat doubtful that this logic is correct.

Greetings,

Andres Freund



Re: [HACKERS] Unlogged tables cleanup

От
Andres Freund
Дата:
Hi,

On 2019-05-13 13:33:00 -0400, Alvaro Herrera wrote:
> On 2019-May-13, Andres Freund wrote:
> 
> > On 2019-05-13 13:07:30 -0400, Alvaro Herrera wrote:
> > > On 2019-May-13, Andres Freund wrote:
> 
> > > The first ResetUnloggedRelations call occurs before any WAL is replayed,
> > > so the data dir certainly still in inconsistent state.  At that point,
> > > we need the init fork files to be present, because the init files are the
> > > indicators of what relations we need to delete the other forks for.
> > 
> > Hm. I think this might be a self-made problem. For the main fork, we
> > don't need this - if the init fork was created before the last
> > checkpoint/restartpoint, it'll be on-disk. If it was created afterwards,
> > WAL replay will recreate both main an init fork. So the problem is just
> > that the VM fork might survive, because it'll not get nuked given the
> > current arrangement. Is that what you're thinking about?

I was wrong here - I thought we WAL logged the main fork creation even
for unlogged tables. I think it's foolish that we don't, but we don't.

Greetings,

Andres Freund



Re: [HACKERS] Unlogged tables cleanup

От
Michael Paquier
Дата:
On Mon, May 13, 2019 at 10:52:21AM -0700, Andres Freund wrote:
> I was wrong here - I thought we WAL logged the main fork creation even
> for unlogged tables. I think it's foolish that we don't, but we don't.

Why?  The main fork is not actually necessary, and the beginning of
recovery would make sure that any existing main fork gets removed and
that the main fork gets recreated by copying it from the init fork,
which is WAL-logged.  So we definitely have to log the init fork, but
logging the main fork is just useless data in the WAL record.  Or you
have something else in mind?
--
Michael

Вложения

Re: [HACKERS] Unlogged tables cleanup

От
Michael Paquier
Дата:
On Mon, May 13, 2019 at 11:28:32AM -0400, Robert Haas wrote:
> 1. The comment added in that commit says "Recover may as well remove
> it while replaying..." but what is really meant is "Recovery may well
> remove it well replaying..."  The phrase "may as well" means that
> there isn't really any reason not to do it even if it's not strictly
> necessary.  The phrase "may well" means that it's entirely possible.
> The latter meaning is the one we want here.

Perhaps I wrote this comment.  Per what you say, indeed we should use
"may well" here, to outline the fact that it is a possibility.  I have
to admit that I don't completely get the difference with "may as
well", which also sounds like a possibility by reading this comment
today, but I am no native speaker.

> 2. The comment as adjusted in that commit refers to needing an
> immediate sync "even if the page has been logged, because the write
> did not go through shared_buffers," but there is no page and no write,
> because an empty heap is just an empty file.   That logic makes sense
> for index AMs that bypass shared buffers to write a metapage, e.g.
> nbtree, as opposed to others which go through shared_buffers and don't
> have the immediate sync, e.g. brin.  However, the heap writes no pages
> either through shared buffers or bypassing shared buffers, so either
> I'm confused or the comment makes no sense.

We need to ensure that the init fork, even if empty, still exists so
as recovery can do its job of detection that this is an unlogged table
and then recreate the empty main fork.  Perhaps the comment could
insist on the on-disk existence of the relfilenode instead, making
necessary the logging of this page, however empty it is.

> 3. Before that commit, the comment said that "the immediate sync is
> required, because otherwise there's no guarantee that this will hit
> the disk before the next checkpoint moves the redo pointer."  However,
> that justification seems to apply to *anything* that does smgrcreate +
> log_smgrcreate would also need to do smgrimmedsync, and
> RelationCreateStorage doesn't.  Unless, for some reason that I'm not
> thinking of right now, the init fork has stronger durability
> requirements that the main fork, it's hard to understand why
> heapam_relation_set_new_filenode can call RelationCreateStorage to do
> smgrcreate + log_smgrcreate for the main fork and that's OK, but when
> it does the same thing itself for the init fork, it now needs
> smgrimmedsync as well.
>
> My guess, just shooting from the hip, is that the smgrimmedsync call
> can be removed here.  If that's wrong, then we need a better
> explanation for why it's needed, and we possibly need to add it to
> every single place that does smgrcreate that doesn't have it already.

That's mentioned upthread, and my sloppy memories on the matter match
that we need the fsync call to ensure that the init fork is present at
the init of recovery, and that logging makes sure that is does not get
wiped out if replaying a database or tablespace record.
--
Michael

Вложения

Re: [HACKERS] Unlogged tables cleanup

От
Michael Paquier
Дата:
On Mon, May 13, 2019 at 10:37:35AM -0700, Andres Freund wrote:
> Ugh, this is all such a mess. But, isn't this broken independently of
> the smgrimmedsync() issue? In a basebackup case, the basebackup could
> have included the main fork, but not the init fork, and the reverse. WAL
> replay *solely* needs to be able to recover from that.  At the very
> least we'd have to do the cleanup step after becoming consistent, not
> just before recovery even started.

Yes, the logic using smgrimmedsync() is race-prone and weaker than the
index AMs in my opinion, even if the failure window is limited (I
think that this is mentioned upthread a bit).  What's actually the
reason preventing us from delaying the checkpointer like the index AMs
for the logging of heap init fork?  The fact that the init heap fork
is an empty page?
--
Michael

Вложения

Re: [HACKERS] Unlogged tables cleanup

От
Andres Freund
Дата:
Hi,

On 2019-05-14 12:50:09 +0900, Michael Paquier wrote:
> On Mon, May 13, 2019 at 10:52:21AM -0700, Andres Freund wrote:
> > I was wrong here - I thought we WAL logged the main fork creation even
> > for unlogged tables. I think it's foolish that we don't, but we don't.
>
> Why?  The main fork is not actually necessary, and the beginning of
> recovery would make sure that any existing main fork gets removed and
> that the main fork gets recreated by copying it from the init fork,
> which is WAL-logged.

Several things:

1) A main fork that's created while a base-backup is in progress might
   only partially exist on a HS node. We won't necessarily remove it,
   because the init fork might *not* have been copied. The init fork
   will be re-created by WAL replay, but that'll be too late for the
   ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP) call.

   Which then will

   a) cause ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to fail, because the
      target file already exists, as mentioned elsewhere in this thread

   b) cause trouble when access the unlogged table - unfortunately there
      are several ways to access unlogged tables on a HS node - so the
      main fork better either not exist, or be in a valid empty
      state. Case in point:
      COPY blarg TO STDOUT;
      does not throw an error due to unlogged-ness.

   If we instead properly WAL logged the MAIN fork creation b) wouldn't
   be a problem.

   And we could have the UNLOGGED_RELATION_INIT case re-use the
   hashtable built at UNLOGGED_RELATION_CLEANUP time, and avoid a second
   scan of the filesystem. As all relations that were created after the
   LSN crash recovery started at would be valid, we would not need to do
   a full second scan: All unlogged relations created after that LSN are
   guaranteed to have a consistent main fork.

2) With a small bit of additional work, the sketch to fix 1) would allow
   us to allow read access to unlogged relations during HS. We'd just
   have do the UNLOGGED_RELATION_INIT work at the start of recovery.

3) I'm somewhat certain that there are relfilenode reuse issues with the
   current approach, again during base backups (as there the checkpoint
   logic to prevent relfilenode reuse doesn't work). I think you can
   easily end up with main / vm / fsm / init forks that actually don't
   belong to the same relation for a base backup. If you don't force
   delete all other forks at the time the init fork record is replayed,
   there's no defense against that.

There's probably more, but I gotta cook.


> So we definitely have to log the init fork, but logging the main fork
> is just useless data in the WAL record.  Or you have something else in
> mind?

Well, as I argued somewhere else in this thread, I think this all should
just be one WAL record.

Something very roughly like:

struct xl_smgr_create
{
        RelFileNode     rnode;
        ForkNumber    forkNum;
        bool        remove_all_other_forks;
        bool        copy_init_to_main_fork;
        size_t        data_length;
        char        fork_contents[];
};

Presumably with a flags argument instead of separate booleans.


Greetings,

Andres Freund



Re: [HACKERS] Unlogged tables cleanup

От
Andres Freund
Дата:
Hi,

On 2019-05-14 13:23:28 +0900, Michael Paquier wrote:
> On Mon, May 13, 2019 at 10:37:35AM -0700, Andres Freund wrote:
> > Ugh, this is all such a mess. But, isn't this broken independently of
> > the smgrimmedsync() issue? In a basebackup case, the basebackup could
> > have included the main fork, but not the init fork, and the reverse. WAL
> > replay *solely* needs to be able to recover from that.  At the very
> > least we'd have to do the cleanup step after becoming consistent, not
> > just before recovery even started.
> 
> Yes, the logic using smgrimmedsync() is race-prone and weaker than the
> index AMs in my opinion, even if the failure window is limited (I
> think that this is mentioned upthread a bit).

How's it limited? On a large database a base backup easily can take
*days*. And e.g. VM and FSM can easily have inodes that are much newer
than the the main/init forks, so typical base-backups (via OS/glibc
readdir) will sort them at a later point (or it'll be hashed, in which
case it's entirely random), so the window between when the different
forks are copied are large.


> What's actually the reason preventing us from delaying the
> checkpointer like the index AMs for the logging of heap init fork?

I'm not following. What do you mean by "delaying the checkpointer"?

Greetings,

Andres Freund



Re: [HACKERS] Unlogged tables cleanup

От
Michael Paquier
Дата:
On Mon, May 13, 2019 at 09:33:52PM -0700, Andres Freund wrote:
> On 2019-05-14 13:23:28 +0900, Michael Paquier wrote:
>> What's actually the reason preventing us from delaying the
>> checkpointer like the index AMs for the logging of heap init fork?
>
> I'm not following. What do you mean by "delaying the checkpointer"?

I mean what Robert has mentioned here:
https://www.postgresql.org/message-id/CA+TgmoZ4TWaPCKhF-szV-nPxDXL40zCwm9pNFJZURvRgm2oJzQ@mail.gmail.com

And my gut tells me that he got that right, because we are discussing
about race conditions with crashes and checkpoints in-between calls to
smgrimmedsync() and log_newpage().  That could be invasive for
back-branches, but for HEAD this would make the whole init fork
handling saner.
--
Michael

Вложения

Re: [HACKERS] Unlogged tables cleanup

От
Andres Freund
Дата:
Hi,

On 2019-05-14 14:22:15 +0900, Michael Paquier wrote:
> On Mon, May 13, 2019 at 09:33:52PM -0700, Andres Freund wrote:
> > On 2019-05-14 13:23:28 +0900, Michael Paquier wrote:
> >> What's actually the reason preventing us from delaying the
> >> checkpointer like the index AMs for the logging of heap init fork?
> > 
> > I'm not following. What do you mean by "delaying the checkpointer"?
> 
> I mean what Robert has mentioned here:
> https://www.postgresql.org/message-id/CA+TgmoZ4TWaPCKhF-szV-nPxDXL40zCwm9pNFJZURvRgm2oJzQ@mail.gmail.com

That's a proposal, not something we actually ended up though? That's
what confuses me about your earlier paragraph: "like the index AMs"?
Where are we doing that in index AMs?


> And my gut tells me that he got that right, because we are discussing
> about race conditions with crashes and checkpoints in-between calls to
> smgrimmedsync() and log_newpage().  That could be invasive for
> back-branches, but for HEAD this would make the whole init fork
> handling saner.

How would this protect against the issues I mentioned where recovery
starts from an earlier checkpoint and the basebackup could pick up a
random set of version of the different forks?

I don't like the idea of expanding the use of delayChkpt further for
common operations, if anything we ought to try to reduce it. But I also
don't see how it'd actually fix the issues, so perhaps that's moot.

Greetings,

Andres Freund



Re: [HACKERS] Unlogged tables cleanup

От
Robert Haas
Дата:
On Tue, May 14, 2019 at 2:19 AM Andres Freund <andres@anarazel.de> wrote:
> How would this protect against the issues I mentioned where recovery
> starts from an earlier checkpoint and the basebackup could pick up a
> random set of version of the different forks?
>
> I don't like the idea of expanding the use of delayChkpt further for
> common operations, if anything we ought to try to reduce it. But I also
> don't see how it'd actually fix the issues, so perhaps that's moot.

Based on the discussion so far, I think there are a number of related
problems here:

1. A base backup might copy the main fork but not the init fork. I
think that's a problem that nobody's thought about before Andres
raised it on this thread.  I may be wrong. Anyway, if that happens,
the start-of-recovery code isn't going to do the right thing, because
it won't know that it's dealing with an unlogged relation.

2. Suppose the system reaches the end of
heapam_relation_set_new_filenode and then the system crashes.  Because
of the smgrimmedsync(), and only because of the smgrimmedsync(), the
init fork would exist at the start of recovery.  However, unlike the
heap, some of the index AMs don't actually have a call to
smgrimmedsync() in their *buildempty() functions.  If I'm not
mistaken, the ones that write pages through shared_buffers do not do
smgrimmedsync, and the ones that use private buffers do perform
smgrimmedsync.  Therefore, the ones that write pages through
shared_buffers are vulnerable to a crash right after the unlogged
table is created.  The init fork could fail to survive the crash, and
therefore the start-of-recovery code would again fail to know that
it's dealign with an unlogged relation.

3. So, why is it like that, anyway?  Why do index AMs that write pages
via shared_buffers not have smgrimmedsync?  The answer is that we did
it like that because we were worrying about a different problem -
specifically, checkpointing.  If I dirty a buffer and write a WAL
record for the changes that I made, it is guaranteed that the dirty
data will make it to disk before the next checkpoint is written; we've
got all sorts of interlocking in BufferSync, SyncOneBuffer, etc. to
make sure that happens.  But if I create a file in the filesystem and
write a WAL record for that change, none of that interlocking does
anything.  So unless I not only create that file but smgrimmedsync()
it before the next checkpoint's redo location is fixed, a crash could
make the file disappear.

For example, consider RelationCreateStorage.  What prevents the
following sequence of events?

S1: sgmropen()
S1: smgrcreate()
S1: log_smgrcreate()
-- at this point the checkpointer fixes the redo pointer, writes every
dirty buffer in the system, and completes a checkpoint
S1: commits
-- crash

On restart, I think we'll could potentially be missing the file
created by smgrcreate(), and we won't replay the WAL record generated
by log_smgrcreate() either because it's before the checkpoint.

I believe that it is one aspect of this third problem that we
previously fixed on this thread, but I think we failed to understand
how general the issue actually is.  It affects _init forks of unlogged
tables, but I think it also affects essentially every other use of
smgrcreate(), whether it's got anything to do with unlogged tables or
not. For an index AM, which has a non-empty initial representation,
the consequences are pretty limited, because the transaction can't
commit having created only an empty fork.  It's got to put some data
into either the main fork or the init fork, as the case may be.  If it
aborts, then we may leave some orphaned files behind, but if we lose
one, we just have fewer orphaned files, so nobody cares.  And if it
commits, then either everything's before the checkpoint (and the file
will have been fsync'd because we fsync'd the buffers), or
everything's after the checkpoint (so we will replay the WAL), or only
the smgrcreate() is before the checkpoint and the rest is after (in
which case XLogReadBufferExtended will do the smgrcreate over again
for us and we'll be fine).  But since the heap uses an empty initial
representation, it enjoys no similar protection.

So, IIUC, the reason why we were talking about delayCkpt before is
because it would allow us to remove the smgrimmedsync() of the
unlogged fork while still avoiding problem #3.  However it does
nothing to protect against problem #1 or #2.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Unlogged tables cleanup

От
Michael Paquier
Дата:
On Mon, May 20, 2019 at 09:16:32AM -0400, Robert Haas wrote:
> Based on the discussion so far, I think there are a number of related
> problems here:

Thanks for the summary.

> 1. A base backup might copy the main fork but not the init fork. I
> think that's a problem that nobody's thought about before Andres
> raised it on this thread.  I may be wrong. Anyway, if that happens,
> the start-of-recovery code isn't going to do the right thing, because
> it won't know that it's dealing with an unlogged relation.

Hmm.  I got to think harder about this one, and I find the argument
of Andres from upthread to use (CLEANUP | INIT) for
ResetUnloggedRelations() at the end of recovery quite sensible:
https://www.postgresql.org/message-id/20190513173735.3znpqfkb3gasig6v@alap3.anarazel.de

It seems that the main risk is to finish with partially-present index
main forks.  For heap, the main would still be empty.

> 2. Suppose the system reaches the end of
> heapam_relation_set_new_filenode and then the system crashes.  Because
> of the smgrimmedsync(), and only because of the smgrimmedsync(), the
> init fork would exist at the start of recovery.  However, unlike the
> heap, some of the index AMs don't actually have a call to
> smgrimmedsync() in their *buildempty() functions.  If I'm not
> mistaken, the ones that write pages through shared_buffers do not do
> smgrimmedsync, and the ones that use private buffers do perform
> smgrimmedsync.

Yes, that maps with what I can see in the code for the various empty()
callbacks.

> Therefore, the ones that write pages through
> shared_buffers are vulnerable to a crash right after the unlogged
> table is created.  The init fork could fail to survive the crash, and
> therefore the start-of-recovery code would again fail to know that
> it's dealign with an unlogged relation.

Taking the example of gist which uses shared buffers for the init
fork logging, we take an exclusive lock on the buffer involved while
logging the init fork, meaning that the checkpoint is not able to
complete until the lock is released and the buffer is flushed.  Do you
have a specific sequence in mind?

> 3. So, why is it like that, anyway?  Why do index AMs that write pages
> via shared_buffers not have smgrimmedsync?  The answer is that we did
> it like that because we were worrying about a different problem -
> specifically, checkpointing.  If I dirty a buffer and write a WAL
> record for the changes that I made, it is guaranteed that the dirty
> data will make it to disk before the next checkpoint is written; we've
> got all sorts of interlocking in BufferSync, SyncOneBuffer, etc. to
> make sure that happens.  But if I create a file in the filesystem and
> write a WAL record for that change, none of that interlocking does
> anything.  So unless I not only create that file but smgrimmedsync()
> it before the next checkpoint's redo location is fixed, a crash could
> make the file disappear.
>
> On restart, I think we'll could potentially be missing the file
> created by smgrcreate(), and we won't replay the WAL record generated
> by log_smgrcreate() either because it's before the checkpoint.

Right, that's possible.  If you take the case of btree, the problem is
the window between the sync and the page logging which is unsafe.

> I believe that it is one aspect of this third problem that we
> previously fixed on this thread, but I think we failed to understand
> how general the issue actually is.  It affects _init forks of unlogged
> tables, but I think it also affects essentially every other use of
> smgrcreate(), whether it's got anything to do with unlogged tables or
> not. For an index AM, which has a non-empty initial representation,
> the consequences are pretty limited, because the transaction can't
> commit having created only an empty fork.  It's got to put some data
> into either the main fork or the init fork, as the case may be.  If it
> aborts, then we may leave some orphaned files behind, but if we lose
> one, we just have fewer orphaned files, so nobody cares.  And if it
> commits, then either everything's before the checkpoint (and the file
> will have been fsync'd because we fsync'd the buffers), or
> everything's after the checkpoint (so we will replay the WAL), or only
> the smgrcreate() is before the checkpoint and the rest is after (in
> which case XLogReadBufferExtended will do the smgrcreate over again
> for us and we'll be fine).  But since the heap uses an empty initial
> representation, it enjoys no similar protection.

Not sure what to think about this paragraph.  I need to ponder it
more.

> So, IIUC, the reason why we were talking about delayCkpt before is
> because it would allow us to remove the smgrimmedsync() of the
> unlogged fork while still avoiding problem #3.  However it does
> nothing to protect against problem #1 or #2.

Right.  I am not completely sure why #2 is a problem though, please
see my comments above.  For #1 the point raised upthread to do a
cleanup + init at the end of recovery makes sense.
--
Michael

Вложения

Re: [HACKERS] Unlogged tables cleanup

От
Robert Haas
Дата:
On Tue, May 21, 2019 at 4:17 AM Michael Paquier <michael@paquier.xyz> wrote:
> > 2. Suppose the system reaches the end of
> > heapam_relation_set_new_filenode and then the system crashes.  Because
> > of the smgrimmedsync(), and only because of the smgrimmedsync(), the
> > init fork would exist at the start of recovery.  However, unlike the
> > heap, some of the index AMs don't actually have a call to
> > smgrimmedsync() in their *buildempty() functions.  If I'm not
> > mistaken, the ones that write pages through shared_buffers do not do
> > smgrimmedsync, and the ones that use private buffers do perform
> > smgrimmedsync.
>
> Yes, that maps with what I can see in the code for the various empty()
> callbacks.
>
> > Therefore, the ones that write pages through
> > shared_buffers are vulnerable to a crash right after the unlogged
> > table is created.  The init fork could fail to survive the crash, and
> > therefore the start-of-recovery code would again fail to know that
> > it's dealign with an unlogged relation.
>
> Taking the example of gist which uses shared buffers for the init
> fork logging, we take an exclusive lock on the buffer involved while
> logging the init fork, meaning that the checkpoint is not able to
> complete until the lock is released and the buffer is flushed.  Do you
> have a specific sequence in mind?

Yes.  I thought I had described it.  You create an unlogged table,
with an index of a type that does not smgrimmedsync(), your
transaction commits, and then the system crashes, losing the _init
fork for the index.

There's no checkpoint involved in this scenario, so any argument that
it can't be a problem because of checkpoints is necessarily incorrect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Unlogged tables cleanup

От
Michael Paquier
Дата:
On Tue, May 21, 2019 at 08:39:18AM -0400, Robert Haas wrote:
> Yes.  I thought I had described it.  You create an unlogged table,
> with an index of a type that does not smgrimmedsync(), your
> transaction commits, and then the system crashes, losing the _init
> fork for the index.

The init forks won't magically go away, except in one case for empty
routines not going through shared buffers.

The empty routines writing pages which do not go through shared
buffers still log them, so those pages could go away when the host
crashes, and a checkpoint completed before calling smgrimmedsync() and
after the page has been logged.  Please note the logic in
spgbuildempty() as an example.

Then, empty routines going through shared buffers fill in one or more
buffers, mark it/them as empty, dirty it/them, log the page(s) and then
unlock the buffer(s).  If a crash happens after the transaction
commits, so we would still have the init page in WAL, and at the end
of recovery we would know about it.

Or is there something in log_newpage_buffer()/log_newpage() I am
missing but you foresee?
--
Michael

Вложения

Re: [HACKERS] Unlogged tables cleanup

От
Robert Haas
Дата:
On Thu, May 23, 2019 at 2:43 AM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, May 21, 2019 at 08:39:18AM -0400, Robert Haas wrote:
> > Yes.  I thought I had described it.  You create an unlogged table,
> > with an index of a type that does not smgrimmedsync(), your
> > transaction commits, and then the system crashes, losing the _init
> > fork for the index.
>
> The init forks won't magically go away, except in one case for empty
> routines not going through shared buffers.

No magic is required.  If you haven't called fsync(), the file might
not be there after a system crash.

Going through shared_buffers guarantees that the file will be
fsync()'d before the next checkpoint, but I'm talking about a scenario
where you crash before the next checkpoint.

> Then, empty routines going through shared buffers fill in one or more
> buffers, mark it/them as empty, dirty it/them, log the page(s) and then
> unlock the buffer(s).  If a crash happens after the transaction
> commits, so we would still have the init page in WAL, and at the end
> of recovery we would know about it.

Yeah, but the problem is that the currently system requires us to know
about it at the *beginning* of recovery.  See my earlier remarks:

Suppose we create an unlogged table and then crash. The main fork
makes it to disk, and the init fork does not.  Before WAL replay, we
remove any main forks that have init forks, but because the init fork
was lost, that does not happen.  Recovery recreates the init fork.
After WAL replay, we try to copy_file() each _init fork to the
corresponding main fork. That fails, because copy_file() expects to be
able to create the target file, and here it can't do that because it
already exists.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Unlogged tables cleanup

От
Michael Paquier
Дата:
On Thu, May 23, 2019 at 09:14:59AM -0400, Robert Haas wrote:
> Suppose we create an unlogged table and then crash. The main fork
> makes it to disk, and the init fork does not.  Before WAL replay, we
> remove any main forks that have init forks, but because the init fork
> was lost, that does not happen.  Recovery recreates the init fork.
> After WAL replay, we try to copy_file() each _init fork to the
> corresponding main fork. That fails, because copy_file() expects to be
> able to create the target file, and here it can't do that because it
> already exists.

Ah, sorry.  Now I understand your point.  I got your previous message
wrong as I thought about the fact that init forks are logged, hence
after a crash you should still have these at the end of recovery, and
I got confused when you wrote that init forks just disappear.  However
you were referring to the state of a cluster before recovery begins,
and I was thinking about the state where the cluster has run recovery
and has reached consistency.

If we should do something about such cases, then this brings us back
to do (INIT | CLEANUP) at the end of recovery anyway?
--
Michael

Вложения

Re: [HACKERS] Unlogged tables cleanup

От
Robert Haas
Дата:
On Sun, May 26, 2019 at 10:11 PM Michael Paquier <michael@paquier.xyz> wrote:
> If we should do something about such cases, then this brings us back
> to do (INIT | CLEANUP) at the end of recovery anyway?

I believe that could fix problems #1 and #2, but we'd have to think
about what new issues it would introduce.

#3 seems like it needs a different fix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company