Обсуждение: Unlogged tables cleanup
Hi, hackers I wonder if such behavior can be considered as a bug: knizhnik@knizhnik:~/dtm-data$ psql postgres psql (10devel) Type "help" for help. postgres=# create tablespace fs location '/home/knizhnik/dtm-data/fs'; CREATE TABLESPACE postgres=# set default_tablespace=fs; SET postgres=# create unlogged table foo(x integer); CREATE TABLE postgres=# insert into foo values(generate_series(1,100000)); INSERT 0 100000 Now simulate server crash using using "pkill -9 postgres". knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l logfile start pg_ctl: another server might be running; trying to start server anyway server starting knizhnik@knizhnik:~/dtm-data$ psql postgres psql (10devel) Type "help" for help. postgres=# select * from foo; ERROR: could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385": No such file or directory knizhnik@knizhnik:~/dtm-data$ ls fs PG_10_201611041 knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/ So all relation directory is removed! It happens only for first table created in tablespace. If you create table in Postgres data directory everything is ok: first segment of relation is truncated but not deleted. Also if you create one more unlogged table in tablespace it is truncated correctly: postgres=# set default_tablespace=fs; SET postgres=# create unlogged table foo1(x integer); CREATE TABLE postgres=# insert into foo1 values(generate_series(1,100000)); INSERT 0 100000 postgres=# \q knizhnik@knizhnik:~/dtm-data$ pkill -9 postgres knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l logfile start pg_ctl: another server might be running; trying to start server anyway server starting knizhnik@knizhnik:~/dtm-data$ psql postgres psql (10devel) Type "help" for help. postgres=# select * from foo1; x --- (0 rows) knizhnik@knizhnik:~/dtm-data$ ls -l fs/PG_10_201611041/12289/* -rw------- 1 knizhnik knizhnik 0 Nov 9 19:52 fs/PG_10_201611041/12289/32768 -rw------- 1 knizhnik knizhnik 0 Nov 9 19:52 fs/PG_10_201611041/12289/32768_init -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Wed, Nov 9, 2016 at 11:56 AM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > Now simulate server crash using using "pkill -9 postgres". > > knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l > logfile start > pg_ctl: another server might be running; trying to start server anyway > server starting > knizhnik@knizhnik:~/dtm-data$ psql postgres > psql (10devel) > Type "help" for help. > > postgres=# select * from foo; > ERROR: could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385": > No such file or directory > > knizhnik@knizhnik:~/dtm-data$ ls fs > PG_10_201611041 > knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/ > > So all relation directory is removed! > It happens only for first table created in tablespace. > If you create table in Postgres data directory everything is ok: first > segment of relation is truncated but not deleted. Whoa. There should be an _init fork that doesn't get removed, and without removing the _init fork you shouldn't be able to remove the directory that contains it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Nov 10, 2016 at 3:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Nov 9, 2016 at 11:56 AM, Konstantin Knizhnik > <k.knizhnik@postgrespro.ru> wrote: >> Now simulate server crash using using "pkill -9 postgres". >> >> knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l >> logfile start >> pg_ctl: another server might be running; trying to start server anyway >> server starting >> knizhnik@knizhnik:~/dtm-data$ psql postgres >> psql (10devel) >> Type "help" for help. >> >> postgres=# select * from foo; >> ERROR: could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385": >> No such file or directory >> >> knizhnik@knizhnik:~/dtm-data$ ls fs >> PG_10_201611041 >> knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/ >> >> So all relation directory is removed! >> It happens only for first table created in tablespace. >> If you create table in Postgres data directory everything is ok: first >> segment of relation is truncated but not deleted. > > Whoa. There should be an _init fork that doesn't get removed, and > without removing the _init fork you shouldn't be able to remove the > directory that contains it. Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you have locally a standby pointing as well to this tablespace? -- Michael
On Nov 10, 2016, at 10:17 AM, Michael Paquier wrote: > > Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you > have locally a standby pointing as well to this tablespace? No, it is latest sources from Postgres repository. Please notice that you should create new database and tablespace to reproduce this issue. So actually the whole sequence is mkdir fs initdb -D pgsql pg_ctl -D pgsql -l logfile start psql postgres # create tablespace fs location '/home/knizhnik/dtm-data/fs'; # set default_tablespace=fs; # create unlogged table foo(x integer); # insert into foo values(generate_series(1,100000)); # ^D pkill -9 postgres pg_ctl -D pgsql -l logfile start # select * from foo; > -- > Michael
On Thu, Nov 10, 2016 at 4:23 PM, konstantin knizhnik <k.knizhnik@postgrespro.ru> wrote: > No, it is latest sources from Postgres repository. > Please notice that you should create new database and tablespace to reproduce this issue. > So actually the whole sequence is > > mkdir fs > initdb -D pgsql > pg_ctl -D pgsql -l logfile start > psql postgres > # create tablespace fs location '/home/knizhnik/dtm-data/fs'; > # set default_tablespace=fs; > # create unlogged table foo(x integer); > # insert into foo values(generate_series(1,100000)); > # ^D > pkill -9 postgres > pg_ctl -D pgsql -l logfile start > # select * from foo; OK, I understood what I was missing. This can be reproduced with wal_level = minimal. When using hot_standby things are working properly. -- Michael
On Thu, Nov 10, 2016 at 4:33 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Nov 10, 2016 at 4:23 PM, konstantin knizhnik > <k.knizhnik@postgrespro.ru> wrote: >> No, it is latest sources from Postgres repository. >> Please notice that you should create new database and tablespace to reproduce this issue. >> So actually the whole sequence is >> >> mkdir fs >> initdb -D pgsql >> pg_ctl -D pgsql -l logfile start >> psql postgres >> # create tablespace fs location '/home/knizhnik/dtm-data/fs'; >> # set default_tablespace=fs; >> # create unlogged table foo(x integer); >> # insert into foo values(generate_series(1,100000)); >> # ^D >> pkill -9 postgres >> pg_ctl -D pgsql -l logfile start >> # select * from foo; > > OK, I understood what I was missing. This can be reproduced with > wal_level = minimal. When using hot_standby things are working > properly. Okay, so what happens is that the CREATE TABLESPACE record removes the tablespace directory and recreates a fresh one, but as no CREATE records are created for unlogged tables the init fork is not re-created. It seems to me that we should log a record to recreate the INIT fork, and that heap_create_with_catalog() is missing something. Generating a record in RelationCreateStorage() is the more direct way, and that actually fixes the issue. Now the comments at the top of it mention that RelationCreateStorage() should only be used to create the MAIN forknum. So, wouldn't a correct fix be to log this INIT record at the end of heap_create()? -- Michael
On Thu, Nov 10, 2016 at 5:15 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Okay, so what happens is that the CREATE TABLESPACE record removes the > tablespace directory and recreates a fresh one, but as no CREATE > records are created for unlogged tables the init fork is not > re-created. It seems to me that we should log a record to recreate the > INIT fork, and that heap_create_with_catalog() is missing something. > Generating a record in RelationCreateStorage() is the more direct way, > and that actually fixes the issue. Now the comments at the top of it > mention that RelationCreateStorage() should only be used to create the > MAIN forknum. So, wouldn't a correct fix be to log this INIT record at > the end of heap_create()? Nah. Looking at the code the fix is quite obvious. heap_create_init_fork() is checking for XLogIsNeeded() to decide if the INIT forknum should be logged or not. But this is wrong, it needs to be done unconditionally to ensure that the relation gets created at replay. -- Michael
Вложения
On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Nov 10, 2016 at 5:15 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> Okay, so what happens is that the CREATE TABLESPACE record removes the >> tablespace directory and recreates a fresh one, but as no CREATE >> records are created for unlogged tables the init fork is not >> re-created. It seems to me that we should log a record to recreate the >> INIT fork, and that heap_create_with_catalog() is missing something. >> Generating a record in RelationCreateStorage() is the more direct way, >> and that actually fixes the issue. Now the comments at the top of it >> mention that RelationCreateStorage() should only be used to create the >> MAIN forknum. So, wouldn't a correct fix be to log this INIT record at >> the end of heap_create()? > > Nah. Looking at the code the fix is quite obvious. > heap_create_init_fork() is checking for XLogIsNeeded() to decide if > the INIT forknum should be logged or not. But this is wrong, it needs > to be done unconditionally to ensure that the relation gets created at > replay. I think that we should also update other *buildempty() functions as well. For example, if the table has a primary key, we'll encounter the error again for btree index. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Thu, Nov 10, 2016 at 10:52 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> Nah. Looking at the code the fix is quite obvious. >> heap_create_init_fork() is checking for XLogIsNeeded() to decide if >> the INIT forknum should be logged or not. But this is wrong, it needs >> to be done unconditionally to ensure that the relation gets created at >> replay. > > I think that we should also update other *buildempty() functions as well. > For example, if the table has a primary key, we'll encounter the error again > for btree index. Good point. I just had a look at all the AMs: bloom, btree and SP-gist are plainly broken. The others are fine. Attached is an updated patch. Here are the tests I have done with wal_level = minimal to test all the AMs: \! rm -rf /Users/mpaquier/Desktop/tbsp/PG* create extension bloom; create extension btree_gist; create extension btree_gin; create tablespace popo location '/Users/mpaquier/Desktop/tbsp'; set default_tablespace = popo; create unlogged table foo (a int); insert into foo values (generate_series(1,10000)); create index foo_bt on foo(a); create index foo_bloom on foo using bloom(a); create index foo_gin on foo using gin (a); create index foo_gist on foo using gist (a); create index foo_brin on foo using brin (a); create unlogged table foo_geo (a box); insert into foo_geo values ('(1,2),(2,3)'); create index foo_spgist ON foo_geo using spgist(a); -- crash PG -- Insert some data insert into foo values (1000000); insert into foo_geo values ('(50,12),(312,3)'); This should create 8 INIT forks, 6 for the indexes and 2 for the tables. On HEAD, only 3 are getting created and with the patch all of them are. -- Michael
Вложения
On Thu, Nov 10, 2016 at 9:25 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Nov 10, 2016 at 10:52 PM, Kuntal Ghosh > <kuntalghosh.2007@gmail.com> wrote: >> On Thu, Nov 10, 2016 at 3:42 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> Nah. Looking at the code the fix is quite obvious. >>> heap_create_init_fork() is checking for XLogIsNeeded() to decide if >>> the INIT forknum should be logged or not. But this is wrong, it needs >>> to be done unconditionally to ensure that the relation gets created at >>> replay. >> >> I think that we should also update other *buildempty() functions as well. >> For example, if the table has a primary key, we'll encounter the error again >> for btree index. > > Good point. I just had a look at all the AMs: bloom, btree and SP-gist > are plainly broken. The others are fine. Attached is an updated patch. > > Here are the tests I have done with wal_level = minimal to test all the AMs: > \! rm -rf /Users/mpaquier/Desktop/tbsp/PG* > create extension bloom; > create extension btree_gist; > create extension btree_gin; > create tablespace popo location '/Users/mpaquier/Desktop/tbsp'; > set default_tablespace = popo; > create unlogged table foo (a int); > insert into foo values (generate_series(1,10000)); > create index foo_bt on foo(a); > create index foo_bloom on foo using bloom(a); > create index foo_gin on foo using gin (a); > create index foo_gist on foo using gist (a); > create index foo_brin on foo using brin (a); > create unlogged table foo_geo (a box); > insert into foo_geo values ('(1,2),(2,3)'); > create index foo_spgist ON foo_geo using spgist(a); > > -- crash PG > -- Insert some data > insert into foo values (1000000); > insert into foo_geo values ('(50,12),(312,3)'); > > This should create 8 INIT forks, 6 for the indexes and 2 for the > tables. On HEAD, only 3 are getting created and with the patch all of > them are. The header comment for heap_create_init_fork() says this: /** Set up an init fork for an unlogged table so that it can be correctly* reinitialized on restart. Since we're going todo an immediate sync, we* only need to xlog this if archiving or streaming is enabled. And the* immediate sync is required,because otherwise there's no guarantee that* this will hit the disk before the next checkpoint moves the redo pointer.*/ Your patch causes the code not to match the comment any more. And the comment explains why at the time I wrote this code I thought it was OK to have the XLogIsNeeded() test in there, so it clearly needs updating. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 16, 2016 at 11:45 AM, Robert Haas <robertmhaas@gmail.com> wrote: > The header comment for heap_create_init_fork() says this: > > /* > * Set up an init fork for an unlogged table so that it can be correctly > * reinitialized on restart. Since we're going to do an immediate sync, we > * only need to xlog this if archiving or streaming is enabled. And the > * immediate sync is required, because otherwise there's no guarantee that > * this will hit the disk before the next checkpoint moves the redo pointer. > */ > > Your patch causes the code not to match the comment any more. And the > comment explains why at the time I wrote this code I thought it was OK > to have the XLogIsNeeded() test in there, so it clearly needs > updating. Indeed I missed this comment block. Please let me suggest the following instead: /* * Set up an init fork for an unlogged table so that it can be correctly - * reinitialized on restart. Since we're going to do an immediate sync, we - * only need to xlog this if archiving or streaming is enabled. And the - * immediate sync is required, because otherwise there's no guarantee that - * this will hit the disk before the next checkpoint moves the redo pointer. + * reinitialized on restart. An immediate sync is required even if the + * page has been logged, because the write did not go through + * shared_buffers and therefore a concurrent checkpoint may have moved + * the redo pointer past our xlog record. */ -- Michael
Вложения
On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Nov 16, 2016 at 11:45 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> The header comment for heap_create_init_fork() says this: >> >> /* >> * Set up an init fork for an unlogged table so that it can be correctly >> * reinitialized on restart. Since we're going to do an immediate sync, we >> * only need to xlog this if archiving or streaming is enabled. And the >> * immediate sync is required, because otherwise there's no guarantee that >> * this will hit the disk before the next checkpoint moves the redo pointer. >> */ >> >> Your patch causes the code not to match the comment any more. And the >> comment explains why at the time I wrote this code I thought it was OK >> to have the XLogIsNeeded() test in there, so it clearly needs >> updating. > > Indeed I missed this comment block. Please let me suggest the following instead: > /* > * Set up an init fork for an unlogged table so that it can be correctly > - * reinitialized on restart. Since we're going to do an immediate sync, we > - * only need to xlog this if archiving or streaming is enabled. And the > - * immediate sync is required, because otherwise there's no guarantee that > - * this will hit the disk before the next checkpoint moves the redo pointer. > + * reinitialized on restart. An immediate sync is required even if the > + * page has been logged, because the write did not go through > + * shared_buffers and therefore a concurrent checkpoint may have moved > + * the redo pointer past our xlog record. > */ Hmm. Well, that deletes the comment that's no longer true, but it doesn't replace it with any explanation of why we also need to WAL-log it unconditionally, and I think that explanation is not entirely trivial? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 16, 2016 at 7:09 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> Indeed I missed this comment block. Please let me suggest the following instead: >> /* >> * Set up an init fork for an unlogged table so that it can be correctly >> - * reinitialized on restart. Since we're going to do an immediate sync, we >> - * only need to xlog this if archiving or streaming is enabled. And the >> - * immediate sync is required, because otherwise there's no guarantee that >> - * this will hit the disk before the next checkpoint moves the redo pointer. >> + * reinitialized on restart. An immediate sync is required even if the >> + * page has been logged, because the write did not go through >> + * shared_buffers and therefore a concurrent checkpoint may have moved >> + * the redo pointer past our xlog record. >> */ > > Hmm. Well, that deletes the comment that's no longer true, but it > doesn't replace it with any explanation of why we also need to WAL-log > it unconditionally, and I think that explanation is not entirely > trivial? OK, the original code does not give any special reason either regarding why doing so is safe for archiving or streaming :) More seriously, if there could be more details regarding that, I would think that we could say something like "logging the init fork is mandatory in any case to ensure its on-disk presence when at recovery replay, even on non-default tablespaces whose base location are deleted and re-created from scratch if the WAL record in charge of creating this tablespace gets replayed". The problem shows up because of tablespaces being deleted at replay at the end... So perhaps this makes sense. What do you think? -- Michael
On Wed, Nov 16, 2016 at 11:55 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Nov 16, 2016 at 7:09 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Nov 16, 2016 at 3:54 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> Indeed I missed this comment block. Please let me suggest the following instead: >>> /* >>> * Set up an init fork for an unlogged table so that it can be correctly >>> - * reinitialized on restart. Since we're going to do an immediate sync, we >>> - * only need to xlog this if archiving or streaming is enabled. And the >>> - * immediate sync is required, because otherwise there's no guarantee that >>> - * this will hit the disk before the next checkpoint moves the redo pointer. >>> + * reinitialized on restart. An immediate sync is required even if the >>> + * page has been logged, because the write did not go through >>> + * shared_buffers and therefore a concurrent checkpoint may have moved >>> + * the redo pointer past our xlog record. >>> */ >> >> Hmm. Well, that deletes the comment that's no longer true, but it >> doesn't replace it with any explanation of why we also need to WAL-log >> it unconditionally, and I think that explanation is not entirely >> trivial? > > OK, the original code does not give any special reason either > regarding why doing so is safe for archiving or streaming :) Yeah, but surely it's obvious that if you don't WAL log it, it's not going to work for archiving or streaming. It's a lot less obvious why you have to WAL log it when you're not doing either of those things if you've already written it and fsync'd it locally. How is it going to disappear if it's been fsync'd, one wonders? > More seriously, if there could be more details regarding that, I would > think that we could say something like "logging the init fork is > mandatory in any case to ensure its on-disk presence when at recovery > replay, even on non-default tablespaces whose base location are > deleted and re-created from scratch if the WAL record in charge of > creating this tablespace gets replayed". The problem shows up because > of tablespaces being deleted at replay at the end... So perhaps this > makes sense. What do you think? Yes, that's about what I think we need to explain. Actually, I'm wondering if we ought to just switch this over to using the delayChkpt mechanism instead of going through this complicated song-and-dance. Incurring an immediate sync to avoid having to WAL-log this is a bit tenuous, but having to WAL-log it AND do the immediate sync just seems silly, and I'm actually a bit worried that even with your fix there might still be a latent bug somewhere. We couldn't switch mechanisms cleanly in the 9.2 branch (cf. f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should back-patch it in the form you have it in already, but it's probably worth switching over at least in master. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Nov 17, 2016 at 7:06 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Yeah, but surely it's obvious that if you don't WAL log it, it's not > going to work for archiving or streaming. It's a lot less obvious why > you have to WAL log it when you're not doing either of those things if > you've already written it and fsync'd it locally. How is it going to > disappear if it's been fsync'd, one wonders? That's not obvious that replaying a WAL record for a database creation or tablespace creation would cause that for sure! I didn't know that redo was wiping them out with rmtree() at replay before looking at this bug. One more thing to recall when fixing an issue in this area in the future. >> More seriously, if there could be more details regarding that, I would >> think that we could say something like "logging the init fork is >> mandatory in any case to ensure its on-disk presence when at recovery >> replay, even on non-default tablespaces whose base location are >> deleted and re-created from scratch if the WAL record in charge of >> creating this tablespace gets replayed". The problem shows up because >> of tablespaces being deleted at replay at the end... So perhaps this >> makes sense. What do you think? > > Yes, that's about what I think we need to explain. OK, what do you think about the attached then? I came up with something like that for those code paths: - /* Write the page. If archiving/streaming, XLOG it. */ + /* + * Write the page and log it unconditionally. This is important + * particularly for indexes created on tablespaces and databases + * whose creation happened after the last redo pointer as recovery + * removes any of their existing content when the corresponding + * create records are replayed. + */ I have not been able to use the word "create" less than that. The patch is really repetitive, but I think that we had better mention the need of logging the content in each code path and not touch log_newpage() to keep it a maximum flexible. > Actually, I'm > wondering if we ought to just switch this over to using the delayChkpt > mechanism instead of going through this complicated song-and-dance. > Incurring an immediate sync to avoid having to WAL-log this is a bit > tenuous, but having to WAL-log it AND do the immediate sync just seems > silly, and I'm actually a bit worried that even with your fix there > might still be a latent bug somewhere. We couldn't switch mechanisms > cleanly in the 9.2 branch (cf. > f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should > back-patch it in the form you have it in already, but it's probably > worth switching over at least in master. Thinking through it... Could we not just rip off the sync phase because there is delayChkpt? It seems to me that what counts is that the commit of the transaction that creates the relation does not get past the redo point. It would matter for read uncommitted transactions but that concept does not exist in PG, and never will. On back-branches it is definitely safer to keep the sync, I am just thinking about a HEAD-only optimization here as you do. -- Michael
Вложения
Michael, your greetings were passed on to me with a request that I look at this thread. On Fri, Nov 18, 2016 at 12:33 PM, Michael Paquier <michael.paquier@gmail.com> wrote: >>> More seriously, if there could be more details regarding that, I would >>> think that we could say something like "logging the init fork is >>> mandatory in any case to ensure its on-disk presence when at recovery >>> replay, even on non-default tablespaces whose base location are >>> deleted and re-created from scratch if the WAL record in charge of >>> creating this tablespace gets replayed". The problem shows up because >>> of tablespaces being deleted at replay at the end... So perhaps this >>> makes sense. What do you think? >> >> Yes, that's about what I think we need to explain. > > OK, what do you think about the attached then? > > I came up with something like that for those code paths: > - /* Write the page. If archiving/streaming, XLOG it. */ > + /* > + * Write the page and log it unconditionally. This is important > + * particularly for indexes created on tablespaces and databases > + * whose creation happened after the last redo pointer as recovery > + * removes any of their existing content when the corresponding > + * create records are replayed. > + */ > I have not been able to use the word "create" less than that. The > patch is really repetitive, but I think that we had better mention the > need of logging the content in each code path and not touch > log_newpage() to keep it a maximum flexible. In blinsert.c, nbtree.c, etc. how about: Write the page and log it. It might seem that an immediate sync would be sufficient to guarantee that the file exists on disk, but recovery itself might remove it while replaying, for example, an XLOG_DBASE_CREATE record. Therefore, we need this even when wal_level=minimal. >> Actually, I'm >> wondering if we ought to just switch this over to using the delayChkpt >> mechanism instead of going through this complicated song-and-dance. >> Incurring an immediate sync to avoid having to WAL-log this is a bit >> tenuous, but having to WAL-log it AND do the immediate sync just seems >> silly, and I'm actually a bit worried that even with your fix there >> might still be a latent bug somewhere. We couldn't switch mechanisms >> cleanly in the 9.2 branch (cf. >> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should >> back-patch it in the form you have it in already, but it's probably >> worth switching over at least in master. > > Thinking through it... Could we not just rip off the sync phase > because there is delayChkpt? It seems to me that what counts is that > the commit of the transaction that creates the relation does not get > past the redo point. It would matter for read uncommitted transactions > but that concept does not exist in PG, and never will. On > back-branches it is definitely safer to keep the sync, I am just > thinking about a HEAD-only optimization here as you do. Right (I think). If we set and clear delayChkpt around this work, we don't need the immediate sync. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Dec 8, 2016 at 6:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Michael, your greetings were passed on to me with a request that I > look at this thread. Thanks for showing up! > On Fri, Nov 18, 2016 at 12:33 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >>>> More seriously, if there could be more details regarding that, I would >>>> think that we could say something like "logging the init fork is >>>> mandatory in any case to ensure its on-disk presence when at recovery >>>> replay, even on non-default tablespaces whose base location are >>>> deleted and re-created from scratch if the WAL record in charge of >>>> creating this tablespace gets replayed". The problem shows up because >>>> of tablespaces being deleted at replay at the end... So perhaps this >>>> makes sense. What do you think? >>> >>> Yes, that's about what I think we need to explain. >> >> OK, what do you think about the attached then? >> >> I came up with something like that for those code paths: >> - /* Write the page. If archiving/streaming, XLOG it. */ >> + /* >> + * Write the page and log it unconditionally. This is important >> + * particularly for indexes created on tablespaces and databases >> + * whose creation happened after the last redo pointer as recovery >> + * removes any of their existing content when the corresponding >> + * create records are replayed. >> + */ >> I have not been able to use the word "create" less than that. The >> patch is really repetitive, but I think that we had better mention the >> need of logging the content in each code path and not touch >> log_newpage() to keep it a maximum flexible. > > In blinsert.c, nbtree.c, etc. how about: > > Write the page and log it. It might seem that an immediate sync would > be sufficient to guarantee that the file exists on disk, but recovery > itself might remove it while replaying, for example, an > XLOG_DBASE_CREATE record. Therefore, we need this even when > wal_level=minimal. OK, I rewrote a bit the patch as attached. What do you think? >>> Actually, I'm >>> wondering if we ought to just switch this over to using the delayChkpt >>> mechanism instead of going through this complicated song-and-dance. >>> Incurring an immediate sync to avoid having to WAL-log this is a bit >>> tenuous, but having to WAL-log it AND do the immediate sync just seems >>> silly, and I'm actually a bit worried that even with your fix there >>> might still be a latent bug somewhere. We couldn't switch mechanisms >>> cleanly in the 9.2 branch (cf. >>> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should >>> back-patch it in the form you have it in already, but it's probably >>> worth switching over at least in master. >> >> Thinking through it... Could we not just rip off the sync phase >> because there is delayChkpt? It seems to me that what counts is that >> the commit of the transaction that creates the relation does not get >> past the redo point. It would matter for read uncommitted transactions >> but that concept does not exist in PG, and never will. On >> back-branches it is definitely safer to keep the sync, I am just >> thinking about a HEAD-only optimization here as you do. > > Right (I think). If we set and clear delayChkpt around this work, we > don't need the immediate sync. My point is a bit different than what you mean I think: the transaction creating an unlogged relfilenode would not need to even set delayChkpt in the empty() routines because other transactions would not refer to it until this transaction has committed. So I am arguing about just removing the sync phase. -- Michael
Вложения
On Wed, Dec 7, 2016 at 11:20 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > OK, I rewrote a bit the patch as attached. What do you think? Committed and back-patched all the way back to 9.2. >> Right (I think). If we set and clear delayChkpt around this work, we >> don't need the immediate sync. > > My point is a bit different than what you mean I think: the > transaction creating an unlogged relfilenode would not need to even > set delayChkpt in the empty() routines because other transactions > would not refer to it until this transaction has committed. So I am > arguing about just removing the sync phase. That doesn't sound right; see the comment for heap_create_init_fork. Suppose the transaction creating the unlogged table commits, a checkpoint happens, and then the operating system crashes. Without the immediate sync, the operating system crash can cause the un-sync'd file to crash, and because of the checkpoint the WAL record that creates it isn't replayed either. So the file's just gone. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 9, 2016 at 4:54 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Dec 7, 2016 at 11:20 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> OK, I rewrote a bit the patch as attached. What do you think? > > Committed and back-patched all the way back to 9.2. Thanks! >>> Right (I think). If we set and clear delayChkpt around this work, we >>> don't need the immediate sync. >> >> My point is a bit different than what you mean I think: the >> transaction creating an unlogged relfilenode would not need to even >> set delayChkpt in the empty() routines because other transactions >> would not refer to it until this transaction has committed. So I am >> arguing about just removing the sync phase. > > That doesn't sound right; see the comment for heap_create_init_fork. > Suppose the transaction creating the unlogged table commits, a > checkpoint happens, and then the operating system crashes. Without > the immediate sync, the operating system crash can cause the un-sync'd > file to crash, and because of the checkpoint the WAL record that > creates it isn't replayed either. So the file's just gone. Doh. That would have made sense if the checkpoint was actually flushing the page if it was in shared buffers. But that's not the case. -- Michael
On Thu, Dec 8, 2016 at 7:49 PM Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Dec 9, 2016 at 4:54 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Dec 7, 2016 at 11:20 PM, Michael Paquier > > <michael.paquier@gmail.com> wrote: > >> OK, I rewrote a bit the patch as attached. What do you think? > > > > Committed and back-patched all the way back to 9.2. > > Thanks! This thread resulted in commit fa0f466d5329e10b16f3b38c8eaf5306f7e234e8, and today I had cause to look at that commit again. The code is now in heapam_relation_set_new_filenode. I noticed a few things that seem like they are not quite right. 1. The comment added in that commit says "Recover may as well remove it while replaying..." but what is really meant is "Recovery may well remove it well replaying..." The phrase "may as well" means that there isn't really any reason not to do it even if it's not strictly necessary. The phrase "may well" means that it's entirely possible. The latter meaning is the one we want here. 2. The comment as adjusted in that commit refers to needing an immediate sync "even if the page has been logged, because the write did not go through shared_buffers," but there is no page and no write, because an empty heap is just an empty file. That logic makes sense for index AMs that bypass shared buffers to write a metapage, e.g. nbtree, as opposed to others which go through shared_buffers and don't have the immediate sync, e.g. brin. However, the heap writes no pages either through shared buffers or bypassing shared buffers, so either I'm confused or the comment makes no sense. 3. Before that commit, the comment said that "the immediate sync is required, because otherwise there's no guarantee that this will hit the disk before the next checkpoint moves the redo pointer." However, that justification seems to apply to *anything* that does smgrcreate + log_smgrcreate would also need to do smgrimmedsync, and RelationCreateStorage doesn't. Unless, for some reason that I'm not thinking of right now, the init fork has stronger durability requirements that the main fork, it's hard to understand why heapam_relation_set_new_filenode can call RelationCreateStorage to do smgrcreate + log_smgrcreate for the main fork and that's OK, but when it does the same thing itself for the init fork, it now needs smgrimmedsync as well. My guess, just shooting from the hip, is that the smgrimmedsync call can be removed here. If that's wrong, then we need a better explanation for why it's needed, and we possibly need to add it to every single place that does smgrcreate that doesn't have it already. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I agree that the wording "recovery may as well" is incorrect and that "may well" makes it correct. On 2019-May-13, Robert Haas wrote: > My guess, just shooting from the hip, is that the smgrimmedsync call > can be removed here. If that's wrong, then we need a better > explanation for why it's needed, and we possibly need to add it to > every single place that does smgrcreate that doesn't have it already. AFAICS ResetUnloggedRelations copies the init fork after replaying WAL, so it would be sufficient to have the init fork be recovered from WAL for that to work. However, we also do ResetUnloggedRelations *before* replaying WAL in order to remove leftover not-init-fork files, and that process requires that the init fork is present at that time. So I think the immedsync call is necessary (otherwise the cleanup may fail). I don't quite understand why the log_smgrcreate is necessary, but I think it is for reasons that are not adequately explained by the existing comments. IMO if you can remove either the immedsync or the log_smgrcreate call and no test fails, then we're either missing test cases, or (one of) the calls is unnecessary. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-05-13 12:24:05 -0400, Alvaro Herrera wrote: > > My guess, just shooting from the hip, is that the smgrimmedsync call > > can be removed here. If that's wrong, then we need a better > > explanation for why it's needed, and we possibly need to add it to > > every single place that does smgrcreate that doesn't have it already. > > AFAICS ResetUnloggedRelations copies the init fork after replaying WAL, > so it would be sufficient to have the init fork be recovered from WAL > for that to work. However, we also do ResetUnloggedRelations *before* > replaying WAL in order to remove leftover not-init-fork files, and that > process requires that the init fork is present at that time. What scenario are you precisely wondering about? That ResetUnloggedRelations() could overwrite the main fork, while not yet having valid contents (due to the lack of smgrimmedsync())? Shouldn't that only be possible while still in an inconsistent state? A checkpoint would have serialized the correct contents, and we'd not reach HS consistency before having replayed that WAL records resetting the table and the init fork consistency? > So I think > the immedsync call is necessary (otherwise the cleanup may fail). I > don't quite understand why the log_smgrcreate is necessary, but I think > it is for reasons that are not adequately explained by the existing > comments. Well, otherwise the relation won't exist on a standby? And if replay starts from before a database/tablespace creation we'd remove the init fork. So if it's not in the WAL, we'd loose it. > IMO if you can remove either the immedsync or the log_smgrcreate call > and no test fails, then we're either missing test cases, or (one of) the > calls is unnecessary. We definitely don't have a high coverage of more complicated recovery scenarios. Greetings, Andres Freund
On 2019-May-13, Andres Freund wrote: > On 2019-05-13 12:24:05 -0400, Alvaro Herrera wrote: > > AFAICS ResetUnloggedRelations copies the init fork after replaying WAL, > > so it would be sufficient to have the init fork be recovered from WAL > > for that to work. However, we also do ResetUnloggedRelations *before* > > replaying WAL in order to remove leftover not-init-fork files, and that > > process requires that the init fork is present at that time. > > What scenario are you precisely wondering about? That > ResetUnloggedRelations() could overwrite the main fork, while not yet > having valid contents (due to the lack of smgrimmedsync())? Shouldn't > that only be possible while still in an inconsistent state? A checkpoint > would have serialized the correct contents, and we'd not reach HS > consistency before having replayed that WAL records resetting the table > and the init fork consistency? The first ResetUnloggedRelations call occurs before any WAL is replayed, so the data dir certainly still in inconsistent state. At that point, we need the init fork files to be present, because the init files are the indicators of what relations we need to delete the other forks for. Maybe we can do something lighter than a full immedsync of all the data for the init file -- it would be sufficient to have the file *exist* -- but I'm not sure this optimization is worth anything. > > So I think the immedsync call is necessary (otherwise the cleanup > > may fail). I don't quite understand why the log_smgrcreate is > > necessary, but I think it is for reasons that are not adequately > > explained by the existing comments. > > Well, otherwise the relation won't exist on a standby? And if replay > starts from before a database/tablespace creation we'd remove the init > fork. So if it's not in the WAL, we'd loose it. Ah, of course. Well, that needs to be in the comments then. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, May 13, 2019 at 12:50 PM Andres Freund <andres@anarazel.de> wrote: > > AFAICS ResetUnloggedRelations copies the init fork after replaying WAL, > > so it would be sufficient to have the init fork be recovered from WAL > > for that to work. However, we also do ResetUnloggedRelations *before* > > replaying WAL in order to remove leftover not-init-fork files, and that > > process requires that the init fork is present at that time. > > What scenario are you precisely wondering about? That > ResetUnloggedRelations() could overwrite the main fork, while not yet > having valid contents (due to the lack of smgrimmedsync())? Shouldn't > that only be possible while still in an inconsistent state? A checkpoint > would have serialized the correct contents, and we'd not reach HS > consistency before having replayed that WAL records resetting the table > and the init fork consistency? I think I see what Alvaro is talking about, or at least I think I see *a* possible problem based on his remarks. Suppose we create an unlogged table and then crash. The main fork makes it to disk, and the init fork does not. Before WAL replay, we remove any main forks that have init forks, but because the init fork was lost, that does not happen. Recovery recreates the init fork. After WAL replay, we try to copy_file() each _init fork to the corresponding main fork. That fails, because copy_file() expects to be able to create the target file, and here it can't do that because it already exists. If that's the scenario, I'm not sure the smgrimmedsync() call is sufficient. Suppose we log_smgrcreate() but then crash before smgrimmedsync()... seems like we'd need to do them in the other order, or else maybe just pass a flag to copy_file() telling it not to be so picky. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019-May-13, Robert Haas wrote: > I think I see what Alvaro is talking about, or at least I think I see > *a* possible problem based on his remarks. > > Suppose we create an unlogged table and then crash. The main fork > makes it to disk, and the init fork does not. Before WAL replay, we > remove any main forks that have init forks, but because the init fork > was lost, that does not happen. Recovery recreates the init fork. > After WAL replay, we try to copy_file() each _init fork to the > corresponding main fork. That fails, because copy_file() expects to be > able to create the target file, and here it can't do that because it > already exists. ... right, that seems to be it. > If that's the scenario, I'm not sure the smgrimmedsync() call is > sufficient. Suppose we log_smgrcreate() but then crash before > smgrimmedsync()... seems like we'd need to do them in the other order, > or else maybe just pass a flag to copy_file() telling it not to be so > picky. Well, if we do modify copy_file and thus we don't think the deletion step serves any purpose, why not just get rid of the deletion step entirely? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-05-13 13:07:30 -0400, Alvaro Herrera wrote: > On 2019-May-13, Andres Freund wrote: > > > On 2019-05-13 12:24:05 -0400, Alvaro Herrera wrote: > > > > AFAICS ResetUnloggedRelations copies the init fork after replaying WAL, > > > so it would be sufficient to have the init fork be recovered from WAL > > > for that to work. However, we also do ResetUnloggedRelations *before* > > > replaying WAL in order to remove leftover not-init-fork files, and that > > > process requires that the init fork is present at that time. > > > > What scenario are you precisely wondering about? That > > ResetUnloggedRelations() could overwrite the main fork, while not yet > > having valid contents (due to the lack of smgrimmedsync())? Shouldn't > > that only be possible while still in an inconsistent state? A checkpoint > > would have serialized the correct contents, and we'd not reach HS > > consistency before having replayed that WAL records resetting the table > > and the init fork consistency? > > The first ResetUnloggedRelations call occurs before any WAL is replayed, > so the data dir certainly still in inconsistent state. At that point, > we need the init fork files to be present, because the init files are the > indicators of what relations we need to delete the other forks for. Hm. I think this might be a self-made problem. For the main fork, we don't need this - if the init fork was created before the last checkpoint/restartpoint, it'll be on-disk. If it was created afterwards, WAL replay will recreate both main an init fork. So the problem is just that the VM fork might survive, because it'll not get nuked given the current arrangement. Is that what you're thinking about? I'm doubtful that that is a sane arrangement - there very well could be tables created and dropped, and then recreated with a recycled oid, between start and end of recovery. I'm not sure this is actively a problem for the VM, but I think it's pretty broken for the FSM. Why isn't the correct answer to nuke all forks during the WAL replay of the main relation's creation? > Maybe we can do something lighter than a full immedsync of all the data > for the init file -- it would be sufficient to have the file *exist* -- > but I'm not sure this optimization is worth anything. I don't think just that is sufficient in isolation for types of relations with metapages (e.g. btree) - the init fork constains data there. > > > So I think the immedsync call is necessary (otherwise the cleanup > > > may fail). I don't quite understand why the log_smgrcreate is > > > necessary, but I think it is for reasons that are not adequately > > > explained by the existing comments. > > > > Well, otherwise the relation won't exist on a standby? And if replay > > starts from before a database/tablespace creation we'd remove the init > > fork. So if it's not in the WAL, we'd loose it. > > Ah, of course. Well, that needs to be in the comments then. I think it is? * ... Recovery may as well remove it * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE * record. Therefore, logging is necessary even if wal_level=minimal. Greetings, Andres Freund
On 2019-May-13, Andres Freund wrote: > On 2019-05-13 13:07:30 -0400, Alvaro Herrera wrote: > > On 2019-May-13, Andres Freund wrote: > > The first ResetUnloggedRelations call occurs before any WAL is replayed, > > so the data dir certainly still in inconsistent state. At that point, > > we need the init fork files to be present, because the init files are the > > indicators of what relations we need to delete the other forks for. > > Hm. I think this might be a self-made problem. For the main fork, we > don't need this - if the init fork was created before the last > checkpoint/restartpoint, it'll be on-disk. If it was created afterwards, > WAL replay will recreate both main an init fork. So the problem is just > that the VM fork might survive, because it'll not get nuked given the > current arrangement. Is that what you're thinking about? No, this wasn't was I was trying to explain. Robert described it better. > > Maybe we can do something lighter than a full immedsync of all the data > > for the init file -- it would be sufficient to have the file *exist* -- > > but I'm not sure this optimization is worth anything. > > I don't think just that is sufficient in isolation for types of > relations with metapages (e.g. btree) - the init fork constains data > there. No, I meant that when doing the initial cleanup (before WAL replay) we only delete files; and for that we only need to know whether the table is unlogged, and we know that by testing presence of the init file. We do need the contents *after* WAL replay, and for indexes we of course need the actual contents of the init fork. > > > Well, otherwise the relation won't exist on a standby? And if replay > > > starts from before a database/tablespace creation we'd remove the init > > > fork. So if it's not in the WAL, we'd loose it. > > > > Ah, of course. Well, that needs to be in the comments then. > > I think it is? > > * ... Recovery may as well remove it > * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE > * record. Therefore, logging is necessary even if wal_level=minimal. I meant the standby part. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-05-13 13:09:17 -0400, Robert Haas wrote: > On Mon, May 13, 2019 at 12:50 PM Andres Freund <andres@anarazel.de> wrote: > > > AFAICS ResetUnloggedRelations copies the init fork after replaying WAL, > > > so it would be sufficient to have the init fork be recovered from WAL > > > for that to work. However, we also do ResetUnloggedRelations *before* > > > replaying WAL in order to remove leftover not-init-fork files, and that > > > process requires that the init fork is present at that time. > > > > What scenario are you precisely wondering about? That > > ResetUnloggedRelations() could overwrite the main fork, while not yet > > having valid contents (due to the lack of smgrimmedsync())? Shouldn't > > that only be possible while still in an inconsistent state? A checkpoint > > would have serialized the correct contents, and we'd not reach HS > > consistency before having replayed that WAL records resetting the table > > and the init fork consistency? > > I think I see what Alvaro is talking about, or at least I think I see > *a* possible problem based on his remarks. > > Suppose we create an unlogged table and then crash. The main fork > makes it to disk, and the init fork does not. Before WAL replay, we > remove any main forks that have init forks, but because the init fork > was lost, that does not happen. Recovery recreates the init fork. > After WAL replay, we try to copy_file() each _init fork to the > corresponding main fork. That fails, because copy_file() expects to be > able to create the target file, and here it can't do that because it > already exists. Ugh, this is all such a mess. But, isn't this broken independently of the smgrimmedsync() issue? In a basebackup case, the basebackup could have included the main fork, but not the init fork, and the reverse. WAL replay *solely* needs to be able to recover from that. At the very least we'd have to do the cleanup step after becoming consistent, not just before recovery even started. > If that's the scenario, I'm not sure the smgrimmedsync() call is > sufficient. Suppose we log_smgrcreate() but then crash before > smgrimmedsync()... seems like we'd need to do them in the other order, > or else maybe just pass a flag to copy_file() telling it not to be so > picky. I think I mentioned this elsewhere recently, I think our smgr WAL logging makes this unneccessarily hard. We should just have the intial smgrcreate record (be it for the main fork, or just the init fork), have a flag that tells WAL replay to clean out everything pre-existing Hm, I'm also a bit confused as to how ResetUnloggedRelations() got to take its argument as a bitmask - given that we only pass in individual flags at the moment. Seems pretty clear to me that the call at the end of recovery would need to be UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT, unless we make some bigger changes? Greetings, Andres Freund
Hi, On 2019-05-13 13:21:33 -0400, Alvaro Herrera wrote: > On 2019-May-13, Robert Haas wrote: > > If that's the scenario, I'm not sure the smgrimmedsync() call is > > sufficient. Suppose we log_smgrcreate() but then crash before > > smgrimmedsync()... seems like we'd need to do them in the other order, > > or else maybe just pass a flag to copy_file() telling it not to be so > > picky. > > Well, if we do modify copy_file and thus we don't think the deletion > step serves any purpose, why not just get rid of the deletion step > entirely? Well, it does serve a purpose: /* * We're in recovery, so unlogged relations may be trashed and must be * reset. This should be done BEFORE allowing Hot Standby * connections, so that read-only backends don't try to read whatever * garbage is left over from before. */ but I'm somewhat doubtful that this logic is correct. Greetings, Andres Freund
Hi, On 2019-05-13 13:33:00 -0400, Alvaro Herrera wrote: > On 2019-May-13, Andres Freund wrote: > > > On 2019-05-13 13:07:30 -0400, Alvaro Herrera wrote: > > > On 2019-May-13, Andres Freund wrote: > > > > The first ResetUnloggedRelations call occurs before any WAL is replayed, > > > so the data dir certainly still in inconsistent state. At that point, > > > we need the init fork files to be present, because the init files are the > > > indicators of what relations we need to delete the other forks for. > > > > Hm. I think this might be a self-made problem. For the main fork, we > > don't need this - if the init fork was created before the last > > checkpoint/restartpoint, it'll be on-disk. If it was created afterwards, > > WAL replay will recreate both main an init fork. So the problem is just > > that the VM fork might survive, because it'll not get nuked given the > > current arrangement. Is that what you're thinking about? I was wrong here - I thought we WAL logged the main fork creation even for unlogged tables. I think it's foolish that we don't, but we don't. Greetings, Andres Freund
On Mon, May 13, 2019 at 10:52:21AM -0700, Andres Freund wrote: > I was wrong here - I thought we WAL logged the main fork creation even > for unlogged tables. I think it's foolish that we don't, but we don't. Why? The main fork is not actually necessary, and the beginning of recovery would make sure that any existing main fork gets removed and that the main fork gets recreated by copying it from the init fork, which is WAL-logged. So we definitely have to log the init fork, but logging the main fork is just useless data in the WAL record. Or you have something else in mind? -- Michael
Вложения
On Mon, May 13, 2019 at 11:28:32AM -0400, Robert Haas wrote: > 1. The comment added in that commit says "Recover may as well remove > it while replaying..." but what is really meant is "Recovery may well > remove it well replaying..." The phrase "may as well" means that > there isn't really any reason not to do it even if it's not strictly > necessary. The phrase "may well" means that it's entirely possible. > The latter meaning is the one we want here. Perhaps I wrote this comment. Per what you say, indeed we should use "may well" here, to outline the fact that it is a possibility. I have to admit that I don't completely get the difference with "may as well", which also sounds like a possibility by reading this comment today, but I am no native speaker. > 2. The comment as adjusted in that commit refers to needing an > immediate sync "even if the page has been logged, because the write > did not go through shared_buffers," but there is no page and no write, > because an empty heap is just an empty file. That logic makes sense > for index AMs that bypass shared buffers to write a metapage, e.g. > nbtree, as opposed to others which go through shared_buffers and don't > have the immediate sync, e.g. brin. However, the heap writes no pages > either through shared buffers or bypassing shared buffers, so either > I'm confused or the comment makes no sense. We need to ensure that the init fork, even if empty, still exists so as recovery can do its job of detection that this is an unlogged table and then recreate the empty main fork. Perhaps the comment could insist on the on-disk existence of the relfilenode instead, making necessary the logging of this page, however empty it is. > 3. Before that commit, the comment said that "the immediate sync is > required, because otherwise there's no guarantee that this will hit > the disk before the next checkpoint moves the redo pointer." However, > that justification seems to apply to *anything* that does smgrcreate + > log_smgrcreate would also need to do smgrimmedsync, and > RelationCreateStorage doesn't. Unless, for some reason that I'm not > thinking of right now, the init fork has stronger durability > requirements that the main fork, it's hard to understand why > heapam_relation_set_new_filenode can call RelationCreateStorage to do > smgrcreate + log_smgrcreate for the main fork and that's OK, but when > it does the same thing itself for the init fork, it now needs > smgrimmedsync as well. > > My guess, just shooting from the hip, is that the smgrimmedsync call > can be removed here. If that's wrong, then we need a better > explanation for why it's needed, and we possibly need to add it to > every single place that does smgrcreate that doesn't have it already. That's mentioned upthread, and my sloppy memories on the matter match that we need the fsync call to ensure that the init fork is present at the init of recovery, and that logging makes sure that is does not get wiped out if replaying a database or tablespace record. -- Michael
Вложения
On Mon, May 13, 2019 at 10:37:35AM -0700, Andres Freund wrote: > Ugh, this is all such a mess. But, isn't this broken independently of > the smgrimmedsync() issue? In a basebackup case, the basebackup could > have included the main fork, but not the init fork, and the reverse. WAL > replay *solely* needs to be able to recover from that. At the very > least we'd have to do the cleanup step after becoming consistent, not > just before recovery even started. Yes, the logic using smgrimmedsync() is race-prone and weaker than the index AMs in my opinion, even if the failure window is limited (I think that this is mentioned upthread a bit). What's actually the reason preventing us from delaying the checkpointer like the index AMs for the logging of heap init fork? The fact that the init heap fork is an empty page? -- Michael
Вложения
Hi, On 2019-05-14 12:50:09 +0900, Michael Paquier wrote: > On Mon, May 13, 2019 at 10:52:21AM -0700, Andres Freund wrote: > > I was wrong here - I thought we WAL logged the main fork creation even > > for unlogged tables. I think it's foolish that we don't, but we don't. > > Why? The main fork is not actually necessary, and the beginning of > recovery would make sure that any existing main fork gets removed and > that the main fork gets recreated by copying it from the init fork, > which is WAL-logged. Several things: 1) A main fork that's created while a base-backup is in progress might only partially exist on a HS node. We won't necessarily remove it, because the init fork might *not* have been copied. The init fork will be re-created by WAL replay, but that'll be too late for the ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP) call. Which then will a) cause ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to fail, because the target file already exists, as mentioned elsewhere in this thread b) cause trouble when access the unlogged table - unfortunately there are several ways to access unlogged tables on a HS node - so the main fork better either not exist, or be in a valid empty state. Case in point: COPY blarg TO STDOUT; does not throw an error due to unlogged-ness. If we instead properly WAL logged the MAIN fork creation b) wouldn't be a problem. And we could have the UNLOGGED_RELATION_INIT case re-use the hashtable built at UNLOGGED_RELATION_CLEANUP time, and avoid a second scan of the filesystem. As all relations that were created after the LSN crash recovery started at would be valid, we would not need to do a full second scan: All unlogged relations created after that LSN are guaranteed to have a consistent main fork. 2) With a small bit of additional work, the sketch to fix 1) would allow us to allow read access to unlogged relations during HS. We'd just have do the UNLOGGED_RELATION_INIT work at the start of recovery. 3) I'm somewhat certain that there are relfilenode reuse issues with the current approach, again during base backups (as there the checkpoint logic to prevent relfilenode reuse doesn't work). I think you can easily end up with main / vm / fsm / init forks that actually don't belong to the same relation for a base backup. If you don't force delete all other forks at the time the init fork record is replayed, there's no defense against that. There's probably more, but I gotta cook. > So we definitely have to log the init fork, but logging the main fork > is just useless data in the WAL record. Or you have something else in > mind? Well, as I argued somewhere else in this thread, I think this all should just be one WAL record. Something very roughly like: struct xl_smgr_create { RelFileNode rnode; ForkNumber forkNum; bool remove_all_other_forks; bool copy_init_to_main_fork; size_t data_length; char fork_contents[]; }; Presumably with a flags argument instead of separate booleans. Greetings, Andres Freund
Hi, On 2019-05-14 13:23:28 +0900, Michael Paquier wrote: > On Mon, May 13, 2019 at 10:37:35AM -0700, Andres Freund wrote: > > Ugh, this is all such a mess. But, isn't this broken independently of > > the smgrimmedsync() issue? In a basebackup case, the basebackup could > > have included the main fork, but not the init fork, and the reverse. WAL > > replay *solely* needs to be able to recover from that. At the very > > least we'd have to do the cleanup step after becoming consistent, not > > just before recovery even started. > > Yes, the logic using smgrimmedsync() is race-prone and weaker than the > index AMs in my opinion, even if the failure window is limited (I > think that this is mentioned upthread a bit). How's it limited? On a large database a base backup easily can take *days*. And e.g. VM and FSM can easily have inodes that are much newer than the the main/init forks, so typical base-backups (via OS/glibc readdir) will sort them at a later point (or it'll be hashed, in which case it's entirely random), so the window between when the different forks are copied are large. > What's actually the reason preventing us from delaying the > checkpointer like the index AMs for the logging of heap init fork? I'm not following. What do you mean by "delaying the checkpointer"? Greetings, Andres Freund
On Mon, May 13, 2019 at 09:33:52PM -0700, Andres Freund wrote: > On 2019-05-14 13:23:28 +0900, Michael Paquier wrote: >> What's actually the reason preventing us from delaying the >> checkpointer like the index AMs for the logging of heap init fork? > > I'm not following. What do you mean by "delaying the checkpointer"? I mean what Robert has mentioned here: https://www.postgresql.org/message-id/CA+TgmoZ4TWaPCKhF-szV-nPxDXL40zCwm9pNFJZURvRgm2oJzQ@mail.gmail.com And my gut tells me that he got that right, because we are discussing about race conditions with crashes and checkpoints in-between calls to smgrimmedsync() and log_newpage(). That could be invasive for back-branches, but for HEAD this would make the whole init fork handling saner. -- Michael
Вложения
Hi, On 2019-05-14 14:22:15 +0900, Michael Paquier wrote: > On Mon, May 13, 2019 at 09:33:52PM -0700, Andres Freund wrote: > > On 2019-05-14 13:23:28 +0900, Michael Paquier wrote: > >> What's actually the reason preventing us from delaying the > >> checkpointer like the index AMs for the logging of heap init fork? > > > > I'm not following. What do you mean by "delaying the checkpointer"? > > I mean what Robert has mentioned here: > https://www.postgresql.org/message-id/CA+TgmoZ4TWaPCKhF-szV-nPxDXL40zCwm9pNFJZURvRgm2oJzQ@mail.gmail.com That's a proposal, not something we actually ended up though? That's what confuses me about your earlier paragraph: "like the index AMs"? Where are we doing that in index AMs? > And my gut tells me that he got that right, because we are discussing > about race conditions with crashes and checkpoints in-between calls to > smgrimmedsync() and log_newpage(). That could be invasive for > back-branches, but for HEAD this would make the whole init fork > handling saner. How would this protect against the issues I mentioned where recovery starts from an earlier checkpoint and the basebackup could pick up a random set of version of the different forks? I don't like the idea of expanding the use of delayChkpt further for common operations, if anything we ought to try to reduce it. But I also don't see how it'd actually fix the issues, so perhaps that's moot. Greetings, Andres Freund
On Tue, May 14, 2019 at 2:19 AM Andres Freund <andres@anarazel.de> wrote: > How would this protect against the issues I mentioned where recovery > starts from an earlier checkpoint and the basebackup could pick up a > random set of version of the different forks? > > I don't like the idea of expanding the use of delayChkpt further for > common operations, if anything we ought to try to reduce it. But I also > don't see how it'd actually fix the issues, so perhaps that's moot. Based on the discussion so far, I think there are a number of related problems here: 1. A base backup might copy the main fork but not the init fork. I think that's a problem that nobody's thought about before Andres raised it on this thread. I may be wrong. Anyway, if that happens, the start-of-recovery code isn't going to do the right thing, because it won't know that it's dealing with an unlogged relation. 2. Suppose the system reaches the end of heapam_relation_set_new_filenode and then the system crashes. Because of the smgrimmedsync(), and only because of the smgrimmedsync(), the init fork would exist at the start of recovery. However, unlike the heap, some of the index AMs don't actually have a call to smgrimmedsync() in their *buildempty() functions. If I'm not mistaken, the ones that write pages through shared_buffers do not do smgrimmedsync, and the ones that use private buffers do perform smgrimmedsync. Therefore, the ones that write pages through shared_buffers are vulnerable to a crash right after the unlogged table is created. The init fork could fail to survive the crash, and therefore the start-of-recovery code would again fail to know that it's dealign with an unlogged relation. 3. So, why is it like that, anyway? Why do index AMs that write pages via shared_buffers not have smgrimmedsync? The answer is that we did it like that because we were worrying about a different problem - specifically, checkpointing. If I dirty a buffer and write a WAL record for the changes that I made, it is guaranteed that the dirty data will make it to disk before the next checkpoint is written; we've got all sorts of interlocking in BufferSync, SyncOneBuffer, etc. to make sure that happens. But if I create a file in the filesystem and write a WAL record for that change, none of that interlocking does anything. So unless I not only create that file but smgrimmedsync() it before the next checkpoint's redo location is fixed, a crash could make the file disappear. For example, consider RelationCreateStorage. What prevents the following sequence of events? S1: sgmropen() S1: smgrcreate() S1: log_smgrcreate() -- at this point the checkpointer fixes the redo pointer, writes every dirty buffer in the system, and completes a checkpoint S1: commits -- crash On restart, I think we'll could potentially be missing the file created by smgrcreate(), and we won't replay the WAL record generated by log_smgrcreate() either because it's before the checkpoint. I believe that it is one aspect of this third problem that we previously fixed on this thread, but I think we failed to understand how general the issue actually is. It affects _init forks of unlogged tables, but I think it also affects essentially every other use of smgrcreate(), whether it's got anything to do with unlogged tables or not. For an index AM, which has a non-empty initial representation, the consequences are pretty limited, because the transaction can't commit having created only an empty fork. It's got to put some data into either the main fork or the init fork, as the case may be. If it aborts, then we may leave some orphaned files behind, but if we lose one, we just have fewer orphaned files, so nobody cares. And if it commits, then either everything's before the checkpoint (and the file will have been fsync'd because we fsync'd the buffers), or everything's after the checkpoint (so we will replay the WAL), or only the smgrcreate() is before the checkpoint and the rest is after (in which case XLogReadBufferExtended will do the smgrcreate over again for us and we'll be fine). But since the heap uses an empty initial representation, it enjoys no similar protection. So, IIUC, the reason why we were talking about delayCkpt before is because it would allow us to remove the smgrimmedsync() of the unlogged fork while still avoiding problem #3. However it does nothing to protect against problem #1 or #2. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, May 20, 2019 at 09:16:32AM -0400, Robert Haas wrote: > Based on the discussion so far, I think there are a number of related > problems here: Thanks for the summary. > 1. A base backup might copy the main fork but not the init fork. I > think that's a problem that nobody's thought about before Andres > raised it on this thread. I may be wrong. Anyway, if that happens, > the start-of-recovery code isn't going to do the right thing, because > it won't know that it's dealing with an unlogged relation. Hmm. I got to think harder about this one, and I find the argument of Andres from upthread to use (CLEANUP | INIT) for ResetUnloggedRelations() at the end of recovery quite sensible: https://www.postgresql.org/message-id/20190513173735.3znpqfkb3gasig6v@alap3.anarazel.de It seems that the main risk is to finish with partially-present index main forks. For heap, the main would still be empty. > 2. Suppose the system reaches the end of > heapam_relation_set_new_filenode and then the system crashes. Because > of the smgrimmedsync(), and only because of the smgrimmedsync(), the > init fork would exist at the start of recovery. However, unlike the > heap, some of the index AMs don't actually have a call to > smgrimmedsync() in their *buildempty() functions. If I'm not > mistaken, the ones that write pages through shared_buffers do not do > smgrimmedsync, and the ones that use private buffers do perform > smgrimmedsync. Yes, that maps with what I can see in the code for the various empty() callbacks. > Therefore, the ones that write pages through > shared_buffers are vulnerable to a crash right after the unlogged > table is created. The init fork could fail to survive the crash, and > therefore the start-of-recovery code would again fail to know that > it's dealign with an unlogged relation. Taking the example of gist which uses shared buffers for the init fork logging, we take an exclusive lock on the buffer involved while logging the init fork, meaning that the checkpoint is not able to complete until the lock is released and the buffer is flushed. Do you have a specific sequence in mind? > 3. So, why is it like that, anyway? Why do index AMs that write pages > via shared_buffers not have smgrimmedsync? The answer is that we did > it like that because we were worrying about a different problem - > specifically, checkpointing. If I dirty a buffer and write a WAL > record for the changes that I made, it is guaranteed that the dirty > data will make it to disk before the next checkpoint is written; we've > got all sorts of interlocking in BufferSync, SyncOneBuffer, etc. to > make sure that happens. But if I create a file in the filesystem and > write a WAL record for that change, none of that interlocking does > anything. So unless I not only create that file but smgrimmedsync() > it before the next checkpoint's redo location is fixed, a crash could > make the file disappear. > > On restart, I think we'll could potentially be missing the file > created by smgrcreate(), and we won't replay the WAL record generated > by log_smgrcreate() either because it's before the checkpoint. Right, that's possible. If you take the case of btree, the problem is the window between the sync and the page logging which is unsafe. > I believe that it is one aspect of this third problem that we > previously fixed on this thread, but I think we failed to understand > how general the issue actually is. It affects _init forks of unlogged > tables, but I think it also affects essentially every other use of > smgrcreate(), whether it's got anything to do with unlogged tables or > not. For an index AM, which has a non-empty initial representation, > the consequences are pretty limited, because the transaction can't > commit having created only an empty fork. It's got to put some data > into either the main fork or the init fork, as the case may be. If it > aborts, then we may leave some orphaned files behind, but if we lose > one, we just have fewer orphaned files, so nobody cares. And if it > commits, then either everything's before the checkpoint (and the file > will have been fsync'd because we fsync'd the buffers), or > everything's after the checkpoint (so we will replay the WAL), or only > the smgrcreate() is before the checkpoint and the rest is after (in > which case XLogReadBufferExtended will do the smgrcreate over again > for us and we'll be fine). But since the heap uses an empty initial > representation, it enjoys no similar protection. Not sure what to think about this paragraph. I need to ponder it more. > So, IIUC, the reason why we were talking about delayCkpt before is > because it would allow us to remove the smgrimmedsync() of the > unlogged fork while still avoiding problem #3. However it does > nothing to protect against problem #1 or #2. Right. I am not completely sure why #2 is a problem though, please see my comments above. For #1 the point raised upthread to do a cleanup + init at the end of recovery makes sense. -- Michael
Вложения
On Tue, May 21, 2019 at 4:17 AM Michael Paquier <michael@paquier.xyz> wrote: > > 2. Suppose the system reaches the end of > > heapam_relation_set_new_filenode and then the system crashes. Because > > of the smgrimmedsync(), and only because of the smgrimmedsync(), the > > init fork would exist at the start of recovery. However, unlike the > > heap, some of the index AMs don't actually have a call to > > smgrimmedsync() in their *buildempty() functions. If I'm not > > mistaken, the ones that write pages through shared_buffers do not do > > smgrimmedsync, and the ones that use private buffers do perform > > smgrimmedsync. > > Yes, that maps with what I can see in the code for the various empty() > callbacks. > > > Therefore, the ones that write pages through > > shared_buffers are vulnerable to a crash right after the unlogged > > table is created. The init fork could fail to survive the crash, and > > therefore the start-of-recovery code would again fail to know that > > it's dealign with an unlogged relation. > > Taking the example of gist which uses shared buffers for the init > fork logging, we take an exclusive lock on the buffer involved while > logging the init fork, meaning that the checkpoint is not able to > complete until the lock is released and the buffer is flushed. Do you > have a specific sequence in mind? Yes. I thought I had described it. You create an unlogged table, with an index of a type that does not smgrimmedsync(), your transaction commits, and then the system crashes, losing the _init fork for the index. There's no checkpoint involved in this scenario, so any argument that it can't be a problem because of checkpoints is necessarily incorrect. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 21, 2019 at 08:39:18AM -0400, Robert Haas wrote: > Yes. I thought I had described it. You create an unlogged table, > with an index of a type that does not smgrimmedsync(), your > transaction commits, and then the system crashes, losing the _init > fork for the index. The init forks won't magically go away, except in one case for empty routines not going through shared buffers. The empty routines writing pages which do not go through shared buffers still log them, so those pages could go away when the host crashes, and a checkpoint completed before calling smgrimmedsync() and after the page has been logged. Please note the logic in spgbuildempty() as an example. Then, empty routines going through shared buffers fill in one or more buffers, mark it/them as empty, dirty it/them, log the page(s) and then unlock the buffer(s). If a crash happens after the transaction commits, so we would still have the init page in WAL, and at the end of recovery we would know about it. Or is there something in log_newpage_buffer()/log_newpage() I am missing but you foresee? -- Michael
Вложения
On Thu, May 23, 2019 at 2:43 AM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, May 21, 2019 at 08:39:18AM -0400, Robert Haas wrote: > > Yes. I thought I had described it. You create an unlogged table, > > with an index of a type that does not smgrimmedsync(), your > > transaction commits, and then the system crashes, losing the _init > > fork for the index. > > The init forks won't magically go away, except in one case for empty > routines not going through shared buffers. No magic is required. If you haven't called fsync(), the file might not be there after a system crash. Going through shared_buffers guarantees that the file will be fsync()'d before the next checkpoint, but I'm talking about a scenario where you crash before the next checkpoint. > Then, empty routines going through shared buffers fill in one or more > buffers, mark it/them as empty, dirty it/them, log the page(s) and then > unlock the buffer(s). If a crash happens after the transaction > commits, so we would still have the init page in WAL, and at the end > of recovery we would know about it. Yeah, but the problem is that the currently system requires us to know about it at the *beginning* of recovery. See my earlier remarks: Suppose we create an unlogged table and then crash. The main fork makes it to disk, and the init fork does not. Before WAL replay, we remove any main forks that have init forks, but because the init fork was lost, that does not happen. Recovery recreates the init fork. After WAL replay, we try to copy_file() each _init fork to the corresponding main fork. That fails, because copy_file() expects to be able to create the target file, and here it can't do that because it already exists. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 23, 2019 at 09:14:59AM -0400, Robert Haas wrote: > Suppose we create an unlogged table and then crash. The main fork > makes it to disk, and the init fork does not. Before WAL replay, we > remove any main forks that have init forks, but because the init fork > was lost, that does not happen. Recovery recreates the init fork. > After WAL replay, we try to copy_file() each _init fork to the > corresponding main fork. That fails, because copy_file() expects to be > able to create the target file, and here it can't do that because it > already exists. Ah, sorry. Now I understand your point. I got your previous message wrong as I thought about the fact that init forks are logged, hence after a crash you should still have these at the end of recovery, and I got confused when you wrote that init forks just disappear. However you were referring to the state of a cluster before recovery begins, and I was thinking about the state where the cluster has run recovery and has reached consistency. If we should do something about such cases, then this brings us back to do (INIT | CLEANUP) at the end of recovery anyway? -- Michael
Вложения
On Sun, May 26, 2019 at 10:11 PM Michael Paquier <michael@paquier.xyz> wrote: > If we should do something about such cases, then this brings us back > to do (INIT | CLEANUP) at the end of recovery anyway? I believe that could fix problems #1 and #2, but we'd have to think about what new issues it would introduce. #3 seems like it needs a different fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company