Обсуждение: BUG #7670: BUG #7545: Unresponsive server with error log reporting: "poll() failed: Invalid argument"
BUG #7670: BUG #7545: Unresponsive server with error log reporting: "poll() failed: Invalid argument"
От
igorya.inscriptio@gmail.com
Дата:
The following bug has been logged on the website: Bug reference: 7670 Logged by: Igor Email address: igorya.inscriptio@gmail.com PostgreSQL version: 9.2.1 Operating system: FreeBSD 9.0 Description: = The same a BUG as 7545. Pgsql compiled from FreeBSD ports with options: _OPTIONS_READ=3Dpostgresql-server-9.2.1 _FILE_COMPLETE_OPTIONS_LIST=3D NLS DTRACE PAM LDAP MIT_KRB5 HEIMDAL_KRB5 GSSAPI OPTIMIZED_CFLAGS XML TZDATA DEBUG ICU INTDATE SSL OPTIONS_FILE_SET+=3DNLS OPTIONS_FILE_UNSET+=3DDTRACE OPTIONS_FILE_UNSET+=3DPAM OPTIONS_FILE_UNSET+=3DLDAP OPTIONS_FILE_UNSET+=3DMIT_KRB5 OPTIONS_FILE_UNSET+=3DHEIMDAL_KRB5 OPTIONS_FILE_UNSET+=3DGSSAPI OPTIONS_FILE_UNSET+=3DOPTIMIZED_CFLAGS OPTIONS_FILE_SET+=3DXML OPTIONS_FILE_SET+=3DTZDATA OPTIONS_FILE_UNSET+=3DDEBUG OPTIONS_FILE_UNSET+=3DICU OPTIONS_FILE_SET+=3DINTDATE OPTIONS_FILE_SET+=3DSSL
On 2012-11-17 21:00:36 +0000, igorya.inscriptio@gmail.com wrote: > The following bug has been logged on the website: > > Bug reference: 7670 > Logged by: Igor > Email address: igorya.inscriptio@gmail.com > PostgreSQL version: 9.2.1 > Operating system: FreeBSD 9.0 > Description: > > The same a BUG as 7545. > > Pgsql compiled from FreeBSD ports with options: > > _OPTIONS_READ=postgresql-server-9.2.1 > _FILE_COMPLETE_OPTIONS_LIST= NLS DTRACE PAM LDAP MIT_KRB5 HEIMDAL_KRB5 > GSSAPI OPTIMIZED_CFLAGS XML TZDATA DEBUG ICU INTDATE SSL > OPTIONS_FILE_SET+=NLS > OPTIONS_FILE_UNSET+=DTRACE > OPTIONS_FILE_UNSET+=PAM > OPTIONS_FILE_UNSET+=LDAP > OPTIONS_FILE_UNSET+=MIT_KRB5 > OPTIONS_FILE_UNSET+=HEIMDAL_KRB5 > OPTIONS_FILE_UNSET+=GSSAPI > OPTIONS_FILE_UNSET+=OPTIMIZED_CFLAGS > OPTIONS_FILE_SET+=XML > OPTIONS_FILE_SET+=TZDATA > OPTIONS_FILE_UNSET+=DEBUG > OPTIONS_FILE_UNSET+=ICU > OPTIONS_FILE_SET+=INTDATE > OPTIONS_FILE_SET+=SSL Could you give a bit more context about when exactly that error is happening? It would also be helpful if you could set log_error_verbosity = 'verbose'; in the config file and report back with the complete error message, including line numbers. If the error is easy to reproduce it would also be helpful to see the failing command with ktruss or similar. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
... Could you give a bit more context about when exactly that error is > happening? > No :( This is the random event. It would also be helpful if you could set log_error_verbosity = > 'verbose'; Done. And will wait. -- Regards System and web developer. <http://amkmobile.com>
.. > Could you give a bit more context about when exactly that error is > happening? > It would also be helpful if you could set log_error_verbosity = > 'verbose'; in the config file and report back with the complete error > message, including line numbers. > > I try to restart DB and seems get this situation: === 2012-11-18 18:30:25 EET pid:93744 LOG: received SIGHUP, reloading configuration files 2012-11-18 18:30:25 EET pid:93744 LOG: 00000: parameter "log_error_verbosity" changed to "verbose" 2012-11-18 18:30:25 EET pid:93744 LOCATION: ProcessConfigFile, guc-file.l:318 2012-11-18 18:36:45 EET pid:93744 LOG: 00000: received fast shutdown request 2012-11-18 18:36:45 EET pid:93744 LOCATION: pmdie, postmaster.c:2187 2012-11-18 18:36:45 EET pid:93744 LOG: 00000: aborting any active transactions 2012-11-18 18:36:45 EET pid:93744 LOCATION: pmdie, postmaster.c:2211 2012-11-18 18:36:45 EET pid:93749 LOG: 00000: autovacuum launcher shutting down 2012-11-18 18:36:45 EET pid:93749 LOCATION: AutoVacLauncherMain, autovacuum.c:767 2012-11-18 18:36:45 EET pid:93746 LOG: 00000: shutting down 2012-11-18 18:36:45 EET pid:93746 LOCATION: ShutdownXLOG, xlog.c:7560 2012-11-18 18:36:45 EET pid:93746 LOG: 00000: database system is shut down 2012-11-18 18:36:45 EET pid:93746 LOCATION: ShutdownXLOG, xlog.c:7582 2012-11-18 18:36:46 EET pid:95528 FATAL: XX000: poll() failed: Invalid argument 2012-11-18 18:36:46 EET pid:95528 LOCATION: WaitLatchOrSocket, pg_latch.c:296 === After several minutes the rc.d script said: pg_ctl: could not start server Examine the log output. Then I did: killall -2 postgres killall -3 postgres And this remains procesess: 95527 pgsql 1 52 0 2159M 82400K pipewr 10 0:00 0.00% postgres 95540 pgsql 1 20 0 2163M 83432K pipewr 0 0:00 0.00% postgres I can stop only via killall -9 :( -- Regards System and web developer. <http://amkmobile.com>
And for my previous letter: I feel :) that the problem is in "logging_collector" setting - I used this feature. But now I comments this setting in config. And will watching .. -- Regards System and web developer. <http://amkmobile.com>
On 2012-11-18 18:53:13 +0200, Igor wrote: > .. > > > Could you give a bit more context about when exactly that error is > > happening? > > It would also be helpful if you could set log_error_verbosity = > > 'verbose'; in the config file and report back with the complete error > > message, including line numbers. > > > > > I try to restart DB and seems get this situation: > > === > 2012-11-18 18:30:25 EET pid:93744 LOG: received SIGHUP, reloading > configuration files > 2012-11-18 18:30:25 EET pid:93744 LOG: 00000: parameter > "log_error_verbosity" changed to "verbose" > 2012-11-18 18:30:25 EET pid:93744 LOCATION: ProcessConfigFile, > guc-file.l:318 > 2012-11-18 18:36:45 EET pid:93744 LOG: 00000: received fast shutdown > request > 2012-11-18 18:36:45 EET pid:93744 LOCATION: pmdie, postmaster.c:2187 > 2012-11-18 18:36:45 EET pid:93744 LOG: 00000: aborting any active > transactions > 2012-11-18 18:36:45 EET pid:93744 LOCATION: pmdie, postmaster.c:2211 > 2012-11-18 18:36:45 EET pid:93749 LOG: 00000: autovacuum launcher > shutting down > 2012-11-18 18:36:45 EET pid:93749 LOCATION: AutoVacLauncherMain, > autovacuum.c:767 > 2012-11-18 18:36:45 EET pid:93746 LOG: 00000: shutting down > 2012-11-18 18:36:45 EET pid:93746 LOCATION: ShutdownXLOG, xlog.c:7560 > 2012-11-18 18:36:45 EET pid:93746 LOG: 00000: database system is shut > down > 2012-11-18 18:36:45 EET pid:93746 LOCATION: ShutdownXLOG, xlog.c:7582 > 2012-11-18 18:36:46 EET pid:95528 FATAL: XX000: poll() failed: Invalid > argument > 2012-11-18 18:36:46 EET pid:95528 LOCATION: WaitLatchOrSocket, > pg_latch.c:296 > === Ok, this is already helpful. Do you have anything the log that shows what process pid 95528 is? Could you grep for it? Are you able to recompile postgres with other compilation options and/or some debugging patches applied? The first thing would be to recompile postgres with --enable-cassert... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
2012/11/18 Andres Freund <andres@2ndquadrant.com> > On 2012-11-18 18:53:13 +0200, Igor wrote: > > .. > > > > > Could you give a bit more context about when exactly that error is > > > happening? > > > It would also be helpful if you could set log_error_verbosity = > > > 'verbose'; in the config file and report back with the complete error > > > message, including line numbers. > > > > > > > > I try to restart DB and seems get this situation: > > > > === > > 2012-11-18 18:30:25 EET pid:93744 LOG: received SIGHUP, reloading > > configuration files > > 2012-11-18 18:30:25 EET pid:93744 LOG: 00000: parameter > > "log_error_verbosity" changed to "verbose" > > 2012-11-18 18:30:25 EET pid:93744 LOCATION: ProcessConfigFile, > > guc-file.l:318 > > 2012-11-18 18:36:45 EET pid:93744 LOG: 00000: received fast shutdown > > request > > 2012-11-18 18:36:45 EET pid:93744 LOCATION: pmdie, postmaster.c:2187 > > 2012-11-18 18:36:45 EET pid:93744 LOG: 00000: aborting any active > > transactions > > 2012-11-18 18:36:45 EET pid:93744 LOCATION: pmdie, postmaster.c:2211 > > 2012-11-18 18:36:45 EET pid:93749 LOG: 00000: autovacuum launcher > > shutting down > > 2012-11-18 18:36:45 EET pid:93749 LOCATION: AutoVacLauncherMain, > > autovacuum.c:767 > > 2012-11-18 18:36:45 EET pid:93746 LOG: 00000: shutting down > > 2012-11-18 18:36:45 EET pid:93746 LOCATION: ShutdownXLOG, > xlog.c:7560 > > 2012-11-18 18:36:45 EET pid:93746 LOG: 00000: database system is > shut > > down > > 2012-11-18 18:36:45 EET pid:93746 LOCATION: ShutdownXLOG, > xlog.c:7582 > > 2012-11-18 18:36:46 EET pid:95528 FATAL: XX000: poll() failed: > Invalid > > argument > > 2012-11-18 18:36:46 EET pid:95528 LOCATION: WaitLatchOrSocket, > > pg_latch.c:296 > > === > > Ok, this is already helpful. Do you have anything the log that shows > what process pid 95528 is? Could you grep for it? > No, this string - is the last string in the log for this event. I think, I finished the process 95528 by kill -3 (or -2) signal. I did this after unsuccessfully report of the pg_ctl (rc.d script). > Are you able to recompile postgres with other compilation options and/or > some debugging patches applied? > Will try. Give me this patches. > The first thing would be to recompile postgres with --enable-cassert... > That all ? Do you understand my english ? :) -- Regards System and web developer. <http://amkmobile.com>
Igor <igorya.inscriptio@gmail.com> writes: > I feel :) that the problem is in "logging_collector" setting - I used this > feature. But now I comments this setting in config. And will watching .. I was just about to ask what you have log_rotation_age set to ... regards, tom lane
On 2012-11-18 19:22:04 +0200, Igor wrote: > 2012/11/18 Andres Freund <andres@2ndquadrant.com> > > > On 2012-11-18 18:53:13 +0200, Igor wrote: > > > .. > > > > > > > Could you give a bit more context about when exactly that error is > > > > happening? > > > > It would also be helpful if you could set log_error_verbosity = > > > > 'verbose'; in the config file and report back with the complete error > > > > message, including line numbers. > > > > > > > > > > > I try to restart DB and seems get this situation: > > > > > > === > > > 2012-11-18 18:30:25 EET pid:93744 LOG: received SIGHUP, reloading > > > configuration files > > > 2012-11-18 18:30:25 EET pid:93744 LOG: 00000: parameter > > > "log_error_verbosity" changed to "verbose" > > > 2012-11-18 18:30:25 EET pid:93744 LOCATION: ProcessConfigFile, > > > guc-file.l:318 > > > 2012-11-18 18:36:45 EET pid:93744 LOG: 00000: received fast shutdown > > > request > > > 2012-11-18 18:36:45 EET pid:93744 LOCATION: pmdie, postmaster.c:2187 > > > 2012-11-18 18:36:45 EET pid:93744 LOG: 00000: aborting any active > > > transactions > > > 2012-11-18 18:36:45 EET pid:93744 LOCATION: pmdie, postmaster.c:2211 > > > 2012-11-18 18:36:45 EET pid:93749 LOG: 00000: autovacuum launcher > > > shutting down > > > 2012-11-18 18:36:45 EET pid:93749 LOCATION: AutoVacLauncherMain, > > > autovacuum.c:767 > > > 2012-11-18 18:36:45 EET pid:93746 LOG: 00000: shutting down > > > 2012-11-18 18:36:45 EET pid:93746 LOCATION: ShutdownXLOG, > > xlog.c:7560 > > > 2012-11-18 18:36:45 EET pid:93746 LOG: 00000: database system is > > shut > > > down > > > 2012-11-18 18:36:45 EET pid:93746 LOCATION: ShutdownXLOG, > > xlog.c:7582 > > > 2012-11-18 18:36:46 EET pid:95528 FATAL: XX000: poll() failed: > > Invalid > > > argument > > > 2012-11-18 18:36:46 EET pid:95528 LOCATION: WaitLatchOrSocket, > > > pg_latch.c:296 > > > === > > > > Ok, this is already helpful. Do you have anything the log that shows > > what process pid 95528 is? Could you grep for it? > > > > No, this string - is the last string in the log for this event. Actually I would be interested in *earlier* entries in the log ;) > > Are you able to recompile postgres with other compilation options and/or > > some debugging patches applied? > Will try. Give me this patches. Ok, let me prepare something. > > The first thing would be to recompile postgres with --enable-cassert... > > > > That all ? Yes, that could already catch the source of the error. Note that you probably will want to revert to non --enable-cassert builds after we debugged this, it has a non-negligible overhead. > Do you understand my english ? :) I try ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
2012/11/18 Tom Lane <tgl@sss.pgh.pa.us> > Igor <igorya.inscriptio@gmail.com> writes: > > I feel :) that the problem is in "logging_collector" setting - I used > this > > feature. But now I comments this setting in config. And will watching .. > > I was just about to ask what you have log_rotation_age set to ... > > Was: log_rotation_age = 30d -- Regards System and web developer. <http://amkmobile.com>
On 18 November 2012 17:12, Andres Freund <andres@2ndquadrant.com> wrote: > Ok, this is already helpful. Do you have anything the log that shows > what process pid 95528 is? Could you grep for it? I haven't looked at this in detail, but are you sure that this isn't the bug that was fixed by commit e81e8f9342b037246b284bad15e42e21b1929301 ? The "invalid argument" poll() error is too generic to be sure what is really at fault, but this doesn't need to be a FATAL error, it seems. Is this a 32-bit platform? It's probably just that there is a large number of sockets, which Google seems to think can cause this error with poll(), so this seems like a slight variant of the problem reported by Sean Chittenden (kernel resource exhaustion causes pipe creation to fail, with unnecessarily bad consequences for Postgres). -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 2012-11-18 12:29:29 -0500, Tom Lane wrote: > Igor <igorya.inscriptio@gmail.com> writes: > > I feel :) that the problem is in "logging_collector" setting - I used this > > feature. But now I comments this setting in config. And will watching .. > > I was just about to ask what you have log_rotation_age set to ... According to the manpages freebsds poll won't return EINVAL if a contained fd is invalid... I wonder if the the timeout value is screwed and overflowed the cast from 'long' to int and got negative. Igor, I guess youre on a 64bit system? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Igor <igorya.inscriptio@gmail.com> writes: > 2012/11/18 Tom Lane <tgl@sss.pgh.pa.us> >> I was just about to ask what you have log_rotation_age set to ... > Was: > log_rotation_age = 30d That's the problem then. Internally that gets converted to milliseconds, which'll overflow at 24-something days. Use a smaller value. So the actual bug here is an over-optimistic maximum value for log_rotation_age in guc.c. After looking through all the callers of WaitLatch, I think the only other similar issue is for wal_sender_timeout, which currently is allowed to range up to INT_MAX; but such values seem pretty useless. I'm inclined to propose limiting both of these to the equivalent of 15 days. Alternatively we could try to rejigger things to prevent asking WaitLatch to wait for more than 2^31 msec, but it's not real clear to me that it's worth the trouble. regards, tom lane
On 2012-11-18 17:36:40 +0000, Peter Geoghegan wrote: > On 18 November 2012 17:12, Andres Freund <andres@2ndquadrant.com> wrote: > > Ok, this is already helpful. Do you have anything the log that shows > > what process pid 95528 is? Could you grep for it? > > I haven't looked at this in detail, but are you sure that this isn't > the bug that was fixed by commit > e81e8f9342b037246b284bad15e42e21b1929301 ? The "invalid argument" > poll() error is too generic to be sure what is really at fault, but > this doesn't need to be a FATAL error, it seems. I looked at that commit at first as well, but I don't really see how it could have fixed that issue. Btw, its not a FATAL but a plain ERROR. > Is this a 32-bit platform? It's probably just that there is a large > number of sockets, which Google seems to think can cause this error > with poll(), so this seems like a slight variant of the problem > reported by Sean Chittenden (kernel resource exhaustion causes pipe > creation to fail, with unnecessarily bad consequences for Postgres). We never have more than 3 fds in the poll. And a failure to create the pipe should have caused problems at other locations before this... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Peter Geoghegan <peter@2ndquadrant.com> writes: > I haven't looked at this in detail, but are you sure that this isn't > the bug that was fixed by commit > e81e8f9342b037246b284bad15e42e21b1929301 ? The "invalid argument" > poll() error is too generic to be sure what is really at fault, but > this doesn't need to be a FATAL error, it seems. He's running BSD, which is not at all wishy-washy about what it means: [EINVAL] The specified time limit is negative. The reason it's FATAL is that it's happening in the syslogger process, which has no error recovery capability. regards, tom lane
2012/11/18 Andres Freund <andres@2ndquadrant.com> > On 2012-11-18 12:29:29 -0500, Tom Lane wrote: > > Igor <igorya.inscriptio@gmail.com> writes: > > > I feel :) that the problem is in "logging_collector" setting - I used > this > > > feature. But now I comments this setting in config. And will watching > .. > > > > I was just about to ask what you have log_rotation_age set to ... > > According to the manpages freebsds poll won't return EINVAL if a > contained fd is invalid... I wonder if the the timeout value is screwed > and overflowed the cast from 'long' to int and got negative. > > Igor, I guess youre on a 64bit system? > Yes. FreeBSD 9.0-RELEASE amd64 And the problems of free resources on the server does not exist. And there is minimal load. -- Regards System and web developer. <http://amkmobile.com>
2012/11/18 Tom Lane <tgl@sss.pgh.pa.us> > Igor <igorya.inscriptio@gmail.com> writes: > > 2012/11/18 Tom Lane <tgl@sss.pgh.pa.us> > >> I was just about to ask what you have log_rotation_age set to ... > > > Was: > > log_rotation_age = 30d > > That's the problem then. Internally that gets converted to > milliseconds, which'll overflow at 24-something days. Use a smaller > value. > > So the actual bug here is an over-optimistic maximum value for > log_rotation_age in guc.c. After looking through all the callers of > WaitLatch, I think the only other similar issue is for > wal_sender_timeout, which currently is allowed to range up to INT_MAX; > but such values seem pretty useless. > > I'm inclined to propose limiting both of these to the equivalent of 15 > days. # Set this options to: logging_collector = on log_rotation_age = 15d log_rotation_size = 100MB # was early And will watch.. -- Regards System and web developer. <http://amkmobile.com>
On 2012-11-18 12:44:55 -0500, Tom Lane wrote: > Igor <igorya.inscriptio@gmail.com> writes: > > 2012/11/18 Tom Lane <tgl@sss.pgh.pa.us> > >> I was just about to ask what you have log_rotation_age set to ... > > > Was: > > log_rotation_age = 30d > > That's the problem then. Internally that gets converted to > milliseconds, which'll overflow at 24-something days. Use a smaller > value. Nicely guessed. > So the actual bug here is an over-optimistic maximum value for > log_rotation_age in guc.c. After looking through all the callers of > WaitLatch, I think the only other similar issue is for > wal_sender_timeout, which currently is allowed to range up to INT_MAX; > but such values seem pretty useless. > > I'm inclined to propose limiting both of these to the equivalent of 15 > days. Alternatively we could try to rejigger things to prevent asking > WaitLatch to wait for more than 2^31 msec, but it's not real clear to > me that it's worth the trouble. In general I have no problem imposing lower limits, but it seems to be a ugly to get errors for an invalid configuration file after a minor version upgrade. While the wal_sender_timeout isn't really likely to be that high I don't think the log_rotation_age one is unlikely to be set to something in the month range, thats not an unreasonable value. I suggest adding an Assert before the (int)timeout cast, but I guess you already plan to do so. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2012-11-18 12:44:55 -0500, Tom Lane wrote: >> I'm inclined to propose limiting both of these to the equivalent of 15 >> days. Alternatively we could try to rejigger things to prevent asking >> WaitLatch to wait for more than 2^31 msec, but it's not real clear to >> me that it's worth the trouble. > In general I have no problem imposing lower limits, but it seems to be a > ugly to get errors for an invalid configuration file after a minor > version upgrade. While the wal_sender_timeout isn't really likely to be > that high I don't think the log_rotation_age one is unlikely to be set > to something in the month range, thats not an unreasonable value. Well, we have two reports of people trying such values (assuming that #7545 actually is the same thing), and it didn't work for either of them. I don't think it's a problem to restrict the value to something that will work rather than fail. If you're worried that there's somebody out there using 20 or 21 days as log_rotation_age, I guess we could limit to INT_MAX/1000 seconds or something just less than that. regards, tom lane
2012/11/18 Tom Lane <tgl@sss.pgh.pa.us> > Andres Freund <andres@2ndquadrant.com> writes: > > On 2012-11-18 12:44:55 -0500, Tom Lane wrote: > >> I'm inclined to propose limiting both of these to the equivalent of 15 > >> days. Alternatively we could try to rejigger things to prevent asking > >> WaitLatch to wait for more than 2^31 msec, but it's not real clear to > >> me that it's worth the trouble. > > > In general I have no problem imposing lower limits, but it seems to be a > > ugly to get errors for an invalid configuration file after a minor > > version upgrade. While the wal_sender_timeout isn't really likely to be > > that high I don't think the log_rotation_age one is unlikely to be set > > to something in the month range, thats not an unreasonable value. > > Well, we have two reports of people trying such values (assuming that > #7545 actually is the same thing), and it didn't work for either of > them. I don't think it's a problem to restrict the value to something > that will work rather than fail. > > If you're worried that there's somebody out there using 20 or 21 days > as log_rotation_age, I guess we could limit to INT_MAX/1000 seconds or > something just less than that. > Some experiments are necessary on my side ? -- Regards System and web developer. <http://amkmobile.com>
On 18 November 2012 18:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, we have two reports of people trying such values (assuming that > #7545 actually is the same thing), and it didn't work for either of > them. I don't think it's a problem to restrict the value to something > that will work rather than fail. Right. sizeof(int) is very probably 4 on all platforms that we support. I see no problem with the proposal. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Am 18.11.2012 um 19:36 schrieb Peter Geoghegan <peter@2ndquadrant.com>: > On 18 November 2012 18:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Well, we have two reports of people trying such values (assuming that >> #7545 actually is the same thing), and it didn't work for either of >> them. I don't think it's a problem to restrict the value to = something >> that will work rather than fail. >=20 > Right. sizeof(int) is very probably 4 on all platforms that we > support. I see no problem with the proposal. Have you cross-checked this on a 64bit platform vs. a 32 bit platform? e.g. on Linux i386 vs. Linux amd64? >=20 > --=20 > Peter Geoghegan http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training and Services >=20 >=20 > --=20 > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs Freundliche Gr=FCsse, micro systems Marc Balmer -- Marc Balmer micro systems, Wiesendamm 2a, Postfach, 4019 Basel fon +41 61 383 05 10, fax +41 61 383 05 12, http://www.msys.ch/
On 2012-11-18 13:18:42 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2012-11-18 12:44:55 -0500, Tom Lane wrote: > >> I'm inclined to propose limiting both of these to the equivalent of 15 > >> days. Alternatively we could try to rejigger things to prevent asking > >> WaitLatch to wait for more than 2^31 msec, but it's not real clear to > >> me that it's worth the trouble. > > > In general I have no problem imposing lower limits, but it seems to be a > > ugly to get errors for an invalid configuration file after a minor > > version upgrade. While the wal_sender_timeout isn't really likely to be > > that high I don't think the log_rotation_age one is unlikely to be set > > to something in the month range, thats not an unreasonable value. > > Well, we have two reports of people trying such values (assuming that > #7545 actually is the same thing), and it didn't work for either of > them. I don't think it's a problem to restrict the value to something > that will work rather than fail. Thats a good point. But #7545 was on OSX so its not really that much evidence a larger value doesn't work on non-bsdish systems. In fact, a setting of 32 days seems not to cause any immediate problems here on linux, even when choosing the timeout in a way it results in a negative timeout value for poll. Not sure what its waiting for, but it doesn't crash. Greetings, Andres Freund
On 18 November 2012 18:42, Marc Balmer <marc@msys.ch> wrote: > Have you cross-checked this on a 64bit platform vs. a 32 bit platform? > e.g. on Linux i386 vs. Linux amd64? Well, strictly speaking sizeof(int) is dictated by both the compiler and CPU architecture in question. I believe that all current Unix-like systems follow LLP64 (or LLP32) in practice. sizeof(int) is 4 on Windows (though, in contrast to LLP64, sizeof(long) is 4 too on 64-bit windows). -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2012-11-18 13:18:42 -0500, Tom Lane wrote: >> Well, we have two reports of people trying such values (assuming that >> #7545 actually is the same thing), and it didn't work for either of >> them. I don't think it's a problem to restrict the value to something >> that will work rather than fail. > Thats a good point. But #7545 was on OSX so its not really that much > evidence a larger value doesn't work on non-bsdish systems. > In fact, a setting of 32 days seems not to cause any immediate problems > here on linux, even when choosing the timeout in a way it results in a > negative timeout value for poll. Not sure what its waiting for, but it > doesn't crash. If you look closely at what's happening in syslogger.c, you'll realize that it's a phase of the moon kind of behavior, because the next_rotation_time is rounded off to a multiple of the specified log_rotation_age. Then you'll get an overflow (or not) depending on whether now() is within 2^31 msec of that target time. So for example with a 30-day log_rotation_age, you'd be seeing failures during the first five days of every "month", where a month is defined as exactly-30-day intervals since the epoch. And I think there's a timezone correction in there too, which would shift the trouble intervals around for different people. Anyway, even if Linux for some reason doesn't reject negative values, I think we need to limit the GUC's range so we don't try to use them. Maybe you'd get sane behavior and maybe you wouldn't. I notice that the Single Unix Spec doesn't specify EINVAL for negative timeout values, but it also doesn't define what happens for a negative value other than -1. So this is basically an unspecified case and it's incumbent on us to not do it if we want portable behavior. regards, tom lane
On 2012-11-18 14:07:37 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2012-11-18 13:18:42 -0500, Tom Lane wrote: > >> Well, we have two reports of people trying such values (assuming that > >> #7545 actually is the same thing), and it didn't work for either of > >> them. I don't think it's a problem to restrict the value to something > >> that will work rather than fail. > > > Thats a good point. But #7545 was on OSX so its not really that much > > evidence a larger value doesn't work on non-bsdish systems. > > > In fact, a setting of 32 days seems not to cause any immediate problems > > here on linux, even when choosing the timeout in a way it results in a > > negative timeout value for poll. Not sure what its waiting for, but it > > doesn't crash. > > If you look closely at what's happening in syslogger.c, you'll realize > that it's a phase of the moon kind of behavior, because the > next_rotation_time is rounded off to a multiple of the specified > log_rotation_age. Then you'll get an overflow (or not) depending on > whether now() is within 2^31 msec of that target time. So for example > with a 30-day log_rotation_age, you'd be seeing failures during the > first five days of every "month", where a month is defined as > exactly-30-day intervals since the epoch. And I think there's a > timezone correction in there too, which would shift the trouble > intervals around for different people. I verified with strace that I calculated correctly and got a negative value: poll([{fd=10, events=POLLIN}, {fd=3, events=POLLIN}], 2, -1033055888) = 1 ([{fd=10, revents=POLLIN}]) (that poll finished was due to a SIGHUP) > Anyway, even if Linux for some reason doesn't reject negative values, > I think we need to limit the GUC's range so we don't try to use them. > Maybe you'd get sane behavior and maybe you wouldn't. I notice that > the Single Unix Spec doesn't specify EINVAL for negative timeout > values, but it also doesn't define what happens for a negative value > other than -1. So this is basically an unspecified case and it's > incumbent on us to not do it if we want portable behavior. I aggree that we need to do something. I just think it might be enough to clamp the timeout value passed to WaitLatchOrSocket to the maximum valid value. The rest of SysLoggerMain seems to work correctly in that case. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Marc Balmer <marc@msys.ch> writes: > Am 18.11.2012 um 19:36 schrieb Peter Geoghegan <peter@2ndquadrant.com>: >> Right. sizeof(int) is very probably 4 on all platforms that we >> support. I see no problem with the proposal. > Have you cross-checked this on a 64bit platform vs. a 32 bit platform? > e.g. on Linux i386 vs. Linux amd64? sizeof(int) is most definitely 4 on all platforms that we support. I've lost count of how many places would fall over if that weren't the case, but for sure the tuple packing code would have problems. regards, tom lane
Andres Freund <andres@2ndquadrant.com> writes: > On 2012-11-18 14:07:37 -0500, Tom Lane wrote: >> Anyway, even if Linux for some reason doesn't reject negative values, >> I think we need to limit the GUC's range so we don't try to use them. >> Maybe you'd get sane behavior and maybe you wouldn't. I notice that >> the Single Unix Spec doesn't specify EINVAL for negative timeout >> values, but it also doesn't define what happens for a negative value >> other than -1. So this is basically an unspecified case and it's >> incumbent on us to not do it if we want portable behavior. > I aggree that we need to do something. I just think it might be enough > to clamp the timeout value passed to WaitLatchOrSocket to the maximum > valid value. The rest of SysLoggerMain seems to work correctly in that > case. This seems a rather 64-bit-centric view of the world. If "long" is 32 bits, the value will have overflowed before it ever gets to WaitLatch. I'm thinking it might be reasonable to put an Assert(timeout <= INT_MAX) into WaitLatch, since we're already asserting timeout >= 0. But adding a run-time clamp is not sufficient to fix the problem, because if any overflow happens it's happening upstream. regards, tom lane