Обсуждение: Notice and share memory corruption
I get the following on untuned Linux (Redhat 6.2) using stock 7.0.2 rpm-s NOTICE: RegisterSharedInvalid: SI buffer overflow NOTICE: InvalidateSharedInvalid: cache state reset Actually I get many of them ;( I'm running a script that does a bunch of mixed INSERTS, UPDATES, DELETES and SELECTS. after getting that I'm unable to vacuum database until I reset the OS Where/how should I start looking (or is it a known problem) Are there any simple workarounds to stop it happening. ----------- Hannu
Hannu Krosing <hannu@tm.ee> writes:
> I get the following on untuned Linux (Redhat 6.2) using stock 7.0.2
> rpm-s
> NOTICE: RegisterSharedInvalid: SI buffer overflow
> NOTICE: InvalidateSharedInvalid: cache state reset
> Actually I get many of them ;(
AFAIK, these are just noise in 7.0. The only reason you see them is
we haven't got round to removing the messages or downgrading them to
elog(DEBUG).
> I'm running a script that does a bunch of mixed INSERTS, UPDATES,
> DELETES and SELECTS.
I'll bet you also have some backends sitting idle with open
transactions? The combination of idle and active backends is what
usually provokes SI overruns.
> after getting that I'm unable to vacuum database until I reset the OS
Define your terms more carefully, please. What do you mean by
"unable to vacuum" --- what happens *exactly*? In any case,
surely it doesn't take an OS reboot to recover. I might believe
you need to restart the postmaster...
regards, tom lane
Tom Lane wrote: > > Hannu Krosing <hannu@tm.ee> writes: > > I get the following on untuned Linux (Redhat 6.2) using stock 7.0.2 > > rpm-s > > > NOTICE: RegisterSharedInvalid: SI buffer overflow > > NOTICE: InvalidateSharedInvalid: cache state reset > > > Actually I get many of them ;( > > AFAIK, these are just noise in 7.0. The only reason you see them is > we haven't got round to removing the messages or downgrading them to > elog(DEBUG). > > > I'm running a script that does a bunch of mixed INSERTS, UPDATES, > > DELETES and SELECTS. > > I'll bet you also have some backends sitting idle with open > transactions? The combination of idle and active backends is what > usually provokes SI overruns. > > > after getting that I'm unable to vacuum database until I reset the OS > > Define your terms more carefully, please. What do you mean by > "unable to vacuum" --- what happens *exactly*? NOTICE: FlushRelationBuffers(access_right, 2009): block 1944 is referenced (private 0, global 2) FATAL 1: VACUUM (vc_repair_frag): FlushRelationBuffers returned -2 pqReadData() -- backend closed the channel unexpectedly. This probably means the backend terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. > In any case, > surely it doesn't take an OS reboot to recover. I might believe > you need to restart the postmaster... on one machine a simple restart worked Maybe i have to really restart it (instead of doing /etc/rc.d/init.d/postgresql restart) by running killall -9 /usr/bin/postgres I was quite sure that just restarting it did not help, but maybe it really did not restart, just claimed to . On the other I still get amphora2=# vacuum; NOTICE: FlushRelationBuffers(item, 30): block 2 is referenced (private 0, global 1) FATAL 1: VACUUM (vc_repair_frag): FlushRelationBuffers returned -2 pqReadData() -- backend closed the channel unexpectedly. This probably means the backend terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. after stopping postmaster (and checking it is stopped) I could do a vacuum after restarting the whole machine... OTOH it _may_ be that someone started another backend right after restart and did something, but must this be a FATAL error ? ----------- Hannu
Hannu Krosing <hannu@tm.ee> writes:
>> Define your terms more carefully, please. What do you mean by
>> "unable to vacuum" --- what happens *exactly*?
> NOTICE: FlushRelationBuffers(access_right, 2009): block 1944 is
> referenced (private 0, global 2)
> FATAL 1: VACUUM (vc_repair_frag): FlushRelationBuffers returned -2
Oh, that's interesting. This error indicates that some prior
transaction neglected to release a reference count on a shared buffer.
We have seen sporadic reports of this problem in 7.0, but so far no
one has come up with a reproducible example. If you can boil down
your script to something that reproducibly causes the problem then
that'd be a great help in tracking it down.
If you have clients that sometimes disconnect in the middle of a
transaction, it might help to apply the attached patch.
> Maybe i have to really restart it (instead of doing
> /etc/rc.d/init.d/postgresql restart)
> by running killall -9 /usr/bin/postgres
Restarting the postmaster should clear the problem (by releasing and
reinitializing shared memory). I dunno where you got the idea that
kill -9 was a recommended way of shutting down the system, but I sure
wouldn't recommend it. A plain kill on the postmaster ought to do it
(see the pg_ctl script in release 7.0.*).
regards, tom lane
*** src/backend/tcop/postgres.c.orig Sat May 20 22:23:30 2000
--- src/backend/tcop/postgres.c Wed Aug 30 16:47:51 2000
***************
*** 1459,1465 **** * Initialize the deferred trigger manager */ if (DeferredTriggerInit() != 0)
! proc_exit(0); SetProcessingMode(NormalProcessing);
--- 1459,1465 ---- * Initialize the deferred trigger manager */ if (DeferredTriggerInit() != 0)
! goto normalexit; SetProcessingMode(NormalProcessing);
***************
*** 1479,1490 **** TPRINTF(TRACE_VERBOSE, "AbortCurrentTransaction"); AbortCurrentTransaction();
! InError = false; if (ExitAfterAbort)
! {
! ProcReleaseLocks(); /* Just to be sure... */
! proc_exit(0);
! } } Warn_restart_ready = true; /* we can now handle elog(ERROR) */
--- 1479,1489 ---- TPRINTF(TRACE_VERBOSE, "AbortCurrentTransaction"); AbortCurrentTransaction();
! if (ExitAfterAbort)
! goto errorexit;
!
! InError = false; } Warn_restart_ready = true; /* we can now handle elog(ERROR) */
***************
*** 1553,1560 **** if (HandleFunctionRequest() == EOF) { /* lost
frontendconnection during F message input */
! pq_close();
! proc_exit(0); } break;
--- 1552,1558 ---- if (HandleFunctionRequest() == EOF) { /* lost
frontendconnection during F message input */
! goto normalexit; } break;
***************
*** 1608,1618 **** */ case 'X': case EOF:
! if (!IsUnderPostmaster)
! ShutdownXLOG();
! pq_close();
! proc_exit(0);
! break; default: elog(ERROR, "unknown frontend message was received");
--- 1606,1612 ---- */ case 'X': case EOF:
! goto normalexit; default: elog(ERROR, "unknown frontend message was
received");
***************
*** 1642,1651 **** if (IsUnderPostmaster) NullCommand(Remote); }
! } /* infinite for-loop */
! proc_exit(0); /* shouldn't get here... */
! return 1; } #ifndef HAVE_GETRUSAGE
--- 1636,1655 ---- if (IsUnderPostmaster) NullCommand(Remote); }
! } /* end of main loop */
!
! normalexit:
! ExitAfterAbort = true; /* ensure we will exit if elog during abort */
! AbortOutOfAnyTransaction();
! if (!IsUnderPostmaster)
! ShutdownXLOG();
!
! errorexit:
! pq_close();
! ProcReleaseLocks(); /* Just to be sure... */
! proc_exit(0);
! return 1; /* keep compiler quiet */ } #ifndef HAVE_GETRUSAGE