Re: Spontaneous PostgreSQL Server Reboot?
От | Andrew Biagioni |
---|---|
Тема | Re: Spontaneous PostgreSQL Server Reboot? |
Дата | |
Msg-id | 406B984E.7030109@e-greek.net обсуждение исходный текст |
Ответ на | Re: Spontaneous PostgreSQL Server Reboot? ("scott.marlowe" <scott.marlowe@ihs.com>) |
Список | pgsql-admin |
scott.marlowe wrote: > On Tue, 30 Mar 2004, Andrew Biagioni wrote: > > >>Alex, >> >>the answer is "no" to all of these. We are a tiny start-up (2 guys, and >>we do our own cleaning); ambient temperature varies significantly but >>is not related to the failure, and one machine starts beeping when it >>gets too hot (then we added an extra case fan); no fancy watchdogs >>(maybe someday... One can only dream :-> ); three different cases, >>power supplies, motherboards, etc., etc. (one power supply is >>extra-large, and that's the machine that started failing first!). >> >>We originally blamed the problem on hardware failure (first machine); >>then on OS version/configuration (second machine); now we're out of >>things to blame, except maybe unusually bad luck... > > > What did memtest86 say? > > Did the same person build all the machines? I've seen plenty of folks > build machines and zap the memory when installing it. >95% of all ESD > failures are partial / delayed failures, so just because a computer boots > up doesn't mean proper ESD procedures were followed, and if not, and if > you're in a dry environment like I am (I live in Denver) then it's quite > possible all three have bad CPU/mobo/memory or something like that. Two different people built the machines; we're both electrical engineers with plenty of familiarity and experience with static issues, so that particular issue is not likely. As for memtest86 - I haven't been able to run it on two of the machines yet (they are in production), and I have to restart the third one (it was "retired" after the third time it died on us). Meanwhile I found out some more details: - the first machine had a software raid system that may have been unreliable - the second machine had a much older kernel and sloppily-updated modules, and it would hang -- not reboot - the last machine to reboot MAY have been a line power issue (the whole building lost power a few hours later, so I lost some info on other machines' restarting -- I'll dig more). So -- it's memtest86 and badblocks for all three (as soon as I can), better UPS-ing, updated kernel(s), and checking more machines' logs; then we'll see... Thanks to you all for the suggestions -- keep them coming! Andrew
В списке pgsql-admin по дате отправления: