Обсуждение: [BUGS] BUG #14555: EBUSY error on read() on NFS

Поиск
Список
Период
Сортировка

[BUGS] BUG #14555: EBUSY error on read() on NFS

От
ashwath.rao@altair.com
Дата:
The following bug has been logged on the website:

Bug reference:      14555
Logged by:          Ashwath Rao
Email address:      ashwath.rao@altair.com
PostgreSQL version: 9.3.6
Operating system:   SLES11SP4
Description:

We use Postgres version 9.3.6 with our product PBS Professional. We're
having an issue that seems to have cropped up at one of our customer because
of the filesystem for the datastore but only exposed with the PostGreSQL
update. The datastore sometimes does not work. When we actually try to dump
the datastore we get very similar messages to the pg_log messages:

#/opt/pbs/default/pgsql/bin/pg_dump -U <USER>-p 15007 pbs_datastore >
pbs_datastore_14022017.sql
Password: 
pg_dump: Dumping the contents of table "job_attr" failed: PQgetResult()
failed.
pg_dump: Error message from server: ERROR:  could not read block 69600 in
file "base/16384/16555": Device or resource busy
pg_dump: The command was: COPY pbs.job_attr (ji_jobid, attr_name,
attr_resource, attr_value, attr_flags) TO stdout;


Once we have this, we seem to have errors only on that one file, but EBUSY
is _really_ puzzling. It's not one of the valid errno setting for a read()
system call.

Bizarrely, problems seem to be reported on a small number of blocks:
# grep ERR /panfs/e/PBS/datastore/pg_log/pbs_dataservice_log.Tue
2017-02-14 08:13:33 UTCERROR:  could not read block 69600 in file
"base/16384/16555": Device or resource busy
2017-02-14 08:15:59 UTCERROR:  could not read block 99298 in file
"base/16384/16555": Device or resource busy
2017-02-14 08:37:29 UTCERROR:  could not read block 9608 in file
"base/16384/16555": Device or resource busy

But it's not consistently the same block:
# /opt/pbs/default/pgsql/bin/pg_dump -U <USER> -p 15007 pbs_datastore >
pbs_datastore_17012017_new.sql
Password: 
pg_dump: Dumping the contents of table "job_attr" failed: PQgetResult()
failed.
pg_dump: Error message from server: ERROR:  could not read block 105740 in
file "base/16384/16555": Device or resource busy
pg_dump: The command was: COPY pbs.job_attr (ji_jobid, attr_name,
attr_resource, attr_value, attr_flags) TO stdout;

When you're in this state, then even the usual error recovery methods don't
work reliably:
# $PBS_EXEC/pgsql/bin/psql -U <USER> -p 15007 -d pbs_datastore
Password for user crayadm: 
psql (9.3.6)
Type "help" for help.
 
pbs_datastore=# set search_path to pbs;
SET
pbs_datastore=# set zero_damaged_pages=on;
SET
pbs_datastore=# vacuum full;
ERROR:  could not read block 99298 in file "base/16384/16555": Device or
resource busy


The file can be read *sequentially* without any issue, though:

dd if=/panfs/e/PBS/datastore/base/16384/16555 of=/tmp/16555 
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB) copied, 10.1065 s, 106 MB/s
xcepbs00:~ # echo $?
0


Last time we had this, just making a tarball and untarring things fixed
everything! In other words, the files appeared to be readable sequentially
without any issue but the I/O patterns used by PostGreSQL seemed to give it
all trouble.

We are trying to see what can actually give back EBUSY? Is this on a read
call or could this be another call? Does the message tell anything about
_where_ in the code there was a failure?


--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14555: EBUSY error on read() on NFS

От
Tom Lane
Дата:
ashwath.rao@altair.com writes:
> Once we have this, we seem to have errors only on that one file, but EBUSY
> is _really_ puzzling. It's not one of the valid errno setting for a read()
> system call.

Indeed.  Some googling suggests that this might be a known issue
in the kernel: Red Hat fixed something with similar symptoms in their
RHEL6 series about a year ago.  You might want to check for SLES updates,
and pester SUSE if there's not a fix available.  It seems highly unlikely
that it's Postgres' fault in any meaningful sense, in any case.

(FWIW, a lot of Postgres hackers consider NFS to be too unreliable to
keep a database on.  NFS is great, don't get me wrong, but it's got a
very long track record of intermittent weirdness like this.  If you're
trying to get from three-nines to five-nines reliability, keeping your
data on NFS is a serious stumbling block to getting there.)

            regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14555: EBUSY error on read() on NFS

От
John R Pierce
Дата:
On 2/19/2017 8:43 PM, Tom Lane wrote:
> (FWIW, a lot of Postgres hackers consider NFS to be too unreliable to
> keep a database on.  NFS is great, don't get me wrong, but it's got a
> very long track record of intermittent weirdness like this.  If you're
> trying to get from three-nines to five-nines reliability, keeping your
> data on NFS is a serious stumbling block to getting there.)

For what its worth (about $0.02), I remember Oracle saying DO NOT USE 
NFS for database storage, unless it was a NetApp Filer with a specific 
set of options configured, that no other configuration was considered 
robust enough for database storage.


-- 
john r pierce, recycling bits in santa cruz



-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] BUG #14555: EBUSY error on read() on NFS

От
Tom Lane
Дата:
ashwath.rao@altair.com writes:
> We are trying to see what can actually give back EBUSY? Is this on a read
> call or could this be another call? Does the message tell anything about
> _where_ in the code there was a failure?

BTW, there is only one place in the PG 9.3 sources that can return that
exact error string, but some digging down into the called code says that
the errno could possibly be coming from either open() or read(), depending
on whether the process already had the file open.

Interestingly, Red Hat's bug database says that they've fixed two separate
kernel bugs resulting in EBUSY-on-NFS errors in the past couple years.
One bug afflicted open() calls and the other read() calls.  I doubt that
either bug was unique to Red Hat's copy of the kernel.

Again, I think you need to be discussing this with SUSE not us.  I'm not
a SUSE user so I don't know where to look to find out about bugs in
SUSE's kernel versions.

            regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs