Обсуждение: The ability of postgres to determine loss of files of the main fork

Поиск
Список
Период
Сортировка

The ability of postgres to determine loss of files of the main fork

От
Frits Hoogland
Дата:
 Hi hackers,

Postgres determines the availability of the main fork, the actual data file,
upon reading it. This is also what amcheck will detect.

However, if a relation exceeds 1GB and has more than a single data file
segment, there is the option of silent data loss.  For example, if a table
consists of five segments and the third one goes missing, a sequential scan
will happily conclude that the table consists only of two segments and won't
report an error.  Only an index scan that tries to return a row in the
missing segment will report an error.

Currently, this kind of data loss cannot be detected if you check the
integrity of the table using verify_heapam().  Only if you manually use the
primary key index to obtain data from a missing segment, you will get an error.

Therefore, I would like to request an enhancement: add an option to
verify_heapam() that causes the primary key index to be scanned and makes
sure that all line pointers in the index point to existing tuples.

An alternative might be to track the number of segments of a relation in
pg_class, but that may be difficult to make crash-safe.

Frits Hoogland




Re: The ability of postgres to determine loss of files of the main fork

От
Aleksander Alekseev
Дата:
Hi Frits,

> Therefore, I would like to request an enhancement: add an option to
> verify_heapam() that causes the primary key index to be scanned and makes
> sure that all line pointers in the index point to existing tuples.

I'm a bit puzzled by your emphasis on primary keys. In Postgres it is
legal to have tables without PKs, indexes, or even columns:

=# create table my_table();
=# select * from my_table;

To clarify, are you proposing not to check such tables?

> An alternative might be to track the number of segments of a relation in
> pg_class, but that may be difficult to make crash-safe.

Hm... the fact that we have a segment on disk doesn't mean it is not
empty or not corrupted. Let's say we will add a check you are
proposing. The next person will complain that we don't check the size
of the segments. The next one - about the fact that we don't verify
checksums.

So IMO there is little value in adding a check for the existence of
the segments for a single table. And the *real* check will not differ
much from something like SELECT * FROM my_table, or from making a
complete backup of the database.

-- 
Best regards,
Aleksander Alekseev



Re: The ability of postgres to determine loss of files of the main fork

От
Tom Lane
Дата:
Aleksander Alekseev <aleksander@tigerdata.com> writes:
>> Therefore, I would like to request an enhancement: add an option to
>> verify_heapam() that causes the primary key index to be scanned and makes
>> sure that all line pointers in the index point to existing tuples.

> ... IMO there is little value in adding a check for the existence of
> the segments for a single table. And the *real* check will not differ
> much from something like SELECT * FROM my_table, or from making a
> complete backup of the database.

As Frits mentioned, neither of those actions will really notice if a
table has been truncated via loss of a segment.

However, I think the requested functionality already exists via
contrib/amcheck (see the heapallindexed option).  The user does have
to make a decision which index to check with, but I think that'd be
required anyway --- as you say, there isn't necessarily a primary key.

            regards, tom lane



Re: The ability of postgres to determine loss of files of the main fork

От
Laurenz Albe
Дата:
On Tue, 2025-09-30 at 18:55 +0300, Aleksander Alekseev wrote:
> > Therefore, I would like to request an enhancement: add an option to
> > verify_heapam() that causes the primary key index to be scanned and makes
> > sure that all line pointers in the index point to existing tuples.
>
> I'm a bit puzzled by your emphasis on primary keys. In Postgres it is
> legal to have tables without PKs, indexes, or even columns:
>
> =# create table my_table();
> =# select * from my_table;
>
> To clarify, are you proposing not to check such tables?

If there is no index on a table, there is no way to determine a missing
segment.  I don't think that Frits deliberately wants to *not* check
such tables, it's just that there is no way to do it as things are now.

Just because a check is not possible in certain cases is no good reason
to forgo a check in those cases where it is possible.

Yours,
Laurenz Albe



Re: The ability of postgres to determine loss of files of the main fork

От
Michael Banck
Дата:
Hi,

wow, this is one of the most terrifying threads I've ever seen...

On Tue, Sep 30, 2025 at 12:41:29PM -0400, Tom Lane wrote:
> Aleksander Alekseev <aleksander@tigerdata.com> writes:
> >> Therefore, I would like to request an enhancement: add an option to
> >> verify_heapam() that causes the primary key index to be scanned and makes
> >> sure that all line pointers in the index point to existing tuples.
> 
> > ... IMO there is little value in adding a check for the existence of
> > the segments for a single table. And the *real* check will not differ
> > much from something like SELECT * FROM my_table, or from making a
> > complete backup of the database.
> 
> As Frits mentioned, neither of those actions will really notice if a
> table has been truncated via loss of a segment.

Is there a valid case for a missing segment? If not, couldn't this be
caught somewhere in the storage manager?
 
> However, I think the requested functionality already exists via
> contrib/amcheck (see the heapallindexed option).  

It doesn't balk for me, am I doing something wrong?

|mbanck@mbanck-lin-1:~$ psql -c "SELECT COUNT(*) FROM pgbench_accounts"
|  count
|----------
| 20000000
|(1 row)
|
|mbanck@mbanck-lin-1:~$ rm /tmp/pg_virtualenv.h8ObRG/data/18/regress/base/5/16462.1
|mbanck@mbanck-lin-1:~$ ls /tmp/pg_virtualenv.h8ObRG/data/18/regress/base/5/16462* | grep -v 16462_
|/tmp/pg_virtualenv.h8ObRG/data/18/regress/base/5/16462
|/tmp/pg_virtualenv.h8ObRG/data/18/regress/base/5/16462.2
|mbanck@mbanck-lin-1:~$ psql -c "SELECT COUNT(*) FROM pgbench_accounts"
|  count  
|---------
| 7995392
|(1 row)
|
|mbanck@mbanck-lin-1:~$ /usr/lib/postgresql/18/bin/pg_amcheck -v --heapallindexed -t pgbench_accounts
|pg_amcheck: including database "postgres"
|pg_amcheck: in database "postgres": using amcheck version "1.5" in schema "public"
|pg_amcheck: checking heap table "postgres.public.pgbench_accounts"
|pg_amcheck: checking btree index "postgres.public.pgbench_accounts_pkey"
|mbanck@mbanck-lin-1:~$ echo $?
|0
|mbanck@mbanck-lin-1:~$ /usr/lib/postgresql/18/bin/pg_amcheck -v --heapallindexed -i pgbench_accounts_pkey
|pg_amcheck: including database "postgres"
|pg_amcheck: in database "postgres": using amcheck version "1.5" in schema "public"
|pg_amcheck: checking btree index "postgres.public.pgbench_accounts_pkey"
|mbanck@mbanck-lin-1:~$ echo $?
|0
|mbanck@mbanck-lin-1:~$

And neither pg_checksums nor pg_basebackup catch it either...


Michael



Re: The ability of postgres to determine loss of files of the main fork

От
Arseniy Mukhin
Дата:
Hi,

On Wed, Oct 1, 2025 at 10:02 AM Michael Banck <mbanck@gmx.net> wrote:
>
> Hi,
>
> wow, this is one of the most terrifying threads I've ever seen...
>
> On Tue, Sep 30, 2025 at 12:41:29PM -0400, Tom Lane wrote:
> > Aleksander Alekseev <aleksander@tigerdata.com> writes:
> > >> Therefore, I would like to request an enhancement: add an option to
> > >> verify_heapam() that causes the primary key index to be scanned and makes
> > >> sure that all line pointers in the index point to existing tuples.
> >
> > > ... IMO there is little value in adding a check for the existence of
> > > the segments for a single table. And the *real* check will not differ
> > > much from something like SELECT * FROM my_table, or from making a
> > > complete backup of the database.
> >
> > As Frits mentioned, neither of those actions will really notice if a
> > table has been truncated via loss of a segment.
>
> Is there a valid case for a missing segment? If not, couldn't this be
> caught somewhere in the storage manager?
>
> > However, I think the requested functionality already exists via
> > contrib/amcheck (see the heapallindexed option).
>
> It doesn't balk for me, am I doing something wrong?
>
> |mbanck@mbanck-lin-1:~$ psql -c "SELECT COUNT(*) FROM pgbench_accounts"
> |  count
> |----------
> | 20000000
> |(1 row)
> |
> |mbanck@mbanck-lin-1:~$ rm /tmp/pg_virtualenv.h8ObRG/data/18/regress/base/5/16462.1
> |mbanck@mbanck-lin-1:~$ ls /tmp/pg_virtualenv.h8ObRG/data/18/regress/base/5/16462* | grep -v 16462_
> |/tmp/pg_virtualenv.h8ObRG/data/18/regress/base/5/16462
> |/tmp/pg_virtualenv.h8ObRG/data/18/regress/base/5/16462.2
> |mbanck@mbanck-lin-1:~$ psql -c "SELECT COUNT(*) FROM pgbench_accounts"
> |  count
> |---------
> | 7995392
> |(1 row)
> |
> |mbanck@mbanck-lin-1:~$ /usr/lib/postgresql/18/bin/pg_amcheck -v --heapallindexed -t pgbench_accounts
> |pg_amcheck: including database "postgres"
> |pg_amcheck: in database "postgres": using amcheck version "1.5" in schema "public"
> |pg_amcheck: checking heap table "postgres.public.pgbench_accounts"
> |pg_amcheck: checking btree index "postgres.public.pgbench_accounts_pkey"
> |mbanck@mbanck-lin-1:~$ echo $?
> |0
> |mbanck@mbanck-lin-1:~$ /usr/lib/postgresql/18/bin/pg_amcheck -v --heapallindexed -i pgbench_accounts_pkey
> |pg_amcheck: including database "postgres"
> |pg_amcheck: in database "postgres": using amcheck version "1.5" in schema "public"
> |pg_amcheck: checking btree index "postgres.public.pgbench_accounts_pkey"
> |mbanck@mbanck-lin-1:~$ echo $?
> |0
> |mbanck@mbanck-lin-1:~$

I tried to repeat it and has the same results with bt_index_check().
IIUC the reason amcheck doesn't show any corruption here is that
allheapindexed check just builds bloom filter for all index tuples and
then test every heap tuple against it. So we actually never check that
every index tuple points to the existing segment/page/heap_tuple here.


Best regards,
Arseniy Mukhin



Re: The ability of postgres to determine loss of files of the main fork

От
Jakub Wartak
Дата:
On Wed, Oct 1, 2025 at 9:02 AM Michael Banck <mbanck@gmx.net> wrote:
>
> Hi,
>
> wow, this is one of the most terrifying threads I've ever seen...

Same.

> On Tue, Sep 30, 2025 at 12:41:29PM -0400, Tom Lane wrote:
> > Aleksander Alekseev <aleksander@tigerdata.com> writes:
> > >> Therefore, I would like to request an enhancement: add an option to
> > >> verify_heapam() that causes the primary key index to be scanned and makes
> > >> sure that all line pointers in the index point to existing tuples.
> >
> > > ... IMO there is little value in adding a check for the existence of
> > > the segments for a single table. And the *real* check will not differ
> > > much from something like SELECT * FROM my_table, or from making a
> > > complete backup of the database.
> >
> > As Frits mentioned, neither of those actions will really notice if a
> > table has been truncated via loss of a segment.
>
> Is there a valid case for a missing segment? If not, couldn't this be
> caught somewhere in the storage manager?
>

I've took a look on PG17 and in _mfd_openseg() there's if fd < 0
return NULL after open(), but out of it's callers only _mdfd_getseg()
seems to be alerting on that NULL. To me this seems like a bug,
because i've seen way too many times people and software deleting
files randomly. Even simple crashes (with e2fsck, xfs_repair) could
put orphaned inodes into /lost+found. IMHO all files should be opened
at least on startup to check integrity, because the non-zero return
code (during such SELECT) for openat(2) seems o be coming out of
RelationGetNumberOfBlocksInFork()->table_block_relation_size()->smgrnblocks()->mdnblocks()->_mdfd_openseg().
Now if the 1st seg file would be missing we would complain in
mdopenfork(). mdnblocks() says even "all active segments of the
relation are opened...", but even that apparently is not true.

The bigger context seems to be be that 049469e7e7cfe0c69 (2015) could
be culprit here as well, as it is stated there that mdnblocks() could
earlier create zero-length files back in day and it removed that
ereport(ERROR) when unable to access that file.

Another idea (than this being a bug) is that Thomas had a large
relation patchset back in [1], but I wouldn't be a fan of us operating
on 31-32TB files ;)

-J.

[1] -
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BBGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0%3Dm6dDiA%40mail.gmail.com



Re: The ability of postgres to determine loss of files of the main fork

От
Aleksander Alekseev
Дата:
Hi Jakub,

> IMHO all files should be opened at least on startup to check integrity,

That might be a lot of files to open. Even if you can open a file it
doesn't mean it's not empty or is not corrupted.

-- 
Best regards,
Aleksander Alekseev



Re: The ability of postgres to determine loss of files of the main fork

От
Frits Hoogland
Дата:
I am proposing the database to have the ability to detect when it has missing segments.
One of the main and basic properties of a database is to store data consistently and reliably.
If the database cannot detect that is missing things, this is a huge breach to safety and reliability.

And you're right that other things can be checked, this is just about this single ability.
Like Laurenz says, that other checks could possibly be done is not a reason not to look or consider this.

Laurenz makes the right assumption about my message, and sorry to suggest an emphasis on a primary key
for doing the check. A primary key has a high chance of existing, and must contain all table rows pretty much
guaranteed, and therefore is a good candidate for validating the table, but this is really just a practical
way to be able to check if a table has all its segments available without requiring changes to the catalog.

My aim actually is to point out this, what I think, is an omission, where we are missing out.
I don't know what would fit best in the current engine, my suggestion is to keep track of segments, or the last
page that contains data and thus should be available, so that metadata is available to allow the database
to be able to validate if the main fork and all segments are available.


Frits Hoogland




On 1 Oct 2025, at 08:26, Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Tue, 2025-09-30 at 18:55 +0300, Aleksander Alekseev wrote:
Therefore, I would like to request an enhancement: add an option to
verify_heapam() that causes the primary key index to be scanned and makes
sure that all line pointers in the index point to existing tuples.

I'm a bit puzzled by your emphasis on primary keys. In Postgres it is
legal to have tables without PKs, indexes, or even columns:

=# create table my_table();
=# select * from my_table;

To clarify, are you proposing not to check such tables?

If there is no index on a table, there is no way to determine a missing
segment.  I don't think that Frits deliberately wants to *not* check
such tables, it's just that there is no way to do it as things are now.

Just because a check is not possible in certain cases is no good reason
to forgo a check in those cases where it is possible.

Yours,
Laurenz Albe

Re: The ability of postgres to determine loss of files of the main fork

От
Frits Hoogland
Дата:
Thank you for your answer Tom,

As pointed out in another thread of this topic: using the heapallindexed option, it is 
not possible to detect that the table has missing segments and thus missing data.
What it will detect is if the index is missing data that is existing in the table, it validates
table->index.

Frits Hoogland




On 30 Sep 2025, at 18:41, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Aleksander Alekseev <aleksander@tigerdata.com> writes:
Therefore, I would like to request an enhancement: add an option to
verify_heapam() that causes the primary key index to be scanned and makes
sure that all line pointers in the index point to existing tuples.

... IMO there is little value in adding a check for the existence of
the segments for a single table. And the *real* check will not differ
much from something like SELECT * FROM my_table, or from making a
complete backup of the database.

As Frits mentioned, neither of those actions will really notice if a
table has been truncated via loss of a segment.

However, I think the requested functionality already exists via
contrib/amcheck (see the heapallindexed option).  The user does have
to make a decision which index to check with, but I think that'd be
required anyway --- as you say, there isn't necessarily a primary key.

regards, tom lane

Re: The ability of postgres to determine loss of files of the main fork

От
Jakub Wartak
Дата:
On Wed, Oct 1, 2025 at 1:46 PM Aleksander Alekseev
<aleksander@tigerdata.com> wrote:
>
> Hi Jakub,
>
> > IMHO all files should be opened at least on startup to check integrity,
>
> That might be a lot of files to open.

I was afraid of that, but let's say modern high-end is 200TB big DB,
that's like 200*1024 1GB files, but I'm getting such time(1) timings
for 204k files on ext4:

$ time ./createfiles                      # real    0m2.157s, it's
open(O_CREAT)+close()
$ time ls -l many_files_dir/ > /dev/null # real    0m0.734s
$ time ./openfiles                          # real    0m0.297s , for
already existing ones (hot)
$ time ./openfiles                          # real    0m1.456s , for
already existing ones (cold, echo 3 > drop_caches sysctl)

Not bad in my book as a one time activity. It could pose a problem
potentially with some high latency open() calls, maybe NFS or
something remote I guess.

> Even if you can open a file it doesn't mean it's not empty

Correct, I haven't investigated that rabbithole...

> or is not corrupted.

I think checksums guard users well in this case as they would get
notified that stuff is wonky (much better than wrong result/silent
data loss)

-J.



Re: The ability of postgres to determine loss of files of the main fork

От
Frits Hoogland
Дата:
Thank you for looking into this Jakub, Aleksander, Michael, Tom,

There are more scenario's where files might be missing: how about a backup somehow missing files?
But also deliberate sabotage, you can hide data in this way, and it would be hard to detect, and even
harder to understand what happened, why it happened and how it happened. I dare to say quite ideal
if you want a name to be removed from a database a covert way, and know something about the
database and have access, or somehow can manipulate in the data directory.

Without having the exact sourcecode files and functions handy, the way I think using the segments
works in postgres, is that the database, because of the absence of any indicator of size, will scan 
the relfilenode, and when it reaches the set limit (1GB), it will try to open the next segment. If that 
segment doesn't exist, the database assumes it's the end of the segment. That is all. That is why there
is no error: it's exactly alike when there truly is no more data, and there is no metadata to understand
there is supposed to be more data.

And therefore I talked about the primary key: that contains "indirect" data to be able to detect if a
table has a missing segment. But I think really what is needed is a max page number to indicate
the known size.



Frits Hoogland




On 1 Oct 2025, at 13:20, Jakub Wartak <jakub.wartak@enterprisedb.com> wrote:

On Wed, Oct 1, 2025 at 9:02 AM Michael Banck <mbanck@gmx.net> wrote:

Hi,

wow, this is one of the most terrifying threads I've ever seen...

Same.

On Tue, Sep 30, 2025 at 12:41:29PM -0400, Tom Lane wrote:
Aleksander Alekseev <aleksander@tigerdata.com> writes:
Therefore, I would like to request an enhancement: add an option to
verify_heapam() that causes the primary key index to be scanned and makes
sure that all line pointers in the index point to existing tuples.

... IMO there is little value in adding a check for the existence of
the segments for a single table. And the *real* check will not differ
much from something like SELECT * FROM my_table, or from making a
complete backup of the database.

As Frits mentioned, neither of those actions will really notice if a
table has been truncated via loss of a segment.

Is there a valid case for a missing segment? If not, couldn't this be
caught somewhere in the storage manager?


I've took a look on PG17 and in _mfd_openseg() there's if fd < 0
return NULL after open(), but out of it's callers only _mdfd_getseg()
seems to be alerting on that NULL. To me this seems like a bug,
because i've seen way too many times people and software deleting
files randomly. Even simple crashes (with e2fsck, xfs_repair) could
put orphaned inodes into /lost+found. IMHO all files should be opened
at least on startup to check integrity, because the non-zero return
code (during such SELECT) for openat(2) seems o be coming out of
RelationGetNumberOfBlocksInFork()->table_block_relation_size()->smgrnblocks()->mdnblocks()->_mdfd_openseg().
Now if the 1st seg file would be missing we would complain in
mdopenfork(). mdnblocks() says even "all active segments of the
relation are opened...", but even that apparently is not true.

The bigger context seems to be be that 049469e7e7cfe0c69 (2015) could
be culprit here as well, as it is stated there that mdnblocks() could
earlier create zero-length files back in day and it removed that
ereport(ERROR) when unable to access that file.

Another idea (than this being a bug) is that Thomas had a large
relation patchset back in [1], but I wouldn't be a fan of us operating
on 31-32TB files ;)

-J.

[1] - https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BBGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0%3Dm6dDiA%40mail.gmail.com

Re: The ability of postgres to determine loss of files of the main fork

От
Michael Banck
Дата:
Hi,

On Wed, Oct 01, 2025 at 02:05:53PM +0200, Jakub Wartak wrote:
> On Wed, Oct 1, 2025 at 1:46 PM Aleksander Alekseev
> <aleksander@tigerdata.com> wrote:
> > > IMHO all files should be opened at least on startup to check
> > > integrity,

I would say s/startup/crash recovery/, if any.

> > That might be a lot of files to open.
> 
> I was afraid of that, but let's say modern high-end is 200TB big DB,
> that's like 200*1024 1GB files, but I'm getting such time(1) timings
> for 204k files on ext4:
> 
> $ time ./createfiles                      # real    0m2.157s, it's
> open(O_CREAT)+close()
> $ time ls -l many_files_dir/ > /dev/null # real    0m0.734s
> $ time ./openfiles                          # real    0m0.297s , for
> already existing ones (hot)
> $ time ./openfiles                          # real    0m1.456s , for
> already existing ones (cold, echo 3 > drop_caches sysctl)
>
> Not bad in my book as a one time activity. It could pose a problem
> potentially with some high latency open() calls, maybe NFS or
> something remote I guess.

Yeah, did you try on SAN as well? I am doubtful that will be performant.


Michael



Re: The ability of postgres to determine loss of files of the main fork

От
Laurenz Albe
Дата:
On Wed, 2025-10-01 at 13:58 +0200, Frits Hoogland wrote:
> I am proposing the database to have the ability to detect when it has missing segments.

Just a random idea: one solution would be if each segment has a flag that indicates
if that is the last segment or not.  But that would break the on-disk storage format,
unless there is room left for an extra flag somewhere in the current layout.

Yours,
Laurenz Albe



Re: The ability of postgres to determine loss of files of the main fork

От
Andres Freund
Дата:
Hi,

On 2025-10-01 15:39:04 +0200, Laurenz Albe wrote:
> On Wed, 2025-10-01 at 13:58 +0200, Frits Hoogland wrote:
> > I am proposing the database to have the ability to detect when it has missing segments.
> 
> Just a random idea: one solution would be if each segment has a flag that indicates
> if that is the last segment or not.  But that would break the on-disk storage format,
> unless there is room left for an extra flag somewhere in the current layout.

It'd also make extensions / truncations more complicated. I rather doubt we're
going there. Right now relation extension aren't WAL logged. While there might
be reasons to change that, I don't think this is enough justification for
doing so.

Greetings,

Andres



Re: The ability of postgres to determine loss of files of the main fork

От
Frits Hoogland
Дата:
> On 1 Oct 2025, at 15:49, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-10-01 15:39:04 +0200, Laurenz Albe wrote:
>> On Wed, 2025-10-01 at 13:58 +0200, Frits Hoogland wrote:
>>> I am proposing the database to have the ability to detect when it has missing segments.
>>
>> Just a random idea: one solution would be if each segment has a flag that indicates
>> if that is the last segment or not.  But that would break the on-disk storage format,
>> unless there is room left for an extra flag somewhere in the current layout.
>
> It'd also make extensions / truncations more complicated. I rather doubt we're
> going there. Right now relation extension aren't WAL logged. While there might
> be reasons to change that, I don't think this is enough justification for
> doing so.
>

What would be a achievable way of making postgres under the relation size?
How about a field in pg_class that keeps the final data page, so that the catalog
keeps the size, which then allows utilities and the database itself to understand how
many segments should exist?

> Greetings,
>
> Andres