Обсуждение: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)

Поиск

Список

Период

Сортировка

PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)

От

Kirk Wolak

Дата:

15 августа, 19:10:58

Hackers,

I am asking for feedback on this. We just migrated a client to PG and all is well. Except the developers requested a copy of recent production data for testing.

We have a handful of tables that have 90% of the data going back 30 years.

We NEVER restore this data to Dev or Staging. We used a special RMAN backup where these tables had a "WHERE clause" applied to them during the backup/dump process.

It would be awesome if pg_dump offered something similar. I am willing to code this up, but there are questions. And I want to do it in a way that makes the most sense to the PG Community.

My initial thought is a simple filtering file, of the format:
schema.table = WHERE ...

"schema"."Table" = WHERE ...

"schema"."t1" = LIMIT 50000

If the --filter-data <filename> is not specified, the code would not change anything. If it was, it would see if the table was in the file with a filter, it would read that filter, and apply it to the resulting COPY command.

I don't believe this impacts pg_restore. But if we wanted to include any kind of messaging in the restore process that "pg_dump --filter-data was used, this is NOT a complete dump!", then I would appreciate that, and include it in pg_restore.

Just to make the point. The full pg_dump takes 60 minutes (During which, we must turn off certain features to avoid throwing errors/locking issues). Excluding these tables takes 2-3 minutes. (Side dumping limited versions of them with \COPY takes 3 minutes). And frankly we have enough backups of the many years of data, we don't need daily snapshots of them, or to carry them around.

Thanks in advance. I am hoping that I am not the only one that would benefit from a filtered dump (as opposed to all or nothing).

Finally, I considered using an entire query, which could allow data-masking and more complex queries if there are FKs involved. But that seemed like a much bigger ask (and a potential foot-gun).

PS: A Quick Hack feature we could leverage would be to flag the ignore-table-data to generate an EMPTY .dat file, and the internal toc.dat reference to load that file... THEN simply overwrite that file with our manual \COPY command. This would be almost a trivial change, and would work for what we do/need. BUT it feels "off" a bit.

Re: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)

От

Tom Lane

Дата:

15 августа, 19:37:18

Kirk Wolak <wolakk@gmail.com> writes:
>   We have a handful of tables that have 90% of the data going back 30 years.
> We NEVER restore this data to Dev or Staging.  We used a special RMAN
> backup where these tables had a "WHERE clause" applied to them during the
> backup/dump process.

Have you considered partitioning these tables by date and then not
dumping the older partitions?  That would fit into existing
functionality a lot better.

            regards, tom lane

Re: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)

От

Greg Sabino Mullane

Дата:

15 августа, 20:25:02

I've seen this idea pop up over the years, and it's still a good one. Rather than invent new flags, I think a better approach would be to convince pg_dump to dump a view, such that table foo has a view fooslice to limit / filter the output. Then we can simply do:

pg_dump -t foo=view:fooslice

and under the hood pg_dump would do

COPY (SELECT * FROM fooslice) TO ...

rather than

COPY foo TO ...

While we could make the view mapping into a separate filtering file as you suggest, that's more complexity and also a little more dangerous in an action-at-a-distance way, so I'd rather have people be very specific in the mapping on the command line.

I could swear I made a POC years ago for something like this, but it's long gone. If I get some time, I'll give it a fresh attempt.

Cheers,

Greg

Crunchy Data - https://www.crunchydata.com

Enterprise Postgres Software Products & Tech Support

Re: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)

От

Kirk Wolak

Дата:

17 августа, 01:48:41

On Fri, Aug 15, 2025 at 12:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kirk Wolak <wolakk@gmail.com> writes:
> We have a handful of tables that have 90% of the data going back 30 years.
> We NEVER restore this data to Dev or Staging. We used a special RMAN
> backup where these tables had a "WHERE clause" applied to them during the
> backup/dump process.

Have you considered partitioning these tables by date and then not
dumping the older partitions? That would fit into existing
functionality a lot better.

regards, tom lane

I had not considered it until now. But because we are talking about only keeping 7 - 28 days of decades of data.

I would assume that window would require a bit of work in production to maintain, making it a tough sell to the client.

Re: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)

От

Kirk Wolak

Дата:

17 августа, 01:56:12

On Fri, Aug 15, 2025 at 1:25 PM Greg Sabino Mullane <htamfids@gmail.com> wrote:

I've seen this idea pop up over the years, and it's still a good one. Rather than invent new flags, I think a better approach would be to convince pg_dump to dump a view, such that table foo has a view fooslice to limit / filter the output. Then we can simply do:

pg_dump -t foo=view:fooslice
..
While we could make the view mapping into a separate filtering file as you suggest, that's more complexity and also a little more dangerous in an action-at-a-distance way, so I'd rather have people be very specific in the mapping on the command line.

Hmmm, first, we are talking a full pg_dump, with some data filtering, I don't think we can avoid adding some kind of switch. While being able to do it on the command line is great for 1 table or 2... it gets unwieldy pretty quickly. Also, changing this inside of a production environment to maintain the slices when different "dumps" have different parameters (our monthly dump goes back 35 days), so for us, having different files with different settings makes the most sense.

But this is why I am asking for input/feedback. The feedback is producing ideas...

Thanks!

Re: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)

От

jian he

Дата:

23 августа, 18:04:42

hi.

I just found this
https://www.postgresql.org/message-id/flat/CALAY4q8o00Sg5nemi2AuqNuLvmGLaR6br%2BYrvM41ZuSPjM2Qvg%40mail.gmail.com
maybe it's not that very helpful.

IMV,
pg_dump --option="view_name"
is better than
pg_dump -d cary --where="test1:a3 = ( select max(aa1) from test2 )" > testdump2

Re: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)

От

Nikolay Samokhvalov

Дата:

26 августа, 18:04:00

On Sat, Aug 23, 2025 at 8:05 AM jian he <jian.universality@gmail.com> wrote:

hi.

I just found this
https://www.postgresql.org/message-id/flat/CALAY4q8o00Sg5nemi2AuqNuLvmGLaR6br%2BYrvM41ZuSPjM2Qvg%40mail.gmail.com
maybe it's not that very helpful.

IMV,
pg_dump --option="view_name"
is better than
pg_dump -d cary --where="test1:a3 = ( select max(aa1) from test2 )" > testdump2

In some cases, we cannot or don't want to create a view. For example, view creation might be banned in a project :) to avoid dependency management headache. Or we simply might lack permissions for DDL.

Back to the idea of having ability to dump with arbitrary filter – I think it's a great idea. COPY supports arbitrary SELECT as input and STDOUT as output and this is what we usually use, via psql, but doing so, we lose a lot of important mechanics pg_dump has.

At the same time, if we think about approach with views, it brings much more than just filtering out some rows -- we can do arbitrary transformations including projection, aggregation, and joining.

Ideally, it would be awesome to have a concept of virtual view that would be implemented at pg_dump level to support any kind transformation. While avoiding the need to have DDL permissions and change schema. This could give huge freedom and enable lots of workflows (e.g., for testing – replacing some actual sensitive values with random data on the fly would be extremely helpful to have!)

Nik

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)

PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)

Re: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)

Re: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)

Re: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)

Re: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)

Re: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)

Re: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)