Обсуждение: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)
Hackers,
I am asking for feedback on this. We just migrated a client to PG and all is well. Except the developers requested a copy of recent production data for testing.
We have a handful of tables that have 90% of the data going back 30 years.
We NEVER restore this data to Dev or Staging. We used a special RMAN backup where these tables had a "WHERE clause" applied to them during the backup/dump process.
It would be awesome if pg_dump offered something similar. I am willing to code this up, but there are questions. And I want to do it in a way that makes the most sense to the PG Community.
My initial thought is a simple filtering file, of the format:
schema.table = WHERE ...
schema.table = WHERE ...
"schema"."Table" = WHERE ...
"schema"."t1" = LIMIT 50000
If the --filter-data <filename> is not specified, the code would not change anything. If it was, it would see if the table was in the file with a filter, it would read that filter, and apply it to the resulting COPY command.
I don't believe this impacts pg_restore. But if we wanted to include any kind of messaging in the restore process that "pg_dump --filter-data was used, this is NOT a complete dump!", then I would appreciate that, and include it in pg_restore.
Just to make the point. The full pg_dump takes 60 minutes (During which, we must turn off certain features to avoid throwing errors/locking issues). Excluding these tables takes 2-3 minutes. (Side dumping limited versions of them with \COPY takes 3 minutes). And frankly we have enough backups of the many years of data, we don't need daily snapshots of them, or to carry them around.
I don't believe this impacts pg_restore. But if we wanted to include any kind of messaging in the restore process that "pg_dump --filter-data was used, this is NOT a complete dump!", then I would appreciate that, and include it in pg_restore.
Just to make the point. The full pg_dump takes 60 minutes (During which, we must turn off certain features to avoid throwing errors/locking issues). Excluding these tables takes 2-3 minutes. (Side dumping limited versions of them with \COPY takes 3 minutes). And frankly we have enough backups of the many years of data, we don't need daily snapshots of them, or to carry them around.
Thanks in advance. I am hoping that I am not the only one that would benefit from a filtered dump (as opposed to all or nothing).
Finally, I considered using an entire query, which could allow data-masking and more complex queries if there are FKs involved. But that seemed like a much bigger ask (and a potential foot-gun).
Finally, I considered using an entire query, which could allow data-masking and more complex queries if there are FKs involved. But that seemed like a much bigger ask (and a potential foot-gun).
PS: A Quick Hack feature we could leverage would be to flag the ignore-table-data to generate an EMPTY .dat file, and the internal toc.dat reference to load that file... THEN simply overwrite that file with our manual \COPY command. This would be almost a trivial change, and would work for what we do/need. BUT it feels "off" a bit.
Kirk Wolak <wolakk@gmail.com> writes: > We have a handful of tables that have 90% of the data going back 30 years. > We NEVER restore this data to Dev or Staging. We used a special RMAN > backup where these tables had a "WHERE clause" applied to them during the > backup/dump process. Have you considered partitioning these tables by date and then not dumping the older partitions? That would fit into existing functionality a lot better. regards, tom lane
Re: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)
От
Greg Sabino Mullane
Дата:
I've seen this idea pop up over the years, and it's still a good one. Rather than invent new flags, I think a better approach would be to convince pg_dump to dump a view, such that table foo has a view fooslice to limit / filter the output. Then we can simply do:
pg_dump -t foo=view:fooslice
and under the hood pg_dump would do
COPY (SELECT * FROM fooslice) TO ...
rather than
COPY foo TO ...
While we could make the view mapping into a separate filtering file as you suggest, that's more complexity and also a little more dangerous in an action-at-a-distance way, so I'd rather have people be very specific in the mapping on the command line.
I could swear I made a POC years ago for something like this, but it's long gone. If I get some time, I'll give it a fresh attempt.
Cheers,
Greg
--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support
On Fri, Aug 15, 2025 at 12:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Kirk Wolak <wolakk@gmail.com> writes:
> We have a handful of tables that have 90% of the data going back 30 years.
> We NEVER restore this data to Dev or Staging. We used a special RMAN
> backup where these tables had a "WHERE clause" applied to them during the
> backup/dump process.
Have you considered partitioning these tables by date and then not
dumping the older partitions? That would fit into existing
functionality a lot better.
regards, tom lane
I had not considered it until now. But because we are talking about only keeping 7 - 28 days of decades of data.
I would assume that window would require a bit of work in production to maintain, making it a tough sell to the client.
On Fri, Aug 15, 2025 at 1:25 PM Greg Sabino Mullane <htamfids@gmail.com> wrote:
I've seen this idea pop up over the years, and it's still a good one. Rather than invent new flags, I think a better approach would be to convince pg_dump to dump a view, such that table foo has a view fooslice to limit / filter the output. Then we can simply do:pg_dump -t foo=view:fooslice..While we could make the view mapping into a separate filtering file as you suggest, that's more complexity and also a little more dangerous in an action-at-a-distance way, so I'd rather have people be very specific in the mapping on the command line.
Hmmm, first, we are talking a full pg_dump, with some data filtering, I don't think we can avoid adding some kind of switch. While being able to do it on the command line is great for 1 table or 2... it gets unwieldy pretty quickly. Also, changing this inside of a production environment to maintain the slices when different "dumps" have different parameters (our monthly dump goes back 35 days), so for us, having different files with different settings makes the most sense.
But this is why I am asking for input/feedback. The feedback is producing ideas...
Thanks!
hi. I just found this https://www.postgresql.org/message-id/flat/CALAY4q8o00Sg5nemi2AuqNuLvmGLaR6br%2BYrvM41ZuSPjM2Qvg%40mail.gmail.com maybe it's not that very helpful. IMV, pg_dump --option="view_name" is better than pg_dump -d cary --where="test1:a3 = ( select max(aa1) from test2 )" > testdump2
Re: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)
От
Nikolay Samokhvalov
Дата:
On Sat, Aug 23, 2025 at 8:05 AM jian he <jian.universality@gmail.com> wrote:
hi.
I just found this
https://www.postgresql.org/message-id/flat/CALAY4q8o00Sg5nemi2AuqNuLvmGLaR6br%2BYrvM41ZuSPjM2Qvg%40mail.gmail.com
maybe it's not that very helpful.
IMV,
pg_dump --option="view_name"
is better than
pg_dump -d cary --where="test1:a3 = ( select max(aa1) from test2 )" > testdump2
In some cases, we cannot or don't want to create a view. For example, view creation might be banned in a project :) to avoid dependency management headache. Or we simply might lack permissions for DDL.
Back to the idea of having ability to dump with arbitrary filter – I think it's a great idea. COPY supports arbitrary SELECT as input and STDOUT as output and this is what we usually use, via psql, but doing so, we lose a lot of important mechanics pg_dump has.
At the same time, if we think about approach with views, it brings much more than just filtering out some rows -- we can do arbitrary transformations including projection, aggregation, and joining.
Ideally, it would be awesome to have a concept of virtual view that would be implemented at pg_dump level to support any kind transformation. While avoiding the need to have DDL permissions and change schema. This could give huge freedom and enable lots of workflows (e.g., for testing – replacing some actual sensitive values with random data on the fly would be extremely helpful to have!)
Nik