Обсуждение: is there a way to automate deduplication of strings?

Поиск
Список
Период
Сортировка

is there a way to automate deduplication of strings?

От
Chris Papademetrious
Дата:

Hello everyone! First time poster here.

 

I have a question about deduplicating text strings stored in a database. I am aware of the pattern of creating a separate table for unique values, then referencing those values by key. But this requires some transactional complexity for storage and retrieval, along with cleanup of no-longer-referenced values over time. And, this complexity grows with the number of primary-table columns that use this indirection.

 

I would only use this for (1) seldom-referenced columns that (2) have a high rate of duplication and (3) have an average string length that makes deduplication worthwhile.

 

Are there any native or extension-based methods to simplify this in Postgres? I searched and came up empty, but maybe I’m not searching with the right terms.

 

Thanks!

 

  • Chris

 

 

Re: is there a way to automate deduplication of strings?

От
Greg Sabino Mullane
Дата:
It is not quite clear what you are trying to do. Can you provide a small test table to show what you want to achieve?

Cheers,
Greg

--
Enterprise Postgres Software Products & Tech Support

RE: is there a way to automate deduplication of strings?

От
Chris Papademetrious
Дата:

Hi Greg,

 

Thanks for the reply! I tried to be vague to avoid getting distracted by the details, but I think I overdid it!

 

Let’s say I have a table of transactions like this:

 

CREATE TABLE transaction (

    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

    user_agent TEXT NOT NULL,

    --

    -- ...more columns here...

    --

);

 

The table can contain millions of transactions. The user_agent field stores information about the application that performed each transaction. These user-agent values will be populated from a relatively small set of unique values. For example,

 

MyFictitiousApp/1.0 (Linux; x86_64; Ubuntu 22.04.3 LTS) Desktop (BuildID 00000000-0000-0000-0000-000000000000)

MyFictitiousApp/1.0 (Linux; x86_64; Ubuntu 22.04.3 LTS) Mobile (BuildID 00000000-0000-0000-0000-000000000000)

MyFictitiousApp/1.0 (Windows 11 25H2) Desktop (BuildID 00000000-0000-0000-0000-000000000000)

MyFictitiousApp/1.0 (Windows 11 25H2) Mobile (BuildID 00000000-0000-0000-0000-000000000000)

MyFictitiousApp/1.1 (Linux; x86_64; Ubuntu 22.04.3 LTS) Desktop  (BuildID 11111111-1111-1111-1111-111111111111)

MyFictitiousApp/1.1 (Linux; x86_64; Ubuntu 22.04.3 LTS) Mobile (BuildID 11111111-1111-1111-1111-111111111111)

MyFictitiousApp/1.1 (Windows 11 25H2) Desktop (BuildID 11111111-1111-1111-1111-111111111111)

MyFictitiousApp/1.1 (Windows 11 25H2) Mobile (BuildID 11111111-1111-1111-1111-111111111111)

 

The values themselves will vary over time (as new versions appear and old versions age out) so the set cannot be hardcoded, but the column will always contain large numbers of duplicate values.

 

I could store the user-agent values in a separate table and reference them by a UUID computed from their value:

 

CREATE TABLE user_agent (

    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

    user_agent TEXT NOT NULL UNIQUE,

);

 

CREATE TABLE transaction (

    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

    user_agent_id UUID NOT NULL,

    CONSTRAINT fk_user_agent FOREIGN KEY (user_agent_id) REFERENCES user_agent(id) ON DELETE CASCADE

);

 

but this adds transactional complexity for storage and retrieval, along with cleanup of no-longer-referenced values over time.

 

I’m wishing for a magic “sparsely stored texts” column in Postgres that performs this deduplication automatically, but I don’t think it exists. So I’m wondering, is there an extension or some other trick to get the space savings without the transactional complexity?

 

  • Chris

 

 

From: Greg Sabino Mullane <htamfids@gmail.com>
Sent: Wednesday, December 31, 2025 10:12 AM
To: Chris Papademetrious <chrispy@synopsys.com>
Cc: pgsql-novice@lists.postgresql.org
Subject: Re: is there a way to automate deduplication of strings?

 

It is not quite clear what you are trying to do. Can you provide a small test table to show what you want to achieve?

 

Cheers,

Greg

 

--

Enterprise Postgres Software Products & Tech Support

 

Re: is there a way to automate deduplication of strings?

От
Greg Sabino Mullane
Дата:
Ah, okay, so what you want is to normalize the data. Turns out, that's something database purists strive for, so you are in good company. Yes, there is a little more overhead, but that's kind of the cost of using a relational model. You can smooth things out a great bit with views, triggers, and functions. For example, you could create a view that joins the two tables together under the hood but an application or user would simply select from it as if it were the original non-duplicated form:

create view txn as select t.id /* plus other fields from "t" here */ , u.user_agent from transaction t join user_agent u on (u.id = t.user_agent_id);

You can also make the view ("txn" in this case) updateable by adding INSTEAD OF triggers on the view, which then call some functions that do the necessary maintenance work behind the scenes. Then your application simply calls insert/update/delete on the view as if it were the original table.

https://www.postgresql.org/docs/current/sql-createtrigger.html

We can help with those on this list, of course, but there are some good examples on that page to get you started.

Minor notes:

* Since the user_agent contains a small and finite number of variants, I would not bother with a uuid and just use an int:

create table user_agent (id int primary key generated always as identity, user_agent text not null unique);

* You ought to look at the new uuidv7() function, as it improves a lot on regular uuids. See for example:

https://www.thenile.dev/blog/uuidv7

* Rather than ON DELETE CASCADE, you probably want ON DELETE RESTRICT for this use case

* You mentioned "cleanup of no-longer-referenced values over time"

I would not sweat this, at least for this particular example. If those user agents were used once, they may get used again. For different types of data and larger tables, you could write a small sql function that cleans up stray entries and make sure to call that function after a bulk delete, or just cron it to run weekly.

Cheers,
Greg

--
Enterprise Postgres Software Products & Tech Support