Обсуждение: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Поиск

Список

Период

Сортировка

PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

"Pierre Barre"

Дата:

18 июля 2025 г., 01:57:47

Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining
performancecomparable to local NVMe. The approach uses block-level access rather than trying to map filesystem
operationsto S3 objects. 

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs
unmodifiedon ZFS pools built on these block devices: 

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies
despitethe underlying storage being in S3. 

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and
ZeroFS'smemory caches, while cold data comes from S3. 

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
2. Multiple cache layers hide S3 latency:
   a. ZFS ARC/L2ARC for frequently accessed blocks
   b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can
uselike any other block device 
   c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1


                         PostgreSQL Client
                                   |
                                   | SQL queries
                                   |
                            +--------------+
                            |  PG Proxy    |
                            | (HAProxy/    |
                            |  PgBouncer)  |
                            +--------------+
                               /        \
                              /          \
                   Synchronous            Synchronous
                   Replication            Replication
                            /              \
                           /                \
              +---------------+        +---------------+
              | PostgreSQL 1  |        | PostgreSQL 2  |
              | (Primary)     |◄------►| (Standby)     |
              +---------------+        +---------------+
                      |                        |
                      |  POSIX filesystem ops  |
                      |                        |
              +---------------+        +---------------+
              |   ZFS Pool 1  |        |   ZFS Pool 2  |
              | (3-way mirror)|        | (3-way mirror)|
              +---------------+        +---------------+
               /      |      \          /      |      \
              /       |       \        /       |       \
        NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
             |        |        |           |        |        |
        +--------++--------++--------++--------++--------++--------+
        |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
        +--------++--------++--------++--------++--------++--------+
             |         |         |         |         |         |
             |         |         |         |         |         |
        S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
        (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)

Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
                \                    /
                 \                  /
                  Same ZFS Pool (NBD)
                         |
                  6 Global ZeroFS
                         |
                      S3 Regions


The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

Laurenz Albe

Дата:

18 июля 2025 г., 07:40:58

On Fri, 2025-07-18 at 00:57 +0200, Pierre Barre wrote:
> Looking forward to your feedback and questions!

I think the biggest hurdle you will have to overcome is to
convince notoriously paranoid DBAs that this tall stack
provides reliable service, honors fsync() etc.

Performance is great, but it is not everything.  If things
perform surprisingly well, people become suspicious.

> P.S. The full project includes a custom NFS filesystem too.

"NFS" is a key word that does not inspire confidence in
PostgreSQL circles...

Yours,
Laurenz Albe

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

"Pierre Barre"

Дата:

18 июля 2025 г., 12:59:18

Hi Laurenz,

> I think the biggest hurdle you will have to overcome is to
> convince notoriously paranoid DBAs that this tall stack
> provides reliable service, honors fsync() etc.

Indeed, but that doesn't have to be "sudden." I think we need to gain confidence in the whole system gradually by
startingwith throwable workloads (e.g., persistent volumes in CI), then moving to data we can afford to lose, then
backups,and finally to production data. 

>> P.S. The full project includes a custom NFS filesystem too.
>
> "NFS" is a key word that does not inspire confidence in
> PostgreSQL circles...

I've had my fair share of major annoyances with NFS too!

I think bad experiences with NFS are basically due to the fact that when the hardware is bad, the NFS server
implementationis bad, and the kernel treats it mostly like a "local" filesystem (in terms of failure behavior).  

So when it doesn't work well, everything goes down.

But the protocols themselves are not inherently bad—they are actually quite elegant. NFSv3 is just what you need to
reach(very close to) POSIX compliance. The NFS server implementation in ZeroFS passes all 8,662 tests in
https://github.com/Barre/pjdfstest_nfs.

https://github.com/Barre/ZeroFS/actions/runs/16367571315/job/46248240251#step:11:9376

For database workloads specifically, users will probably prefer running something like ZFS on top of the NBD server
ratherthan using NFS directly. 

Best,
Pierre

On Fri, Jul 18, 2025, at 06:40, Laurenz Albe wrote:
> On Fri, 2025-07-18 at 00:57 +0200, Pierre Barre wrote:
>> Looking forward to your feedback and questions!
>
> I think the biggest hurdle you will have to overcome is to
> convince notoriously paranoid DBAs that this tall stack
> provides reliable service, honors fsync() etc.
>
> Performance is great, but it is not everything.  If things
> perform surprisingly well, people become suspicious.
>
>> P.S. The full project includes a custom NFS filesystem too.
>
> "NFS" is a key word that does not inspire confidence in
> PostgreSQL circles...
>
> Yours,
> Laurenz Albe

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

Seref Arikan

Дата:

18 июля 2025 г., 13:42:42

Sorry, this was meant to go to the whole group:

Very interesting!. Great work. Can you clarify how exactly you're running postgres in your tests? A specific AWS service? What's the test infrastructure that sits above the file system?

On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre@barre.sh> wrote:

Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
2. Multiple cache layers hide S3 latency:
a. ZFS ARC/L2ARC for frequently accessed blocks
b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1

PostgreSQL Client
|
| SQL queries
|
+--------------+
| PG Proxy |
| (HAProxy/ |
| PgBouncer) |
+--------------+
/ \
/ \
Synchronous Synchronous
Replication Replication
/ \
/ \
+---------------+ +---------------+
| PostgreSQL 1 | | PostgreSQL 2 |
| (Primary) |◄------►| (Standby) |
+---------------+ +---------------+
| |
| POSIX filesystem ops |
| |
+---------------+ +---------------+
| ZFS Pool 1 | | ZFS Pool 2 |
| (3-way mirror)| | (3-way mirror)|
+---------------+ +---------------+
/ | \ / | \
/ | \ / | \
NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814
| | | | | |
+--------++--------++--------++--------++--------++--------+
|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
+--------++--------++--------++--------++--------++--------+
| | | | | |
| | | | | |
S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
(us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)

Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
\ /
\ /
Same ZFS Pool (NBD)
|
6 Global ZeroFS
|
S3 Regions

The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

"Pierre Barre"

Дата:

18 июля 2025 г., 13:57:39

Hi Seref,

For the benchmarks, I used Hetzner's cloud service with the following setup:

- A Hetzner s3 bucket in the FSN1 region

- A virtual machine of type ccx63 48 vCPU 192 GB memory

- 3 ZeroFS nbd devices (same s3 bucket)

- A ZFS stripped pool with the 3 devices

- 200GB zfs L2ARC

- Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Best,

Pierre

On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:

Sorry, this was meant to go to the whole group:

Very interesting!. Great work. Can you clarify how exactly you're running postgres in your tests? A specific AWS service? What's the test infrastructure that sits above the file system?

On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre@barre.sh> wrote:
Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
2. Multiple cache layers hide S3 latency:
a. ZFS ARC/L2ARC for frequently accessed blocks
b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1

PostgreSQL Client
|
| SQL queries
|
+--------------+
| PG Proxy |
| (HAProxy/ |
| PgBouncer) |
+--------------+
/ \
/ \
Synchronous Synchronous
Replication Replication
/ \
/ \
+---------------+ +---------------+
| PostgreSQL 1 | | PostgreSQL 2 |
| (Primary) |◄------►| (Standby) |
+---------------+ +---------------+
| |
| POSIX filesystem ops |
| |
+---------------+ +---------------+
| ZFS Pool 1 | | ZFS Pool 2 |
| (3-way mirror)| | (3-way mirror)|
+---------------+ +---------------+
/ | \ / | \
/ | \ / | \
NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814
| | | | | |
+--------++--------++--------++--------++--------++--------+
|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
+--------++--------++--------++--------++--------++--------+
| | | | | |
| | | | | |
S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
(us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)

Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
\ /
\ /
Same ZFS Pool (NBD)
|
6 Global ZeroFS
|
S3 Regions

The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

"Pierre Barre"

Дата:

18 июля 2025 г., 15:31:05

Now, I'm trying to understand how CAP theorem applies here. Traditional PostgreSQL replication has clear CAP trade-offs
-you choose between consistency and availability during partitions. 

But when PostgreSQL instances share storage rather than replicate:
- Consistency seems maintained (same data)
- Availability seems maintained (client can always promote an accessible node)
- Partitions between PostgreSQL nodes don't prevent the system from functioning

It seems that CAP assumes specific implementation details (like nodes maintaining independent state) without explicitly
statingthem. 

How should we think about CAP theorem when distributed nodes share storage rather than coordinate state? Are the
trade-offssimply moved to a different layer, or does shared storage fundamentally change the analysis? 

Client with awareness of both PostgreSQL nodes
    |                               |
    ↓ (partition here)              ↓
PostgreSQL Primary              PostgreSQL Standby
    |                               |
    └───────────┬───────────────────┘
                ↓
         Shared ZFS Pool
                |
         6 Global ZeroFS instances

Best,
Pierre

On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:
> Hi Seref,
>
> For the benchmarks, I used Hetzner's cloud service with the following setup:
>
> - A Hetzner s3 bucket in the FSN1 region
> - A virtual machine of type ccx63 48 vCPU 192 GB memory
> - 3 ZeroFS nbd devices (same s3 bucket)
> - A ZFS stripped pool with the 3 devices
> - 200GB zfs L2ARC
> - Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and
wal_recycle= off. 
>
> Best,
> Pierre
>
> On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>> Sorry, this was meant to go to the whole group:
>>
>> Very interesting!. Great work. Can you clarify how exactly you're running postgres in your tests? A specific AWS
service?What's the test infrastructure that sits above the file system? 
>>
>> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre@barre.sh> wrote:
>>> Hi everyone,
>>>
>>> I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining
performancecomparable to local NVMe. The approach uses block-level access rather than trying to map filesystem
operationsto S3 objects. 
>>>
>>> ZeroFS: https://github.com/Barre/ZeroFS
>>>
>>> # The Architecture
>>>
>>> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs
unmodifiedon ZFS pools built on these block devices: 
>>>
>>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>>>
>>> By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond
latenciesdespite the underlying storage being in S3. 
>>>
>>> ## Performance Results
>>>
>>> Here are pgbench results from PostgreSQL running on this setup:
>>>
>>> ### Read/Write Workload
>>>
>>> ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.
>>> transaction type: <builtin: TPC-B (sort of)>
>>> scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number of threads: 15
>>> maximum number of tries: 1
>>> number of transactions per client: 100000
>>> number of transactions actually processed: 5000000/5000000
>>> number of failed transactions: 0 (0.000%)
>>> latency average = 0.943 ms
>>> initial connection time = 48.043 ms
>>> tps = 53041.006947 (without initial connection time)
>>> ```
>>>
>>> ### Read-Only Workload
>>>
>>> ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.
>>> transaction type: <builtin: select only>
>>> scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number of threads: 15
>>> maximum number of tries: 1
>>> number of transactions per client: 100000
>>> number of transactions actually processed: 5000000/5000000
>>> number of failed transactions: 0 (0.000%)
>>> latency average = 0.121 ms
>>> initial connection time = 53.358 ms
>>> tps = 413436.248089 (without initial connection time)
>>> ```
>>>
>>> These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC
andZeroFS's memory caches, while cold data comes from S3. 
>>>
>>> ## How It Works
>>>
>>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>> 2. Multiple cache layers hide S3 latency:
>>>    a. ZFS ARC/L2ARC for frequently accessed blocks
>>>    b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS
canuse like any other block device 
>>>    c. Optional local disk cache
>>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>>> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree
>>>
>>> ## Geo-Distributed PostgreSQL
>>>
>>> Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.
>>>
>>> Example architectures:
>>>
>>> Architecture 1
>>>
>>>
>>>                          PostgreSQL Client
>>>                                    |
>>>                                    | SQL queries
>>>                                    |
>>>                             +--------------+
>>>                             |  PG Proxy    |
>>>                             | (HAProxy/    |
>>>                             |  PgBouncer)  |
>>>                             +--------------+
>>>                                /        \
>>>                               /          \
>>>                    Synchronous            Synchronous
>>>                    Replication            Replication
>>>                             /              \
>>>                            /                \
>>>               +---------------+        +---------------+
>>>               | PostgreSQL 1  |        | PostgreSQL 2  |
>>>               | (Primary)     |◄------►| (Standby)     |
>>>               +---------------+        +---------------+
>>>                       |                        |
>>>                       |  POSIX filesystem ops  |
>>>                       |                        |
>>>               +---------------+        +---------------+
>>>               |   ZFS Pool 1  |        |   ZFS Pool 2  |
>>>               | (3-way mirror)|        | (3-way mirror)|
>>>               +---------------+        +---------------+
>>>                /      |      \          /      |      \
>>>               /       |       \        /       |       \
>>>         NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
>>>              |        |        |           |        |        |
>>>         +--------++--------++--------++--------++--------++--------+
>>>         |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
>>>         +--------++--------++--------++--------++--------++--------+
>>>              |         |         |         |         |         |
>>>              |         |         |         |         |         |
>>>         S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
>>>         (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>>>
>>> Architecture 2:
>>>
>>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
>>>                 \                    /
>>>                  \                  /
>>>                   Same ZFS Pool (NBD)
>>>                          |
>>>                   6 Global ZeroFS
>>>                          |
>>>                       S3 Regions
>>>
>>>
>>> The main advantages I see are:
>>> 1. Dramatic cost reduction for large datasets
>>> 2. Simplified geo-distribution
>>> 3. Infinite storage capacity
>>> 4. Built-in encryption and compression
>>>
>>> Looking forward to your feedback and questions!
>>>
>>> Best,
>>> Pierre
>>>
>>> P.S. The full project includes a custom NFS filesystem too.
>>>
>

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

Seref Arikan

Дата:

18 июля 2025 г., 15:55:56

Thanks, I learned something else: I didn't know Hetzner offered S3 compatible storage.

The interesting thing is, a few searches about the performance return mostly negative impressions about their object storage in comparison to the original S3.

Finding out what kind of performance your benchmarks would yield on a pure AWS setting would be interesting. I am not asking you to do that, but you may get even better performance in that case :)

Cheers,

Seref

On Fri, Jul 18, 2025 at 11:58 AM Pierre Barre <pierre@barre.sh> wrote:

Hi Seref,

For the benchmarks, I used Hetzner's cloud service with the following setup:

- A Hetzner s3 bucket in the FSN1 region
- A virtual machine of type ccx63 48 vCPU 192 GB memory
- 3 ZeroFS nbd devices (same s3 bucket)
- A ZFS stripped pool with the 3 devices
- 200GB zfs L2ARC
- Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Best,
Pierre

On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
Sorry, this was meant to go to the whole group:

Very interesting!. Great work. Can you clarify how exactly you're running postgres in your tests? A specific AWS service? What's the test infrastructure that sits above the file system?

On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre@barre.sh> wrote:
Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
2. Multiple cache layers hide S3 latency:
a. ZFS ARC/L2ARC for frequently accessed blocks
b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1

PostgreSQL Client
|
| SQL queries
|
+--------------+
| PG Proxy |
| (HAProxy/ |
| PgBouncer) |
+--------------+
/ \
/ \
Synchronous Synchronous
Replication Replication
/ \
/ \
+---------------+ +---------------+
| PostgreSQL 1 | | PostgreSQL 2 |
| (Primary) |◄------►| (Standby) |
+---------------+ +---------------+
| |
| POSIX filesystem ops |
| |
+---------------+ +---------------+
| ZFS Pool 1 | | ZFS Pool 2 |
| (3-way mirror)| | (3-way mirror)|
+---------------+ +---------------+
/ | \ / | \
/ | \ / | \
NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814
| | | | | |
+--------++--------++--------++--------++--------++--------+
|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
+--------++--------++--------++--------++--------++--------+
| | | | | |
| | | | | |
S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
(us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)

Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
\ /
\ /
Same ZFS Pool (NBD)
|
6 Global ZeroFS
|
S3 Regions

The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

"Pierre Barre"

Дата:

18 июля 2025 г., 16:11:40

The interesting thing is, a few searches about the performance return mostly negative impressions about their object storage in comparison to the original S3.

I think they had a rough start, but it's quite good now from what I've experienced. It's also dirt-cheap, and they don't bill for operations. So if you run ZeroFS on that you only pay for raw storage at €4.99 a month.

Combine that with their dirt cheap dedicated servers, https://www.hetzner.com/dedicated-rootserver/matrix-ax/ you can have a <€50 a month multi-terabytes postgres database

I'm dreaming of running https://www.merklemap.com/ on such a setup, but it's too early yet :)

Finding out what kind of performance your benchmarks would yield on a pure AWS setting would be interesting. I am not asking you to do that, but you may get even better performance in that case :)

Yes, I need to try that!

Best,

Pierre

On Fri, Jul 18, 2025, at 14:55, Seref Arikan wrote:

Thanks, I learned something else: I didn't know Hetzner offered S3 compatible storage.

The interesting thing is, a few searches about the performance return mostly negative impressions about their object storage in comparison to the original S3.

Finding out what kind of performance your benchmarks would yield on a pure AWS setting would be interesting. I am not asking you to do that, but you may get even better performance in that case :)

Cheers,
Seref

On Fri, Jul 18, 2025 at 11:58 AM Pierre Barre <pierre@barre.sh> wrote:

Hi Seref,

For the benchmarks, I used Hetzner's cloud service with the following setup:

- A Hetzner s3 bucket in the FSN1 region
- A virtual machine of type ccx63 48 vCPU 192 GB memory
- 3 ZeroFS nbd devices (same s3 bucket)
- A ZFS stripped pool with the 3 devices
- 200GB zfs L2ARC
- Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Best,
Pierre

On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
Sorry, this was meant to go to the whole group:

Very interesting!. Great work. Can you clarify how exactly you're running postgres in your tests? A specific AWS service? What's the test infrastructure that sits above the file system?

On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre@barre.sh> wrote:
Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
2. Multiple cache layers hide S3 latency:
a. ZFS ARC/L2ARC for frequently accessed blocks
b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1

PostgreSQL Client
|
| SQL queries
|
+--------------+
| PG Proxy |
| (HAProxy/ |
| PgBouncer) |
+--------------+
/ \
/ \
Synchronous Synchronous
Replication Replication
/ \
/ \
+---------------+ +---------------+
| PostgreSQL 1 | | PostgreSQL 2 |
| (Primary) |◄------►| (Standby) |
+---------------+ +---------------+
| |
| POSIX filesystem ops |
| |
+---------------+ +---------------+
| ZFS Pool 1 | | ZFS Pool 2 |
| (3-way mirror)| | (3-way mirror)|
+---------------+ +---------------+
/ | \ / | \
/ | \ / | \
NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814
| | | | | |
+--------++--------++--------++--------++--------++--------+
|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
+--------++--------++--------++--------++--------++--------+
| | | | | |
| | | | | |
S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
(us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)

Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
\ /
\ /
Same ZFS Pool (NBD)
|
6 Global ZeroFS
|
S3 Regions

The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

"Pierre Barre"

Дата:

24 июля 2025 г., 21:12:19

> "NFS" is a key word that does not inspire confidence in
PostgreSQL circles...

Coming back to this, I just implemented 9P, which should translates to proper semantics for FSYNC.

mount -t 9p -o trans=tcp,port=5564,version=9p2000.L,msize=65536,access=user 127.0.0.1 /mnt/9p

Best,
Pierre

On Fri, Jul 18, 2025, at 06:40, Laurenz Albe wrote:
> On Fri, 2025-07-18 at 00:57 +0200, Pierre Barre wrote:
>> Looking forward to your feedback and questions!
>
> I think the biggest hurdle you will have to overcome is to
> convince notoriously paranoid DBAs that this tall stack
> provides reliable service, honors fsync() etc.
>
> Performance is great, but it is not everything.  If things
> perform surprisingly well, people become suspicious.
>
>> P.S. The full project includes a custom NFS filesystem too.
>
> "NFS" is a key word that does not inspire confidence in
> PostgreSQL circles...
>
> Yours,
> Laurenz Albe

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

Nico Williams

Дата:

24 июля 2025 г., 22:41:37

On Fri, Jul 18, 2025 at 06:40:58AM +0200, Laurenz Albe wrote:
> On Fri, 2025-07-18 at 00:57 +0200, Pierre Barre wrote:
> > Looking forward to your feedback and questions!
> 
> I think the biggest hurdle you will have to overcome is to
> convince notoriously paranoid DBAs that this tall stack
> provides reliable service, honors fsync() etc.

Is there a test suite that can be used to test PG's ACIDity in the face
of simulated power failures?

> Performance is great, but it is not everything.  If things
> perform surprisingly well, people become suspicious.

+1

> > P.S. The full project includes a custom NFS filesystem too.
> 
> "NFS" is a key word that does not inspire confidence in
> PostgreSQL circles...

Certainly NFSv3 should.  NFSv4 is much safer but I've no experience
running PG on it and I assume there will be cases where recovery from
network and/or server failures is slow.

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

Nico Williams

Дата:

24 июля 2025 г., 22:44:46

On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:
> - Postgres configured accordingly memory-wise as well as with
>   synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Bingo.  That's why it's fast (synchronous_commit = off).  It's also why
it's not safe _unless_ you have a local, fast, persistent ZIL device
(which I assume you don't).

Nico
--

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

"Pierre Barre"

Дата:

24 июля 2025 г., 22:50:58

It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit.
Synchronous_commitdon’t make your system automatically safe either, and if that’s a requirement, there’s many
workarounds,as you suggested, it certainly doesn’t make the setup useless. 

Best,
Pierre

On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:
> On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:
>> - Postgres configured accordingly memory-wise as well as with
>>   synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>
> Bingo.  That's why it's fast (synchronous_commit = off).  It's also why
> it's not safe _unless_ you have a local, fast, persistent ZIL device
> (which I assume you don't).
>
> Nico
> --

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

Jeff Ross

Дата:

25 июля 2025 г., 01:03:34

On 7/24/25 13:50, Pierre Barre wrote:

> It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit.
Synchronous_commitdon’t make your system automatically safe either, and if that’s a requirement, there’s many
workarounds,as you suggested, it certainly doesn’t make the setup useless.
 
>
> Best,
> Pierre
>
> On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:
>> On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:
>>> - Postgres configured accordingly memory-wise as well as with
>>>    synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>> Bingo.  That's why it's fast (synchronous_commit = off).  It's also why
>> it's not safe _unless_ you have a local, fast, persistent ZIL device
>> (which I assume you don't).
>>
>> Nico
>> --
This then begs the obvious question of how fast is this with 
synchronous_commit = on?

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

Marco Torres

Дата:

25 июля 2025 г., 01:21:39

My humble take on this project: well done! You are opening the doors to work on a much-needed endeavor, decouple compute from storage, and potentially elaborate on other projects for an active/active cluster! I applaud you.

On Thu, Jul 17, 2025, 4:59 PM Pierre Barre <pierre@barre.sh> wrote:

Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
2. Multiple cache layers hide S3 latency:
a. ZFS ARC/L2ARC for frequently accessed blocks
b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1

PostgreSQL Client
|
| SQL queries
|
+--------------+
| PG Proxy |
| (HAProxy/ |
| PgBouncer) |
+--------------+
/ \
/ \
Synchronous Synchronous
Replication Replication
/ \
/ \
+---------------+ +---------------+
| PostgreSQL 1 | | PostgreSQL 2 |
| (Primary) |◄------►| (Standby) |
+---------------+ +---------------+
| |
| POSIX filesystem ops |
| |
+---------------+ +---------------+
| ZFS Pool 1 | | ZFS Pool 2 |
| (3-way mirror)| | (3-way mirror)|
+---------------+ +---------------+
/ | \ / | \
/ | \ / | \
NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814
| | | | | |
+--------++--------++--------++--------++--------++--------+
|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
+--------++--------++--------++--------++--------++--------+
| | | | | |
| | | | | |
S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
(us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)

Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
\ /
\ /
Same ZFS Pool (NBD)
|
6 Global ZeroFS
|
S3 Regions

The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

"Pierre Barre"

Дата:

25 июля 2025 г., 01:31:58

Hi Marco,

Thanks for the kind words!

> and potentially elaborate on other projects for an active/active cluster! I applaud you.

I wrote an argument there: https://github.com/Barre/ZeroFS?tab=readme-ov-file#cap-theorem

I definitely want to write a proof of concept when I get some time.

Best,

Pierre

On Fri, Jul 25, 2025, at 00:21, Marco Torres wrote:

My humble take on this project: well done! You are opening the doors to work on a much-needed endeavor, decouple compute from storage, and potentially elaborate on other projects for an active/active cluster! I applaud you.

On Thu, Jul 17, 2025, 4:59 PM Pierre Barre <pierre@barre.sh> wrote:
Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 100000
number of transactions actually processed: 5000000/5000000
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
2. Multiple cache layers hide S3 latency:
a. ZFS ARC/L2ARC for frequently accessed blocks
b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1

PostgreSQL Client
|
| SQL queries
|
+--------------+
| PG Proxy |
| (HAProxy/ |
| PgBouncer) |
+--------------+
/ \
/ \
Synchronous Synchronous
Replication Replication
/ \
/ \
+---------------+ +---------------+
| PostgreSQL 1 | | PostgreSQL 2 |
| (Primary) |◄------►| (Standby) |
+---------------+ +---------------+
| |
| POSIX filesystem ops |
| |
+---------------+ +---------------+
| ZFS Pool 1 | | ZFS Pool 2 |
| (3-way mirror)| | (3-way mirror)|
+---------------+ +---------------+
/ | \ / | \
/ | \ / | \
NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814
| | | | | |
+--------++--------++--------++--------++--------++--------+
|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
+--------++--------++--------++--------++--------++--------+
| | | | | |
| | | | | |
S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
(us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)

Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
\ /
\ /
Same ZFS Pool (NBD)
|
6 Global ZeroFS
|
S3 Regions

The main advantages I see are:
1. Dramatic cost reduction for large datasets
2. Simplified geo-distribution
3. Infinite storage capacity
4. Built-in encryption and compression

Looking forward to your feedback and questions!

Best,
Pierre

P.S. The full project includes a custom NFS filesystem too.

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

"Pierre Barre"

Дата:

25 июля 2025 г., 01:44:01

> This then begs the obvious question of how fast is this with
> synchronous_commit = on?

Probably not awful, especially with commit_delay.

I'll try that and report back.

Best,
Pierre

On Fri, Jul 25, 2025, at 00:03, Jeff Ross wrote:
> On 7/24/25 13:50, Pierre Barre wrote:
>
>> It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit.
Synchronous_commitdon’t make your system automatically safe either, and if that’s a requirement, there’s many
workarounds,as you suggested, it certainly doesn’t make the setup useless. 
>>
>> Best,
>> Pierre
>>
>> On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:
>>> On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:
>>>> - Postgres configured accordingly memory-wise as well as with
>>>>    synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>>> Bingo.  That's why it's fast (synchronous_commit = off).  It's also why
>>> it's not safe _unless_ you have a local, fast, persistent ZIL device
>>> (which I assume you don't).
>>>
>>> Nico
>>> --
> This then begs the obvious question of how fast is this with
> synchronous_commit = on?

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

"Pierre Barre"

Дата:

25 июля 2025 г., 12:25:48

Hi,

I went ahead and did that test.

Here is the postgresql config I used for reference (note the wal options (recycle, init_zero) as well as
full_page_writes= off, because ZeroFS cannot have torn writes by design). 

https://gist.github.com/Barre/8d68f0d00446389998a31f4e60f3276d

Test was running on Azure with Standard D16ads v5 (16 vcpus, 64 GiB memory)

This time, I didn't run ZFS with L2ARC, I just mounted ZeroFS with 9p.

synchronous_commit = off

postgres@zerofs:~$ pgbench -vvv -c 100 -j 40 -t 1000 bench
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 100000/100000
number of failed transactions: 0 (0.000%)
latency average = 6.239 ms
initial connection time = 68.922 ms
tps = 16026.940646 (without initial connection time)

synchronous_commit = on

postgres@zerofs:~$ pgbench -vvv -c 50 -j 15 -t 1000 bench
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 50000/50000
number of failed transactions: 0 (0.000%)
latency average = 197.723 ms
initial connection time = 46.089 ms
tps = 252.878721 (without initial connection time)

Not great barebones with with synchronous_commit, but still usable!

Best,
Pierre

On Fri, Jul 25, 2025, at 00:44, Pierre Barre wrote:
>> This then begs the obvious question of how fast is this with
>> synchronous_commit = on?
>
> Probably not awful, especially with commit_delay.
>
> I'll try that and report back.
>
> Best,
> Pierre
>
> On Fri, Jul 25, 2025, at 00:03, Jeff Ross wrote:
>> On 7/24/25 13:50, Pierre Barre wrote:
>>
>>> It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit.
Synchronous_commitdon’t make your system automatically safe either, and if that’s a requirement, there’s many
workarounds,as you suggested, it certainly doesn’t make the setup useless. 
>>>
>>> Best,
>>> Pierre
>>>
>>> On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:
>>>> On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:
>>>>> - Postgres configured accordingly memory-wise as well as with
>>>>>    synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>>>> Bingo.  That's why it's fast (synchronous_commit = off).  It's also why
>>>> it's not safe _unless_ you have a local, fast, persistent ZIL device
>>>> (which I assume you don't).
>>>>
>>>> Nico
>>>> --
>> This then begs the obvious question of how fast is this with
>> synchronous_commit = on?

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

"Pierre Barre"

Дата:

26 июля 2025 г., 04:16:24

I built postgres (same version, 16.9) but --with-block-size=32 (I'd really love if this would be a initdb time flag!)
anddid some more testing: 

synchronous_commit = off

postgres@zerofs:~$ pgbench -vvv -c 100 -j 40 -t 10000 bench
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
number of failed transactions: 0 (0.000%)
latency average = 5.727 ms
initial connection time = 59.223 ms
tps = 17460.128835 (without initial connection time)

synchronous_commit = on

postgres@zerofs:/root$ pgbench -vvv -c 100 -j 40 -t 1000 bench
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 100000/100000
number of failed transactions: 0 (0.000%)
latency average = 301.800 ms
initial connection time = 62.237 ms
tps = 331.345391 (without initial connection time)

=====================================

Then, using the same setup (same server, same postgres build), I create a ZeroFS NBD device with ext4 on top

/dev/nbd0 on /mnt_9p type ext4 (rw,relatime,stripe=32)

synchronous_commit = off

postgres@zerofs:/mnt_9p$ pgbench -vvv -c 100 -j 40 -t 10000 bench
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
number of failed transactions: 0 (0.000%)
latency average = 3.615 ms
initial connection time = 45.653 ms
tps = 27665.373366 (without initial connection time)

synchronous_commit = on

postgres@zerofs:/root$ pgbench -vvv -c 100 -j 40 -t 1000 bench
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 1000
number of transactions actually processed: 100000/100000
number of failed transactions: 0 (0.000%)
latency average = 337.762 ms
initial connection time = 43.969 ms
tps = 296.066616 (without initial connection time)

Best,
Pierre

On Fri, Jul 25, 2025, at 11:25, Pierre Barre wrote:
> Hi,
>
> I went ahead and did that test.
>
> Here is the postgresql config I used for reference (note the wal
> options (recycle, init_zero) as well as full_page_writes = off, because
> ZeroFS cannot have torn writes by design).
>
> https://gist.github.com/Barre/8d68f0d00446389998a31f4e60f3276d
>
> Test was running on Azure with Standard D16ads v5 (16 vcpus, 64 GiB memory)
>
> This time, I didn't run ZFS with L2ARC, I just mounted ZeroFS with 9p.
>
> synchronous_commit = off
>
> postgres@zerofs:~$ pgbench -vvv -c 100 -j 40 -t 1000 bench
> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
> starting vacuum...end.
> starting vacuum pgbench_accounts...end.
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 50
> query mode: simple
> number of clients: 100
> number of threads: 40
> maximum number of tries: 1
> number of transactions per client: 1000
> number of transactions actually processed: 100000/100000
> number of failed transactions: 0 (0.000%)
> latency average = 6.239 ms
> initial connection time = 68.922 ms
> tps = 16026.940646 (without initial connection time)
>
>
> synchronous_commit = on
>
> postgres@zerofs:~$ pgbench -vvv -c 50 -j 15 -t 1000 bench
> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
> starting vacuum...end.
> starting vacuum pgbench_accounts...end.
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 50
> query mode: simple
> number of clients: 50
> number of threads: 15
> maximum number of tries: 1
> number of transactions per client: 1000
> number of transactions actually processed: 50000/50000
> number of failed transactions: 0 (0.000%)
> latency average = 197.723 ms
> initial connection time = 46.089 ms
> tps = 252.878721 (without initial connection time)
>
>
> Not great barebones with with synchronous_commit, but still usable!
>
> Best,
> Pierre
>
> On Fri, Jul 25, 2025, at 00:44, Pierre Barre wrote:
>>> This then begs the obvious question of how fast is this with
>>> synchronous_commit = on?
>>
>> Probably not awful, especially with commit_delay.
>>
>> I'll try that and report back.
>>
>> Best,
>> Pierre
>>
>> On Fri, Jul 25, 2025, at 00:03, Jeff Ross wrote:
>>> On 7/24/25 13:50, Pierre Barre wrote:
>>>
>>>> It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit.
Synchronous_commitdon’t make your system automatically safe either, and if that’s a requirement, there’s many
workarounds,as you suggested, it certainly doesn’t make the setup useless. 
>>>>
>>>> Best,
>>>> Pierre
>>>>
>>>> On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:
>>>>> On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:
>>>>>> - Postgres configured accordingly memory-wise as well as with
>>>>>>    synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>>>>> Bingo.  That's why it's fast (synchronous_commit = off).  It's also why
>>>>> it's not safe _unless_ you have a local, fast, persistent ZIL device
>>>>> (which I assume you don't).
>>>>>
>>>>> Nico
>>>>> --
>>> This then begs the obvious question of how fast is this with
>>> synchronous_commit = on?

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

"Pierre Barre"

Дата:

26 июля 2025 г., 04:22:24

And finally, some read only benchmarks with the same postgres build.

9P:

postgres@zerofs:/mnt_9p$ pgbench -vvv -c 100 -j 40 -t 10000 bench -S
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
number of failed transactions: 0 (0.000%)
latency average = 0.539 ms
initial connection time = 59.157 ms
tps = 185652.686153 (without initial connection time)


ext4:

postgres@zerofs:/root$ pgbench -vvv -c 100 -j 40 -t 10000 bench -S
pgbench (16.9 (Ubuntu 16.10-1))
starting vacuum...end.
starting vacuum pgbench_accounts...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: simple
number of clients: 100
number of threads: 40
maximum number of tries: 1
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
number of failed transactions: 0 (0.000%)
latency average = 0.547 ms
initial connection time = 44.054 ms
tps = 182836.180428 (without initial connection time)

Best,
Pierre


On Sat, Jul 26, 2025, at 03:16, Pierre Barre wrote:
> I built postgres (same version, 16.9) but --with-block-size=32 (I'd
> really love if this would be a initdb time flag!) and did some more
> testing:
>
> synchronous_commit = off
>
> postgres@zerofs:~$ pgbench -vvv -c 100 -j 40 -t 10000 bench
> pgbench (16.9 (Ubuntu 16.10-1))
> starting vacuum...end.
> starting vacuum pgbench_accounts...end.
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 50
> query mode: simple
> number of clients: 100
> number of threads: 40
> maximum number of tries: 1
> number of transactions per client: 10000
> number of transactions actually processed: 1000000/1000000
> number of failed transactions: 0 (0.000%)
> latency average = 5.727 ms
> initial connection time = 59.223 ms
> tps = 17460.128835 (without initial connection time)
>
> synchronous_commit = on
>
> postgres@zerofs:/root$ pgbench -vvv -c 100 -j 40 -t 1000 bench
> pgbench (16.9 (Ubuntu 16.10-1))
> starting vacuum...end.
> starting vacuum pgbench_accounts...end.
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 50
> query mode: simple
> number of clients: 100
> number of threads: 40
> maximum number of tries: 1
> number of transactions per client: 1000
> number of transactions actually processed: 100000/100000
> number of failed transactions: 0 (0.000%)
> latency average = 301.800 ms
> initial connection time = 62.237 ms
> tps = 331.345391 (without initial connection time)
>
> =====================================
>
> Then, using the same setup (same server, same postgres build), I create
> a ZeroFS NBD device with ext4 on top
>
> /dev/nbd0 on /mnt_9p type ext4 (rw,relatime,stripe=32)
>
> synchronous_commit = off
>
> postgres@zerofs:/mnt_9p$ pgbench -vvv -c 100 -j 40 -t 10000 bench
> pgbench (16.9 (Ubuntu 16.10-1))
> starting vacuum...end.
> starting vacuum pgbench_accounts...end.
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 50
> query mode: simple
> number of clients: 100
> number of threads: 40
> maximum number of tries: 1
> number of transactions per client: 10000
> number of transactions actually processed: 1000000/1000000
> number of failed transactions: 0 (0.000%)
> latency average = 3.615 ms
> initial connection time = 45.653 ms
> tps = 27665.373366 (without initial connection time)
>
> synchronous_commit = on
>
> postgres@zerofs:/root$ pgbench -vvv -c 100 -j 40 -t 1000 bench
> pgbench (16.9 (Ubuntu 16.10-1))
> starting vacuum...end.
> starting vacuum pgbench_accounts...end.
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 50
> query mode: simple
> number of clients: 100
> number of threads: 40
> maximum number of tries: 1
> number of transactions per client: 1000
> number of transactions actually processed: 100000/100000
> number of failed transactions: 0 (0.000%)
> latency average = 337.762 ms
> initial connection time = 43.969 ms
> tps = 296.066616 (without initial connection time)
>
> Best,
> Pierre
>
>
> On Fri, Jul 25, 2025, at 11:25, Pierre Barre wrote:
>> Hi,
>>
>> I went ahead and did that test.
>>
>> Here is the postgresql config I used for reference (note the wal
>> options (recycle, init_zero) as well as full_page_writes = off, because
>> ZeroFS cannot have torn writes by design).
>>
>> https://gist.github.com/Barre/8d68f0d00446389998a31f4e60f3276d
>>
>> Test was running on Azure with Standard D16ads v5 (16 vcpus, 64 GiB memory)
>>
>> This time, I didn't run ZFS with L2ARC, I just mounted ZeroFS with 9p.
>>
>> synchronous_commit = off
>>
>> postgres@zerofs:~$ pgbench -vvv -c 100 -j 40 -t 1000 bench
>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>> starting vacuum...end.
>> starting vacuum pgbench_accounts...end.
>> transaction type: <builtin: TPC-B (sort of)>
>> scaling factor: 50
>> query mode: simple
>> number of clients: 100
>> number of threads: 40
>> maximum number of tries: 1
>> number of transactions per client: 1000
>> number of transactions actually processed: 100000/100000
>> number of failed transactions: 0 (0.000%)
>> latency average = 6.239 ms
>> initial connection time = 68.922 ms
>> tps = 16026.940646 (without initial connection time)
>>
>>
>> synchronous_commit = on
>>
>> postgres@zerofs:~$ pgbench -vvv -c 50 -j 15 -t 1000 bench
>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>> starting vacuum...end.
>> starting vacuum pgbench_accounts...end.
>> transaction type: <builtin: TPC-B (sort of)>
>> scaling factor: 50
>> query mode: simple
>> number of clients: 50
>> number of threads: 15
>> maximum number of tries: 1
>> number of transactions per client: 1000
>> number of transactions actually processed: 50000/50000
>> number of failed transactions: 0 (0.000%)
>> latency average = 197.723 ms
>> initial connection time = 46.089 ms
>> tps = 252.878721 (without initial connection time)
>>
>>
>> Not great barebones with with synchronous_commit, but still usable!
>>
>> Best,
>> Pierre
>>
>> On Fri, Jul 25, 2025, at 00:44, Pierre Barre wrote:
>>>> This then begs the obvious question of how fast is this with
>>>> synchronous_commit = on?
>>>
>>> Probably not awful, especially with commit_delay.
>>>
>>> I'll try that and report back.
>>>
>>> Best,
>>> Pierre
>>>
>>> On Fri, Jul 25, 2025, at 00:03, Jeff Ross wrote:
>>>> On 7/24/25 13:50, Pierre Barre wrote:
>>>>
>>>>> It’s not “safe” or “unsafe”, there’s mountains of valid workloads which don’t require synchronous_commit.
Synchronous_commitdon’t make your system automatically safe either, and if that’s a requirement, there’s many
workarounds,as you suggested, it certainly doesn’t make the setup useless. 
>>>>>
>>>>> Best,
>>>>> Pierre
>>>>>
>>>>> On Thu, Jul 24, 2025, at 21:44, Nico Williams wrote:
>>>>>> On Fri, Jul 18, 2025 at 12:57:39PM +0200, Pierre Barre wrote:
>>>>>>> - Postgres configured accordingly memory-wise as well as with
>>>>>>>    synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>>>>>> Bingo.  That's why it's fast (synchronous_commit = off).  It's also why
>>>>>> it's not safe _unless_ you have a local, fast, persistent ZIL device
>>>>>> (which I assume you don't).
>>>>>>
>>>>>> Nico
>>>>>> --
>>>> This then begs the obvious question of how fast is this with
>>>> synchronous_commit = on?

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

Vladimir Churyukin

Дата:

26 июля 2025 г., 09:57:59

A shared storage would require a lot of extra work. That's essentially what AWS Aurora does.

You will have to have functionality to sync in-memory states between nodes, because all the instances will have cached data that can easily become stale on any write operation.

That alone is not that simple. You will have to modify some locking logic. Most likely do a lot of other changes in a lot of places, Postgres was not just built with the assumption that the storage can be shared.

-Vladimir

On Fri, Jul 18, 2025 at 5:31 AM Pierre Barre <pierre@barre.sh> wrote:

Now, I'm trying to understand how CAP theorem applies here. Traditional PostgreSQL replication has clear CAP trade-offs - you choose between consistency and availability during partitions.

But when PostgreSQL instances share storage rather than replicate:
- Consistency seems maintained (same data)
- Availability seems maintained (client can always promote an accessible node)
- Partitions between PostgreSQL nodes don't prevent the system from functioning

It seems that CAP assumes specific implementation details (like nodes maintaining independent state) without explicitly stating them.

How should we think about CAP theorem when distributed nodes share storage rather than coordinate state? Are the trade-offs simply moved to a different layer, or does shared storage fundamentally change the analysis?

Client with awareness of both PostgreSQL nodes
| |
↓ (partition here) ↓
PostgreSQL Primary PostgreSQL Standby
| |
└───────────┬───────────────────┘
↓
Shared ZFS Pool
|
6 Global ZeroFS instances

Best,
Pierre

On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:
> Hi Seref,
>
> For the benchmarks, I used Hetzner's cloud service with the following setup:
>
> - A Hetzner s3 bucket in the FSN1 region
> - A virtual machine of type ccx63 48 vCPU 192 GB memory
> - 3 ZeroFS nbd devices (same s3 bucket)
> - A ZFS stripped pool with the 3 devices
> - 200GB zfs L2ARC
> - Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>
> Best,
> Pierre
>
> On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>> Sorry, this was meant to go to the whole group:
>>
>> Very interesting!. Great work. Can you clarify how exactly you're running postgres in your tests? A specific AWS service? What's the test infrastructure that sits above the file system?
>>
>> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre@barre.sh> wrote:
>>> Hi everyone,
>>>
>>> I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.
>>>
>>> ZeroFS: https://github.com/Barre/ZeroFS
>>>
>>> # The Architecture
>>>
>>> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:
>>>
>>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>>>
>>> By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.
>>>
>>> ## Performance Results
>>>
>>> Here are pgbench results from PostgreSQL running on this setup:
>>>
>>> ### Read/Write Workload
>>>
>>> ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.
>>> transaction type: <builtin: TPC-B (sort of)>
>>> scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number of threads: 15
>>> maximum number of tries: 1
>>> number of transactions per client: 100000
>>> number of transactions actually processed: 5000000/5000000
>>> number of failed transactions: 0 (0.000%)
>>> latency average = 0.943 ms
>>> initial connection time = 48.043 ms
>>> tps = 53041.006947 (without initial connection time)
>>> ```
>>>
>>> ### Read-Only Workload
>>>
>>> ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.
>>> transaction type: <builtin: select only>
>>> scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number of threads: 15
>>> maximum number of tries: 1
>>> number of transactions per client: 100000
>>> number of transactions actually processed: 5000000/5000000
>>> number of failed transactions: 0 (0.000%)
>>> latency average = 0.121 ms
>>> initial connection time = 53.358 ms
>>> tps = 413436.248089 (without initial connection time)
>>> ```
>>>
>>> These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.
>>>
>>> ## How It Works
>>>
>>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>> 2. Multiple cache layers hide S3 latency:
>>> a. ZFS ARC/L2ARC for frequently accessed blocks
>>> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>> c. Optional local disk cache
>>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>>> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree
>>>
>>> ## Geo-Distributed PostgreSQL
>>>
>>> Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.
>>>
>>> Example architectures:
>>>
>>> Architecture 1
>>>
>>>
>>> PostgreSQL Client
>>> |
>>> | SQL queries
>>> |
>>> +--------------+
>>> | PG Proxy |
>>> | (HAProxy/ |
>>> | PgBouncer) |
>>> +--------------+
>>> / \
>>> / \
>>> Synchronous Synchronous
>>> Replication Replication
>>> / \
>>> / \
>>> +---------------+ +---------------+
>>> | PostgreSQL 1 | | PostgreSQL 2 |
>>> | (Primary) |◄------►| (Standby) |
>>> +---------------+ +---------------+
>>> | |
>>> | POSIX filesystem ops |
>>> | |
>>> +---------------+ +---------------+
>>> | ZFS Pool 1 | | ZFS Pool 2 |
>>> | (3-way mirror)| | (3-way mirror)|
>>> +---------------+ +---------------+
>>> / | \ / | \
>>> / | \ / | \
>>> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814
>>> | | | | | |
>>> +--------++--------++--------++--------++--------++--------+
>>> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
>>> +--------++--------++--------++--------++--------++--------+
>>> | | | | | |
>>> | | | | | |
>>> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
>>> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>>>
>>> Architecture 2:
>>>
>>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
>>> \ /
>>> \ /
>>> Same ZFS Pool (NBD)
>>> |
>>> 6 Global ZeroFS
>>> |
>>> S3 Regions
>>>
>>>
>>> The main advantages I see are:
>>> 1. Dramatic cost reduction for large datasets
>>> 2. Simplified geo-distribution
>>> 3. Infinite storage capacity
>>> 4. Built-in encryption and compression
>>>
>>> Looking forward to your feedback and questions!
>>>
>>> Best,
>>> Pierre
>>>
>>> P.S. The full project includes a custom NFS filesystem too.
>>>
>

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

"Pierre Barre"

Дата:

26 июля 2025 г., 10:36:07

What you describe doesn’t look like something very useful for the vast majority of projects that needs a database. Why would you even want that if you can avoid it?

If your “single node” can handle tens / hundreds of thousands requests per second, still have very durable and highly available storage, as well as fast recovery mechanisms, what’s the point?

I am not trying to cater to extreme outliers that may want very weird like this, that’s just not the use-cases I want to address, because I believe they are few and far between.

Best,

Pierre

On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote:

A shared storage would require a lot of extra work. That's essentially what AWS Aurora does.
You will have to have functionality to sync in-memory states between nodes, because all the instances will have cached data that can easily become stale on any write operation.
That alone is not that simple. You will have to modify some locking logic. Most likely do a lot of other changes in a lot of places, Postgres was not just built with the assumption that the storage can be shared.

-Vladimir

On Fri, Jul 18, 2025 at 5:31 AM Pierre Barre <pierre@barre.sh> wrote:
Now, I'm trying to understand how CAP theorem applies here. Traditional PostgreSQL replication has clear CAP trade-offs - you choose between consistency and availability during partitions.

But when PostgreSQL instances share storage rather than replicate:
- Consistency seems maintained (same data)
- Availability seems maintained (client can always promote an accessible node)
- Partitions between PostgreSQL nodes don't prevent the system from functioning

It seems that CAP assumes specific implementation details (like nodes maintaining independent state) without explicitly stating them.

How should we think about CAP theorem when distributed nodes share storage rather than coordinate state? Are the trade-offs simply moved to a different layer, or does shared storage fundamentally change the analysis?

Client with awareness of both PostgreSQL nodes
| |
↓ (partition here) ↓
PostgreSQL Primary PostgreSQL Standby
| |
└───────────┬───────────────────┘
↓
Shared ZFS Pool
|
6 Global ZeroFS instances

Best,
Pierre

On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:
> Hi Seref,
>
> For the benchmarks, I used Hetzner's cloud service with the following setup:
>
> - A Hetzner s3 bucket in the FSN1 region
> - A virtual machine of type ccx63 48 vCPU 192 GB memory
> - 3 ZeroFS nbd devices (same s3 bucket)
> - A ZFS stripped pool with the 3 devices
> - 200GB zfs L2ARC
> - Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>
> Best,
> Pierre
>
> On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>> Sorry, this was meant to go to the whole group:
>>
>> Very interesting!. Great work. Can you clarify how exactly you're running postgres in your tests? A specific AWS service? What's the test infrastructure that sits above the file system?
>>
>> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre@barre.sh> wrote:
>>> Hi everyone,
>>>
>>> I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.
>>>
>>> ZeroFS: https://github.com/Barre/ZeroFS
>>>
>>> # The Architecture
>>>
>>> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:
>>>
>>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>>>
>>> By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.
>>>
>>> ## Performance Results
>>>
>>> Here are pgbench results from PostgreSQL running on this setup:
>>>
>>> ### Read/Write Workload
>>>
>>> ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.
>>> transaction type: <builtin: TPC-B (sort of)>
>>> scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number of threads: 15
>>> maximum number of tries: 1
>>> number of transactions per client: 100000
>>> number of transactions actually processed: 5000000/5000000
>>> number of failed transactions: 0 (0.000%)
>>> latency average = 0.943 ms
>>> initial connection time = 48.043 ms
>>> tps = 53041.006947 (without initial connection time)
>>> ```
>>>
>>> ### Read-Only Workload
>>>
>>> ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.
>>> transaction type: <builtin: select only>
>>> scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number of threads: 15
>>> maximum number of tries: 1
>>> number of transactions per client: 100000
>>> number of transactions actually processed: 5000000/5000000
>>> number of failed transactions: 0 (0.000%)
>>> latency average = 0.121 ms
>>> initial connection time = 53.358 ms
>>> tps = 413436.248089 (without initial connection time)
>>> ```
>>>
>>> These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.
>>>
>>> ## How It Works
>>>
>>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>> 2. Multiple cache layers hide S3 latency:
>>> a. ZFS ARC/L2ARC for frequently accessed blocks
>>> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>> c. Optional local disk cache
>>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>>> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree
>>>
>>> ## Geo-Distributed PostgreSQL
>>>
>>> Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.
>>>
>>> Example architectures:
>>>
>>> Architecture 1
>>>
>>>
>>> PostgreSQL Client
>>> |
>>> | SQL queries
>>> |
>>> +--------------+
>>> | PG Proxy |
>>> | (HAProxy/ |
>>> | PgBouncer) |
>>> +--------------+
>>> / \
>>> / \
>>> Synchronous Synchronous
>>> Replication Replication
>>> / \
>>> / \
>>> +---------------+ +---------------+
>>> | PostgreSQL 1 | | PostgreSQL 2 |
>>> | (Primary) |◄------►| (Standby) |
>>> +---------------+ +---------------+
>>> | |
>>> | POSIX filesystem ops |
>>> | |
>>> +---------------+ +---------------+
>>> | ZFS Pool 1 | | ZFS Pool 2 |
>>> | (3-way mirror)| | (3-way mirror)|
>>> +---------------+ +---------------+
>>> / | \ / | \
>>> / | \ / | \
>>> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814
>>> | | | | | |
>>> +--------++--------++--------++--------++--------++--------+
>>> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
>>> +--------++--------++--------++--------++--------++--------+
>>> | | | | | |
>>> | | | | | |
>>> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
>>> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>>>
>>> Architecture 2:
>>>
>>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
>>> \ /
>>> \ /
>>> Same ZFS Pool (NBD)
>>> |
>>> 6 Global ZeroFS
>>> |
>>> S3 Regions
>>>
>>>
>>> The main advantages I see are:
>>> 1. Dramatic cost reduction for large datasets
>>> 2. Simplified geo-distribution
>>> 3. Infinite storage capacity
>>> 4. Built-in encryption and compression
>>>
>>> Looking forward to your feedback and questions!
>>>
>>> Best,
>>> Pierre
>>>
>>> P.S. The full project includes a custom NFS filesystem too.
>>>
>

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

Vladimir Churyukin

Дата:

26 июля 2025 г., 10:42:54

Sorry, I was referring to this:

> But when PostgreSQL instances share storage rather than replicate:
> - Consistency seems maintained (same data)
> - Availability seems maintained (client can always promote an accessible node)
> - Partitions between PostgreSQL nodes don't prevent the system from functioning

Some pretty well-known cases of storage / compute separation (Aurora, Neon) also share the storage between instances,
that's why I'm a bit confused by your reply. I thought you're thinking about this approach too, that's why I mentioned what kind of challenges one may have on that path.

On Sat, Jul 26, 2025 at 12:36 AM Pierre Barre <pierre@barre.sh> wrote:

What you describe doesn’t look like something very useful for the vast majority of projects that needs a database. Why would you even want that if you can avoid it?

If your “single node” can handle tens / hundreds of thousands requests per second, still have very durable and highly available storage, as well as fast recovery mechanisms, what’s the point?

I am not trying to cater to extreme outliers that may want very weird like this, that’s just not the use-cases I want to address, because I believe they are few and far between.

Best,
Pierre

On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote:
A shared storage would require a lot of extra work. That's essentially what AWS Aurora does.
You will have to have functionality to sync in-memory states between nodes, because all the instances will have cached data that can easily become stale on any write operation.
That alone is not that simple. You will have to modify some locking logic. Most likely do a lot of other changes in a lot of places, Postgres was not just built with the assumption that the storage can be shared.

-Vladimir

On Fri, Jul 18, 2025 at 5:31 AM Pierre Barre <pierre@barre.sh> wrote:
Now, I'm trying to understand how CAP theorem applies here. Traditional PostgreSQL replication has clear CAP trade-offs - you choose between consistency and availability during partitions.

But when PostgreSQL instances share storage rather than replicate:
- Consistency seems maintained (same data)
- Availability seems maintained (client can always promote an accessible node)
- Partitions between PostgreSQL nodes don't prevent the system from functioning

It seems that CAP assumes specific implementation details (like nodes maintaining independent state) without explicitly stating them.

How should we think about CAP theorem when distributed nodes share storage rather than coordinate state? Are the trade-offs simply moved to a different layer, or does shared storage fundamentally change the analysis?

Client with awareness of both PostgreSQL nodes
| |
↓ (partition here) ↓
PostgreSQL Primary PostgreSQL Standby
| |
└───────────┬───────────────────┘
↓
Shared ZFS Pool
|
6 Global ZeroFS instances

Best,
Pierre

On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:
> Hi Seref,
>
> For the benchmarks, I used Hetzner's cloud service with the following setup:
>
> - A Hetzner s3 bucket in the FSN1 region
> - A virtual machine of type ccx63 48 vCPU 192 GB memory
> - 3 ZeroFS nbd devices (same s3 bucket)
> - A ZFS stripped pool with the 3 devices
> - 200GB zfs L2ARC
> - Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>
> Best,
> Pierre
>
> On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>> Sorry, this was meant to go to the whole group:
>>
>> Very interesting!. Great work. Can you clarify how exactly you're running postgres in your tests? A specific AWS service? What's the test infrastructure that sits above the file system?
>>
>> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre@barre.sh> wrote:
>>> Hi everyone,
>>>
>>> I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.
>>>
>>> ZeroFS: https://github.com/Barre/ZeroFS
>>>
>>> # The Architecture
>>>
>>> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:
>>>
>>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>>>
>>> By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.
>>>
>>> ## Performance Results
>>>
>>> Here are pgbench results from PostgreSQL running on this setup:
>>>
>>> ### Read/Write Workload
>>>
>>> ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.
>>> transaction type: <builtin: TPC-B (sort of)>
>>> scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number of threads: 15
>>> maximum number of tries: 1
>>> number of transactions per client: 100000
>>> number of transactions actually processed: 5000000/5000000
>>> number of failed transactions: 0 (0.000%)
>>> latency average = 0.943 ms
>>> initial connection time = 48.043 ms
>>> tps = 53041.006947 (without initial connection time)
>>> ```
>>>
>>> ### Read-Only Workload
>>>
>>> ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.
>>> transaction type: <builtin: select only>
>>> scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number of threads: 15
>>> maximum number of tries: 1
>>> number of transactions per client: 100000
>>> number of transactions actually processed: 5000000/5000000
>>> number of failed transactions: 0 (0.000%)
>>> latency average = 0.121 ms
>>> initial connection time = 53.358 ms
>>> tps = 413436.248089 (without initial connection time)
>>> ```
>>>
>>> These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.
>>>
>>> ## How It Works
>>>
>>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>> 2. Multiple cache layers hide S3 latency:
>>> a. ZFS ARC/L2ARC for frequently accessed blocks
>>> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>> c. Optional local disk cache
>>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>>> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree
>>>
>>> ## Geo-Distributed PostgreSQL
>>>
>>> Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.
>>>
>>> Example architectures:
>>>
>>> Architecture 1
>>>
>>>
>>> PostgreSQL Client
>>> |
>>> | SQL queries
>>> |
>>> +--------------+
>>> | PG Proxy |
>>> | (HAProxy/ |
>>> | PgBouncer) |
>>> +--------------+
>>> / \
>>> / \
>>> Synchronous Synchronous
>>> Replication Replication
>>> / \
>>> / \
>>> +---------------+ +---------------+
>>> | PostgreSQL 1 | | PostgreSQL 2 |
>>> | (Primary) |◄------►| (Standby) |
>>> +---------------+ +---------------+
>>> | |
>>> | POSIX filesystem ops |
>>> | |
>>> +---------------+ +---------------+
>>> | ZFS Pool 1 | | ZFS Pool 2 |
>>> | (3-way mirror)| | (3-way mirror)|
>>> +---------------+ +---------------+
>>> / | \ / | \
>>> / | \ / | \
>>> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814
>>> | | | | | |
>>> +--------++--------++--------++--------++--------++--------+
>>> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
>>> +--------++--------++--------++--------++--------++--------+
>>> | | | | | |
>>> | | | | | |
>>> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
>>> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>>>
>>> Architecture 2:
>>>
>>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
>>> \ /
>>> \ /
>>> Same ZFS Pool (NBD)
>>> |
>>> 6 Global ZeroFS
>>> |
>>> S3 Regions
>>>
>>>
>>> The main advantages I see are:
>>> 1. Dramatic cost reduction for large datasets
>>> 2. Simplified geo-distribution
>>> 3. Infinite storage capacity
>>> 4. Built-in encryption and compression
>>>
>>> Looking forward to your feedback and questions!
>>>
>>> Best,
>>> Pierre
>>>
>>> P.S. The full project includes a custom NFS filesystem too.
>>>
>

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

"Pierre Barre"

Дата:

26 июля 2025 г., 10:51:15

Ah, by "shared storage" I mean that each node can acquire exclusivity, not that they can both R/W to it at the same time.

> Some pretty well-known cases of storage / compute separation (Aurora, Neon) also share the storage between instances,

That model is cool, but I think it's more of a solution for outliers as I was suggesting, not something that most would or should want.

Best,

Pierre

On Sat, Jul 26, 2025, at 09:42, Vladimir Churyukin wrote:

Sorry, I was referring to this:

> But when PostgreSQL instances share storage rather than replicate:
> - Consistency seems maintained (same data)
> - Availability seems maintained (client can always promote an accessible node)
> - Partitions between PostgreSQL nodes don't prevent the system from functioning

Some pretty well-known cases of storage / compute separation (Aurora, Neon) also share the storage between instances,
that's why I'm a bit confused by your reply. I thought you're thinking about this approach too, that's why I mentioned what kind of challenges one may have on that path.

On Sat, Jul 26, 2025 at 12:36 AM Pierre Barre <pierre@barre.sh> wrote:

What you describe doesn’t look like something very useful for the vast majority of projects that needs a database. Why would you even want that if you can avoid it?

If your “single node” can handle tens / hundreds of thousands requests per second, still have very durable and highly available storage, as well as fast recovery mechanisms, what’s the point?

I am not trying to cater to extreme outliers that may want very weird like this, that’s just not the use-cases I want to address, because I believe they are few and far between.

Best,
Pierre

On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote:
A shared storage would require a lot of extra work. That's essentially what AWS Aurora does.
You will have to have functionality to sync in-memory states between nodes, because all the instances will have cached data that can easily become stale on any write operation.
That alone is not that simple. You will have to modify some locking logic. Most likely do a lot of other changes in a lot of places, Postgres was not just built with the assumption that the storage can be shared.

-Vladimir

On Fri, Jul 18, 2025 at 5:31 AM Pierre Barre <pierre@barre.sh> wrote:
Now, I'm trying to understand how CAP theorem applies here. Traditional PostgreSQL replication has clear CAP trade-offs - you choose between consistency and availability during partitions.

But when PostgreSQL instances share storage rather than replicate:
- Consistency seems maintained (same data)
- Availability seems maintained (client can always promote an accessible node)
- Partitions between PostgreSQL nodes don't prevent the system from functioning

It seems that CAP assumes specific implementation details (like nodes maintaining independent state) without explicitly stating them.

How should we think about CAP theorem when distributed nodes share storage rather than coordinate state? Are the trade-offs simply moved to a different layer, or does shared storage fundamentally change the analysis?

Client with awareness of both PostgreSQL nodes
| |
↓ (partition here) ↓
PostgreSQL Primary PostgreSQL Standby
| |
└───────────┬───────────────────┘
↓
Shared ZFS Pool
|
6 Global ZeroFS instances

Best,
Pierre

On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:
> Hi Seref,
>
> For the benchmarks, I used Hetzner's cloud service with the following setup:
>
> - A Hetzner s3 bucket in the FSN1 region
> - A virtual machine of type ccx63 48 vCPU 192 GB memory
> - 3 ZeroFS nbd devices (same s3 bucket)
> - A ZFS stripped pool with the 3 devices
> - 200GB zfs L2ARC
> - Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>
> Best,
> Pierre
>
> On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>> Sorry, this was meant to go to the whole group:
>>
>> Very interesting!. Great work. Can you clarify how exactly you're running postgres in your tests? A specific AWS service? What's the test infrastructure that sits above the file system?
>>
>> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre@barre.sh> wrote:
>>> Hi everyone,
>>>
>>> I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.
>>>
>>> ZeroFS: https://github.com/Barre/ZeroFS
>>>
>>> # The Architecture
>>>
>>> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:
>>>
>>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>>>
>>> By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.
>>>
>>> ## Performance Results
>>>
>>> Here are pgbench results from PostgreSQL running on this setup:
>>>
>>> ### Read/Write Workload
>>>
>>> ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.
>>> transaction type: <builtin: TPC-B (sort of)>
>>> scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number of threads: 15
>>> maximum number of tries: 1
>>> number of transactions per client: 100000
>>> number of transactions actually processed: 5000000/5000000
>>> number of failed transactions: 0 (0.000%)
>>> latency average = 0.943 ms
>>> initial connection time = 48.043 ms
>>> tps = 53041.006947 (without initial connection time)
>>> ```
>>>
>>> ### Read-Only Workload
>>>
>>> ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.
>>> transaction type: <builtin: select only>
>>> scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number of threads: 15
>>> maximum number of tries: 1
>>> number of transactions per client: 100000
>>> number of transactions actually processed: 5000000/5000000
>>> number of failed transactions: 0 (0.000%)
>>> latency average = 0.121 ms
>>> initial connection time = 53.358 ms
>>> tps = 413436.248089 (without initial connection time)
>>> ```
>>>
>>> These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.
>>>
>>> ## How It Works
>>>
>>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>> 2. Multiple cache layers hide S3 latency:
>>> a. ZFS ARC/L2ARC for frequently accessed blocks
>>> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>> c. Optional local disk cache
>>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>>> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree
>>>
>>> ## Geo-Distributed PostgreSQL
>>>
>>> Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.
>>>
>>> Example architectures:
>>>
>>> Architecture 1
>>>
>>>
>>> PostgreSQL Client
>>> |
>>> | SQL queries
>>> |
>>> +--------------+
>>> | PG Proxy |
>>> | (HAProxy/ |
>>> | PgBouncer) |
>>> +--------------+
>>> / \
>>> / \
>>> Synchronous Synchronous
>>> Replication Replication
>>> / \
>>> / \
>>> +---------------+ +---------------+
>>> | PostgreSQL 1 | | PostgreSQL 2 |
>>> | (Primary) |◄------►| (Standby) |
>>> +---------------+ +---------------+
>>> | |
>>> | POSIX filesystem ops |
>>> | |
>>> +---------------+ +---------------+
>>> | ZFS Pool 1 | | ZFS Pool 2 |
>>> | (3-way mirror)| | (3-way mirror)|
>>> +---------------+ +---------------+
>>> / | \ / | \
>>> / | \ / | \
>>> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814
>>> | | | | | |
>>> +--------++--------++--------++--------++--------++--------+
>>> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
>>> +--------++--------++--------++--------++--------++--------+
>>> | | | | | |
>>> | | | | | |
>>> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
>>> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>>>
>>> Architecture 2:
>>>
>>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
>>> \ /
>>> \ /
>>> Same ZFS Pool (NBD)
>>> |
>>> 6 Global ZeroFS
>>> |
>>> S3 Regions
>>>
>>>
>>> The main advantages I see are:
>>> 1. Dramatic cost reduction for large datasets
>>> 2. Simplified geo-distribution
>>> 3. Infinite storage capacity
>>> 4. Built-in encryption and compression
>>>
>>> Looking forward to your feedback and questions!
>>>
>>> Best,
>>> Pierre
>>>
>>> P.S. The full project includes a custom NFS filesystem too.
>>>
>

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

От

"Pierre Barre"

Дата:

26 июля 2025 г., 11:44:41

Also, Neon [0] and Aurora [1] pricing is so high that it seems to make most use-cases impractical (well, if you want a managed offering...). Neon's top public tier is not even what a single modern dedicated server (or virtual machine) can provide. I would have thought decoupling compute and storage would make the offerings cheaper, if anything.

Taking my own Merklemap [2] use-case where I run a 30TB database with Neon pricing (and I don't doubt that the non-public pricing would be even more expensive than that):

Storage Scaling:

- Business plan: 500 GB -> $700

- You need: 30,000 GB (30 TB)

- Scaling factor: 60x

- Linear estimate: $700 × 60 = $42,000/month

- Total 12 months cost: $504,000

Aurora calculation [3]:

- Instance type: db.r5.24xlarge

- Monthly cost: $21,887.28

- Total 12 months cost: $262,647.36

Now, calculating the same 30TB with the same instance type and S3 storage [4]:

- Instance Type: r5.24xlarge

- Monthly cost: $5,555.04

- Total 12 months cost: $66,660.48

But more interestingly, you don't need to use AWS at all anymore, because you can just move your setup anywhere at this point, as you get a similar level of reliability - and simplicity - but with very cheap services.

Hetzner ccx63 + Cloudflare R2:

- Hetzner ccx63: €287.99/month ≈ $338/month

- R2 storage (30TB): 30,000 GB × $0.015 = $450/month

- R2 operations: Should be measured to be calculated properly, but will probably be negligible.

- Total monthly: ~$760

- Total 12 months cost: $9,120/year

Best,

Pierre

[0] https://neon.com/pricing

[1] https://aws.amazon.com/rds/aurora/pricing/

[2] https://www.merklemap.com/

[3] https://calculator.aws/#/estimate?id=3f0ce6a91eed9a666d54bb8852ea00b042c3cd6e

[4] https://calculator.aws/#/estimate?id=1a77d8da3489bafc8681c6fd738a3186fb749ea3

On Sat, Jul 26, 2025, at 09:51, Pierre Barre wrote:

Ah, by "shared storage" I mean that each node can acquire exclusivity, not that they can both R/W to it at the same time.

> Some pretty well-known cases of storage / compute separation (Aurora, Neon) also share the storage between instances,

That model is cool, but I think it's more of a solution for outliers as I was suggesting, not something that most would or should want.

Best,
Pierre

On Sat, Jul 26, 2025, at 09:42, Vladimir Churyukin wrote:
Sorry, I was referring to this:

> But when PostgreSQL instances share storage rather than replicate:
> - Consistency seems maintained (same data)
> - Availability seems maintained (client can always promote an accessible node)
> - Partitions between PostgreSQL nodes don't prevent the system from functioning

Some pretty well-known cases of storage / compute separation (Aurora, Neon) also share the storage between instances,
that's why I'm a bit confused by your reply. I thought you're thinking about this approach too, that's why I mentioned what kind of challenges one may have on that path.

On Sat, Jul 26, 2025 at 12:36 AM Pierre Barre <pierre@barre.sh> wrote:

What you describe doesn’t look like something very useful for the vast majority of projects that needs a database. Why would you even want that if you can avoid it?

If your “single node” can handle tens / hundreds of thousands requests per second, still have very durable and highly available storage, as well as fast recovery mechanisms, what’s the point?

I am not trying to cater to extreme outliers that may want very weird like this, that’s just not the use-cases I want to address, because I believe they are few and far between.

Best,
Pierre

On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote:
A shared storage would require a lot of extra work. That's essentially what AWS Aurora does.
You will have to have functionality to sync in-memory states between nodes, because all the instances will have cached data that can easily become stale on any write operation.
That alone is not that simple. You will have to modify some locking logic. Most likely do a lot of other changes in a lot of places, Postgres was not just built with the assumption that the storage can be shared.

-Vladimir

On Fri, Jul 18, 2025 at 5:31 AM Pierre Barre <pierre@barre.sh> wrote:
Now, I'm trying to understand how CAP theorem applies here. Traditional PostgreSQL replication has clear CAP trade-offs - you choose between consistency and availability during partitions.

But when PostgreSQL instances share storage rather than replicate:
- Consistency seems maintained (same data)
- Availability seems maintained (client can always promote an accessible node)
- Partitions between PostgreSQL nodes don't prevent the system from functioning

It seems that CAP assumes specific implementation details (like nodes maintaining independent state) without explicitly stating them.

How should we think about CAP theorem when distributed nodes share storage rather than coordinate state? Are the trade-offs simply moved to a different layer, or does shared storage fundamentally change the analysis?

Client with awareness of both PostgreSQL nodes
| |
↓ (partition here) ↓
PostgreSQL Primary PostgreSQL Standby
| |
└───────────┬───────────────────┘
↓
Shared ZFS Pool
|
6 Global ZeroFS instances

Best,
Pierre

On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:
> Hi Seref,
>
> For the benchmarks, I used Hetzner's cloud service with the following setup:
>
> - A Hetzner s3 bucket in the FSN1 region
> - A virtual machine of type ccx63 48 vCPU 192 GB memory
> - 3 ZeroFS nbd devices (same s3 bucket)
> - A ZFS stripped pool with the 3 devices
> - 200GB zfs L2ARC
> - Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>
> Best,
> Pierre
>
> On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>> Sorry, this was meant to go to the whole group:
>>
>> Very interesting!. Great work. Can you clarify how exactly you're running postgres in your tests? A specific AWS service? What's the test infrastructure that sits above the file system?
>>
>> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pierre@barre.sh> wrote:
>>> Hi everyone,
>>>
>>> I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects.
>>>
>>> ZeroFS: https://github.com/Barre/ZeroFS
>>>
>>> # The Architecture
>>>
>>> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices:
>>>
>>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>>>
>>> By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3.
>>>
>>> ## Performance Results
>>>
>>> Here are pgbench results from PostgreSQL running on this setup:
>>>
>>> ### Read/Write Workload
>>>
>>> ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.
>>> transaction type: <builtin: TPC-B (sort of)>
>>> scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number of threads: 15
>>> maximum number of tries: 1
>>> number of transactions per client: 100000
>>> number of transactions actually processed: 5000000/5000000
>>> number of failed transactions: 0 (0.000%)
>>> latency average = 0.943 ms
>>> initial connection time = 48.043 ms
>>> tps = 53041.006947 (without initial connection time)
>>> ```
>>>
>>> ### Read-Only Workload
>>>
>>> ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.
>>> transaction type: <builtin: select only>
>>> scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number of threads: 15
>>> maximum number of tries: 1
>>> number of transactions per client: 100000
>>> number of transactions actually processed: 5000000/5000000
>>> number of failed transactions: 0 (0.000%)
>>> latency average = 0.121 ms
>>> initial connection time = 53.358 ms
>>> tps = 413436.248089 (without initial connection time)
>>> ```
>>>
>>> These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3.
>>>
>>> ## How It Works
>>>
>>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>> 2. Multiple cache layers hide S3 latency:
>>> a. ZFS ARC/L2ARC for frequently accessed blocks
>>> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
>>> c. Optional local disk cache
>>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>>> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree
>>>
>>> ## Geo-Distributed PostgreSQL
>>>
>>> Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups.
>>>
>>> Example architectures:
>>>
>>> Architecture 1
>>>
>>>
>>> PostgreSQL Client
>>> |
>>> | SQL queries
>>> |
>>> +--------------+
>>> | PG Proxy |
>>> | (HAProxy/ |
>>> | PgBouncer) |
>>> +--------------+
>>> / \
>>> / \
>>> Synchronous Synchronous
>>> Replication Replication
>>> / \
>>> / \
>>> +---------------+ +---------------+
>>> | PostgreSQL 1 | | PostgreSQL 2 |
>>> | (Primary) |◄------►| (Standby) |
>>> +---------------+ +---------------+
>>> | |
>>> | POSIX filesystem ops |
>>> | |
>>> +---------------+ +---------------+
>>> | ZFS Pool 1 | | ZFS Pool 2 |
>>> | (3-way mirror)| | (3-way mirror)|
>>> +---------------+ +---------------+
>>> / | \ / | \
>>> / | \ / | \
>>> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814
>>> | | | | | |
>>> +--------++--------++--------++--------++--------++--------+
>>> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
>>> +--------++--------++--------++--------++--------++--------+
>>> | | | | | |
>>> | | | | | |
>>> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
>>> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>>>
>>> Architecture 2:
>>>
>>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
>>> \ /
>>> \ /
>>> Same ZFS Pool (NBD)
>>> |
>>> 6 Global ZeroFS
>>> |
>>> S3 Regions
>>>
>>>
>>> The main advantages I see are:
>>> 1. Dramatic cost reduction for large datasets
>>> 2. Simplified geo-distribution
>>> 3. Infinite storage capacity
>>> 4. Built-in encryption and compression
>>>
>>> Looking forward to your feedback and questions!
>>>
>>> Best,
>>> Pierre
>>>
>>> P.S. The full project includes a custom NFS filesystem too.
>>>
>

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: PostgreSQL on S3-backed Block Storage with Near-Local Performance