Обсуждение: how does postgresql handle LOB/CLOB/BLOB column data that dies before the query ends

Поиск
Список
Период
Сортировка

how does postgresql handle LOB/CLOB/BLOB column data that dies before the query ends

От
Noel Grandin
Дата:
Hi

Hacker from another open-source DB here (h2database.com).

How does postgresql handle the following situation?

(1) a table containing a LOB column 
(2) a query that does
   ResultSet rs = query("select lob_column from table_foo");
  while (rs.next())
  {
      retrieve_lob_data(rs.getLob(1));
      .... very long running stuff here......
   }

In the face of concurrent updates that might overwrite the existing LOB data, how does PostgresQL handle this?

Does it keep the LOB data around until the ResultSet/Connection is closed?
Or does it impose some extra constraint on the client side? e.g.. explicitly opening and closing a transaction, and only wipe the "old" LOB data when the transaction is closed?

I ask because I have implemented two of the four LOB implementations that H2 has used, and we are still having trouble :-(

Regards, Noel.

Re: how does postgresql handle LOB/CLOB/BLOB column data that dies before the query ends

От
Tom Lane
Дата:
Noel Grandin <noelgrandin@gmail.com> writes:
> Hacker from another open-source DB here (h2database.com).

> How does postgresql handle the following situation?

> (1) a table containing a LOB column

Postgres doesn't really do LOB in the same sense that some other DBs
have, so you'd need to specify what you have in mind in Postgres
terms to get a useful answer.

We do have a concept of "large objects" named by OIDs, but they're
much more of a manually-managed, nontransparent feature than typical
LOB implementations.  I don't think our JDBC driver implements the
sort of syntax you sketch (I could be wrong though, not much of a
JDBC guy).

Having said that ...

> In the face of concurrent updates that might overwrite the existing LOB
> data, how does PostgresQL handle this?

... reading from a large object follows the same MVCC rules we use
for all other data.  We allow multiple versions of a tuple to exist
on-disk, and we don't clean out old versions until no live transaction
can "see" them anymore.  So data consistency is just a matter of using
the same "snapshot" (which selects appropriate tuple versions) across
however many queries you want consistent results from.  If somebody
writes new data meanwhile, it doesn't matter because that tuple version
is invisible to your snapshot.

> Or does it impose some extra constraint on the client side? e.g..
> explicitly opening and closing a transaction, and only wipe the "old" LOB
> data when the transaction is closed?

From a client's perspective, the two options are "snapshots last for
one query" and "snapshots last for one transaction".  You signify which
one you want by selecting a transaction isolation mode when you begin
the transaction.

            regards, tom lane



Re: how does postgresql handle LOB/CLOB/BLOB column data that dies before the query ends

От
Noel Grandin
Дата:
Thanks for the answers.

So, H2, like PostgreSQL, also internally has (a) an MVCC engine and (b) LOBs existing as a on-the-side extra thing.

On Fri, 24 Feb 2023 at 17:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Postgres doesn't really do LOB in the same sense that some other DBs
have, so you'd need to specify what you have in mind in Postgres
terms to get a useful answer.

So, specifically, the primary problem we have is this:

(1) A typical small query returns all of its data in a stream to the client
(2) which means that, from the server's perspective, the transaction is closed the moment the last record in the stream is pushed to the client.
(3) which means that, in the face of concurrent updates, the underlying MVCC data in the query might be long-gone from the server by the time the client has finished reading the result set.
(4) However, with LOBs, the client doesn't get the LOB in the result set data stream, it gets a special identifier (a hash), which it uses to fetch LOB data from the server in chunks
(5) Which means that the lifetime of an individual LOB is just horrible
At the moment the implementation I have satisfies the needs of clients in terms of correctness (crosses fingers), but is horrible in terms of performance because of how long it has to keep LOB data around.

Re: how does postgresql handle LOB/CLOB/BLOB column data that dies before the query ends

От
Tom Lane
Дата:
Noel Grandin <noelgrandin@gmail.com> writes:
> On Fri, 24 Feb 2023 at 17:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Postgres doesn't really do LOB in the same sense that some other DBs
>> have, so you'd need to specify what you have in mind in Postgres
>> terms to get a useful answer.

> So, specifically, the primary problem we have is this:

> (1) A typical small query returns all of its data in a stream to the client
> (2) which means that, from the server's perspective, the transaction is
> closed the moment the last record in the stream is pushed to the client.
> (3) which means that, in the face of concurrent updates, the underlying
> MVCC data in the query might be long-gone from the server by the time the
> client has finished reading the result set.
> (4) However, with LOBs, the client doesn't get the LOB in the result set
> data stream, it gets a special identifier (a hash), which it uses to fetch
> LOB data from the server in chunks
> (5) Which means that the lifetime of an individual LOB is just horrible
> At the moment the implementation I have satisfies the needs of clients in
> terms of correctness (crosses fingers), but is horrible in terms of
> performance because of how long it has to keep LOB data around.

Yeah, Postgres has an analogous kind of problem.  Our standard way to
use "large objects" is to store their identifying OIDs in tables,
fetch the desired OID with a regular SQL query, and then open and read
(or write) the large object using its OID.  So you have a hazard of
time skew between what you saw in the table and what you see in the
large object.  We pretty much lay that problem off on the clients: if
they want consistency of those views they need to make sure that the
same snapshot is used for both the SQL query and the large-object
read.  That's not hard to do, but it isn't the default behavior,
and in particular they can *not* close the transaction that read the
OID if they'd like to read a matching state of the large object.
So far there's not been a lot of complaints about that ...

            regards, tom lane



Re: how does postgresql handle LOB/CLOB/BLOB column data that dies before the query ends

От
Noel Grandin
Дата:


On Sat, 25 Feb 2023 at 08:33, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, Postgres has an analogous kind of problem.  Our standard way to
use "large objects" is to store their identifying OIDs in tables,
... 
and in particular they can *not* close the transaction that read the
OID if they'd like to read a matching state of the large object.
So far there's not been a lot of complaints about that ...


OK, so it seems like so far my design is not far off the PostgreSQL design (which is very comforting). 

I wonder if the difference is in the client<->server protocol.

Does PostgreSQL hold the transaction open until the client side has closed the resultset (or the query object possibly, not sure about the PostgreSQL API here).
H2 has a very simple client-server protocol, which means the client simply sends a query and gets back a result-set stream, and there is no explicit acknowledgement of when the client closes the resultset, which means that the MVCC transaction is typically closed by the time the client even starts reading the resultset.

Re: how does postgresql handle LOB/CLOB/BLOB column data that dies before the query ends

От
Tom Lane
Дата:
Noel Grandin <noelgrandin@gmail.com> writes:
> OK, so it seems like so far my design is not far off the PostgreSQL design
> (which is very comforting).

> I wonder if the difference is in the client<->server protocol.

That could be a piece of the puzzle, yeah.

> Does PostgreSQL hold the transaction open until the client side has closed
> the resultset (or the query object possibly, not sure about the PostgreSQL
> API here).

We use single-threaded server processes, so we couldn't close the
transaction (or more to the point, drop the query's snapshot) until
we've computed and sent the whole resultset.  I should think that
there's a similar requirement even if multi-threaded: if you do MVCC
at all then you have to hold your snapshot (or whatever mechanism
you use) until the resultset is all computed, or else later rows
in the query result might be wrong.

In the scenario I'm describing with a query fetching some large
object OID(s) followed by separate queries retrieving those large
objects, we put it on the client to create an explicit transaction
block around those queries (ie send BEGIN and COMMIT commands),
and to select a transaction mode that causes the same snapshot to
be used across the whole transaction.  If the client fails to do
this, there could be concurrency anomalies.  Any one of those
queries will still deliver self-consistent results, but they
might not match up with earlier or later queries.

            regards, tom lane



Re: how does postgresql handle LOB/CLOB/BLOB column data that dies before the query ends

От
Noel Grandin
Дата:


On Sat, 25 Feb 2023 at 19:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
That could be a piece of the puzzle, yeah.


Thank you very much, this conversion has been a great help.

Regards, Noel Grandin