Обсуждение: Document efficient self-joins / UPDATE LIMIT techniques.

Поиск
Список
Период
Сортировка

Document efficient self-joins / UPDATE LIMIT techniques.

От
Corey Huinker
Дата:
This patch adds a few examples to demonstrate the following:

* The existence of the ctid column on every table
* The utility of ctds in self joins
* A practical usage of SKIP LOCKED

The reasoning for this is a bit long, but if you're interested, keep reading.

In the past, there has been a desire to see a LIMIT clause of some sort on UPDATE and DELETE statements. The reason for this usually stems from having a large archive or backfill operation that if done in one single transaction would overwhelm normal operations, either by the transaction failing outright, locking too many rows, flooding the WAL causing replica lag, or starving other processes of limited I/O.

The reasons for not adding a LIMIT clause are pretty straightforward: it isn't in the SQL Standard, and UPDATE/DELETE operations are unordered operations, so updating 1000 rows randomly isn't a great idea. The people wanting the LIMIT clause were undeterred by this, because they know that they intend to keep issuing updates until they run out of rows to update.

Given these limitations, I would write something like this:

WITH doomed AS (
    SELECT t.id
    FROM my_table AS t
    WHERE t.expiration_date < :'some_archive_date'
    FOR UPDATE SKIP LOCKED
    LIMIT 1000 )
DELETE FROM my_table
WHERE id IN (SELECT id FROM doomed );

This wouldn't interfere with any other updates, so I felt good about it running when the system was not-too-busy. I'd then write a script to run that in a loop, with sleeps to allow the replicas a chance to catch their breath. Then, when the rowcount finally dipped below 1000, I'd issue the final

DELETE FROM my_table WHERE expiration_date < :'some_archive_date';

And this was ok, because at that point I have good reason to believe that there are at most 1000 rows lingering out there, so waiting on locks for those was no big deal.

But a query like this involves one scan along one index (or worse, a seq scan) followed by another scan, either index or seq. Either way, we're taking up a lot of cache with rows we don't even care about.

Then in v12, the query planner got hip to bitmap tidscans, allowing for this optimization:

WITH doomed AS (
    SELECT t.ctid AS tid
    FROM my_table AS t
    WHERE t.expiration_date < :'some_archive_date'
    FOR UPDATE SKIP LOCKED
    LIMIT 1000 )
DELETE FROM my_table
USING doomed WHERE my_table.ctid = doomed.tid;

And this works pretty well, especially if you set up a partial index to meet the quals in the CTE. But we don't document this anywhere, and until UPDATE and DELETE get a LIMIT clause, we probably should document this workaround.
Вложения

Re: Document efficient self-joins / UPDATE LIMIT techniques.

От
Corey Huinker
Дата:
On Wed, Jun 28, 2023 at 2:20 PM Corey Huinker <corey.huinker@gmail.com> wrote:
This patch adds a few examples to demonstrate the following:

Bumping so CF app can see thread.

Re: Document efficient self-joins / UPDATE LIMIT techniques.

От
jian he
Дата:
Hi.
-----------------------------------------
In cases where a DML operation involving many rows must be performed,
and that table experiences numerous other simultaneous DML operations,
a FOR UPDATE clause used in conjunction with SKIP LOCKED can be useful
for performing partial DML operations:

WITH mods AS (SELECT ctid FROM mytable
              WHERE status = 'active' AND retries > 10
              ORDER BY id FOR UPDATE SKIP LOCKED)
UPDATE mytable SET status = 'failed'
FROM mods WHERE mytable.ctid = mods.ctid

This allows the DML operation to be performed in parts, avoiding
locking, until such time as the set of rows that remain to be modified
is small enough that the locking will not affect overall performance,
at which point the same statement can be issued without the SKIP
LOCKED clause to ensure that no rows were overlooked.
----------------------------------
mods found out the ctids to be updated, update mytable actually do the update.
I didn't get "This allows the DML operation to be performed in parts".

omit "at which point", the last sentence still makes sense. so I
didn't get "at which point"?

I am not native english speaker.



Re: Document efficient self-joins / UPDATE LIMIT techniques.

От
Laurenz Albe
Дата:
On Wed, 2023-06-28 at 14:20 -0400, Corey Huinker wrote:
> This patch adds a few examples to demonstrate the following:
>
> * The existence of the ctid column on every table
> * The utility of ctds in self joins
> * A practical usage of SKIP LOCKED

I had a look at your patch, and I am in favor of the general idea.

Style considerations:
---------------------

I think the SQL statements should end with semicolons.  Our SQL examples
are usually written like that.

Our general style with CTEs seems to be (according to
https://www.postgresql.org/docs/current/queries-with.html):

 WITH quaxi AS (
     SELECT ...
 )
 SELECT ...;

About the DELETE example:
-------------------------

The text suggests that a single, big DELETE operation can consume
too many resources.  That may be true, but the sum of your DELETEs
will consume even more resources.

In my experience, the bigger problem with bulk deletes like that is
that you can run into deadlocks easily, so maybe that would be a
better rationale to give.  You could say that with this technique,
you can force the lock to be taken in a certain order, which will
avoid the possibility of deadlock with other such DELETEs.

About the SELECT example:
-------------------------

That example belongs to UPDATE, I'd say, because that is the main
operation.

The reason you give (avoid excessive locking) is good.
Perhaps you could mention that updating in batches also avoids
excessive bload (if you VACUUM between the batches).

About the UPDATE example:
-------------------------

I think that could go, because it is pretty similar to the previous
one.  You even use ctid in both examples.

Status set to "waiting for author".

Yours,
Laurenz Albe



Re: Document efficient self-joins / UPDATE LIMIT techniques.

От
Corey Huinker
Дата:

I think the SQL statements should end with semicolons.  Our SQL examples
are usually written like that.

ok

 

Our general style with CTEs seems to be (according to
https://www.postgresql.org/docs/current/queries-with.html):

 WITH quaxi AS (
     SELECT ...
 )
 SELECT ...;

done
 

About the DELETE example:
-------------------------

The text suggests that a single, big DELETE operation can consume
too many resources.  That may be true, but the sum of your DELETEs
will consume even more resources.

In my experience, the bigger problem with bulk deletes like that is
that you can run into deadlocks easily, so maybe that would be a
better rationale to give.  You could say that with this technique,
you can force the lock to be taken in a certain order, which will
avoid the possibility of deadlock with other such DELETEs.

I've changed the wording to address your concerns:

   While doing this will actually increase the total amount of work performed, it can break the work into chunks that have a more acceptable impact on other workloads.

 

About the SELECT example:
-------------------------

That example belongs to UPDATE, I'd say, because that is the main
operation.

I'm iffy on that suggestion. A big part of putting it in SELECT was the fact that it shows usage of SKIP LOCKED and FOR UPDATE.
 

The reason you give (avoid excessive locking) is good.
Perhaps you could mention that updating in batches also avoids
excessive bload (if you VACUUM between the batches).

I went with:

   This technique has the additional benefit that it can reduce the overal bloat of the updated table if the table can be vacuumed in between batch updates.
 

About the UPDATE example:
-------------------------

I think that could go, because it is pretty similar to the previous
one.  You even use ctid in both examples.

It is similar, but the idea here is to aid in discovery. A user might miss the technique for update if it's only documented in delete, and even if they did see it there, they might not realize that it works for both UPDATE and DELETE. We could make reference links from one to the other, but that seems like extra work for the reader.
 
Вложения

Re: Document efficient self-joins / UPDATE LIMIT techniques.

От
Laurenz Albe
Дата:
On Tue, 2023-10-31 at 14:12 -0400, Corey Huinker wrote:
>
>
> > About the SELECT example:
> > -------------------------
> >
> > That example belongs to UPDATE, I'd say, because that is the main
> > operation.
>
> I'm iffy on that suggestion. A big part of putting it in SELECT was the fact
> that it shows usage of SKIP LOCKED and FOR UPDATE.

I can accept that.

>
> > About the UPDATE example:
> > -------------------------
> >
> > I think that could go, because it is pretty similar to the previous
> > one.  You even use ctid in both examples.
>
> It is similar, but the idea here is to aid in discovery. A user might miss the
> technique for update if it's only documented in delete, and even if they did see
> it there, they might not realize that it works for both UPDATE and DELETE.
> We could make reference links from one to the other, but that seems like extra
> work for the reader.

I am talking about the similarity between the SELECT and the UPDATE example.
I don't agree with bloating the documentation with redundant examples just
to save a user a click.

I like the idea of a link. Perhaps:

  If you need to perform a large UPDATE in batches to avoid excessive bloat,
  deadlocks or to reduce the load on the server, look at the example in <link>.

Other observations:

  @@ -234,6 +234,35 @@ DELETE FROM films
      In some cases the join style is easier to write or faster to
      execute than the sub-select style.
     </para>
  +  <para>
  +   In situations where a single operation would consume too many resources,
  +   either causing the operation to fail or negatively impacting other workloads,
  +   it may be desirable to break up a large <command>DELETE</command> into
  +   multiple separate commands. While doing this will actually increase the
  +   total amount of work performed, it can break the work into chunks that have
  +   a more acceptable impact on other workloads.  The
  +   <glossterm linkend="glossary-sql-standard">SQL standard</glossterm> does
  +   not define a <literal>LIMIT</literal> clause for <command>DELETE</command>
  +   operations, but it is possible get the equivalent functionality through the
  +   <literal>USING</literal> clause to a
  +   <link linkend="queries-with">Common Table Expression</link> which identifies
  +   a subset of rows to be deleted, locks those rows, and returns their system
  +   column <link linkend="ddl-system-columns-ctid">ctid</link> values:

I don't think that reducing the load on the server is such a great use case
that we should recommend it as "best practice" in the documentation (because,
as your patch now mentions, it doesn't reduce the overall load).

I also don't think we need a verbal description of what the following query does.

How about something like:

"If you have to delete lots of rows, it can make sense to perform the operation
 in several smaller batches to reduce the risk of deadlocks.  The
 <glossterm linkend="glossary-sql-standard">SQL standard</glossterm> does
 not define a <literal>LIMIT</literal> clause for <command>DELETE</command>,
 but it is possible to achieve a similar effect with a self-join on
 the system column <link linkend="ddl-system-columns-ctid">ctid</link>:"

  +<programlisting>
  +WITH delete_batch AS (
  +  SELECT l.ctid
  +  FROM user_logs AS l
  +  WHERE l.status = 'archived'
  +  ORDER BY l.creation_date
  +  LIMIT 10000
  +  FOR UPDATE
  +)
  +DELETE FROM user_logs AS ul
  +USING delete_branch AS del
  +WHERE ul.ctid = del.ctid;
  +</programlisting>
  +  This allows for flexible search criteria within the CTE and an efficient self-join.
  +  </para>

The last sentence is redundant, I'd say.

But you could add:

"An added benefit is that by using an <literal>ORDER BY</literal> clause in
 the subquery, you can determine the order in which the rows will be locked
 and deleted, which will prevent deadlocks with other statements that lock
 the rows in the same order."

But if you do that, you had better use "ORDER BY id" or something else that
looks more like a unique column.

--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -1679,6 +1679,30 @@ SELECT * FROM (SELECT * FROM mytable FOR UPDATE) ss WHERE col1 = 5;
     condition is not textually within the sub-query.
    </para>

+   <para>
+    In cases where a <acronym>DML</acronym> operation involving many rows

I think we should avoid using DML.  Beginner might not know it, and it is
not an index term.  My suggestion is "data modification statement/operation".

+    must be performed, and that table experiences numerous other simultaneous
+    <acronym>DML</acronym> operations, a <literal>FOR UPDATE</literal> clause
+    used in conjunction with <literal>SKIP LOCKED</literal> can be useful for
+    performing partial <acronym>DML</acronym> operations:
+
+<programlisting>
+WITH mods AS (
+    SELECT ctid FROM mytable
+    WHERE status = 'active' AND retries > 10
+    ORDER BY id FOR UPDATE SKIP LOCKED
+)
+UPDATE mytable SET status = 'failed'
+FROM mods WHERE mytable.ctid = mods.ctid;
+</programlisting>
+
+    This allows the <acronym>DML</acronym> operation to be performed in parts, avoiding locking,
+    until such time as the set of rows that remain to be modified is small enough

"until such time as" does not sound English to me.  "Until the number of rows that remain"
would be better, in my opinion.

+    that the locking will not affect overall performance, at which point the same

"that the locking" --> "that locking them"

+    statement can be issued without the <literal>SKIP LOCKED</literal> clause to ensure
+    that no rows were overlooked. This technique has the additional benefit that it can reduce
+    the overal bloat of the updated table if the table can be vacuumed in between batch updates.
+   </para>

"overal" --> "overall"

I don't think you should use "vacuum" as a verb.
Suggestion: "if you perform <command>VACUUM</command> on the table between individual
update batches".


Yours,
Laurenz Albe



Re: Document efficient self-joins / UPDATE LIMIT techniques.

От
vignesh C
Дата:
On Tue, 31 Oct 2023 at 23:42, Corey Huinker <corey.huinker@gmail.com> wrote:
>>
>>
>> I think the SQL statements should end with semicolons.  Our SQL examples
>> are usually written like that.
>
>
> ok
>
>
>>
>>
>> Our general style with CTEs seems to be (according to
>> https://www.postgresql.org/docs/current/queries-with.html):
>>
>>  WITH quaxi AS (
>>      SELECT ...
>>  )
>>  SELECT ...;
>
>
> done
>
>>
>>
>> About the DELETE example:
>> -------------------------
>>
>> The text suggests that a single, big DELETE operation can consume
>> too many resources.  That may be true, but the sum of your DELETEs
>> will consume even more resources.
>>
>> In my experience, the bigger problem with bulk deletes like that is
>> that you can run into deadlocks easily, so maybe that would be a
>> better rationale to give.  You could say that with this technique,
>> you can force the lock to be taken in a certain order, which will
>> avoid the possibility of deadlock with other such DELETEs.
>
>
> I've changed the wording to address your concerns:
>
>    While doing this will actually increase the total amount of work performed, it can break the work into chunks that
havea more acceptable impact on other workloads. 
>
>
>>
>>
>> About the SELECT example:
>> -------------------------
>>
>> That example belongs to UPDATE, I'd say, because that is the main
>> operation.
>
>
> I'm iffy on that suggestion. A big part of putting it in SELECT was the fact that it shows usage of SKIP LOCKED and
FORUPDATE. 
>
>>
>>
>> The reason you give (avoid excessive locking) is good.
>> Perhaps you could mention that updating in batches also avoids
>> excessive bload (if you VACUUM between the batches).
>
>
> I went with:
>
>    This technique has the additional benefit that it can reduce the overal bloat of the updated table if the table
canbe vacuumed in between batch updates. 
>
>>
>>
>> About the UPDATE example:
>> -------------------------
>>
>> I think that could go, because it is pretty similar to the previous
>> one.  You even use ctid in both examples.
>
>
> It is similar, but the idea here is to aid in discovery. A user might miss the technique for update if it's only
documentedin delete, and even if they did see it there, they might not realize that it works for both UPDATE and
DELETE.We could make reference links from one to the other, but that seems like extra work for the reader. 

I have changed the status of commitfest entry to "Returned with
Feedback" as Laurenz's comments have not yet been resolved. Please
handle the comments and update the commitfest entry accordingly.

Regards,
Vignesh



Re: Document efficient self-joins / UPDATE LIMIT techniques.

От
Corey Huinker
Дата:
I have changed the status of commitfest entry to "Returned with
Feedback" as Laurenz's comments have not yet been resolved. Please
handle the comments and update the commitfest entry accordingly.


Here's another attempt, applying Laurenz's feedback:

I removed all changes to the SELECT documentation. That might seem strange given that the heavy lifting happens in the SELECT, but I'm working from the assumption that people's greatest need for a ctid self-join will be because they are trying to find the LIMIT keyword on UPDATE/DELETE and coming up empty.

Because the join syntax is subtly different between UPDATE and DELETE, I've kept code examples in both, but the detailed explanation is in UPDATE under the anchor "update-limit" and the DELETE example links to it.

Вложения

Re: Document efficient self-joins / UPDATE LIMIT techniques.

От
Laurenz Albe
Дата:
On Sat, 2024-02-03 at 15:27 -0500, Corey Huinker wrote:
>
> Here's another attempt, applying Laurenz's feedback:

I like this patch much better.

Some comments:

> --- a/doc/src/sgml/ref/delete.sgml
> +++ b/doc/src/sgml/ref/delete.sgml
> @@ -234,6 +234,24 @@ DELETE FROM films
>     In some cases the join style is easier to write or faster to
>     execute than the sub-select style.
>    </para>
> +  <para id="delete-limit">
> +   While there is no <literal>LIMIT</literal> clause for
> +   <command>DELETE</command>, it is possible to get a similar effect
> +   using the method for <command>UPDATE</command> operations described
> +   <link linkend="update-limit">in greater detail here</link>.
> +<programlisting>
> +WITH delete_batch AS (
> +  SELECT l.ctid
> +  FROM user_logs AS l
> +  WHERE l.status = 'archived'
> +  ORDER BY l.creation_date
> +  LIMIT 10000
> +  FOR UPDATE
> +)
> +DELETE FROM user_logs AS ul
> +USING delete_branch AS del
> +WHERE ul.ctid = del.ctid;
> +</programlisting></para>
>   </refsect1>
>
>   <refsect1>

- About the style: there is usually an empty line between an ending </para>
  and the next starting <para>.  It does not matter for correctness, but I
  think it makes the source easier to read.

- I would rather have only "here" as link text rather than "in greater details
  here".  Even better would be something that gives the reader a clue where
  the link will take her, like
  <link linkend="update-limit">the documentation of <command>UPDATE</command></link>.

- I am not sure if it is necessary to have the <programlisting> at all.
  I'd say that it is just a trivial variation of the UPDATE example.
  On the other hand, a beginner might find the example useful.
  Not sure.

If I had my way, I'd just keep the first paragraph, something like

  <para id="delete-limit">
   While there is no <literal>LIMIT</literal> clause for
   <command>DELETE</command>, it is possible to get a similar effect
   using a self-join with a common table expression as described in the
   <link linkend="update-limit"><command>UPDATE</command> examples</link>.
  </para>


> diff --git a/doc/src/sgml/ref/update.sgml b/doc/src/sgml/ref/update.sgml
> index 2ab24b0523..49e0dc29de 100644
> --- a/doc/src/sgml/ref/update.sgml
> +++ b/doc/src/sgml/ref/update.sgml
> @@ -434,7 +434,6 @@ UPDATE wines SET stock = stock + 24 WHERE winename = 'Chateau Lafite 2003';
>  COMMIT;
>  </programlisting>
>    </para>
> -
>    <para>
>     Change the <structfield>kind</structfield> column of the table
>     <structname>films</structname> in the row on which the cursor

Please don't.


I'm mostly fine with the UPDATE example.

> +   it can make sense to perform the operation in smaller batches. Performing a
> +   <command>VACUUM</command> operation on the table in between batches can help
> +   reduce table bloat. The

I think the "in" before between is unnecessary and had better be removed, but
I'll defer to the native speaker.

Yours,
Laurenz Albe



Re: Document efficient self-joins / UPDATE LIMIT techniques.

От
Corey Huinker
Дата:

- About the style: there is usually an empty line between an ending </para>
  and the next starting <para>.  It does not matter for correctness, but I
  think it makes the source easier to read.

Done. I've seen them with spaces and without, and have no preference.
 

- I would rather have only "here" as link text rather than "in greater details
  here".  Even better would be something that gives the reader a clue where
  the link will take her, like
  <link linkend="update-limit">the documentation of <command>UPDATE</command></link>.

Done. 

- I am not sure if it is necessary to have the <programlisting> at all.
  I'd say that it is just a trivial variation of the UPDATE example.
  On the other hand, a beginner might find the example useful.
  Not sure.

I think a beginner would find it useful. The join syntax for DELETE is different from UPDATE in a way that has never made sense to me, and a person with only the UPDATE example might try just replacing UPDATE WITH DELETE and eliminating the SET clause, and frustration would follow. We have an opportunity to show the equivalent join in both cases, let's use it.

 
I think the "in" before between is unnecessary and had better be removed, but
I'll defer to the native speaker.

The "in" is more common when spoken. Removed.
Вложения

Re: Document efficient self-joins / UPDATE LIMIT techniques.

От
Laurenz Albe
Дата:
On Mon, 2024-02-12 at 11:45 -0500, Corey Huinker wrote:
>
> > - I am not sure if it is necessary to have the <programlisting> at all.
> >   I'd say that it is just a trivial variation of the UPDATE example.
> >   On the other hand, a beginner might find the example useful.
> >   Not sure.
>
> I think a beginner would find it useful. The join syntax for DELETE is different from
> UPDATE in a way that has never made sense to me, and a person with only the UPDATE
> example might try just replacing UPDATE WITH DELETE and eliminating the SET clause,
> and frustration would follow. We have an opportunity to show the equivalent join in
> both cases, let's use it.

I think we can leave the decision to the committer.

> > I think the "in" before between is unnecessary and had better be removed, but
> > I'll defer to the native speaker.
>
> The "in" is more common when spoken. Removed.

The "in" is appropriate for intransitive use:
"I've been here and I've been there and I've been in between."
But: "I have been between here and there."

Do you plan to add it to the commitfest?  If yes, I'd set it "ready for committer".

Yours,
Laurenz Albe



Re: Document efficient self-joins / UPDATE LIMIT techniques.

От
Corey Huinker
Дата:
Do you plan to add it to the commitfest?  If yes, I'd set it "ready for committer".

Commitfest entry reanimated. 

Re: Document efficient self-joins / UPDATE LIMIT techniques.

От
Laurenz Albe
Дата:
On Mon, 2024-02-12 at 12:24 -0500, Corey Huinker wrote:
> > Do you plan to add it to the commitfest?  If yes, I'd set it "ready for committer".
>
> Commitfest entry reanimated. 

Truly... you created a revenant in the already closed commitfest.

I closed that again and added a new entry in the open commitfest.

Yours,
Laurenz Albe



Re: Document efficient self-joins / UPDATE LIMIT techniques.

От
"Joel Jacobson"
Дата:
On Tue, Feb 13, 2024, at 10:28, Laurenz Albe wrote:
> On Mon, 2024-02-12 at 12:24 -0500, Corey Huinker wrote:
>> > Do you plan to add it to the commitfest?  If yes, I'd set it "ready for committer".
>> 
>> Commitfest entry reanimated. 
>
> Truly... you created a revenant in the already closed commitfest.
>
> I closed that again and added a new entry in the open commitfest.
>
> Yours,
> Laurenz Albe

This thread reminded me of the old discussion "LIMIT for UPDATE and DELETE" from 2014 [1].

Back in 2014, it was considered a "fringe feature" by some. It is thought to be more commonplace today?

/Joel

[1] https://www.postgresql.org/message-id/flat/CADB9FDf-Vh6RnKAMZ4Rrg_YP9p3THdPbji8qe4qkxRuiOwm%3Dmg%40mail.gmail.com



Re: Document efficient self-joins / UPDATE LIMIT techniques.

От
Corey Huinker
Дата:
On Tue, Feb 13, 2024 at 11:51 AM Joel Jacobson <joel@compiler.org> wrote:
On Tue, Feb 13, 2024, at 10:28, Laurenz Albe wrote:
> On Mon, 2024-02-12 at 12:24 -0500, Corey Huinker wrote:
>> > Do you plan to add it to the commitfest?  If yes, I'd set it "ready for committer".
>>
>> Commitfest entry reanimated.
>
> Truly... you created a revenant in the already closed commitfest.
>
> I closed that again and added a new entry in the open commitfest.
>
> Yours,
> Laurenz Albe

This thread reminded me of the old discussion "LIMIT for UPDATE and DELETE" from 2014 [1].

Back in 2014, it was considered a "fringe feature" by some. It is thought to be more commonplace today?

/Joel

[1] https://www.postgresql.org/message-id/flat/CADB9FDf-Vh6RnKAMZ4Rrg_YP9p3THdPbji8qe4qkxRuiOwm%3Dmg%40mail.gmail.com

This patch came out of a discussion at the last PgCon with the person who made the "fringe feature" quote, who seemed quite supportive of documenting the technique. The comment may have been in regards to actually implementing a LIMIT clause on UPDATE and DELETE, which isn't in the SQL standard and would be difficult to implement as the two statements have no concept of ordering. Documenting the workaround would alleviate some interest in implementing a nonstandard feature.

As for whether it's commonplace, when I was a consultant I had a number of customers that I had who bemoaned how large updates caused big replica lag, basically punishing access to records they did care about in order to properly archive or backfill records they don't care about. I used the technique a lot, putting the update/delete in a loop, and often running multiple copies of the same script at times when I/O contention was low, but if load levels rose it was trivial to just kill a few of the scripts until things calmed down.

Re: Document efficient self-joins / UPDATE LIMIT techniques.

От
"Joel Jacobson"
Дата:
On Tue, Feb 13, 2024, at 23:56, Corey Huinker wrote:
> This patch came out of a discussion at the last PgCon with the person 
> who made the "fringe feature" quote, who seemed quite supportive of 
> documenting the technique. The comment may have been in regards to 
> actually implementing a LIMIT clause on UPDATE and DELETE, which isn't 
> in the SQL standard and would be difficult to implement as the two 
> statements have no concept of ordering. Documenting the workaround 
> would alleviate some interest in implementing a nonstandard feature.

Thanks for sharing the background story.

> As for whether it's commonplace, when I was a consultant I had a number 
> of customers that I had who bemoaned how large updates caused big 
> replica lag, basically punishing access to records they did care about 
> in order to properly archive or backfill records they don't care about. 
> I used the technique a lot, putting the update/delete in a loop, and 
> often running multiple copies of the same script at times when I/O 
> contention was low, but if load levels rose it was trivial to just kill 
> a few of the scripts until things calmed down.

I've also used the technique quite a lot, but only using the PK,
didn't know about the ctid trick, so many thanks for documenting it.

/Joel



Re: Document efficient self-joins / UPDATE LIMIT techniques.

От
Corey Huinker
Дата:
> As for whether it's commonplace, when I was a consultant I had a number
> of customers that I had who bemoaned how large updates caused big
> replica lag, basically punishing access to records they did care about
> in order to properly archive or backfill records they don't care about.
> I used the technique a lot, putting the update/delete in a loop, and
> often running multiple copies of the same script at times when I/O
> contention was low, but if load levels rose it was trivial to just kill
> a few of the scripts until things calmed down.

I've also used the technique quite a lot, but only using the PK,
didn't know about the ctid trick, so many thanks for documenting it.

tid-scans only became a thing a few versions ago (12?). Prior to that, PK was the only way to go.

 

Re: Document efficient self-joins / UPDATE LIMIT techniques.

От
Tom Lane
Дата:
Corey Huinker <corey.huinker@gmail.com> writes:
>> I've also used the technique quite a lot, but only using the PK,
>> didn't know about the ctid trick, so many thanks for documenting it.

> tid-scans only became a thing a few versions ago (12?). Prior to that, PK
> was the only way to go.

I think we had TID scans for awhile before it was possible to use
them in joins, although I don't recall the details of that.
Anyway, pushed after some additional wordsmithing.

            regards, tom lane