Обсуждение: database must be vacuumed with transactions

Поиск
Список
Период
Сортировка

database must be vacuumed with transactions

От
Dave Johansen
Дата:
The database I work on started spitting out the following message last night:
WARNING:  database "ops" must be vacuumed within 9361597 transaction

I ran VACUUM as soon as I got in, but then shortly after, it stopped accepting transactions. I stopped the database and opened single user mode to run VACUUM. So my questions are:
1) How long should I expect this to take (database has about 5 TB of data)?
2) Is there a faster way to make the database be operational again?

Thanks,
Dave

Re: database must be vacuumed with transactions

От
Dave Johansen
Дата:
On Thu, Jun 25, 2015 at 8:49 AM, Dave Johansen <davejohansen@gmail.com> wrote:
The database I work on started spitting out the following message last night:
WARNING:  database "ops" must be vacuumed within 9361597 transaction

I ran VACUUM as soon as I got in, but then shortly after, it stopped accepting transactions. I stopped the database and opened single user mode to run VACUUM. So my questions are:
1) How long should I expect this to take (database has about 5 TB of data)?
2) Is there a faster way to make the database be operational again?

Sorry for the multiple emails, but I forgot to mention that I'm using 8.4 on RHEL 6.4.

Re: database must be vacuumed with transactions

От
John Scalia
Дата:
Here many of our databases are in the TB size range. Vacuums often take several hours to complete, and we have schedule maintenance time for a vacuum full of around an average of 4 hours.

On Thu, Jun 25, 2015 at 8:49 AM, Dave Johansen <davejohansen@gmail.com> wrote:
On Thu, Jun 25, 2015 at 8:49 AM, Dave Johansen <davejohansen@gmail.com> wrote:
The database I work on started spitting out the following message last night:
WARNING:  database "ops" must be vacuumed within 9361597 transaction

I ran VACUUM as soon as I got in, but then shortly after, it stopped accepting transactions. I stopped the database and opened single user mode to run VACUUM. So my questions are:
1) How long should I expect this to take (database has about 5 TB of data)?
2) Is there a faster way to make the database be operational again?

Sorry for the multiple emails, but I forgot to mention that I'm using 8.4 on RHEL 6.4.

Re: database must be vacuumed with transactions

От
Dave Johansen
Дата:
On Thu, Jun 25, 2015 at 8:49 AM, Dave Johansen <davejohansen@gmail.com> wrote:
On Thu, Jun 25, 2015 at 8:49 AM, Dave Johansen <davejohansen@gmail.com> wrote:
The database I work on started spitting out the following message last night:
WARNING:  database "ops" must be vacuumed within 9361597 transaction

I ran VACUUM as soon as I got in, but then shortly after, it stopped accepting transactions. I stopped the database and opened single user mode to run VACUUM. So my questions are:
1) How long should I expect this to take (database has about 5 TB of data)?
2) Is there a faster way to make the database be operational again?

Sorry for the multiple emails, but I forgot to mention that I'm using 8.4 on RHEL 6.4.

Once the VACUUM was started to get the database back online, we did some diagnostics and it appears that there was a VERY large number of transactions done in the last few weeks and that was the source of this problem. It appears that there's a txid_current() function ( http://www.postgresql.org/docs/8.4/static/functions-info.html#FUNCTIONS-TXID-SNAPSHOT ), but is there a way to get the XID for all open connections so we can find the offending connection and fix the issue so this won't happen again?

Thanks,
Dave

Re: database must be vacuumed with transactions

От
John Scalia
Дата:
The pg_stat_activity table will show you which processes are doing what, and you'll be able to see their process ID from the O/S. That might help you.

On Thu, Jun 25, 2015 at 11:30 AM, Dave Johansen <davejohansen@gmail.com> wrote:
On Thu, Jun 25, 2015 at 8:49 AM, Dave Johansen <davejohansen@gmail.com> wrote:
On Thu, Jun 25, 2015 at 8:49 AM, Dave Johansen <davejohansen@gmail.com> wrote:
The database I work on started spitting out the following message last night:
WARNING:  database "ops" must be vacuumed within 9361597 transaction

I ran VACUUM as soon as I got in, but then shortly after, it stopped accepting transactions. I stopped the database and opened single user mode to run VACUUM. So my questions are:
1) How long should I expect this to take (database has about 5 TB of data)?
2) Is there a faster way to make the database be operational again?

Sorry for the multiple emails, but I forgot to mention that I'm using 8.4 on RHEL 6.4.

Once the VACUUM was started to get the database back online, we did some diagnostics and it appears that there was a VERY large number of transactions done in the last few weeks and that was the source of this problem. It appears that there's a txid_current() function ( http://www.postgresql.org/docs/8.4/static/functions-info.html#FUNCTIONS-TXID-SNAPSHOT ), but is there a way to get the XID for all open connections so we can find the offending connection and fix the issue so this won't happen again?

Thanks,
Dave

Re: database must be vacuumed with transactions

От
Dave Johansen
Дата:
On Thu, Jun 25, 2015 at 11:38 AM, John Scalia <jayknowsunix@gmail.com> wrote:
The pg_stat_activity table will show you which processes are doing what, and you'll be able to see their process ID from the O/S. That might help you.

Yes, but I need to see which process ID is doing a TON of small transactions which is causing the XID to increment at an unexpectedly high rate and so my question is "if there's a way for me to get the XID for every connection/query?".

Re: database must be vacuumed with transactions

От
John Scalia
Дата:
Logging?

On Thu, Jun 25, 2015 at 11:42 AM, Dave Johansen <davejohansen@gmail.com> wrote:
On Thu, Jun 25, 2015 at 11:38 AM, John Scalia <jayknowsunix@gmail.com> wrote:
The pg_stat_activity table will show you which processes are doing what, and you'll be able to see their process ID from the O/S. That might help you.

Yes, but I need to see which process ID is doing a TON of small transactions which is causing the XID to increment at an unexpectedly high rate and so my question is "if there's a way for me to get the XID for every connection/query?".

Re: database must be vacuumed with transactions

От
Keith
Дата:


On Thu, Jun 25, 2015 at 2:42 PM, Dave Johansen <davejohansen@gmail.com> wrote:
On Thu, Jun 25, 2015 at 11:38 AM, John Scalia <jayknowsunix@gmail.com> wrote:
The pg_stat_activity table will show you which processes are doing what, and you'll be able to see their process ID from the O/S. That might help you.

Yes, but I need to see which process ID is doing a TON of small transactions which is causing the XID to increment at an unexpectedly high rate and so my question is "if there's a way for me to get the XID for every connection/query?".

Look at the log_line_prefix options. To get everything you're looking for to try and identify where this is coming from look at using at a minimum: %p, %r, %m, %x, %u, %d

http://www.postgresql.org/docs/8.4/static/runtime-config-logging.html#GUC-LOG-LINE-PREFIX

You'll also have to make sure log_min_duration_statement is set to low enough to log the queries you're looking for. Beware setting disabling it or setting it too low, though, if you have a very high number of queries being run. This can quickly overload your IO and disk space.

If you get the database running again, you can also look at the pg_stat_statements contrib module to log query counts and parameterized versions of all queries being run.

http://www.postgresql.org/docs/8.4/static/pgstatstatements.html

And I can't let this email go without say you REALLY need to look at upgrading your database. 8.4 is no longer being supported and receives no security or bug fixes.


Re: database must be vacuumed with transactions

От
Dave Johansen
Дата:
On Thu, Jun 25, 2015 at 2:23 PM, Keith <keith@keithf4.com> wrote:
On Thu, Jun 25, 2015 at 2:42 PM, Dave Johansen <davejohansen@gmail.com> wrote:
On Thu, Jun 25, 2015 at 11:38 AM, John Scalia <jayknowsunix@gmail.com> wrote:
The pg_stat_activity table will show you which processes are doing what, and you'll be able to see their process ID from the O/S. That might help you.

Yes, but I need to see which process ID is doing a TON of small transactions which is causing the XID to increment at an unexpectedly high rate and so my question is "if there's a way for me to get the XID for every connection/query?".

Look at the log_line_prefix options. To get everything you're looking for to try and identify where this is coming from look at using at a minimum: %p, %r, %m, %x, %u, %d

http://www.postgresql.org/docs/8.4/static/runtime-config-logging.html#GUC-LOG-LINE-PREFIX

We found the source of the issue. It's a function that we're using to all a table to be populated from multiple processes. You can see the details here:
http://www.postgresql.org/message-id/CAAcYxUf9o3NpiKaHdeShO_iJjm92314eX6ko0JvJidmksutkJg@mail.gmail.com

Any ideas on how we could handle this with causing it to increment the XID?
 
You'll also have to make sure log_min_duration_statement is set to low enough to log the queries you're looking for. Beware setting disabling it or setting it too low, though, if you have a very high number of queries being run. This can quickly overload your IO and disk space.

This probably isn't feasible, because we run a lot of queries with most of them being very short duration.
 
If you get the database running again, you can also look at the pg_stat_statements contrib module to log query counts and parameterized versions of all queries being run.

http://www.postgresql.org/docs/8.4/static/pgstatstatements.html

We got it up and we will be turning this module on to help diagnose issues like this in the future.
 
And I can't let this email go without say you REALLY need to look at upgrading your database. 8.4 is no longer being supported and receives no security or bug fixes.

We are looking into upgrading to RHEL 7 which comes with Postgres 9.2 but that's a SLOW process.

Re: database must be vacuumed with transactions

От
Keith
Дата:


On Thu, Jun 25, 2015 at 8:57 PM, Dave Johansen <davejohansen@gmail.com> wrote:
On Thu, Jun 25, 2015 at 2:23 PM, Keith <keith@keithf4.com> wrote:
On Thu, Jun 25, 2015 at 2:42 PM, Dave Johansen <davejohansen@gmail.com> wrote:
On Thu, Jun 25, 2015 at 11:38 AM, John Scalia <jayknowsunix@gmail.com> wrote:
The pg_stat_activity table will show you which processes are doing what, and you'll be able to see their process ID from the O/S. That might help you.

Yes, but I need to see which process ID is doing a TON of small transactions which is causing the XID to increment at an unexpectedly high rate and so my question is "if there's a way for me to get the XID for every connection/query?".

Look at the log_line_prefix options. To get everything you're looking for to try and identify where this is coming from look at using at a minimum: %p, %r, %m, %x, %u, %d

http://www.postgresql.org/docs/8.4/static/runtime-config-logging.html#GUC-LOG-LINE-PREFIX

We found the source of the issue. It's a function that we're using to all a table to be populated from multiple processes. You can see the details here:
http://www.postgresql.org/message-id/CAAcYxUf9o3NpiKaHdeShO_iJjm92314eX6ko0JvJidmksutkJg@mail.gmail.com

Any ideas on how we could handle this with causing it to increment the XID?

Not sure on how to avoid the xid increment on this. I doubt this can be avoided, but hopefully someone else can respond to your other email with a more definitive answer. What I would recommend instead in the mean time to help deal with the XID issue is some close monitoring of your current txid vs the autovacuum_max_freeze_age. And get some sort of externally scheduled job to more aggressively vacuum the database based on that. And if you have any static, unwritten tables, you definitely want to VACUUM FREEZE these to avoid the txid building on them. The check_postgres script (https://bucardo.org/wiki/Check_postgres) has this monitoring available as an option. Or if you can't use it, you can at least look at its source code and get the queries it uses and incorporate it into your monitoring solution.

And while this won't help you now, 9.5 has this "do nothing" upsert you're trying to do here built in

http://www.depesz.com/2015/05/10/waiting-for-9-5-add-support-for-insert-on-conflict-do-nothingupdate/

For now, though, the function is the only reliable way to do what you're doing.

 
 
You'll also have to make sure log_min_duration_statement is set to low enough to log the queries you're looking for. Beware setting disabling it or setting it too low, though, if you have a very high number of queries being run. This can quickly overload your IO and disk space.

This probably isn't feasible, because we run a lot of queries with most of them being very short duration.

Just want to clarify, I meant to say you don't want to set it to 0 since that would log everything. Disabling it by setting it to -1 would cause it to not log statements based on their runtime at all, which is the default.

 
If you get the database running again, you can also look at the pg_stat_statements contrib module to log query counts and parameterized versions of all queries being run.

http://www.postgresql.org/docs/8.4/static/pgstatstatements.html

We got it up and we will be turning this module on to help diagnose issues like this in the future.
 
And I can't let this email go without say you REALLY need to look at upgrading your database. 8.4 is no longer being supported and receives no security or bug fixes.

We are looking into upgrading to RHEL 7 which comes with Postgres 9.2 but that's a SLOW process.

Recommend going straight to 9.4, especially with a "SLOW" major upgrading process. ;) Or, if the major upgrade is still really far off in reality, might want to just wait for 9.5 and to get the built in UPSERT. It's currently planned for an October release, but there's no guarantee there.

There are official repos for 9.4 on RHEL 7, and there will be ones for 9.5 as well when it hits.

http://www.postgresql.org/download/linux/redhat/
http://yum.postgresql.org/repopackages.php

Re: database must be vacuumed with transactions

От
Dave Johansen
Дата:
On Thu, Jun 25, 2015 at 6:39 PM, Keith <keith@keithf4.com> wrote:
On Thu, Jun 25, 2015 at 8:57 PM, Dave Johansen <davejohansen@gmail.com> wrote:
On Thu, Jun 25, 2015 at 2:23 PM, Keith <keith@keithf4.com> wrote:
On Thu, Jun 25, 2015 at 2:42 PM, Dave Johansen <davejohansen@gmail.com> wrote:
On Thu, Jun 25, 2015 at 11:38 AM, John Scalia <jayknowsunix@gmail.com> wrote:
The pg_stat_activity table will show you which processes are doing what, and you'll be able to see their process ID from the O/S. That might help you.

Yes, but I need to see which process ID is doing a TON of small transactions which is causing the XID to increment at an unexpectedly high rate and so my question is "if there's a way for me to get the XID for every connection/query?".

Look at the log_line_prefix options. To get everything you're looking for to try and identify where this is coming from look at using at a minimum: %p, %r, %m, %x, %u, %d

http://www.postgresql.org/docs/8.4/static/runtime-config-logging.html#GUC-LOG-LINE-PREFIX

We found the source of the issue. It's a function that we're using to all a table to be populated from multiple processes. You can see the details here:
http://www.postgresql.org/message-id/CAAcYxUf9o3NpiKaHdeShO_iJjm92314eX6ko0JvJidmksutkJg@mail.gmail.com

Any ideas on how we could handle this with causing it to increment the XID?

Not sure on how to avoid the xid increment on this. I doubt this can be avoided, but hopefully someone else can respond to your other email with a more definitive answer.

I think it would be nice to make Postgres not increment the XID in this situation, if it hasn't been fixed already,
 
What I would recommend instead in the mean time to help deal with the XID issue is some close monitoring of your current txid vs the autovacuum_max_freeze_age. And get some sort of externally scheduled job to more aggressively vacuum the database based on that. And if you have any static, unwritten tables, you definitely want to VACUUM FREEZE these to avoid the txid building on them. The check_postgres script (https://bucardo.org/wiki/Check_postgres) has this monitoring available as an option. Or if you can't use it, you can at least look at its source code and get the queries it uses and incorporate it into your monitoring solution.

About a week before the problematic process started, we did a change in our schema that affected >75% of the rows in our database. So there was also a LARGE chunk of data that hadn't been frozen but that needed to and that just made the problem worse. From this we've learned, that doing appropriate VACUUM after a large import/re-write of data can help things out. We've also added monitoring as you've recommended to be sure and prevent this sort of issue.
 
And while this won't help you now, 9.5 has this "do nothing" upsert you're trying to do here built in

http://www.depesz.com/2015/05/10/waiting-for-9-5-add-support-for-insert-on-conflict-do-nothingupdate/

For now, though, the function is the only reliable way to do what you're doing.

Yes, sadly it looks like we're stuck with this solution for now.
 
You'll also have to make sure log_min_duration_statement is set to low enough to log the queries you're looking for. Beware setting disabling it or setting it too low, though, if you have a very high number of queries being run. This can quickly overload your IO and disk space.

This probably isn't feasible, because we run a lot of queries with most of them being very short duration.

Just want to clarify, I meant to say you don't want to set it to 0 since that would log everything. Disabling it by setting it to -1 would cause it to not log statements based on their runtime at all, which is the default.

We'll probably turn on logging when the query takes more than some sizeable fraction of a second just to monitor the health of the database and query usage in general.
 
 If you get the database running again, you can also look at the pg_stat_statements contrib module to log query counts and parameterized versions of all queries being run.

http://www.postgresql.org/docs/8.4/static/pgstatstatements.html

We got it up and we will be turning this module on to help diagnose issues like this in the future.
 
And I can't let this email go without say you REALLY need to look at upgrading your database. 8.4 is no longer being supported and receives no security or bug fixes.

We are looking into upgrading to RHEL 7 which comes with Postgres 9.2 but that's a SLOW process.

Recommend going straight to 9.4, especially with a "SLOW" major upgrading process. ;) Or, if the major upgrade is still really far off in reality, might want to just wait for 9.5 and to get the built in UPSERT. It's currently planned for an October release, but there's no guarantee there.

There are official repos for 9.4 on RHEL 7, and there will be ones for 9.5 as well when it hits.

http://www.postgresql.org/download/linux/redhat/
http://yum.postgresql.org/repopackages.php

The problem is just life cycle. Postgres is a lot better than most software out there and has a 5 year life cycle, but we honestly rely on RHEL's 10 year cycle a LOT. I've been pushing very hard to upgrade to RHEL 7 and we're barely getting to the point of being able to begin testing now that it's over 1 year after its release. So honestly, using the newer versions of Postgres provided on postgresql.org just isn't feasible because by the time we get everything approved and tested, there are only a few years of use left in the support life cycle and that makes it a deal breaker for us.