Обсуждение: BUG #15640: FATAL: XX000: cannot read pg_class without having selected a database

Поиск
Список
Период
Сортировка

BUG #15640: FATAL: XX000: cannot read pg_class without having selected a database

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      15640
Logged by:          Mikhail Kiyanov
Email address:      m.kiianov@rtk-dc.ru
PostgreSQL version: 9.6.11
Operating system:   Debian GNU/Linux 9.1 (stretch)
Description:

Hi!
We are currently having a problems with our production database.
The problem is that database goes to crash and after that put into recovery
mode with the following errors. Please see log files.

2019-02-18 05:30:35 GMT [40461]: [28685-1] db=vc_production,user=estore
LOCATION:  exec_execute_message, postgres.c:2052
2019-02-18 05:30:36 GMT [45739]: [1-1] db=,user= LOG:  00000: automatic
analyze of table "template1.pg_catalog.pg_shdepend" system usage: CPU
0.00s/0.00u sec
elapsed 0.07 sec
2019-02-18 05:30:36 GMT [45739]: [2-1] db=,user= LOCATION:  do_analyze_rel,
analyze.c:665
2019-02-18 05:30:36 GMT [40378]: [2-1] db=,user= FATAL:  XX000: cannot read
pg_class without having selected a database
2019-02-18 05:30:36 GMT [40378]: [3-1] db=,user= LOCATION:  ScanPgRelation,
relcache.c:314
2019-02-18 05:30:36 GMT [40267]: [7-1] db=,user= LOG:  00000: autovacuum
launcher process (PID 40378) exited with exit code 1
2019-02-18 05:30:36 GMT [40267]: [8-1] db=,user= LOCATION:  LogChildExit,
postmaster.c:3552
2019-02-18 05:30:36 GMT [40267]: [9-1] db=,user= LOG:  00000: terminating
any other active server processes
2019-02-18 05:30:36 GMT [40267]: [10-1] db=,user= LOCATION:
HandleChildCrash, postmaster.c:3294
2019-02-18 05:30:36 GMT [44893]: [5-1] db=vc_production,user=vcfront
WARNING:  57P02: terminating connection because of crash of another server
process
2019-02-18 05:30:36 GMT [44893]: [6-1] db=vc_production,user=vcfront DETAIL:
 The postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally and
possibly corrupted shared memory.

It happened several times. And everytime it happens at different time.

Some addtitional information about our server:

postgres=# select version();
                                                               version
                                                         

--------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.6.11 on x86_64-pc-linux-gnu (Debian 9.6.11-1.pgdg90+1),
compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit
(1 row)

$ cat /etc/postgresql/9.6/main/postgresql.conf

shared_preload_libraries = 'pg_stat_statements, timescaledb,
pg_wait_sampling, repmgr, auto_explain'
track_io_timing = on
listen_addresses = '0.0.0.0'
port = 5432
superuser_reserved_connections = 3
unix_socket_group = ''
unix_socket_permissions = 0777
log_filename = 'postgresql-%a.log'
log_rotation_age = 1d
log_truncate_on_rotation = on
log_line_prefix = '%t [%p]: [%l-1] db=%d,user=%u '
log_lock_waits = on
log_temp_files = 0
log_connections = on
log_checkpoints = on
log_disconnections = on
log_directory = '/storage/log/postgresql'

log_error_verbosity = verbose

logging_collector = on
log_min_duration_statement = 1000
log_autovacuum_min_duration = 10
checkpoint_completion_target = 0.9
checkpoint_timeout = 30min
effective_cache_size = 200GB
shared_buffers = 32GB
fsync = on
autovacuum_max_workers = 25
bgwriter_delay = 10
bgwriter_lru_maxpages = 800
vacuum_cost_delay = 1
autovacuum_analyze_scale_factor = 0.01
data_directory = '/var/lib/postgresql/9.6/main'
hba_file = '/etc/postgresql/9.6/main/pg_hba.conf'

# master specific
wal_level = 'hot_standby'
max_wal_senders = 10
max_replication_slots = 9
wal_keep_segments = 100
max_connections = 500
work_mem = 400MB
max_wal_size = 140GB
min_wal_size = 40GB
autovacuum_work_mem = 11GB
autovacuum_vacuum_scale_factor = 0.01
autovacuum_freeze_max_age = 1000000000
autovacuum_multixact_freeze_max_age = 800000000
autovacuum_naptime = 1
maintenance_work_mem = 4096MB
random_page_cost = 1
synchronous_commit = off

# Extentions
auto_explain.log_min_duration = '5s'

vc_production=# select extname,extversion from pg_extension;
      extname       | extversion
--------------------+------------
 plpgsql            | 1.0
 pageinspect        | 1.5
 citext             | 1.3
 pg_buffercache     | 1.2
 pg_stat_statements | 1.4
 pgstattuple        | 1.4
 pg_repack          | 1.4.4
 pg_wait_sampling   | 1.1
 timescaledb        | 0.8.0
(9 rows)

Some tables have personal autovacuum settings:

vc_production=# \d+ estore.events_0
                                                            Table
"estore.events_0"
   Column   |            Type             |
Modifiers                          | Storage  | Stats target | Description

------------+-----------------------------+------------------------------------------------------------+----------+--------------+-------------
 id         | bigint                      | not null default
nextval('estore.events_id_seq'::regclass) | plain    |              |
 camera_uid | text                        | not null
                          | extended |              |
 source     | smallint                    | not null
                          | plain    |              |
 kind       | integer                     | not null
                          | plain    |              |
 begin      | timestamp without time zone | not null
                          | plain    |              |
 end        | timestamp without time zone |
                          | plain    |              |
 properties | jsonb                       | not null
                          | extended |              |
 is_active  | boolean                     | not null
                          | plain    |              |
Indexes:
    "events_0_pkey" PRIMARY KEY, btree (id), tablespace "local_01_ts"
    "events_0_camera_uid_begin_idx" btree (camera_uid, begin), tablespace
"local_01_ts"
    "events_0_camera_uid_end_idx" btree (camera_uid, "end"), tablespace
"local_01_ts"
    "events_0_camera_uid_expr_idx" btree (camera_uid, (properties ->>
'uid'::text)), tablespace "local_01_ts"
    "events_0_camera_uid_kind_source_idx" btree (camera_uid, kind, source),
tablespace "local_01_ts"
    "events_0_camera_uid_source_begin_end_idx" btree (camera_uid, source,
begin, "end"), tablespace "local_01_ts"
    "events_0_end_idx" btree ("end"), tablespace "local_01_ts"
    "events_0_src_knd_cmr_uid_prpt_idx" btree (source, kind, camera_uid,
(properties ->> 'channel_uid'::text)), tablespace "local_01_ts"
Check constraints:
    "events_0_camera_uid_check" CHECK (camera_uid >= '0'::text AND
camera_uid < '1'::text)
Inherits: estore.events
Tablespace: "estore_ts"
Options: autovacuum_vacuum_scale_factor=0.001

vc_production=# \d+ violet.alarm_events
                                                             Table
"violet.alarm_events"
   Column   |            Type             |
Modifiers                             | Storage  | Stats target |
Description

------------+-----------------------------+------------------------------------------------------------------+----------+--------------+-------------
 id         | bigint                      | not null default
nextval('violet.alarm_events_id_seq'::regclass) | plain    |
|
 camera_id  | integer                     | not null
                                | plain    |              |
 kind       | integer                     | not null
                                | plain    |              |
 begin      | timestamp without time zone | not null
                                | plain    |              |
 end        | timestamp without time zone |
                                | plain    |              |
 uid        | text                        | not null
                                | extended |              |
 properties | jsonb                       | not null default '{}'::jsonb
                                | extended |              |
 created_at | timestamp without time zone | not null default now()
                                | plain    |              |
 updated_at | timestamp without time zone | not null default now()
                                | plain    |              |
 deleted_at | timestamp without time zone |
                                | plain    |              |
Indexes:
    "alarm_events_new_pkey" PRIMARY KEY, btree (id), tablespace
"violet_ts"
    "alarm_events_new_begin_camera_id_idx" btree (begin, camera_id),
tablespace "violet_ts"
    "alarm_events_new_begin_idx" btree (begin), tablespace "violet_ts"
    "alarm_events_new_camera_id_idx" btree (camera_id), tablespace
"violet_ts"
    "alarm_events_new_camera_id_uid_idx" btree (camera_id, uid), tablespace
"violet_ts"
Foreign-key constraints:
    "alarm_events_camera_id_fkey" FOREIGN KEY (camera_id) REFERENCES
violet.cameras(id)
Tablespace: "violet_ts"
Options: autovacuum_vacuum_cost_delay=0,
autovacuum_vacuum_scale_factor=0.00001

This is a physical server with:
$ free -m
              total        used        free      shared  buff/cache
available
Mem:         515897       47349        4112       34023      464436
431171
Swap:             0           0           0

$ grep 'processor' -c /proc/cpuinfo
80

model name      : Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz

The database located on ssd disks.

This database is mission critical for our business.
Would you be so kind to help us with our problem.


Re: BUG #15640: FATAL: XX000: cannot read pg_class without havingselected a database

От
Alvaro Herrera
Дата:
On 2019-Feb-18, PG Bug reporting form wrote:


> 2019-02-18 05:30:36 GMT [40378]: [2-1] db=,user= FATAL:  XX000: cannot read pg_class without having selected a
database
> 2019-02-18 05:30:36 GMT [40378]: [3-1] db=,user= LOCATION:  ScanPgRelation, relcache.c:314

Hmm, I wonder if this is a bug introduced in 6a46aba1cd6d (in 9.6.10)
that only hits autovacuum launcher after a relcache init removal at just
the wrong time, or something like that?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: BUG #15640: FATAL: XX000: cannot read pg_class without having selected a database

От
Tom Lane
Дата:
PG Bug reporting form <noreply@postgresql.org> writes:
> 2019-02-18 05:30:36 GMT [40378]: [2-1] db=,user= FATAL:  XX000: cannot read
> pg_class without having selected a database
> 2019-02-18 05:30:36 GMT [40378]: [3-1] db=,user= LOCATION:  ScanPgRelation,
> relcache.c:314
> 2019-02-18 05:30:36 GMT [40267]: [7-1] db=,user= LOG:  00000: autovacuum
> launcher process (PID 40378) exited with exit code 1

Hm.  So something is trying to access pg_class (perhaps only on the way to
reading something else) inside the autovacuum launcher, which it cannot
do because the launcher isn't connected to any specific database.

I have not heard of such behavior in core Postgres, which makes me think
that the problem is in one of your extensions.  You mention

> shared_preload_libraries = 'pg_stat_statements, timescaledb,
> pg_wait_sampling, repmgr, auto_explain'

I'd suggest removing entries from shared_preload_libraries one at a
time to see if the problem stops occurring.  (Obviously, you have to
restart the postmaster each time you change the list.)  Once you've
found the culprit, closer analysis would be possible.

            regards, tom lane


Re: BUG #15640: FATAL: XX000: cannot read pg_class without having selected a database

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2019-Feb-18, PG Bug reporting form wrote:
>> 2019-02-18 05:30:36 GMT [40378]: [2-1] db=,user= FATAL:  XX000: cannot read pg_class without having selected a
database
>> 2019-02-18 05:30:36 GMT [40378]: [3-1] db=,user= LOCATION:  ScanPgRelation, relcache.c:314

> Hmm, I wonder if this is a bug introduced in 6a46aba1cd6d (in 9.6.10)
> that only hits autovacuum launcher after a relcache init removal at just
> the wrong time, or something like that?

I think/hope that the launcher isn't connected to sinval, either.

            regards, tom lane


Re: BUG #15640: FATAL: XX000: cannot read pg_class without havingselected a database

От
Andres Freund
Дата:
Hi,

On 2019-02-18 11:29:49 -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > On 2019-Feb-18, PG Bug reporting form wrote:
> >> 2019-02-18 05:30:36 GMT [40378]: [2-1] db=,user= FATAL:  XX000: cannot read pg_class without having selected a
database
> >> 2019-02-18 05:30:36 GMT [40378]: [3-1] db=,user= LOCATION:  ScanPgRelation, relcache.c:314
> 
> > Hmm, I wonder if this is a bug introduced in 6a46aba1cd6d (in 9.6.10)
> > that only hits autovacuum launcher after a relcache init removal at just
> > the wrong time, or something like that?

I don't immediately see how that'd be the culprit - we'd processed
sinvals in some cases before that change too. I'm inclined to think
Tom's guess is the right one, but ...

My initial guess would be that one of the extensions registered a
relcache callback, and does catalog accesses in an unconditional manner.


> I think/hope that the launcher isn't connected to sinval, either.

Hm, I don't think that's true? The launcher does InitPostgres() and
everything (albeit without a database). And it needs to process
relmapper updates, otherwise it can't access pg_database. And that's
accessed via a normal heap_open().

Greetings,

Andres Freund


Re: BUG #15640: FATAL: XX000: cannot read pg_class without havingselected a database

От
Alvaro Herrera
Дата:
On 2019-Feb-18, Andres Freund wrote:


> My initial guess would be that one of the extensions registered a
> relcache callback, and does catalog accesses in an unconditional manner.

Hmm but why is extension code running in autovacuum launcher?


-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: BUG #15640: FATAL: XX000: cannot read pg_class without havingselected a database

От
Andres Freund
Дата:
On 2019-02-18 13:44:12 -0300, Alvaro Herrera wrote:
> On 2019-Feb-18, Andres Freund wrote:
> 
> 
> > My initial guess would be that one of the extensions registered a
> > relcache callback, and does catalog accesses in an unconditional manner.
> 
> Hmm but why is extension code running in autovacuum launcher?

shared_preload_libraries + CacheRegisterRelcacheCallback().


Re: BUG #15640: FATAL: XX000: cannot read pg_class without having selected a database

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2019-Feb-18, Andres Freund wrote:
>> My initial guess would be that one of the extensions registered a
>> relcache callback, and does catalog accesses in an unconditional manner.

> Hmm but why is extension code running in autovacuum launcher?

Anything registered in shared_preload_libraries is going to be active
in every process.  I don't particularly believe Andres' theory that
CacheRegisterRelcacheCallback is involved, but it seems very likely
that one of these extensions has hooked into some code path that the
launcher can traverse, and it's doing more than is safe there.

If this weren't a production database, I'd suggest changing that error
to a PANIC so we could get a stack trace.  That would almost certainly
pinpoint the culprit.

            regards, tom lane


Re: BUG #15640: FATAL: XX000: cannot read pg_class without havingselected a database

От
Andres Freund
Дата:
On 2019-02-18 12:32:43 -0500, Tom Lane wrote:
> I don't particularly believe Andres' theory that
> CacheRegisterRelcacheCallback is involved

It's how I'd broken postgres temporarily while converting citus from a
fork into an extension :)