Re: very long secondary->primary switch time
От | Tomas Pospisek |
---|---|
Тема | Re: very long secondary->primary switch time |
Дата | |
Msg-id | 3cc70134-f869-05f7-f970-30583fa859b6@sourcepole.ch обсуждение исходный текст |
Ответ на | Re: very long secondary->primary switch time (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-general |
Hi Tom, hi list participants, thanks a lot for replying Tom. On 27.04.21 22:16, Tom Lane wrote: > Tomas Pospisek <tpo2@sourcepole.ch> writes: >> I maintain a postgresql cluster that does failover via patroni. The >> problem is that after a failover happens it takes the secondary too long >> (that is about 35min) to come up and answer queries. The log of the >> secondary looks like this: > >> 04:00:29.777 [9679] LOG: received promote request >> 04:00:29.780 [9693] FATAL: terminating walreceiver process due to >> administrator command >> 04:00:29.780 [9679] LOG: invalid record length at 320/B95A1EE0: wanted >> 24, got 0 >> 04:00:29.783 [9679] LOG: redo done at 320/B95A1EA8 >> 04:00:29.783 [9679] LOG: last completed transaction was at log time >> 2021-03-03 03:57:46.466342+01 > >> 04:35:00.982 [9679] LOG: selected new timeline ID: 15 >> 04:35:01.404 [9679] LOG: archive recovery complete >> 04:35:02.337 [9662] LOG: database system is ready to accept connections > >> The cluster is "fairly large" with thousands of DBs (sic!) and ~1TB of data. > > Hm. WAL replay is already done at the "redo done" entry. There is a > checkpoint after that, I believe, and there may be some effort to search > for dead files as well. Still, if your I/O subsystem is better than > a wet noodle, 35 minutes is a long time to finish that. > > One thing I'm not sure about is whether we try to do the checkpoint > at maximum speed. If you have set GUC options to throttle checkpoint > I/O hard, that could perhaps explain this. I didn't do much customization of postgresql settings and am mostly running what was provided by the upstreams. These are my explicit settings (all the rest is defaults): allow_system_table_mods: 'off' autovacuum: 'off' checkpoint_completion_target: '0.7' effective_cache_size: 16 GB maintenance_work_mem: 1 GB max_connections: 200 max_wal_size: 16 GB shared_buffers: 8 GB shared_preload_libraries: pg_stat_statements track_activities: 'on' track_activity_query_size: 32 kB track_functions: all wal_keep_segments: 100 work_mem: 64 MB > You could possibly learn more by strace'ing the startup process to > see what it's doing. Will do, thanks, however I'm dreading the next failover downtime :-( > Also, what PG version is that exactly? 12.6-1.pgdg20.04+1 Thankful for any help or pointers regarding the long promotion time, *t
В списке pgsql-general по дате отправления: