Обсуждение: Segfault during queries
Hi all,
I keep getting segfaults when running queries against postgresql replicas. We have an ETL job that hits these nodes with many small queries (no more than 8 concurrently) and the box has 16 cores and plenty of I/O and ram.
However, during this process postgresql segfaults, usually on COPY commands that have this shape:
COPY
(
SELECT row_to_json(t)
FROM (
SELECT lead_tag."objectId" AS "lead_id",
lead_tag."tag" AS "tag_id"
FROM (
SELECT "objectId",
jsonb_array_elements_text("tags") AS tag
FROM "Lead"
WHERE true
AND "Lead"."organization" = 'I6JDWAaZx5'
AND tags IS NOT NULL
AND "Lead"."updatedAt" >= '2015-01-01'
AND "Lead"."updatedAt" < '2021-10-30T11:05:40.389773+00:00' ) AS lead_tag) t) TO stdout
Is there anything I can do to prevent this or anything i can look at to try and diagnose what is happening here? I’m running Postgres 12.7 and cannot tell if this is a symptom of a bug or just misconfiguration.
What I see in the error logs is:
2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:ERROR: cache lookup failed for type 0 2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:STATEMENT: COPY (SELECT ROW_TO_JSON(t) FROM (… the query above...) TO STDOUT 2021-10-29 11:08:10 UTC::@:[19406]:LOG: server process (PID 22200) was terminated by signal 11: Segmentation fault
Any help would be greatly appreciated, thanks!
-Tyler
Of course I cannot say if this is your case, but I've been using PG since 2004 and every single instance* of a segfault hasturned out to be caused by some 3rd-party extension or other. And yes, the extension was never related in any way to thequery being run, it just corrupted memory such that PG would die later. So I would audit extensions, and see if any havebeen recently added, any could be removed temporarily, and so on. * It's only 2 or 3 cases in all that time, but still, it was never core PG. -- Scott Ribe scott_ribe@elevated-dev.com https://www.linkedin.com/in/scottribe/ > On Oct 29, 2021, at 8:05 AM, Tyler Brock <tyler.brock@gmail.com> wrote: > > Hi all, > > I keep getting segfaults when running queries against postgresql replicas. We have an ETL job that hits these nodes withmany small queries (no more than 8 concurrently) and the box has 16 cores and plenty of I/O and ram. > > However, during this process postgresql segfaults, usually on COPY commands that have this shape: > > COPY > ( > SELECT row_to_json(t) > FROM ( > SELECT lead_tag."objectId" AS "lead_id", > lead_tag."tag" AS "tag_id" > FROM ( > SELECT "objectId", > jsonb_array_elements_text("tags") AS tag > FROM "Lead" > WHERE true > AND "Lead"."organization" = 'I6JDWAaZx5' > AND tags IS NOT NULL > AND "Lead"."updatedAt" >= '2015-01-01' > AND "Lead"."updatedAt" < '2021-10-30T11:05:40.389773+00:00' ) AS lead_tag) t) TOstdout > > Is there anything I can do to prevent this or anything i can look at to try and diagnose what is happening here? I’m runningPostgres 12.7 and cannot tell if this is a symptom of a bug or just misconfiguration. > > What I see in the error logs is: > > 2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:ERROR: cache lookup failed for type 0 > 2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:STATEMENT: COPY (SELECT ROW_TO_JSON(t) FROM (… thequery above...) TO STDOUT > 2021-10-29 11:08:10 UTC::@:[19406]:LOG: server process (PID 22200) was terminated by signal 11: Segmentation fault > > > Any help would be greatly appreciated, thanks! > > -Tyler
Tyler Brock <tyler.brock@gmail.com> writes: > Is there anything I can do to prevent this or anything i can look at to try > and diagnose what is happening here? I’m running Postgres 12.7 and cannot > tell if this is a symptom of a bug or just misconfiguration. > What I see in the error logs is: > 2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:ERROR: > cache lookup failed for type 0 2021-10-29 11:08:10 > UTC:172.23.17.171(54332):postgres@postgres:[22200]:STATEMENT: COPY (SELECT > ROW_TO_JSON(t) FROM (… the query above...) TO STDOUT 2021-10-29 11:08:10 > UTC::@:[19406]:LOG: server process (PID 22200) was terminated by signal 11: > Segmentation fault That's most certainly a bug. Can you get a stack trace from the crash? https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend regards, tom lane
Interesting! Thank you, I am seeing that we have an outdated version of plv8 that does not match what we should be running for pg12.x, let me try dropping that extension and recreating it with a newer version.
-Tyler
-Tyler
On Oct 29, 2021 at 10:27:45 AM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:
Of course I cannot say if this is your case, but I've been using PG since 2004 and every single instance* of a segfault has turned out to be caused by some 3rd-party extension or other. And yes, the extension was never related in any way to the query being run, it just corrupted memory such that PG would die later. So I would audit extensions, and see if any have been recently added, any could be removed temporarily, and so on.
* It's only 2 or 3 cases in all that time, but still, it was never core PG.
--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/On Oct 29, 2021, at 8:05 AM, Tyler Brock <tyler.brock@gmail.com> wrote:Hi all,I keep getting segfaults when running queries against postgresql replicas. We have an ETL job that hits these nodes with many small queries (no more than 8 concurrently) and the box has 16 cores and plenty of I/O and ram.However, during this process postgresql segfaults, usually on COPY commands that have this shape:COPY(SELECT row_to_json(t)FROM (SELECT lead_tag."objectId" AS "lead_id",lead_tag."tag" AS "tag_id"FROM (SELECT "objectId",jsonb_array_elements_text("tags") AS tagFROM "Lead"WHERE trueAND "Lead"."organization" = 'I6JDWAaZx5'AND tags IS NOT NULLAND "Lead"."updatedAt" >= '2015-01-01'AND "Lead"."updatedAt" < '2021-10-30T11:05:40.389773+00:00' ) AS lead_tag) t) TO stdoutIs there anything I can do to prevent this or anything i can look at to try and diagnose what is happening here? I’m running Postgres 12.7 and cannot tell if this is a symptom of a bug or just misconfiguration.What I see in the error logs is:2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:ERROR: cache lookup failed for type 02021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:STATEMENT: COPY (SELECT ROW_TO_JSON(t) FROM (… the query above...) TO STDOUT2021-10-29 11:08:10 UTC::@:[19406]:LOG: server process (PID 22200) was terminated by signal 11: Segmentation faultAny help would be greatly appreciated, thanks!-Tyler
It doesn’t seem like upgrading plv8 did the trick unfortunately. I’m going to try and get those stack traces but it may be difficult/impossible because we are running postgres via AWS RDS and don’t have control of the underlying VM.
-Tyler
-Tyler
On Oct 29, 2021 at 11:08:57 AM, Tyler Brock <tyler.brock@gmail.com> wrote:
Interesting! Thank you, I am seeing that we have an outdated version of plv8 that does not match what we should be running for pg12.x, let me try dropping that extension and recreating it with a newer version.
-TylerOn Oct 29, 2021 at 10:27:45 AM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:Of course I cannot say if this is your case, but I've been using PG since 2004 and every single instance* of a segfault has turned out to be caused by some 3rd-party extension or other. And yes, the extension was never related in any way to the query being run, it just corrupted memory such that PG would die later. So I would audit extensions, and see if any have been recently added, any could be removed temporarily, and so on.
* It's only 2 or 3 cases in all that time, but still, it was never core PG.
--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/On Oct 29, 2021, at 8:05 AM, Tyler Brock <tyler.brock@gmail.com> wrote:Hi all,I keep getting segfaults when running queries against postgresql replicas. We have an ETL job that hits these nodes with many small queries (no more than 8 concurrently) and the box has 16 cores and plenty of I/O and ram.However, during this process postgresql segfaults, usually on COPY commands that have this shape:COPY(SELECT row_to_json(t)FROM (SELECT lead_tag."objectId" AS "lead_id",lead_tag."tag" AS "tag_id"FROM (SELECT "objectId",jsonb_array_elements_text("tags") AS tagFROM "Lead"WHERE trueAND "Lead"."organization" = 'I6JDWAaZx5'AND tags IS NOT NULLAND "Lead"."updatedAt" >= '2015-01-01'AND "Lead"."updatedAt" < '2021-10-30T11:05:40.389773+00:00' ) AS lead_tag) t) TO stdoutIs there anything I can do to prevent this or anything i can look at to try and diagnose what is happening here? I’m running Postgres 12.7 and cannot tell if this is a symptom of a bug or just misconfiguration.What I see in the error logs is:2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:ERROR: cache lookup failed for type 02021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:STATEMENT: COPY (SELECT ROW_TO_JSON(t) FROM (… the query above...) TO STDOUT2021-10-29 11:08:10 UTC::@:[19406]:LOG: server process (PID 22200) was terminated by signal 11: Segmentation faultAny help would be greatly appreciated, thanks!-Tyler
Hey all, so i solved it but I’m not sure how…
-Tyler
Some combination of upgrading plv8, upgrading from postgres 12.7 to 12.8 and upgrading psycopg minor version did the trick and we are no longer seeing segfaults when we run these queries.
AWS was not helpful in assisting with core dumps + stack traces (as you could imagine).
Their suggestion was to set verbosity to “verbose”… would that have yielded any useful info for you all or would debugging this have been impossible on RDS?
I greatly appreciate all the help.
-Tyler
On Oct 29, 2021 at 1:36:12 PM, Tyler Brock <tyler.brock@gmail.com> wrote:
It doesn’t seem like upgrading plv8 did the trick unfortunately. I’m going to try and get those stack traces but it may be difficult/impossible because we are running postgres via AWS RDS and don’t have control of the underlying VM.
-TylerOn Oct 29, 2021 at 11:08:57 AM, Tyler Brock <tyler.brock@gmail.com> wrote:Interesting! Thank you, I am seeing that we have an outdated version of plv8 that does not match what we should be running for pg12.x, let me try dropping that extension and recreating it with a newer version.
-TylerOn Oct 29, 2021 at 10:27:45 AM, Scott Ribe <scott_ribe@elevated-dev.com> wrote:Of course I cannot say if this is your case, but I've been using PG since 2004 and every single instance* of a segfault has turned out to be caused by some 3rd-party extension or other. And yes, the extension was never related in any way to the query being run, it just corrupted memory such that PG would die later. So I would audit extensions, and see if any have been recently added, any could be removed temporarily, and so on.
* It's only 2 or 3 cases in all that time, but still, it was never core PG.
--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/On Oct 29, 2021, at 8:05 AM, Tyler Brock <tyler.brock@gmail.com> wrote:Hi all,I keep getting segfaults when running queries against postgresql replicas. We have an ETL job that hits these nodes with many small queries (no more than 8 concurrently) and the box has 16 cores and plenty of I/O and ram.However, during this process postgresql segfaults, usually on COPY commands that have this shape:COPY(SELECT row_to_json(t)FROM (SELECT lead_tag."objectId" AS "lead_id",lead_tag."tag" AS "tag_id"FROM (SELECT "objectId",jsonb_array_elements_text("tags") AS tagFROM "Lead"WHERE trueAND "Lead"."organization" = 'I6JDWAaZx5'AND tags IS NOT NULLAND "Lead"."updatedAt" >= '2015-01-01'AND "Lead"."updatedAt" < '2021-10-30T11:05:40.389773+00:00' ) AS lead_tag) t) TO stdoutIs there anything I can do to prevent this or anything i can look at to try and diagnose what is happening here? I’m running Postgres 12.7 and cannot tell if this is a symptom of a bug or just misconfiguration.What I see in the error logs is:2021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:ERROR: cache lookup failed for type 02021-10-29 11:08:10 UTC:172.23.17.171(54332):postgres@postgres:[22200]:STATEMENT: COPY (SELECT ROW_TO_JSON(t) FROM (… the query above...) TO STDOUT2021-10-29 11:08:10 UTC::@:[19406]:LOG: server process (PID 22200) was terminated by signal 11: Segmentation faultAny help would be greatly appreciated, thanks!-Tyler