Обсуждение: Fix generate_useful_gather_paths for parallel unsafe pathkeys
Over on the -bugs list we had a report [1] of a seg fault with incremental sort. The short of the investigation there was that a subplan wasn't being serialized since it wasn't parallel safe, and that attempting to initialize that subplan resulted in the seg fault. Tom pushed a change to raise an ERROR when this occurs so that at least we don't crash the server. There was some small amount of discussion about this on a thread about a somewhat similar issue [2] (where volatile expressions were the issue), but that thread resulted in a committed patch, and beyond that it seems that this issue deserves its own discussion. I've been digging further into the problem, and my first question was "why doesn't this result in a similar error but with a full sort when incremental sort is disabled?" After all, we have code to consider a gather merge + full sort on the cheapest partial path in generate_useful_gather_paths(), and the plan that I get on Luis's test case with incremental sort disabled has an full sort above a gather, which presumably it'd be cheaper to push down into the parallel plan (using gather merge instead of gather). It turns out that there's a separate bug here: I'm guessing that in the original commit of generate_useful_gather_paths() we had a copy/paste thinko in this code: /* * If the path has no ordering at all, then we can't use either * incremental sort or rely on implicit sorting with a gather * merge. */ if (subpath->pathkeys == NIL) continue; The comment claims that we should skip subpaths that aren't sorted at all since we can't possibly use a gather merge directly or with an incremental sort applied first. It's true that we can't do either of those things unless the subpath is already at least partially sorted. But subsequently we attempt to construct a gather merge path on top of a full sort (for the cheapest path), and there's no reason to limit that to partially sorted paths (indeed in the presence of incremental sort it seems unlikely that that would be the cheapest path). We still don't want to build incremental sort paths if the subpath has no pathkeys, but that will imply presorted_keys = 0, which we already check. So 0001 removes that branch entirely. And, as expected, we now get a gather merge + full sort the resulting error about subplan not being initialized. With that oddity out of the way, I started looking at the actually reported problem, and nominally issue is that we have a correlated subquery (in the final target list and the sort clause), and correlated subqueries (not sure why exactly in this case; see [3]) aren't considered parallel-safe. As to why that's happening: 1. find_em_expr_usable_for_sorting_rel doesn't exclude that expression, and arguably correctly so in the general case since one might (though we don't now) use this same kind of logic for non-parallel plans. 2. generate_useful_pathkeys_for_relation is noting that it'd be useful to sort on that expression and sees it as safe in part due to the (1). 3. We create a sort node that includes that expression in its output. That seems pretty much the same (in terms of safety given generalized input paths/plans, not in terms of what Robert brought up in [4]) as what would happen in the prepare_sort_from_pathkeys call in create_gather_merge_plan. So to fix this problem I've added the option to find_em_expr_for_rel to restrict to parallel-safe expressions in 0002. Given point 3 above (and the discussion with Robert earlier) above it seems to me that there's a bigger problem here that just hasn't happened to have been exposed yet: in particular the prepare_sort_from_pathkeys call in create_gather_merge_plan seems suspect to me. But it's also possible other usages of prepare_sort_from_pathkeys could be suspect given other seemingly unrelated changed to path generation, since nothing other than implicit knowledge about the conventions here prevents us from introducing paths generating sorts with expressions not in the target list (in fact a part of the issue here IMO is that non-var expression pathkeys are added to the target list after path generation and selection is completed). Alternatively we could "fix" this by being even more conservative in find_em_expr_usable_for_sorting_rel and disregard any expressions not currently found in the target list. But that seems unsatisfactory to me since, given we know that there are cases where prepare_sort_from_paths is explicitly intended to add pathkey expressions to the target list, we'd be excluding possible useful cases (and indeed we already have, albeit contrived, test cases that fail if that change is made). There exists a very similar path in the postgres fdw code (in fact the function name is the same: get_useful_pathkeys_for_relation) since find_em_expr_for_rel is much simpler (it doesn't do many safety checks at all like we do in find_em_expr_usable_for_sorting_rel), but I think that's safe (though I haven't thought about it a ton) because I believe those are sort keys we're going to send to the foreign server anyway, as opposed to sorting ourselves (but I haven't verified that that's the case and that we never create sort nodes there). James 1: https://www.postgresql.org/message-id/622580997.37108180.1604080457319.JavaMail.zimbra%40siscobra.com.br 2: https://www.postgresql.org/message-id/CAJGNTeNaxpXgBVcRhJX%2B2vSbq%2BF2kJqGBcvompmpvXb7pq%2BoFA%40mail.gmail.com 3: https://www.postgresql.org/message-id/CAAaqYe_vihKjc%2B8LuQa49EHW4%2BKfefb3wHqPYFnCuUqozo%2BLFg%40mail.gmail.com 4: https://www.postgresql.org/message-id/CA%2BTgmobX2079GNJWJVFjo5CmwTg%3DoQQOybFQL2CyZxMY59P6BA%40mail.gmail.com
Вложения
On 11/21/20 2:55 AM, James Coleman wrote: > Over on the -bugs list we had a report [1] of a seg fault with > incremental sort. The short of the investigation there was that a > subplan wasn't being serialized since it wasn't parallel safe, and > that attempting to initialize that subplan resulted in the seg fault. > Tom pushed a change to raise an ERROR when this occurs so that at > least we don't crash the server. > > There was some small amount of discussion about this on a thread about > a somewhat similar issue [2] (where volatile expressions were the > issue), but that thread resulted in a committed patch, and beyond that > it seems that this issue deserves its own discussion. > > I've been digging further into the problem, and my first question was > "why doesn't this result in a similar error but with a full sort when > incremental sort is disabled?" After all, we have code to consider a > gather merge + full sort on the cheapest partial path in > generate_useful_gather_paths(), and the plan that I get on Luis's test > case with incremental sort disabled has an full sort above a gather, > which presumably it'd be cheaper to push down into the parallel plan > (using gather merge instead of gather). > > It turns out that there's a separate bug here: I'm guessing that in > the original commit of generate_useful_gather_paths() we had a > copy/paste thinko in this code: > > /* > * If the path has no ordering at all, then we can't use either > * incremental sort or rely on implicit sorting with a gather > * merge. > */ > if (subpath->pathkeys == NIL) > continue; > > The comment claims that we should skip subpaths that aren't sorted at > all since we can't possibly use a gather merge directly or with an > incremental sort applied first. It's true that we can't do either of > those things unless the subpath is already at least partially sorted. > But subsequently we attempt to construct a gather merge path on top of > a full sort (for the cheapest path), and there's no reason to limit > that to partially sorted paths (indeed in the presence of incremental > sort it seems unlikely that that would be the cheapest path). > > We still don't want to build incremental sort paths if the subpath has > no pathkeys, but that will imply presorted_keys = 0, which we already > check. > > So 0001 removes that branch entirely. And, as expected, we now get a > gather merge + full sort the resulting error about subplan not being > initialized. > Yeah, that's clearly a thinko in generate_useful_gather_paths. Thanks for spotting it! The 0001 patch seems fine to me. > With that oddity out of the way, I started looking at the actually > reported problem, and nominally issue is that we have a correlated > subquery (in the final target list and the sort clause), and > correlated subqueries (not sure why exactly in this case; see [3]) > aren't considered parallel-safe. > > As to why that's happening: > 1. find_em_expr_usable_for_sorting_rel doesn't exclude that > expression, and arguably correctly so in the general case since one > might (though we don't now) use this same kind of logic for > non-parallel plans. > 2. generate_useful_pathkeys_for_relation is noting that it'd be useful > to sort on that expression and sees it as safe in part due to the (1). > 3. We create a sort node that includes that expression in its output. > That seems pretty much the same (in terms of safety given generalized > input paths/plans, not in terms of what Robert brought up in [4]) as > what would happen in the prepare_sort_from_pathkeys call in > create_gather_merge_plan. > > So to fix this problem I've added the option to find_em_expr_for_rel > to restrict to parallel-safe expressions in 0002. > OK, that seems like a valid fix. I wonder if we can have EC with members mixing parallel-safe and parallel-unsafe expressions? I guess no, but it's more a feeling than a clear reasoning and so I wonder what would happen in such cases. > Given point 3 above (and the discussion with Robert earlier) above it > seems to me that there's a bigger problem here that just hasn't > happened to have been exposed yet: in particular the > prepare_sort_from_pathkeys call in create_gather_merge_plan seems > suspect to me. But it's also possible other usages of > prepare_sort_from_pathkeys could be suspect given other seemingly > unrelated changed to path generation, since nothing other than > implicit knowledge about the conventions here prevents us from > introducing paths generating sorts with expressions not in the target > list (in fact a part of the issue here IMO is that non-var expression > pathkeys are added to the target list after path generation and > selection is completed). > > Alternatively we could "fix" this by being even more conservative in > find_em_expr_usable_for_sorting_rel and disregard any expressions not > currently found in the target list. But that seems unsatisfactory to > me since, given we know that there are cases where > prepare_sort_from_paths is explicitly intended to add pathkey > expressions to the target list, we'd be excluding possible useful > cases (and indeed we already have, albeit contrived, test cases that > fail if that change is made). > > There exists a very similar path in the postgres fdw code (in fact the > function name is the same: get_useful_pathkeys_for_relation) since > find_em_expr_for_rel is much simpler (it doesn't do many safety checks > at all like we do in find_em_expr_usable_for_sorting_rel), but I think > that's safe (though I haven't thought about it a ton) because I > believe those are sort keys we're going to send to the foreign server > anyway, as opposed to sorting ourselves (but I haven't verified that > that's the case and that we never create sort nodes there). > Yeah. I think the various FDW-related restrictions may easily make that safe, for the reasons you already mentioned. But also because IIRC FDWs in general are not parallel-safe, so there's no risk of running foreign scans under Gather [Merge]. Thanks for the patches, I'll take a closer look next week. The 0002 patch is clearly something we need to backpatch, not sure about 0001 as it essentially enables new behavior in some cases (Sort on unsorted paths below Gather Merge). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Nov 22, 2020 at 4:59 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 11/21/20 2:55 AM, James Coleman wrote: > > Over on the -bugs list we had a report [1] of a seg fault with > > incremental sort. The short of the investigation there was that a > > subplan wasn't being serialized since it wasn't parallel safe, and > > that attempting to initialize that subplan resulted in the seg fault. > > Tom pushed a change to raise an ERROR when this occurs so that at > > least we don't crash the server. > > > > There was some small amount of discussion about this on a thread about > > a somewhat similar issue [2] (where volatile expressions were the > > issue), but that thread resulted in a committed patch, and beyond that > > it seems that this issue deserves its own discussion. > > > > I've been digging further into the problem, and my first question was > > "why doesn't this result in a similar error but with a full sort when > > incremental sort is disabled?" After all, we have code to consider a > > gather merge + full sort on the cheapest partial path in > > generate_useful_gather_paths(), and the plan that I get on Luis's test > > case with incremental sort disabled has an full sort above a gather, > > which presumably it'd be cheaper to push down into the parallel plan > > (using gather merge instead of gather). > > > > It turns out that there's a separate bug here: I'm guessing that in > > the original commit of generate_useful_gather_paths() we had a > > copy/paste thinko in this code: > > > > /* > > * If the path has no ordering at all, then we can't use either > > * incremental sort or rely on implicit sorting with a gather > > * merge. > > */ > > if (subpath->pathkeys == NIL) > > continue; > > > > The comment claims that we should skip subpaths that aren't sorted at > > all since we can't possibly use a gather merge directly or with an > > incremental sort applied first. It's true that we can't do either of > > those things unless the subpath is already at least partially sorted. > > But subsequently we attempt to construct a gather merge path on top of > > a full sort (for the cheapest path), and there's no reason to limit > > that to partially sorted paths (indeed in the presence of incremental > > sort it seems unlikely that that would be the cheapest path). > > > > We still don't want to build incremental sort paths if the subpath has > > no pathkeys, but that will imply presorted_keys = 0, which we already > > check. > > > > So 0001 removes that branch entirely. And, as expected, we now get a > > gather merge + full sort the resulting error about subplan not being > > initialized. > > > > Yeah, that's clearly a thinko in generate_useful_gather_paths. Thanks > for spotting it! The 0001 patch seems fine to me. > > > With that oddity out of the way, I started looking at the actually > > reported problem, and nominally issue is that we have a correlated > > subquery (in the final target list and the sort clause), and > > correlated subqueries (not sure why exactly in this case; see [3]) > > aren't considered parallel-safe. > > > > As to why that's happening: > > 1. find_em_expr_usable_for_sorting_rel doesn't exclude that > > expression, and arguably correctly so in the general case since one > > might (though we don't now) use this same kind of logic for > > non-parallel plans. > > 2. generate_useful_pathkeys_for_relation is noting that it'd be useful > > to sort on that expression and sees it as safe in part due to the (1). > > 3. We create a sort node that includes that expression in its output. > > That seems pretty much the same (in terms of safety given generalized > > input paths/plans, not in terms of what Robert brought up in [4]) as > > what would happen in the prepare_sort_from_pathkeys call in > > create_gather_merge_plan. > > > > So to fix this problem I've added the option to find_em_expr_for_rel > > to restrict to parallel-safe expressions in 0002. > > > > OK, that seems like a valid fix. I wonder if we can have EC with members > mixing parallel-safe and parallel-unsafe expressions? I guess no, but > it's more a feeling than a clear reasoning and so I wonder what would > happen in such cases. An immediately contrived case would be something like "t.x = unsafe(t.y)". As to whether that can actually result in us wanting to sort by the safe member before we can execute the unsafe member (e.g., in a filter), I'm less sure, but it seems at least plausible to me (or I don't see anything inherently preventing it). > > Given point 3 above (and the discussion with Robert earlier) above it > > seems to me that there's a bigger problem here that just hasn't > > happened to have been exposed yet: in particular the > > prepare_sort_from_pathkeys call in create_gather_merge_plan seems > > suspect to me. But it's also possible other usages of > > prepare_sort_from_pathkeys could be suspect given other seemingly > > unrelated changed to path generation, since nothing other than > > implicit knowledge about the conventions here prevents us from > > introducing paths generating sorts with expressions not in the target > > list (in fact a part of the issue here IMO is that non-var expression > > pathkeys are added to the target list after path generation and > > selection is completed). > > > > Alternatively we could "fix" this by being even more conservative in > > find_em_expr_usable_for_sorting_rel and disregard any expressions not > > currently found in the target list. But that seems unsatisfactory to > > me since, given we know that there are cases where > > prepare_sort_from_paths is explicitly intended to add pathkey > > expressions to the target list, we'd be excluding possible useful > > cases (and indeed we already have, albeit contrived, test cases that > > fail if that change is made). > > > > There exists a very similar path in the postgres fdw code (in fact the > > function name is the same: get_useful_pathkeys_for_relation) since > > find_em_expr_for_rel is much simpler (it doesn't do many safety checks > > at all like we do in find_em_expr_usable_for_sorting_rel), but I think > > that's safe (though I haven't thought about it a ton) because I > > believe those are sort keys we're going to send to the foreign server > > anyway, as opposed to sorting ourselves (but I haven't verified that > > that's the case and that we never create sort nodes there). > > > > Yeah. I think the various FDW-related restrictions may easily make that > safe, for the reasons you already mentioned. But also because IIRC FDWs > in general are not parallel-safe, so there's no risk of running foreign > scans under Gather [Merge]. There are some comments (IIRC in the parallel docs) about FDWs being able to be marked as parallel-safe, but I'm not sure if that facility is in play. Either way, it's quite a different case. > Thanks for the patches, I'll take a closer look next week. The 0002 > patch is clearly something we need to backpatch, not sure about 0001 as > it essentially enables new behavior in some cases (Sort on unsorted > paths below Gather Merge). Thanks for taking a look. I had the same question re: backporting. On the one hand it will change behavior (this is always a bit of a gray area since in one sense bugs change behavior), but IMO it's not a new feature, because the code clearly intended to have that feature in the first place (it creates gather merges on top of a full sort). So I'd lean towards considering it a bug fix, but I'm also not going to make that a hill to die on. One semi-related question: do you think we should add a comment to prepare_sort_from_pathkeys explaining that modifying it may mean find_em_expr_for_rel needs to be updated also? James
On Mon, Nov 23, 2020 at 8:35 AM James Coleman <jtc331@gmail.com> wrote: > > On Sun, Nov 22, 2020 at 4:59 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > ... > > Thanks for the patches, I'll take a closer look next week. The 0002 > > patch is clearly something we need to backpatch, not sure about 0001 as > > it essentially enables new behavior in some cases (Sort on unsorted > > paths below Gather Merge). > > Thanks for taking a look. > > I had the same question re: backporting. On the one hand it will > change behavior (this is always a bit of a gray area since in one > sense bugs change behavior), but IMO it's not a new feature, because > the code clearly intended to have that feature in the first place (it > creates gather merges on top of a full sort). So I'd lean towards > considering it a bug fix, but I'm also not going to make that a hill > to die on. > > One semi-related question: do you think we should add a comment to > prepare_sort_from_pathkeys explaining that modifying it may mean > find_em_expr_for_rel needs to be updated also? Here's an updated patch series. 0001 and 0002 as before, but with some minor cleanup. 0003 disallows SRFs in these sort expressions (per discussion over at [1]). 0004 removes the search through the target for matching volatile expressions (per discussion at [2]). 0005 adds the comment I mentioned above clarifying these two functions are linked. James 1: https://www.postgresql.org/message-id/295524.1606246314%40sss.pgh.pa.us 2: https://www.postgresql.org/message-id/124400.1606159456%40sss.pgh.pa.us
Вложения
On 11/29/20 3:26 PM, James Coleman wrote: > On Mon, Nov 23, 2020 at 8:35 AM James Coleman <jtc331@gmail.com> wrote: >> >> On Sun, Nov 22, 2020 at 4:59 PM Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> ... >>> Thanks for the patches, I'll take a closer look next week. The 0002 >>> patch is clearly something we need to backpatch, not sure about 0001 as >>> it essentially enables new behavior in some cases (Sort on unsorted >>> paths below Gather Merge). >> >> Thanks for taking a look. >> >> I had the same question re: backporting. On the one hand it will >> change behavior (this is always a bit of a gray area since in one >> sense bugs change behavior), but IMO it's not a new feature, because >> the code clearly intended to have that feature in the first place (it >> creates gather merges on top of a full sort). So I'd lean towards >> considering it a bug fix, but I'm also not going to make that a hill >> to die on. >> >> One semi-related question: do you think we should add a comment to >> prepare_sort_from_pathkeys explaining that modifying it may mean >> find_em_expr_for_rel needs to be updated also? > > Here's an updated patch series. > > 0001 and 0002 as before, but with some minor cleanup. > 0001 - seems fine 0002 - Should we rename the "parallel" parameter to something more descriptive, like "require_parallel_safe"? > 0003 disallows SRFs in these sort expressions (per discussion over at [1]). > OK, but I'd move the define from tlist.c to tlist.h, not optimizer.h. > 0004 removes the search through the target for matching volatile > expressions (per discussion at [2]). > OK > 0005 adds the comment I mentioned above clarifying these two functions > are linked. > Makes sense. I wonder if we need to be more specific about how consistent those two places need to be. Do they need to match 1:1, or how do people in the future decide which restrictions need to be in both places? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Nov 29, 2020 at 4:20 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 11/29/20 3:26 PM, James Coleman wrote: > > On Mon, Nov 23, 2020 at 8:35 AM James Coleman <jtc331@gmail.com> wrote: > >> > >> On Sun, Nov 22, 2020 at 4:59 PM Tomas Vondra > >> <tomas.vondra@enterprisedb.com> wrote: > >>> ... > >>> Thanks for the patches, I'll take a closer look next week. The 0002 > >>> patch is clearly something we need to backpatch, not sure about 0001 as > >>> it essentially enables new behavior in some cases (Sort on unsorted > >>> paths below Gather Merge). > >> > >> Thanks for taking a look. > >> > >> I had the same question re: backporting. On the one hand it will > >> change behavior (this is always a bit of a gray area since in one > >> sense bugs change behavior), but IMO it's not a new feature, because > >> the code clearly intended to have that feature in the first place (it > >> creates gather merges on top of a full sort). So I'd lean towards > >> considering it a bug fix, but I'm also not going to make that a hill > >> to die on. > >> > >> One semi-related question: do you think we should add a comment to > >> prepare_sort_from_pathkeys explaining that modifying it may mean > >> find_em_expr_for_rel needs to be updated also? > > > > Here's an updated patch series. > > > > 0001 and 0002 as before, but with some minor cleanup. > > > > 0001 - seems fine > > 0002 - Should we rename the "parallel" parameter to something more > descriptive, like "require_parallel_safe"? Done. > > 0003 disallows SRFs in these sort expressions (per discussion over at [1]). > > > > OK, but I'd move the define from tlist.c to tlist.h, not optimizer.h. IS_SRF_CALL doesn't really seem to be particularly specific conceptually to target lists (and of course now we're using it in equivclass.c), so I'd tried to put it somewhere more general. Is moving it to tlist.h mostly to minimize changes? Or do you think it fits more naturally with the tlist code (I might be missing something)? > > 0004 removes the search through the target for matching volatile > > expressions (per discussion at [2]). > > > > OK > > > 0005 adds the comment I mentioned above clarifying these two functions > > are linked. > > > > Makes sense. I wonder if we need to be more specific about how > consistent those two places need to be. Do they need to match 1:1, or > how do people in the future decide which restrictions need to be in both > places? At this point I'm not sure I'd have a good way to describe that: one is a clear superset of the other (and we already talk in the comments about the other being a subset), but it's really going to be about "what do we want to allow to be sorted proactively"...hmm, that thought made me take a swing at it; let me know what you think. James
Вложения
Hi, I reviewed the patch series, tweaked a couple comments, added commit messages etc. Barring objections, I'll push this in a couple days. One thing that annoys me is that it breaks ABI because it adds two new parameters to find_em_expr_usable_for_sorting_rel, but I don't think we can get around that. We could assume require_parallel_safe=true, but we still need to pass the PlannerInfo pointer. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
- 0001-Consider-unsorted-paths-in-generate_useful_gather-v4.patch
- 0002-Check-parallel-safety-in-generate_useful_gather_p-v4.patch
- 0003-Disallow-SRFs-when-considering-sorts-below-Gather-v4.patch
- 0004-Don-t-search-for-volatile-expr-in-find_em_expr_us-v4.patch
- 0005-Improve-find_em_expr_usable_for_sorting_rel-comme-v4.patch
Hi, I've pushed (and backpatched) all the fixes posted to this thread. I believe that covers all the incremental sort fixes, so I've marked [1] as committed. [1] https://commitfest.postgresql.org/31/2754/ regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company