Обсуждение: A costing analysis tool
I'm looking at trying to fix some clear flaws in costing which cause of our real-world queries to choose sub-optimal plans under PostgreSQL. It's clear that there needs to be a tool to analyze the accuracy of costing for a variety of queries, both to direct any efforts to fix problems and to test for possible costing regressions. As far as I can tell, no such tool currently exists. If I've missed something, please let me know, even if it's ad hoc or incomplete. Note that I'm talking about a tool strictly to check the accuracy of the estimated costs of plans chosen by the planner, nothing else. I'm at the rough planning stage, and would appreciate any feedback on my thoughts before I start actual development. Considerations, in no particular order (but numbered for easy reference): (1) Most of the time, on a well-configured database, PostgreSQL chooses a plan which performs very well. Many test cases need to cover these normal cases, both to set a baseline and to catch any regression. (2) A large database must be created for these tests, since many issues don't show up in small tables. The same data must be generated in every database, so results are comparable and reproducable. (3) Developers should be able to easily add test cases, either for their own use or contributed to the community. (4) The same test query should be able to easily run in different permutations of the following: (a) With no data cached before the run, or as fully cached as possible. (b) With various enable_xxx settings on or off.(c) With or without significant dead space in the tables/indexes. (5) The tool needs to be able to run in a variety of OS environments. At a minimum, this means some way to pick up configuration information to specify how to start and stop the back end, and how to flush the system cache. (6) The relative costs of various plans shifts dramatically when C asserts are enabled. To avoid misleading results, the tool should warn the user when run on a build configured with --enable-cassert, and all results from such an environment should be conspicuously identified as such. (7) I envision a process to create a test database, populate it, run a series of test cases with EXPLAIN ANALYZE, capture the results, parse the results and store them in a database, analyze the results to find means and standard deviations both overall and for each type of plan, and report summaries and outliers -- with references to the test cases. The primary statistic of interest is actual time divided by cost. This seems like it would be of interest overall, and within the permutations mentioned above for a single query. A reasonable set of test cases would illuminate where costing adjustments would provide the most benefit for the least risk. By reining in the most extreme outliers, we could allow the planner to recognize the best plan among the options it is considering with greater accuracy. This tool could be used as a regression test to ensure that a costing adjustment didn't distort the cost of something which had been accurately costed. So, what do you think? -Kevin
Have you looked at the TODO list to see our previous ideas on tuning diagnotics? --------------------------------------------------------------------------- Kevin Grittner wrote: > I'm looking at trying to fix some clear flaws in costing which cause > of our real-world queries to choose sub-optimal plans under PostgreSQL. > It's clear that there needs to be a tool to analyze the accuracy of > costing for a variety of queries, both to direct any efforts to fix > problems and to test for possible costing regressions. As far as I can > tell, no such tool currently exists. If I've missed something, please > let me know, even if it's ad hoc or incomplete. > > Note that I'm talking about a tool strictly to check the accuracy of > the estimated costs of plans chosen by the planner, nothing else. > > I'm at the rough planning stage, and would appreciate any feedback on > my thoughts before I start actual development. Considerations, in no > particular order (but numbered for easy reference): > > (1) Most of the time, on a well-configured database, PostgreSQL > chooses a plan which performs very well. Many test cases need to cover > these normal cases, both to set a baseline and to catch any regression. > > (2) A large database must be created for these tests, since many > issues don't show up in small tables. The same data must be generated > in every database, so results are comparable and reproducable. > > (3) Developers should be able to easily add test cases, either for > their own use or contributed to the community. > > (4) The same test query should be able to easily run in different > permutations of the following: > > (a) With no data cached before the run, or as fully cached as possible. > (b) With various enable_xxx settings on or off. > (c) With or without significant dead space in the tables/indexes. > > (5) The tool needs to be able to run in a variety of OS environments. > At a minimum, this means some way to pick up configuration information > to specify how to start and stop the back end, and how to flush the > system cache. > > (6) The relative costs of various plans shifts dramatically when C > asserts are enabled. To avoid misleading results, the tool should warn > the user when run on a build configured with --enable-cassert, and all > results from such an environment should be conspicuously identified as > such. > > (7) I envision a process to create a test database, populate it, run a > series of test cases with EXPLAIN ANALYZE, capture the results, parse > the results and store them in a database, analyze the results to find > means and standard deviations both overall and for each type of plan, > and report summaries and outliers -- with references to the test cases. > The primary statistic of interest is actual time divided by cost. This > seems like it would be of interest overall, and within the permutations > mentioned above for a single query. > > A reasonable set of test cases would illuminate where costing > adjustments would provide the most benefit for the least risk. By > reining in the most extreme outliers, we could allow the planner to > recognize the best plan among the options it is considering with > greater accuracy. This tool could be used as a regression test to > ensure that a costing adjustment didn't distort the cost of something > which had been accurately costed. > > So, what do you think? > > -Kevin > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Note that I'm talking about a tool strictly to check the accuracy of > the estimated costs of plans chosen by the planner, nothing else. We could definitely do with some infrastructure for testing this. I concur with Bruce's suggestion that you should comb the archives for previous discussions --- but if you can work on it, great! > (2) A large database must be created for these tests, since many > issues don't show up in small tables. The same data must be generated > in every database, so results are comparable and reproducable. Reproducibility is way harder than it might seem at first glance. What's worse, the obvious techniques for creating reproducible numbers amount to eliminating variables that are important in the real world. (One of which is size of database --- some people care about performance of DBs that fit comfortably in RAM...) Realistically, the planner is never going to have complete information. We need to design planning models that generally get the right answer, but are not so complicated that they are (a) impossible to maintain or (b) take huge amounts of time to compute. (We're already getting some flak on the time the planner takes.) So there is plenty of need for engineering compromise here. Still, you can't engineer without raw data, so I'm all for creating a tool that lets us gather real-world cost data. The only concrete suggestion I have at the moment is to not design the tool directly around "measure the ratio of real time to cost". That's only meaningful if the planner's cost model is already basically correct and you are just in need of correcting the cost multipliers. What we need for the near term is ways of quantifying cases where the cost models are just completely out of line with reality. regards, tom lane
Yes I have looked at the TODO list. There is arguably a relationship to: * Have EXPLAIN ANALYZE highlight poor optimizer estimates * Log queries where the optimizer row estimates were dramatically different from the number of rows actually found? Neither of these, however, provides a systematic way to identify problem areas in costing. Nor do they provide systematic regression testing when costing is modified. I was largely motivated to think in the direction of starting with the tool I describe by this post: http://archives.postgresql.org/pgsql-hackers/2005-10/msg00434.php Also Tom Lane mentioned the need for test cases and doubt about whether a particular fix would help or hurt overall. For example: http://archives.postgresql.org/pgsql-hackers/2005-10/msg00417.php The tool I propose would be "non-invasive" -- it would be a client to the back end to help guide and check the actual back end enhancements. This all started because in some of our real-life queries the optimizer is looking at a reasonable set of available plans, and picking one which runs several times slower than one of the alternatives. The problem is clearly that the cost numbers don't approximate reality closely enough. I'm not convinced that the proposed adjustment is a good idea -- it might cause other queries which run fine now to shift to a suboptimal plan, and it might not go far enough toward solving the problem case. The best solution might be somewhat more sophisticated. I suspect that consideration of effective cache size and the expected iteration count might be necessary to get consistenly good cost estimates without breaking anything else. Nobody wants me to try something like that without a good way to do regression testing. At least, that's the impression I've gotten. And really, it's hard to pin down where the problem really lies without a tool like this. Personally, I suspect that part of the problem is an underestimation of the cost of the sort or the mergejoin. I had read through the TODO list several times, and in response to your post searched it again for key words like: tune, tuning, diagnostic, cost, estimate, and plan I haven't been able to spot anything that seems to address the area covered by the proposed tool. Is there something I'm overlooking? My client is willing to pay for my time to address the issue which is causing them a problem, and share that work with the PostgreSQL community. I don't think I'd get the same response regarding something which is not a demonstrated problem for them. I'm certainly not looking to get adversarial with anyone, or to bypass any part of the process. I am continually impressed by the quality of PostgreSQL, and even more impressed by the people posting to these lists, and the assistance they provide to the community. My client and I both hope to give something back as it meshes with our needs and falls within the capabilities of our staff. If this idea survives the conceptual discussions, I'll suggest a TODO item (if nobody beats me to it), so that it's "on the record" -- that seems only reasonable, to prevent duplicate efforts. Thanks for your response, and any further pointers you can provide. -Kevin >>> Bruce Momjian <pgman@candle.pha.pa.us> 10/12/05 8:27 PM >>> Have you looked at the TODO list to see our previous ideas on tuning diagnotics?
Good points, Tom. (I wish my client's email software supported quoting so that I could post replies closer to your points. Sorry 'bout that.) I tried searching the archives, though, and the words I could think to search with generated so many hits that it seemed more or less like a sequential search of the archives, which is daunting. If you have any particular references, suggestions for search strings I might have missed, or even a time range when you think it was discussed, I'll gladly go looking again. I'm not out to reinvent the wheel, lever, or any other basic idea. To cover the "database fits in RAM" situation, we could load some data, run test cases twice, using only the info from the second run, and never flush. Then we could load more data and get on to the cases where not everything is cached. I don't think we can get huge -- these tests have to run in a reasonable amount of time, but I hope we can load enough to get the major scaling effects covered. So far my wildest dreams have not gone beyond a few simple math operations to get to a cost estimate. Only testing will tell, but I don't think it will be significant compared to the other things going on in the planner. (Especially if I can compensate by talking you into letting me drop that ceil function on the basis that without it we're getting the statistical average of the possible actual costs.) It's even possible that more accurate costing of the current alternatives will reduce the need for other, more expensive, optimizer enhancements. (That glass is half FULL, I SWEAR it!) How do you establish that a cost estimate is completely out of line with reality except by comparing its runtime/estimate ratio with others? Unless you're saying not to look at just the summary level, in which case I totally agree -- any one subplan which has an unusual ratio in either direction needs to be examined. If you're getting at something else, please elaborate -- I don't want to miss anything. Thanks for your response. -Kevin >>> Tom Lane <tgl@sss.pgh.pa.us> 10/13/05 12:01 AM >>> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Note that I'm talking about a tool strictly to check the accuracy of > the estimated costs of plans chosen by the planner, nothing else. We could definitely do with some infrastructure for testing this. I concur with Bruce's suggestion that you should comb the archives for previous discussions --- but if you can work on it, great! > (2) A large database must be created for these tests, since many > issues don't show up in small tables. The same data must be generated > in every database, so results are comparable and reproducable. Reproducibility is way harder than it might seem at first glance. What's worse, the obvious techniques for creating reproducible numbers amount to eliminating variables that are important in the real world. (One of which is size of database --- some people care about performance of DBs that fit comfortably in RAM...) Realistically, the planner is never going to have complete information. We need to design planning models that generally get the right answer, but are not so complicated that they are (a) impossible to maintain or (b) take huge amounts of time to compute. (We're already getting some flak on the time the planner takes.) So there is plenty of need for engineering compromise here. Still, you can't engineer without raw data, so I'm all for creating a tool that lets us gather real-world cost data. The only concrete suggestion I have at the moment is to not design the tool directly around "measure the ratio of real time to cost". That's only meaningful if the planner's cost model is already basically correct and you are just in need of correcting the cost multipliers. What we need for the near term is ways of quantifying cases where the cost models are just completely out of line with reality. regards, tom lane
Ouch! I just remembered locale and character sets and encoding. I can't even begin to get my head around what to do with those, unless it is just to make the tool agnostic regarding those issues and test against a variety of setups. Does that seem adequate? I flash back to my first attempts to use LIKE 'XXX%' under PostgreSQL... -Kevin >>> Tom Lane <tgl@sss.pgh.pa.us> 10/13/05 12:01 AM >>> Reproducibility is way harder than it might seem at first glance. What's worse, the obvious techniques for creating reproducible numbers amount to eliminating variables that are important in the real world. (One of which is size of database --- some people care about performance of DBs that fit comfortably in RAM...)
Kevin, > I'm looking at trying to fix some clear flaws in costing which cause > of our real-world queries to choose sub-optimal plans under PostgreSQL. > It's clear that there needs to be a tool to analyze the accuracy of > costing for a variety of queries, both to direct any efforts to fix > problems and to test for possible costing regressions. As far as I can > tell, no such tool currently exists. If I've missed something, please > let me know, even if it's ad hoc or incomplete. Actually, this is pretty completely what I've been thinking about for the last year. I'm very happy that someone else is interested in working on it. > (2) A large database must be created for these tests, since many > issues don't show up in small tables. The same data must be generated > in every database, so results are comparable and reproducable. > > (3) Developers should be able to easily add test cases, either for > their own use or contributed to the community. Sure. However, I think it's important to seperate the test cases from the cost collection tool. Our *best* test cases will be real production applications. For synthetic test cases, we can look to improving DBT-OSDL, Jan-TPCW, OSDBB and eDB's test (if they ever publish it). The only thing that mess of tests is lacking is easy setup and portability. > (7) I envision a process to create a test database, populate it, run a > series of test cases with EXPLAIN ANALYZE, capture the results, parse > the results and store them in a database, analyze the results to find > means and standard deviations both overall and for each type of plan, > and report summaries and outliers -- with references to the test cases. > The primary statistic of interest is actual time divided by cost. This > seems like it would be of interest overall, and within the permutations > mentioned above for a single query. I would actually like to do this differently. I think an asynchronous logging mechanism is more useful, because there are cost estimation problems which don't show up except under conditions of concurrency and heavy server load. For this reason, it's very important that this kind of cost collection could be performed on a production application. What that would mean is some process whereby the system could sample, say, 5% of the queries being run (at random) and run EXPLAIN ANALYZEs against them, logging the results in a way that could be tabularized. Speaking of which, I think you're missing an important first step: tabular output for EXPLAIN ANALYZE. A whole host of query testing tools could be developed if it were easy to shove EA results into a format where statistics could be run on them. Without it, it's pretty hard to do the rest of the testing. > So, what do you think? How much time do you have to spend on this? I'd like to offer you the TestPerf project on pgfoundry (www.pgfoundry.org/projects/testperf) as a container for your work on this idea. I also have access to a variety of test machines for performance tests. --Josh
Thanks, Josh, for the feedback. It sounds as though you are more focused on picking up costing problems which happen during production -- which is clearly valuable, but addresses a somewhat different set of needs than I was looking at. That said, it seems like there is potential to share signifcant code between the two techniques. We'll have to see if we can work that out. I didn't want to broach the subject of the programming language for this at the early conceptual stages, but if we're talking about code sharing, it can't wait too long, so I'll jump in with it now. I was considering using python to program the tool I was discussing. If python is used, I don't care whether there is any change to EXPLAIN ANALYZE -- it only takes a few lines of code to pull out what I need in the current form. My concern is whether python is supported on all of the target platforms. Python does well running queries directly against PostgreSQL, and is fine for shelling out to run commands (such as those needed to stop the back end, flush cache, and start the back end again). I think I will be significantly more productive at this in python than if I used C or perl, but if it's not accessible to the PostgreSQL community as a whole, I'll cope. Comments, anyone? Perhaps the place we can share code is starting at the point where EXPLAIN ANALYZE results have been inserted into a database. The analysis and reporting from that point might be something which could be common code. I'm not yet familiar with DBT-OSDL, Jan-TPCW, OSDBB and eDB, but I'll look them up -- that exactly the sort of suggestion I was hoping to get, so that I don't need to start from scratch in generating the test data. Anyone want to point out something else I should consider? I need to have somewhere for the work to live, and I quite frankly would just as soon dodge the overhead of setting up and maintaining something, so if noone has objections or other suggestions, I'm inclined to take you up on your offer to use your testperf project. Does anyone think some other location would be more appropriate? How much time is a question I'll have to discuss with my client after the concept has been firmed up and I work out a design from which I can estimate. My off-the-cuff guess is that it will require, and I can get approval for, about three FTE weeks. Mixed in with other things which require my attention, that's probably spread over two to three calendar months. If we run into critical optimization problems, this could get a boost in priority, which would shorten the timeline. It's also possible I might have to set it aside to work on some issue which comes out of the blue -- I never know for sure, so I don't want anyone to count on this for anything with a hard deliverable date until we actually have the working tool. If we get into much more detail, I assume we should take this off-list. -Kevin >>> Josh Berkus <josh@agliodbs.com> 10/13/05 12:25 PM >>> Kevin, > I'm looking at trying to fix some clear flaws in costing which cause > of our real-world queries to choose sub-optimal plans under PostgreSQL. > It's clear that there needs to be a tool to analyze the accuracy of > costing for a variety of queries, both to direct any efforts to fix > problems and to test for possible costing regressions. As far as I can > tell, no such tool currently exists. If I've missed something, please > let me know, even if it's ad hoc or incomplete. Actually, this is pretty completely what I've been thinking about for the last year. I'm very happy that someone else is interested in working on it. > (2) A large database must be created for these tests, since many > issues don't show up in small tables. The same data must be generated > in every database, so results are comparable and reproducable. > > (3) Developers should be able to easily add test cases, either for > their own use or contributed to the community. Sure. However, I think it's important to seperate the test cases from the cost collection tool. Our *best* test cases will be real production applications. For synthetic test cases, we can look to improving DBT-OSDL, Jan-TPCW, OSDBB and eDB's test (if they ever publish it). The only thing that mess of tests is lacking is easy setup and portability. > (7) I envision a process to create a test database, populate it, run a > series of test cases with EXPLAIN ANALYZE, capture the results, parse > the results and store them in a database, analyze the results to find > means and standard deviations both overall and for each type of plan, > and report summaries and outliers -- with references to the test cases. > The primary statistic of interest is actual time divided by cost. This > seems like it would be of interest overall, and within the permutations > mentioned above for a single query. I would actually like to do this differently. I think an asynchronous logging mechanism is more useful, because there are cost estimation problems which don't show up except under conditions of concurrency and heavy server load. For this reason, it's very important that this kind of cost collection could be performed on a production application. What that would mean is some process whereby the system could sample, say, 5% of the queries being run (at random) and run EXPLAIN ANALYZEs against them, logging the results in a way that could be tabularized. Speaking of which, I think you're missing an important first step: tabular output for EXPLAIN ANALYZE. A whole host of query testing tools could be developed if it were easy to shove EA results into a format where statistics could be run on them. Without it, it's pretty hard to do the rest of the testing. > So, what do you think? How much time do you have to spend on this? I'd like to offer you the TestPerf project on pgfoundry (www.pgfoundry.org/projects/testperf) as a container for your work on this idea. I also have access to a variety of test machines for performance tests. --Josh
On Thu, Oct 13, 2005 at 01:52:10PM -0500, Kevin Grittner wrote: > Thanks, Josh, for the feedback. > > It sounds as though you are more focused on picking up costing > problems which happen during production -- which is clearly > valuable, but addresses a somewhat different set of needs than > I was looking at. That said, it seems like there is potential to share > signifcant code between the two techniques. We'll have to see if > we can work that out. Firstly, I really hope you get further with this than I did a while ago when I attempted. It's certainly a worthly goal. Secondly, while checking for problems in productions systems is good, it's not going to help with fixing the cost model. For that you need raw data. My basic plan was to setup tables of different sizes and attempt to run queries such as: - Index Scan on each table with different types of keys and coverage. - Seq Scan - Nested loop, etc... I did reach the point where I was wishing I could just give PostgreSQL the plan and tell it to execute it. :) The point of the exercise is to be able to derive correlations so you could from the plan calcuate the actual costs. For example, run a nested loop with an inner index scan once, twice, three times etc so we can actually *see* what the cache effects are. I got stuck on working out how to force the optimiser to produce the plan I want. I didn't try too hard though. The enable_xxx options should be enough, hopefully. Ofcourse you want to run it with different numbers of shared buffers to see how they affect the results. And then you ideally want the results for several different machines, different disk subsystems, memory types, etc and placed on a nice web page so other people can run correlations on the data themselves. This is essentially what you already came up with. Note that for these purposes the actual estimates by PostgreSQL are irrelevent. However, I strongly suggest finding a way of collating the results publically from lots of people because digging for correlations is something lots of people can hammer on and is really hard to program. Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Thanks for the well wishes. It sounds like you were addressing a slightly different problem -- more ambitious than what I propose tackle do as a first step. If I understand you, you were trying to develop your own predictive costing formulas based on plans. I'm merely talking about a tool to evaluate the relative accuracy of the predictions generated by PostgreSQL. So for the purposes of the proposed tool, the PostgreSQL estimates are central. The ultimate goal is to be able to spot where the current calculations are least accurate, so that any adjustments can be made where they are most needed. You will notice that my working assumptions start with the observation that most of the time, PostgreSQL does very well. I have certainly found that to be the case, compared to a commercial product running the same queries against the same data. I'm operating on the assumption that relatively minor adjustments to current techniques can take off what rough edges there are. That said, there's certainly overlap between your effort and what I'm going to be developing. Do you have anything from your work which might save me some time? You point regarding a convenient way for people to submit results from diverse environments, with a nice web presentation of collated results is well taken. I'm not sure my "off-the-cuff" estimate from an earlier post allows enough time for that -- I'll have to be sure to include something at least adequate and expandable in the design. I probably won't try for anything too fancy. I'm at my best on frameworky internal sorts of things. There's a chance that I may be able to talk my client into putting a web app guy on this for a few days to make it pretty. You never know. Would it make more sense for my client to host something like that on their servers, or would it be more appropriate to have something which would install on a postgresql.org server? If the latter, what sorts of technologies are supported? (My client is partial to cocoon.) -Kevin >>> Martijn van Oosterhout <kleptog@svana.org> 10/13/05 2:41 PM >>> On Thu, Oct 13, 2005 at 01:52:10PM -0500, Kevin Grittner wrote: > Thanks, Josh, for the feedback. > > It sounds as though you are more focused on picking up costing > problems which happen during production -- which is clearly > valuable, but addresses a somewhat different set of needs than > I was looking at. That said, it seems like there is potential to share > signifcant code between the two techniques. We'll have to see if > we can work that out. Firstly, I really hope you get further with this than I did a while ago when I attempted. It's certainly a worthly goal. Secondly, while checking for problems in productions systems is good, it's not going to help with fixing the cost model. For that you need raw data. My basic plan was to setup tables of different sizes and attempt to run queries such as: - Index Scan on each table with different types of keys and coverage. - Seq Scan - Nested loop, etc... I did reach the point where I was wishing I could just give PostgreSQL the plan and tell it to execute it. :) The point of the exercise is to be able to derive correlations so you could from the plan calcuate the actual costs. For example, run a nested loop with an inner index scan once, twice, three times etc so we can actually *see* what the cache effects are. I got stuck on working out how to force the optimiser to produce the plan I want. I didn't try too hard though. The enable_xxx options should be enough, hopefully. Ofcourse you want to run it with different numbers of shared buffers to see how they affect the results. And then you ideally want the results for several different machines, different disk subsystems, memory types, etc and placed on a nice web page so other people can run correlations on the data themselves. This is essentially what you already came up with. Note that for these purposes the actual estimates by PostgreSQL are irrelevent. However, I strongly suggest finding a way of collating the results publically from lots of people because digging for correlations is something lots of people can hammer on and is really hard to program. Hope this helps,
On Thu, Oct 13, 2005 at 05:39:32PM -0500, Kevin Grittner wrote: > That said, there's certainly overlap between your effort and > what I'm going to be developing. Do you have anything from > your work which might save me some time? Not really. I got stuck in the query design phase. I didn't even generate any tables :( > There's a chance that I may be able to talk my client into > putting a web app guy on this for a few days to make it pretty. You misunderstand. I don't think a pretty website is seriously needed. Just a consistant output format with the results in a tarball complete with information about the system itself that people can upload. Then anyone can download a few and combine them as they want. Just an ftp server, maybe gforge or whatever. As for the use of cost estimates, I think it's the same problem from different angles. Comparing the estimates with actual time is determining where the current system is wrong. Comparing actual times with eachother might tell you what the model should be. As has been pointed out, the raw data is what's needed. From there you can draw conclusions. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Kevin, > It sounds as though you are more focused on picking up costing > problems which happen during production -- which is clearly > valuable, but addresses a somewhat different set of needs than > I was looking at. That said, it seems like there is potential to share > signifcant code between the two techniques. We'll have to see if > we can work that out. Hmmmm. I think we're not communicating yet. I wanted to get across two points: 1) The "cost accuracy statistics collector" should be a *separate* tool from the performance test. This will allow the collector to be used with a variety of different test frameworks, increasing its use and our general knowledge of costing under different systems. It will also allow us to use pre-built tests which will save time. 2) The collector should be designed in such a way as to allow collection of data from production databases, for two reasons: a) There are factors which affect cost, like concurrency, system loadand swapping, that tend not to occur on test systems. Ignoring these factors will make our cost model fragile and no improvement over the current code. b) Far more (like 10x) people in the communitywould be willing to run a relatively non-intrusive tool against their production system than would be willing to set up a full-blown performance test harness. The more information about the greater variety of systems and application designs we collect, the more accurate our cost model is. Therefore we want a tool which can be used by as many people as possible, which means that production systems need to be an option. > I didn't want to broach the subject of the programming language > for this at the early conceptual stages, but if we're talking about > code sharing, it can't wait too long, so I'll jump in with it now. I was > considering using python to program the tool I was discussing. I see nothing wrong with using Python. > If > python is used, I don't care whether there is any change to EXPLAIN > ANALYZE -- it only takes a few lines of code to pull out what I need > in the current form. Hmmm ... I think you're collecting less data that I would consider necessary for full analysis of complex queries. Could you give an example? > My concern is whether python is supported on > all of the target platforms. I think python support is broad enough that using it is not an inhibitor. We're not talking Haskell, after all. > I think I will be significantly more > productive at this in python than if I used C or perl, but if it's not > accessible to the PostgreSQL community as a whole, I'll cope. > Comments, anyone? There are actually lots of Python people in the commmunity, and Python is easy to learn/read if you are experienced with C, Perl, or Ruby. So I have no objections. > I need to have somewhere for the work to live, and I quite frankly > would just as soon dodge the overhead of setting up and maintaining > something, so if noone has objections or other suggestions, I'm > inclined to take you up on your offer to use your testperf project. > Does anyone think some other location would be more appropriate? More appropriate than pgFoundry? I can't imagine. You'll need to register as a user on pgFoundry, and send me your user name. > If we get into much more detail, I assume we should take this > off-list. Well, once you get going we'll use the testperf-general list, which doesn't get much traffic these days. pgFoundry also supports bug tracking, task management, and document sharing, you should check it out. --Josh -- Josh Berkus Aglio Database Solutions San Francisco
I think I get your point now. If I understand it, you could accomplish what you want under my rough ("exists only in my head so far") design by creating your own test cases and putting together a script to run just those. I would be exremely leary of comparing tests against a database under load with tests run against an otherwise idle system -- you would expect the time per cost unit to increase -- but as long as the load remained relatively constant, you could get useful information from comparing the various tests from that system under that load. Are we tracking now? I propose capturing only three values from the output of explain analyze, and saving it with many columns of context information. Below is a sample script to capture the three columns and display them with the derived ratio and two very limited bits of context. Don't worry about the lack of proper context just yet, we're talking about what I propose to capture from the output files. Here's the script: ------------------------------------------------------------ # simplistic example of parsing through PostgreSQL analyze verbose output import sys def printHuman(filename, lineNum, operation, cost, actualTime):print "%-60s %3s %-20s %10.2f %12.3f %f" \% (filename,lineNum, operation, float(cost), float(actualTime), float(actualTime) / float(cost)) for inputfilename in sys.argv[1:]:inputfile = open(inputfilename, 'r')inputlines = inputfile.readlines()inputfile.close()forindx in range(len(inputlines)): line = inputlines[indx].lstrip() if line[:2]== '->': opName = line[3:line.find('(')].strip() tailstripwords = (' on ', ' using ') for wordin tailstripwords: onpos = opName.find(word) if onpos > -1: opName = opName[:onpos] dotspos = line.find('..') cost = line[dotspos + 2: line.find(' ', dotspos)] nextdotspos= line.find('..', line.find('actual time')) actTime = line[nextdotspos + 2: line.find(' ', nextdotspos)] printHuman(inputfilename, indx, opName, cost, actTime) ------------------------------------------------------------ and here's the output based on the plans shown in an earlier thread ( http://archives.postgresql.org/pgsql-hackers/2005-10/msg00303.php ): ------------------------------------------------------------ [kgrittn@WCS45792 kgrittn]$ python parse_plan.py *.eplain.analyze milw-cal-daterange-bare-cached.eplain.analyze 1 Merge Join 176218.09 4500.785 0.025541 milw-cal-daterange-bare-cached.eplain.analyze 3 Sort 48975.72 812.160 0.016583 milw-cal-daterange-bare-cached.eplain.analyze 5 Bitmap Heap Scan 47081.47 427.463 0.009079 milw-cal-daterange-bare-cached.eplain.analyze 7 Bitmap Index Scan 297.07 19.089 0.064258 milw-cal-daterange-bare-cached.eplain.analyze 9 Sort 126945.94 3124.966 0.024617 milw-cal-daterange-bare-cached.eplain.analyze 11 Bitmap Heap Scan 119071.84 1195.302 0.010038 milw-cal-daterange-bare-cached.eplain.analyze 13 Bitmap Index Scan 1037.84 54.917 0.052915 milw-cal-daterange-bare-noncached.eplain.analyze 1 Merge Join 176218.09 32071.776 0.182000 milw-cal-daterange-bare-noncached.eplain.analyze 3 Sort 48975.72 5954.706 0.121585 milw-cal-daterange-bare-noncached.eplain.analyze 5 Bitmap Heap Scan 47081.47 5560.895 0.118112 milw-cal-daterange-bare-noncached.eplain.analyze 7 Bitmap Index Scan 297.07 52.675 0.177315 milw-cal-daterange-bare-noncached.eplain.analyze 9 Sort 126945.94 25551.291 0.201277 milw-cal-daterange-bare-noncached.eplain.analyze 11 Bitmap Heap Scan 119071.84 23588.122 0.198100 milw-cal-daterange-bare-noncached.eplain.analyze 13 Bitmap Index Scan 1037.84 146.412 0.141074 milw-cal-daterange-nomergejoin-cached.eplain.analyze 1 Nested Loop 298740.94 1464.720 0.004903 milw-cal-daterange-nomergejoin-cached.eplain.analyze 2 Index Scan 49419.45 412.268 0.008342 milw-cal-daterange-nomergejoin-cached.eplain.analyze 4 Index Scan 9.93 0.019 0.001913 milw-cal-daterange-nomergejoin-noncached.eplain.analyze 1 Nested Loop 298740.94 14277.124 0.047791 milw-cal-daterange-nomergejoin-noncached.eplain.analyze 2 Index Scan 49419.45 11510.723 0.232919 milw-cal-daterange-nomergejoin-noncached.eplain.analyze 4 Index Scan 9.93 0.062 0.006244 ------------------------------------------------------------ Besides the additional context info, I expect to be storing the log of the ratio, since it seems to make more sense to average and look for outliers based on that than the raw ratio. Is there anything else you think should be captured? (Keep in mind that I'm trying to keep the scope of this very focused, so it can actually get done, but I'm not adverse to capturing something else that's just sitting there if it's useful to someone.) -Kevin >>> Josh Berkus <josh@agliodbs.com> 10/14/05 11:20 AM >>> Kevin, > It sounds as though you are more focused on picking up costing > problems which happen during production -- which is clearly > valuable, but addresses a somewhat different set of needs than > I was looking at. That said, it seems like there is potential to share > signifcant code between the two techniques. We'll have to see if > we can work that out. Hmmmm. I think we're not communicating yet. I wanted to get across two points: 1) The "cost accuracy statistics collector" should be a *separate* tool from the performance test. This will allow the collector to be used with a variety of different test frameworks, increasing its use and our general knowledge of costing under different systems. It will also allow us to use pre-built tests which will save time. 2) The collector should be designed in such a way as to allow collection of data from production databases, for two reasons: a) There are factors which affect cost, like concurrency, system loadand swapping, that tend not to occur on test systems. Ignoring these factors will make our cost model fragile and no improvement over the current code. b) Far more (like 10x) people in the communitywould be willing to run a relatively non-intrusive tool against their production system than would be willing to set up a full-blown performance test harness. The more information about the greater variety of systems and application designs we collect, the more accurate our cost model is. Therefore we want a tool which can be used by as many people as possible, which means that production systems need to be an option. > I didn't want to broach the subject of the programming language > for this at the early conceptual stages, but if we're talking about > code sharing, it can't wait too long, so I'll jump in with it now. I was > considering using python to program the tool I was discussing. I see nothing wrong with using Python. > If > python is used, I don't care whether there is any change to EXPLAIN > ANALYZE -- it only takes a few lines of code to pull out what I need > in the current form. Hmmm ... I think you're collecting less data that I would consider necessary for full analysis of complex queries. Could you give an example? > My concern is whether python is supported on > all of the target platforms. I think python support is broad enough that using it is not an inhibitor. We're not talking Haskell, after all. > I think I will be significantly more > productive at this in python than if I used C or perl, but if it's not > accessible to the PostgreSQL community as a whole, I'll cope. > Comments, anyone? There are actually lots of Python people in the commmunity, and Python is easy to learn/read if you are experienced with C, Perl, or Ruby. So I have no objections. > I need to have somewhere for the work to live, and I quite frankly > would just as soon dodge the overhead of setting up and maintaining > something, so if noone has objections or other suggestions, I'm > inclined to take you up on your offer to use your testperf project. > Does anyone think some other location would be more appropriate? More appropriate than pgFoundry? I can't imagine. You'll need to register as a user on pgFoundry, and send me your user name. > If we get into much more detail, I assume we should take this > off-list. Well, once you get going we'll use the testperf-general list, which doesn't get much traffic these days. pgFoundry also supports bug tracking, task management, and document sharing, you should check it out. --Josh -- Josh Berkus Aglio Database Solutions San Francisco
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > I propose capturing only three values from the output of explain > analyze, and saving it with many columns of context information. You really have to capture the rowcounts (est and actual) too. Otherwise you can't tell if it's a costing problem or a statistics problem. More generally, I think that depending entirely on EXPLAIN ANALYZE numbers is a bad idea, because the overhead of EXPLAIN ANALYZE is both significant and variable depending on the plan structure. The numbers that I think we must capture are the top-level EXPLAIN cost and the actual runtime of the query (*without* EXPLAIN). Those are the things we would like to get to track closely. EXPLAIN ANALYZE is incredibly valuable as context for such numbers, but it's not the thing we actually wish to optimize. > Besides the additional context info, I expect to be storing the log > of the ratio, since it seems to make more sense to average and > look for outliers based on that than the raw ratio. Why would you store anything but raw data? Easily-derivable numbers should be computed while querying the database, not kept in it. regards, tom lane
I have to keep a very narrow focus on this, or there is likely that nothing will come of it. The particular area which is my target here is the accuracy of the cost values on the subplans considered by the optimizer. As previously stated, we're getting hurt by cases where the optimizer looks at two plans and picks the slower one because the cost numbers are don't correlate well to actual run time, even with accurate statistics. With the query I've been using as an example, the numbers are skewed against the faster plan whether nothing is cached or everything is cached. Without reasonably accurate costing, all the work the optimizer does to come up with logically equivalent paths to the same result doesn't wind up helping as much as it should. You could have some very clever technique which was faster, but ignored (or slower, yet chosen). Of course, if running with EXPLAIN ANALYZE significantly distorts the run time, the whole effort is doomed at the outset. Can you quantify the distortion you mention? Do you know what sorts of conditions are most affected? At a minimum, it sounds likethe tool should have the capability to repeat the tests with and without EXPLAIN ANALYZE, and use the ratio of the two times as a reliability factor. Unfortunately, that means doubling the number of cache flushes, which is likely to be the most time-consuming part of running the tests. On the bright side, we would capture the top level runtimes you want. You make a good point about the expected versus actual rows. The ratio I've been looking at should perhaps be multipled by (actual rowcount / estimated rowcount). And with that information in the database, there would be more options for massaging it for different perspectives. I was speaking loosely when I said that I would store the derived data -- I would probably start with a view to provide the heavily used derived values, and only materialize them in the base tables if necessary for performance. I would not want the application to calculate the heavily used derived values every time, due to the bloat in development time and maintenance effort to repeat the expression everywhere. I had a thought over lunch regarding something else we might want to capture from the EXPLAIN ANALYZE output -- we could reference enclosing plan level in bill of materials fashion. I don't see an immediate use case for it, but it is sitting there, pretty much free for the taking, and seems potentially useful. -Kevin >>> Tom Lane <tgl@sss.pgh.pa.us> 10/14/05 1:37 PM >>> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > I propose capturing only three values from the output of explain > analyze, and saving it with many columns of context information. You really have to capture the rowcounts (est and actual) too. Otherwise you can't tell if it's a costing problem or a statistics problem. More generally, I think that depending entirely on EXPLAIN ANALYZE numbers is a bad idea, because the overhead of EXPLAIN ANALYZE is both significant and variable depending on the plan structure. The numbers that I think we must capture are the top-level EXPLAIN cost and the actual runtime of the query (*without* EXPLAIN). Those are the things we would like to get to track closely. EXPLAIN ANALYZE is incredibly valuable as context for such numbers, but it's not the thing we actually wish to optimize. > Besides the additional context info, I expect to be storing the log > of the ratio, since it seems to make more sense to average and > look for outliers based on that than the raw ratio. Why would you store anything but raw data? Easily-derivable numbers should be computed while querying the database, not kept in it. regards, tom lane
Dang. Obviously, that's inverted. Also, I'd need to factor in the setup time. Bother. Are you sure we can't just make sure the test scripts operate against tables with accurate statistics? >>> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> 10/14/05 3:34 PM >>> The ratio I've been looking at should perhaps be multipled by (actual rowcount / estimated rowcount).
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Dang. Obviously, that's inverted. Also, I'd need to > factor in the setup time. Bother. Are you sure we can't > just make sure the test scripts operate against tables > with accurate statistics? Well, the point of my comment was that you should record the rowcounts so that you know whether or not a particular test case was skewed by inaccurate stats. Whether you choose to disregard those cases or try to include them in the analysis is up to you --- but assuming that every number submitted to you is unaffected by bad stats is just asking for trouble. regards, tom lane
On Fri, Oct 14, 2005 at 03:34:43PM -0500, Kevin Grittner wrote: > Of course, if running with EXPLAIN ANALYZE significantly > distorts the run time, the whole effort is doomed at the outset. > Can you quantify the distortion you mention? Do you know To do the calculations for EXPLAIN ANALYZE, PostgreSQL will call gettimeofday() once (or possibly twice) for every iteration of every node in the execution plan. This is usually (but not always) a kernel call so if there are a lot of rows being processed compared with the amount of other calculations happening, the results are distorted. This is unfortunate because EXPLAIN ANALYZE is an immensly useful tool, as far as it goes. I've pondered if some kind of userspace timing mechanism could be introduced (possibly using builtin CPU cycle counters) to reduce the cost. It does, however, remain a cost. Given that you can see how many times gettimeday() was called, you may be able to correct the error. I havn't tried that though. Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Tom Lane <tgl@sss.pgh.pa.us> writes: > More generally, I think that depending entirely on EXPLAIN ANALYZE > numbers is a bad idea, because the overhead of EXPLAIN ANALYZE is both > significant and variable depending on the plan structure. The numbers > that I think we must capture are the top-level EXPLAIN cost and the > actual runtime of the query (*without* EXPLAIN). Those are the things > we would like to get to track closely. EXPLAIN ANALYZE is incredibly > valuable as context for such numbers, but it's not the thing we actually > wish to optimize. If you really want to do this I think it pays to go down to the raw data that EXPLAIN is really calculating: The resulting time is a sum of the time spent in various operations. So x milliseconds were spent in sequential block fetchs, y milliseconds were spent in random access block fetches, z milliseconds were spent in index lookups, etc. Of course we can't really instrument every operation and record the actual time spent in all of these things, only the total. My point is that it's a linear equation, the sum of the average times for these various operations with factors representing their counts. The cost that the optimizer calculates is similar. It's a linear equation, where it predicts x' sequential block reads, y' random access block reads, and z' index lookups, etc. The resulting cost is just the sum of these things. If the optimizer didn't collapse the cost for each node into a single value and instead retained the individual parameters at each node it could bubble those values all the way up to the surface. Then use the configuration options like random_page_cost etc to calculate the resulting cost once. But then you could gather the raw data where each line is a linear equation and the actual calculated time spent in the query is the sum. The resulting matrix of linear equations could then be thrown at some linear solver to tell you what your systems "real" random_page_cost, index_tuple_cost, etc are. Of course the fact that these values aren't really constant across queries and also that the parameters being entered are just estimates to begin with would mean there wouldn't be a perfect solution, but I bet there's research into how to come up with good compromise solutions. Hell I bet there's even good algorithms for telling you which parameters are sketchy and are probably being poorly estimated or don't represent real-world constant operations well. Doing all this would mean a lot more far reaching changes than just looking at EXPLAIN output of course :/ -- greg
Martijn van Oosterhout <kleptog@svana.org> writes: > This is unfortunate because EXPLAIN ANALYZE is an immensly useful tool, > as far as it goes. I've pondered if some kind of userspace timing > mechanism could be introduced (possibly using builtin CPU cycle > counters) to reduce the cost. It does, however, remain a cost. I wonder if there's a good case for a version of explain analyze that runs the query and outputs the plan along with row counts but not timing for each row. You would still be able to see if the estimates are correct. And it would have basically no overhead so you could use it under a production environment. > Given that you can see how many times gettimeday() was called, you may > be able to correct the error. I havn't tried that though. I tried, it seems like it should be trivial but I got bogged down in details. Removing profiling overhead is pretty standard for profilers to do though. It really seems to me like it ought to be done. It still wouldn't let you use EXPLAIN ANALYZE under production without a lot of overhead though. -- greg
Greg Stark <gsstark@mit.edu> writes: > If the optimizer didn't collapse the cost for each node into a single value > and instead retained the individual parameters at each node it could bubble > those values all the way up to the surface. Then use the configuration options > like random_page_cost etc to calculate the resulting cost once. Hardly --- how will you choose the best subplans if you don't calculate their costs? It might be possible to remember where the costs came from, but I'm unconvinced that there's much gold to be mined that way. I'm also a bit suspicious of the "it's all a linear equation" premise, because the fact of the matter is that the cost estimates are already nonlinear, and are likely to get more so rather than less so as we learn more. A case in point is that the reason nestloop costing sucks so badly at the moment is that it fails to account for cache effects in repeated scans ... which is definitely a nonlinear effect. regards, tom lane
On Sat, Oct 15, 2005 at 05:53:45PM -0400, Greg Stark wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > > This is unfortunate because EXPLAIN ANALYZE is an immensly useful tool, > > as far as it goes. I've pondered if some kind of userspace timing > > mechanism could be introduced (possibly using builtin CPU cycle > > counters) to reduce the cost. It does, however, remain a cost. > > I wonder if there's a good case for a version of explain analyze that runs the > query and outputs the plan along with row counts but not timing for each row. > You would still be able to see if the estimates are correct. And it would have > basically no overhead so you could use it under a production environment. That's an interesting thought. Just counting tuples and loops. Wouldn't be too hard to implement. BTW, for those people thinking of using hardware CPU cycle counters, it won't work. Think multiple CPUs, systems with variable CPU clock sppeds and CPUs idling when there's nothing to do. There is no useful relationship between a CPU cycle counter and wall clock time over the sort of time intervals we're interested in. Interestingly, I notice the windows port of PostgreSQL uses the QueryPerformanceCounter() function. I tried playing with it under linux and found that Linux suspends the CPU while waiting for things to happen. So: sleep(1) ~ 20 million cycles busy loop for 1 second ~ 800 million cycles (CPU speed) So, what's good for battery and power usage is bad for accurate timings. Basically, on Linux it would seriously underestimate the time for blocking system calls on an otherwise idle system. So, it works for Windows because they don't do this... > > Given that you can see how many times gettimeday() was called, you may > > be able to correct the error. I havn't tried that though. > > I tried, it seems like it should be trivial but I got bogged down in details. > Removing profiling overhead is pretty standard for profilers to do though. It > really seems to me like it ought to be done. It still wouldn't let you use > EXPLAIN ANALYZE under production without a lot of overhead though. It occured to me as I wrote it to try and build it into the system. But I have no idea what kind of issues you might have run into. Any references? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Tom Lane <tgl@sss.pgh.pa.us> writes: > Hardly --- how will you choose the best subplans if you don't calculate > their costs? Uhm, good point. but as you say not really a problem. > I'm also a bit suspicious of the "it's all a linear equation" premise, > because the fact of the matter is that the cost estimates are already > nonlinear, and are likely to get more so rather than less so as we learn > more. A case in point is that the reason nestloop costing sucks so > badly at the moment is that it fails to account for cache effects in > repeated scans ... which is definitely a nonlinear effect. That only means the relationship between the estimates for the outside of the nested loop to the estimates inside the loop isn't simple. But the individual estimates for both nodes are still just linear equations themselves. That is, the actual cost for each node is just the result of a simple linear equation of all the parameters estimated at that node. I think they *have* to be linear equations. If not the units can't work out properly. Every operation takes time, and the total amount of time spent in the query is just the sum of all the time spent in those operations. There just aren't very many operations that make much sense on measures of time after all. In fact I wonder if breaking out the individual parameters would offer a way out of the nested loop knot. If you know how much time is the plan inside the nested loop is estimated to spend in index lookups specifically you could discount that but keep the other parameters at full value. Or something like that. It might require breaking random_page_cost into two or three different parameters that would normally have the same cost but aren't handled the same, like random_heap_cost, random_leaf_cost, and random_nonleaf_cost. -- greg
Martijn van Oosterhout <kleptog@svana.org> writes: > Interestingly, I notice the windows port of PostgreSQL uses the > QueryPerformanceCounter() function. I tried playing with it under linux > and found that Linux suspends the CPU while waiting for things to > happen. So: > sleep(1) ~ 20 million cycles > busy loop for 1 second ~ 800 million cycles (CPU speed) > So, what's good for battery and power usage is bad for accurate > timings. Basically, on Linux it would seriously underestimate the time > for blocking system calls on an otherwise idle system. So, it works for > Windows because they don't do this... Hmm ... are we *sure* they don't do that? The QueryPerformanceCounter implementation was added just recently, and I'm not sure it's been tested under any wide range of scenarios. Maybe we will find that it doesn't work :-( regards, tom lane
On Sat, Oct 15, 2005 at 06:48:25PM -0400, Tom Lane wrote: > > So, what's good for battery and power usage is bad for accurate > > timings. Basically, on Linux it would seriously underestimate the time > > for blocking system calls on an otherwise idle system. So, it works for > > Windows because they don't do this... > > Hmm ... are we *sure* they don't do that? The QueryPerformanceCounter > implementation was added just recently, and I'm not sure it's been > tested under any wide range of scenarios. Maybe we will find that it > doesn't work :-( I know Windows didn't used to. Remember those old Cyrix chips that were always warm and tended to overheat. On Linux they worked great because they were always 10+ degrees colder than when running under Windows. There's was even a little Cyrix specific utility to enhance the effect. The documentation [1][2] states the risk that CPU switching may cause issues but don't mention only using it for short intervals or that the clock may freeze. I would have thought they'd mention it if it were an issue. While we're on the subject I notice the code uses GetTimerFrequency() as the baseline, whereas the documentation clearly states you should use QueryPerformanceFrequency() which may be zero if there is no such frequency. Note, this doesn't actually matter unless you're assuming it's actually measuring milliseconds. It does say the frequency can't change, which is bizarre since I'm sure I heard about laptops running at multiple frequencies to save power, even for Windows. Finally, if it is using the RDTSC instruction, it won't work on anything not Pentium class as that was the first Intel chip to have any kind of performance counters. No idea if anyone has tested how Windows reacts in that case or if we care, but it's something to look out for. Here is the program [3] (only for gcc) I used to test. On my laptop it outputs: Using sleep() 00000074:2C039B85 - 00000074:2E34B476 (1.004 sec) = 36772081 cycles (36.614 MHz) Busy wait 00000074:2E36A220 - 00000074:5E00CE4A (1.000 sec) = 801778730 cycles (801.778 MHz) The first number varies considerably depending on what I'm doing at the time. It would be interesting see how it works on other OSes, in particular Windows. Have a nice day, [1] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winui/winui/windowsuserinterface/windowing/timers/timerreference/timerfunctions/queryperformancecounter.asp [2] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winui/winui/windowsuserinterface/windowing/timers/timerreference/timerfunctions/queryperformancefrequency.asp [3] http://svana.org/kleptog/pgsql/rdtsc.c -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Greg, > Or something like that. It might require breaking random_page_cost into two > or three different parameters that would normally have the same cost but > aren't handled the same, like random_heap_cost, random_leaf_cost, and > random_nonleaf_cost. Gods forbid. People don't know how to use random_page_cost as it is; how are they going to handle having 5 different parameters? -- Josh Berkus Aglio Database Solutions San Francisco
Kevin, > I have to keep a very narrow focus on this, or there is likely that > nothing will come of it. The particular area which is my target > here is the accuracy of the cost values on the subplans > considered by the optimizer. Sure. What the rest of us are focused on is helping you build a generally useful tool which can be used to solve future problems and unexpected performance issues as well. I really see needing to collect all of the information possible from EXPLAIN ANALYZE ... not collecting just three bits and throwing the rest of the stuff away. I'd use a structure more like: query_instances ( query text run_instance int estimated_cost float analyze_time float actual_time float time_run timestamp GUCs text[] ) query_steps ( query_instance FK step_id SERIAL parent_step FK node_name text node_type text FK cost_start float cost_end float est_rows INT8 time_start float time_end float actual_rows INT8 loops INT8 db_object name condition_type text FK condition_detail text ) so, for example, the query step: " -> Seq Scan on detail0009 (cost=0.00..20500.11 rows=26 width=1005) (actual time=453.000..5983.000 rows=53588 loops=1)" " Filter: ((txcontenttype ~~* '%html%'::text) AND ((vchost)::text ~~* '%www.%'::text))" Could be displayed as: query_instance 12008 step_id 14701 parent_step 14698 node_name Seq Scan on detail0009 node_type Seq Scan cost_start 0 cost_end 20500.11 est_rows 26 time_start 453.0 time_end 5983.0 actual_rows 53588 loops 1 db_object detail009 condition_type Filter condition_detail ((txcontenttype ~~* '%html%'::text) AND ((vchost)::text ~~* '%www.%'::text)) By collecting all of this data, you make it possible to perform other sorts of analysis on the cost estimates. For example, statistical analysis might tell us that 3-or-more-condition filters take significantly longer to execute than single-condition filters, which would be important to know for the cost model. Limiting it to collecting only 3 of the 13 bits of node data produced by EA would very much limit the usefulness of the tool and the reliability of its statistics. -- Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > Greg, > > > Or something like that. It might require breaking random_page_cost into two > > or three different parameters that would normally have the same cost but > > aren't handled the same, like random_heap_cost, random_leaf_cost, and > > random_nonleaf_cost. > > Gods forbid. People don't know how to use random_page_cost as it is; how are > they going to handle having 5 different parameters? Well I wasn't anticipating actually exposing each of these parameters as separate guv variables. I was anticipating having each of them have the same cost, but possibly be treated differently by outer plan nodes. For instance a nested loop iterating an inner node 100 times with those three costs estimated at <20,15,10> instead of calculating estimated cost of <2000,1500,1000> it can calculate <2000,150,10> -- discounting leaf page accesses by 90% and nonleaf pages by 99% but not discounting heap accesses at all. When the time comes to calculating a single total cost (either for the final plan to display with EXPLAIN or for a decision picking between alternate plans) all three might be costed equally using a single random_page_cost guc. -- greg
On Fri, Oct 14, 2005 at 05:14:43PM +0200, Martijn van Oosterhout wrote: > On Thu, Oct 13, 2005 at 05:39:32PM -0500, Kevin Grittner wrote: > > That said, there's certainly overlap between your effort and > > what I'm going to be developing. Do you have anything from > > your work which might save me some time? > > Not really. I got stuck in the query design phase. I didn't even > generate any tables :( > > > There's a chance that I may be able to talk my client into > > putting a web app guy on this for a few days to make it pretty. > > You misunderstand. I don't think a pretty website is seriously needed. > Just a consistant output format with the results in a tarball complete > with information about the system itself that people can upload. Then > anyone can download a few and combine them as they want. Just an ftp > server, maybe gforge or whatever. Well, it would be best if inbound results ended up in a database that people could query against. That would make it very easy for people to look for results they're interested in. Of course that doesn't preclude making the raw results available too. BTW, this is something I'm also very interested in. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Sun, Oct 16, 2005 at 02:23:41PM -0700, Josh Berkus wrote: > Greg, > > > Or something like that. It might require breaking random_page_cost into two > > or three different parameters that would normally have the same cost but > > aren't handled the same, like random_heap_cost, random_leaf_cost, and > > random_nonleaf_cost. > > Gods forbid. People don't know how to use random_page_cost as it is; how are > they going to handle having 5 different parameters? Well, until we try it we won't know if there's actually need for this or not... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Fri, Oct 14, 2005 at 02:37:37PM -0400, Tom Lane wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > > I propose capturing only three values from the output of explain > > analyze, and saving it with many columns of context information. > > You really have to capture the rowcounts (est and actual) too. > Otherwise you can't tell if it's a costing problem or a statistics > problem. > > More generally, I think that depending entirely on EXPLAIN ANALYZE > numbers is a bad idea, because the overhead of EXPLAIN ANALYZE is both > significant and variable depending on the plan structure. The numbers > that I think we must capture are the top-level EXPLAIN cost and the > actual runtime of the query (*without* EXPLAIN). Those are the things > we would like to get to track closely. EXPLAIN ANALYZE is incredibly > valuable as context for such numbers, but it's not the thing we actually > wish to optimize. The problem with that is that we lose all data on per-node costs versus what the planner thought should happen. ISTM it would be better to run all 3 scenarios: explain, explain analyze, and select count(*). As for the caching issue that raises, I don't buy into the theory that all testing should be done with nothing in the cache, because it's entirely unrealistic in most cases. I think what makes a lot more sense is to do two runs, clearing the cache and swapping the order of analyze and count(*) between the two. That would give us a complete set of data: not only would we know how things break down at a node-by-note level, we'd also know how caching affected things. Given some clever data-mining, one could probably even produce cost estimates for the overhead of explain analyze which could be factored into further analysis. Of course the one downside is this doubles the amount of time it takes for a test to run... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Sat, Oct 15, 2005 at 04:04:52PM +0200, Martijn van Oosterhout wrote: > On Fri, Oct 14, 2005 at 03:34:43PM -0500, Kevin Grittner wrote: > > Of course, if running with EXPLAIN ANALYZE significantly > > distorts the run time, the whole effort is doomed at the outset. > > Can you quantify the distortion you mention? Do you know > > To do the calculations for EXPLAIN ANALYZE, PostgreSQL will call > gettimeofday() once (or possibly twice) for every iteration of every > node in the execution plan. This is usually (but not always) a kernel > call so if there are a lot of rows being processed compared with the > amount of other calculations happening, the results are distorted. > > This is unfortunate because EXPLAIN ANALYZE is an immensly useful tool, > as far as it goes. I've pondered if some kind of userspace timing > mechanism could be introduced (possibly using builtin CPU cycle > counters) to reduce the cost. It does, however, remain a cost. > > Given that you can see how many times gettimeday() was called, you may > be able to correct the error. I havn't tried that though. DTrace (http://www.opensolaris.org/os/community/dtrace/) is another possibility, althought AFAIK it's only available on OpenSolaris right now. But I've also heard that BSD guys are pretty stoked about it, so it might become a standard across multiple OSes. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Fri, Oct 14, 2005 at 03:34:43PM -0500, Kevin Grittner wrote: > of the two times as a reliability factor. Unfortunately, that > means doubling the number of cache flushes, which is likely > to be the most time-consuming part of running the tests. On > the bright side, we would capture the top level runtimes you > want. Actually, if you shut down the database and run this bit of code with a high enough number you should have a nicely cleaned cache. int main(int argc, char *argv[]) { if (!calloc(atoi(argv[1]), 1024*1024)) { printf("Error allocating memory.\n"); } } Running that on a dual Opteron (842's, I think) gives: decibel@fritz.1[16:35]~:10>time ./a.out 3300 3.142u 8.940s 0:40.62 29.7% 5+4302498k 0+0io 2pf+0w That was on http://stats.distributed.net and resulted in about 100MB being paged to disk. With 3000 it only took 20 seconds, but might not have cleared 100% of memory. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Thanks to all who have been offering suggestions. I have been reading them and will try to incorporate as much as possible. I have already reworked that little brain-dead python script into something which uses a regular expression to pick off all of the data from each cost/timing line (including the first one), and tracks the heirarchy. I'll put all of these into the analysis database. Due to client operational problems I've been called in on, I haven't done much more than that so far. I'll try to firm up a proposed schema for the data soon, and some idea of what a test case definition will look like. Then I'll probably have to set it aside for two or three weeks. I'll attach the current plan scanner for review, comment, improvement. Also, someone may want to look at the results from a few queries to get ideas going on how they want to use the data. Regarding the idea of a site where results could be posted and loaded into a database which would be available for public access -- I agree that would be great; however, my client is not willing to take that on. If anyone wants to volunteer, that wuold be fantastic. -Kevin >>> "Jim C. Nasby" <jnasby@pervasive.com> >>> On Fri, Oct 14, 2005 at 03:34:43PM -0500, Kevin Grittner wrote: > of the two times as a reliability factor. Unfortunately, that > means doubling the number of cache flushes, which is likely > to be the most time-consuming part of running the tests. On > the bright side, we would capture the top level runtimes you > want. Actually, if you shut down the database and run this bit of code with a high enough number you should have a nicely cleaned cache. int main(int argc, char *argv[]) { if (!calloc(atoi(argv[1]), 1024*1024)) { printf("Error allocating memory.\n"); } } Running that on a dual Opteron (842's, I think) gives: decibel@fritz.1[16:35]~:10>time ./a.out 3300 3.142u 8.940s 0:40.62 29.7% 5+4302498k 0+0io 2pf+0w That was on http://stats.distributed.net and resulted in about 100MB being paged to disk. With 3000 it only took 20 seconds, but might not have cleared 100% of memory.
Вложения
If we stored the actual queries and the EXPLAIN ANALYZE results (when generated) in the database, what would be the purpose of the node_name, db_object, and condition_detail columns? They don't seem like they would be useful for statistical analysis, and it seems like the information would be more useful in context. Are these column really needed? For a given node_type, are there mutiple valid condition_type values? If so, I need to modify my python script to capture this. If not, I don't see a need to store it. -Kevin >>> Josh Berkus <josh@agliodbs.com> >>> [snip] so, for example, the query step: " -> Seq Scan on detail0009 (cost=0.00..20500.11 rows=26 width=1005) (actual time=453.000..5983.000 rows=53588 loops=1)" " Filter: ((txcontenttype ~~* '%html%'::text) AND ((vchost)::text ~~* '%www.%'::text))" Could be displayed as: query_instance 12008 step_id 14701 parent_step 14698 node_name Seq Scan on detail0009 node_type Seq Scan cost_start 0 cost_end 20500.11 est_rows 26 time_start 453.0 time_end 5983.0 actual_rows 53588 loops 1 db_object detail009 condition_type Filter condition_detail ((txcontenttype ~~* '%html%'::text) AND ((vchost)::text ~~* '%www.%'::text)) By collecting all of this data, you make it possible to perform other sorts of analysis on the cost estimates. For example, statistical analysis might tell us that 3-or-more-condition filters take significantly longer to execute than single-condition filters, which would be important to know for the cost model. Limiting it to collecting only 3 of the 13 bits of node data produced by EA would very much limit the usefulness of the tool and the reliability of its statistics.
Kevin, > If we stored the actual queries and the EXPLAIN ANALYZE results (when > generated) in the database, what would be the purpose of the node_name, > db_object, and condition_detail columns? They don't seem like they > would be useful for statistical analysis, and it seems like the > information would be more useful in context. Are these column really > needed? Yes. For example, the only way you're going analyze index costing by type of index is if the index name is stored somewhere (db_object) so that it can be matched to its characteristics. For condition_detail, again we could determine that (for example) we have costing problems when filters involve more than 2 columns or complex expressions. Node_name is as actually duplicative of some of the other columns, so I suppose it could be dropped. > For a given node_type, are there mutiple valid condition_type values? > If so, I need to modify my python script to capture this. If not, I > don't see a need to store it. I'm not sure. Even if there aren't now, there could be in the future. I'm more focused on supporting cross-node-type conditions. For example, "Filter" conditions can apply to a variety of node types (Index Scan, Merge Join, Subquery Scan, Seq Scan, aggregates). If we were costing Filters, we'd want to be able to aggregate their stats regardless of the node in which they occurred. I'm also really unclear on why you're so focused on storing less information rather than more. In an "investigation" tool like this, it's important to collect as much data as possible because we don't know what's going to be valuable until we analyze it. You seem to be starting out with the idea that you *already* know exactly where the problems are located, in which case why develop a tool at all? Just fix the problem. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Summary of schema I'm considering. Comments welcome. When it gets downt to the detail, it may make sense to combine or split some of these. For example, runtime_options should probably not have a column for each currently known option, but a child table which maps to all non-default option values. submitter Identifies who submitted test results.Name, optional info such as email address and organization runtime_environment Identifies the overall test environment. OS with distribution & version, CPU number, type speed, RAM, background load, static configuration, etc. This provides context for a series of tests, to see how the numbers lookin a given environment dataset_characteristics Identifies data metrics which may affect costing accuracy Table counts, row counts, column counts,disk space used, level of fragmentation. Maybe some of the "standard" tests will share common dataset characteristicsacross multiple environments. cache_state Identifies the level of initial caching for a test, and the degree to which the referenced data can be cached during execution runtime_options Identifies the runtime choices in effect for a query run. The state of EXPLAIN, ANALYZE, enable_xxx, anddynamic configuration settings, such as random_page_cost. query Identifies a test query, possibly run by many people in many environments against various datasets with different cache states and runtime options test_result_summary Ties a query to details about a run, with a summary of results. Run time from the client perspective,rows returned. test_result_step_detail Shows EXPLAIN ANALYZE information (if any) for each step.
I'm not interested in storing less information. I'm trying to make sure that all redundant information is justified. Since I plan to store the actual query text and the full EXPLAIN ANALYZE output, every column I pull out and put in another table is redundant data. The questions are, why do we have it, is it going to be used heavily enough in such a way that it should be stored redundantly? If the goal is to be able to get at the number of filters, etc., perhaps another table which digests this to a form which can be used in summary queries is what's needed. If this is a rare type of query which few people will run, maybe they can go to the EXPLAIN ANALYZE, using the line number info, and grab it from there. -Kevin >>> Josh Berkus <josh@agliodbs.com> >>> I'm also really unclear on why you're so focused on storing less information rather than more. In an "investigation" tool like this, it's important to collect as much data as possible because we don't know what's going to be valuable until we analyze it. You seem to be starting out with the idea that you *already* know exactly where the problems are located, in which case why develop a tool at all? Just fix the problem.
Kevin, > When it gets downt to the detail, it may make sense to combine > or split some of these. For example, runtime_options should > probably not have a column for each currently known option, > but a child table which maps to all non-default option values. I'm a little cautious about storing only non-defaults; the defaults have changed from version to version, after all. If we did that, we'd need to have a "defaults" table in the db as a reference list. Also, we'll need to store runtime options both on the "machine" level and on the "query" level, in order to allow testing of changing an enable_* or other query cost option at runtime. Not sure how to capture this, though. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
On Tuesday 18 October 2005 18:05, Kevin Grittner wrote: > Regarding the idea of a site where results could be posted > and loaded into a database which would be available for > public access -- I agree that would be great; however, my > client is not willing to take that on. If anyone wants to > volunteer, that wuold be fantastic. > Josh, does the setup on the foundry homepages give you enough to do this? -- Robert Treat Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
Maybe we could associate a set of defaults to runtime_environment, and you would associate any overrides with the runtime_options. Does this address both your concerns? >>> Josh Berkus <josh@agliodbs.com> >>> Kevin, > When it gets downt to the detail, it may make sense to combine > or split some of these. For example, runtime_options should > probably not have a column for each currently known option, > but a child table which maps to all non-default option values. I'm a little cautious about storing only non-defaults; the defaults have changed from version to version, after all. If we did that, we'd need to have a "defaults" table in the db as a reference list. Also, we'll need to store runtime options both on the "machine" level and on the "query" level, in order to allow testing of changing an enable_* or other query cost option at runtime. Not sure how to capture this, though.
Robert Treat wrote: >On Tuesday 18 October 2005 18:05, Kevin Grittner wrote: > > >>Regarding the idea of a site where results could be posted >>and loaded into a database which would be available for >>public access -- I agree that would be great; however, my >>client is not willing to take that on. If anyone wants to >>volunteer, that wuold be fantastic. >> >> >> > >Josh, does the setup on the foundry homepages give you enough to do this? > > > pgFoundry currently does not provide either a database or web programs for groups, for security reasons. All you get is the ability to make static pages. The new machine will provide both, we hope. cheers andrew
This would require capture of information beyond what I was thinking about in terms of schema. Do you think we need to capture just index type, or something more? Do you propose that we capture the pg_* metadata related to every object referenced in the plan, every object related to every table in the query, or everything in the entire database? >>> Josh Berkus <josh@agliodbs.com> >>> [smip] the only way you're going analyze index costing by type of index is if the index name is stored somewhere (db_object) so that it can be matched to its characteristics. [snip]
Kevin, > This would require capture of information beyond what I was thinking > about in terms of schema. Do you think we need to capture just index > type, or something more? Do you propose that we capture the pg_* > metadata related to every object referenced in the plan, every object > related to every table in the query, or everything in the entire > database? Not right away. However, having collected the info in the first place would give us the option of duping a copy of the system tables so that we could mine them for more information. Also, for our test harness, we will have pre-programmed object information for the objects we use. -- --Josh Josh Berkus Aglio Database Solutions San Francisco