Обсуждение: pg_stat_statements: faster search by queryid
Hi hackers, Aleksandra Bondar and I are proposing the following patch for pg_stat_statements. The idea. --------- Currently, to get statistics on a specific query, you should execute SELECT * FROM pg_stat_statements WHERE queryid = specific_queryid; This takes a long time because the pg_stat_statements() function forms tuples for all statistics it has first, and then they are filtered by the WHERE clause. If we provide a function like pg_stat_statements_by_queryid(queryid bigint), which would filter statistics by queryid while scanning pgss_hash and return only statistics with the specified queryid, that would be much faster. We can also easily add filtration by userid and dbid, which would lead us to a function like pg_stat_statements_filtered(queryid bigint, userid Oid, dbid Oid). In case some parameter is not specified, its default value is 0, and it means that no filtration is needed on this parameter. Kind of like pg_stat_statements_reset() chooses what statistics should be cleaned. If no parameter is specified, pg_stat_statements_filtered() should return all statistics that pg_stat_statements() would return. This led me to the idea that we should rather extend the pg_stat_statements() function than add a new function. The old way to call pg_stat_statements() will produce the same results, and specifying new parameters will produce filtered results. The patch. ---------- The extended pg_stat_statements() function idea is implemented in the patch attached. I can always rewrite the patch to add a new function and leave pg_stat_statements() as it is, though, if you think it's better to have a separate function for filtering. We've only written the code so far and want to get your opinion on that. If you like the idea, we'll also provide tests and docs. Any suggestions are welcome. Benchmarking. ------------- We prepared a simple test case here to show performance improvement. Download the attached script pg_stat_statements_prepare.sql and run the following in psql. CREATE EXTENSION pg_stat_statements; -- Fill in pg_stat_statements statistics \i /path/to/pgpro_stats_prepare_script.sql -- Get random query ID SELECT queryid AS rand_queryid FROM pg_stat_statements WHERE queryid IS NOT NULL ORDER BY random() LIMIT 1 \gset -- Turn on time measuring \timing -- Get statistics in the old way SELECT * FROM pg_stat_statements WHERE queryid = :rand_queryid; -- Get statistics in the new way SELECT * FROM pg_stat_statements(true, queryid => :rand_queryid); I'm getting that the new way is at least two times faster on my machine. I also compared the time for the old way on master with and without the patch. I get that the difference is within standard deviation. Best regards, Karina Litskevich Postgres Professional: http://postgrespro.com/
Вложения
Hi, Thanks for raising this. > This takes a long time because the pg_stat_statements() function forms > tuples for all statistics it has first, and then they are filtered by > the WHERE clause. I am curious about the specific use case where this becomes a concern. How often are you querying pg_stat_statements for individual entries? My initial thought is that this patch does not remove the issue of loading the entire query text (just to return one or a few entries). I suspect that repeatedly loading query text into memory for lookups could be a bigger problem. One option would be to pass showtext=false for such lookups, but that makes the function more cumbersome to use and understand. I do think having the ability to look up a specific entry based on a key (that is, hash_search instead of hash_seq_search) would be useful. But we also need to consider how to handle query text lookups, so we are not forced to load the entire text file into memory. For what it is worth, I have been thinking about what it would take to move query texts into shared memory, which could make this type of filtering more practical. Just my 2c. -- Sami Imseih Amazon Web Services (AWS)
Thank you for your feedback. > My initial thought is that this patch does not remove the issue of > loading the entire query text (just to return one or a few entries). This patch is not intended to address the issue of loading the file with query texts. We only try to avoid forming and handling tuples that we don't really need. Forming a tuple in pg_stat_statements_internal() includes calling CStringGetTextDatum(), which allocates memory and copies the query text. Avoiding that for queries we don't need makes a big difference already. > I do think having the ability to look up a specific entry based on a > key (that is, hash_search instead of hash_seq_search) would be useful. That's a great idea, thanks! I'm going to try that and include it in the next version of the patch if I succeed. > For what it is worth, I have been thinking about what it would take to > move query texts into shared memory, which could make this type of > filtering more practical. As far as I can tell, pg_stat_statements is storing query texts in an external file to not have problems with long query texts. Here is a quote from the docs: > The representative query texts are kept in an external disk file, and > do not consume shared memory. Therefore, even very lengthy query texts > can be stored successfully. I can think of the idea of caching some of the texts (short enough) in shared memory. That would require adding a GUC that limits query text length or maybe that limits the total amount of memory to be used for caching. So for an individual query, pg_stat_statements should first look up its text in the cache and only go to the disk if it's not found. I'm not sure I want to do that, though... Best regards, Karina Litskevich Postgres Professional: http://postgrespro.com/
> > My initial thought is that this patch does not remove the issue of > > loading the entire query text (just to return one or a few entries). > > This patch is not intended to address the issue of loading the file > with query texts. We only try to avoid forming and handling tuples > that we don't really need. Forming a tuple in > pg_stat_statements_internal() includes calling CStringGetTextDatum(), > which allocates memory and copies the query text. Avoiding that for > queries we don't need makes a big difference already. Yes, but my point is, if someone repeatedly lookup up pg_stat_statements with filters, they will end up loading the query text multiple times. for example: ```` select * from pg_stat_statements where query_id in (10000, 20000, 30000); ``` will only load the query text once to retrieve these 3 query IDs. If I instead do this, with the proposed patch: ``` select * from pg_stat_statements(true, queryid=>10000); select * from pg_stat_statements(true, queryid=>20000); select * from pg_stat_statements(true, queryid=>30000); or select * from pg_stat_activity a, pg_stat_statements(true, queryid=>a.query_id); ``` I will have to load the query text file into memory for every invocation of pg_stat_statements. > > For what it is worth, I have been thinking about what it would take to > > move query texts into shared memory, which could make this type of > > filtering more practical. > > As far as I can tell, pg_stat_statements is storing query texts in an > external file to not have problems with long query texts. Here is a > quote from the docs: > > > The representative query texts are kept in an external disk file, and > > do not consume shared memory. Therefore, even very lengthy query texts > > can be stored successfully. pg_stat_statements_internal must load the file when showtext = true, which is the default. ``` qbuffer = qtext_load_file(&qbuffer_size); ``` -- Sami Imseih Amazon Web Services (AWS)
>> I do think having the ability to look up a specific entry based on a >> key (that is, hash_search instead of hash_seq_search) would be useful. > > That's a great idea, thanks! I'm going to try that and include it in > the next version of the patch if I succeed. Here is the second version of the patch, as promised. I used hash_search in case all three key arguments are provided, the same way as it is done in the entry_reset() function. The diff doesn't look very pleasant. Basically I just moved the code that forms one tuple in a new pg_stat_statements_handle_entry() function to use it in pg_stat_statements_internal(). I also had a second thought about adding a new struct just to pass three key arguments as one filter to the internal function. In the v2 patch I just pass the arguments as they are. I'm not sure which option is better. Anyway, it should be the same in both entry_reset() and pg_stat_statements_internal(), so if you say adding struct pgssFilter was a good idea, I'll rewrite the patch to use it in both pg_stat_statements_internal() and entry_reset(). On Thu, Sep 18, 2025 at 6:33 PM Sami Imseih <samimseih@gmail.com> wrote: > > Yes, but my point is, if someone repeatedly lookup up pg_stat_statements > with filters, they will end up loading the query text multiple times. > > for example: > ```` > select * from pg_stat_statements where query_id in (10000, 20000, 30000); > ``` > > will only load the query text once to retrieve these 3 query IDs. > > If I instead do this, with the proposed patch: > > ``` > select * from pg_stat_statements(true, queryid=>10000); > select * from pg_stat_statements(true, queryid=>20000); > select * from pg_stat_statements(true, queryid=>30000); > > or > select * from pg_stat_activity a, pg_stat_statements(true, queryid=>a.query_id); > > ``` > I will have to load the query text file into memory for every invocation of > pg_stat_statements. > You are right. At some point, if information about multiple queries is needed, a single select from pg_stat_statements followed by filtering will be more efficient than calling pg_stat_statements with filters multiple times. That's something that should be documented. Best regards, Karina Litskevich Postgres Professional: http://postgrespro.com/