Обсуждение: [HACKERS] libpq Alternate Row Processor
The guts of pqRowProcessor in libpq does a good bit of work to maintain the internal data structure of a PGresult. Thereare a few use cases where the caller doesn't need the ability to access the result set row by row, column by columnusing PQgetvalue. Think of an ORM that is just going to copy the data from PGresult for each row into its own structures. I've got a working proof of concept that allows the caller to attach a callback that pqRowProcessor will call instead ofgoing thru its own routine. This eliminates all the copying of data from the PGconn buffer to a PGresult buffer and thenultimately a series of PQgetvalue calls by the client. The callback allows the caller to receive each row's data directlyfrom the PGconn buffer. It would require exposing struct pgDataValue in libpq-fe.h. The prototype for the callback pointer would be: int (*PQrowProcessorCB)(PGresult*, const PGdataValue*, int col_count, void *user_data); My initial testing shows a significant performance improvement. I'd like some opinions on this before wiring up a performanceproof and updating the documentation for a formal patch submission. Kyle Gearhart
On 2/3/17 3:53 PM, Kyle Gearhart wrote: > The guts of pqRowProcessor in libpq does a good bit of work to maintain the internal data structure of a PGresult. Thereare a few use cases where the caller doesn't need the ability to access the result set row by row, column by columnusing PQgetvalue. Think of an ORM that is just going to copy the data from PGresult for each row into its own structures. > > I've got a working proof of concept that allows the caller to attach a callback that pqRowProcessor will call instead ofgoing thru its own routine. This eliminates all the copying of data from the PGconn buffer to a PGresult buffer and thenultimately a series of PQgetvalue calls by the client. The callback allows the caller to receive each row's data directlyfrom the PGconn buffer. > > It would require exposing struct pgDataValue in libpq-fe.h. The prototype for the callback pointer would be: > int (*PQrowProcessorCB)(PGresult*, const PGdataValue*, int col_count, void *user_data); > > My initial testing shows a significant performance improvement. I'd like some opinions on this before wiring up a performanceproof and updating the documentation for a formal patch submission. I just did essentially the same thing for SPI (use a callback to allow the caller to handle the tuple instead of shoving it into a tuplestore). A simple test in plpython showed a 460% improvement. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
Kyle Gearhart <kyle.gearhart@indigohill.io> writes: > The guts of pqRowProcessor in libpq does a good bit of work to maintain the internal data structure of a PGresult. Thereare a few use cases where the caller doesn't need the ability to access the result set row by row, column by columnusing PQgetvalue. Think of an ORM that is just going to copy the data from PGresult for each row into its own structures. It seems like you're sort of reinventing "single row mode": https://www.postgresql.org/docs/devel/static/libpq-single-row-mode.html Do we really need yet another way of breaking the unitary-query-result abstraction? regards, tom lane
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]: > Kyle Gearhart <kyle.gearhart@indigohill.io> writes: >> The guts of pqRowProcessor in libpq does a good bit of work to maintain the internal data structure of a PGresult. Thereare a few use cases where the caller doesn't need the ability to access the result set row by row, column by columnusing PQgetvalue. Think of an ORM that is just going to copy the data from PGresult for each row into its own structures. > It seems like you're sort of reinventing "single row mode": https://www.postgresql.org/docs/devel/static/libpq-single-row-mode.html > Do we really need yet another way of breaking the unitary-query-result abstraction? If it's four times faster...then the option should be available in libpq. I'm traveling tomorrow but will try to get a patchand proof with pgbench dataset up by the middle of the week. The performance gains are consistent with Jim Nasby's findings with SPI. Kyle Gearhart
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]: >> Kyle Gearhart <kyle.gearhart@indigohill.io> writes: >>> The guts of pqRowProcessor in libpq does a good bit of work to maintain the internal data structure of a PGresult. Thereare a few use cases where the caller doesn't need the ability to access the result set row by row, column by columnusing PQgetvalue. Think of an ORM that is just going to copy the data from PGresult for each row into its own structures. >> It seems like you're sort of reinventing "single row mode": https://www.postgresql.org/docs/devel/static/libpq-single-row-mode.html >> Do we really need yet another way of breaking the unitary-query-result abstraction? > If it's four times faster...then the option should be available in libpq. I'm traveling tomorrow but will try to get apatch and proof with pgbench dataset up by the middle of the week. Attached is a proof, test program and test results. No documentation changes have been included at this time. It was tested against a pgbench_accounts record set with 100,000 records. Overall, wall clock improves 24%. User time elapsedis a 430% improvement. About half the time is spent waiting on the IO with the callback. With the regular pqRowProcessoronly about 16% of the time is spent waiting on IO. The test program follows the pgbench program's command line options, with an added parameter called "m", short for mode. Set the option to "row" for single row processing and "cb" for callback processing. I did not provision for the test program to accept a password from a prompt, you'll have to pass that in the arguments. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
On 2/8/17 5:11 PM, Kyle Gearhart wrote: > Overall, wall clock improves 24%. User time elapsed is a 430% improvement. About half the time is spent waiting on theIO with the callback. With the regular pqRowProcessor only about 16% of the time is spent waiting on IO. To wit... real user sys single row 0.214 0.131 0.048 callback 0.161 0.030 0.051 Those are averaged over 11 runs. Can you run a trace to see where all the time is going in the single row case? I don't see an obvious time-suck with a quick look through the code. It'd be interesting to see how things change if you eliminate the filler column from the SELECT. Also, the backend should be buffering ~8kb of data before handing that to the socket. If that's more than the kernel can buffer I'd expect a serious performance hit. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)
On 2/9/17 7:15 PM, Jim Nasby wrote: > Can you run a trace to see where all the time is going in the single row case? I don't see an obvious time-suck with aquick look through the code. It'd be interesting to see how things change if you eliminate the filler column from the SELECT. Traces are attached, these are with callgrind. profile_nofiller.txt: single row without filler column profile_filler.txt: single row with filler column profile_filler_callback.txt: callback with filler column pqResultAlloc looks to hit malloc pretty hard. The callback reduces all of that to a single malloc for each row. Without the filler, here is the average over 11 runs: Real user sys Callback .133 .033 .035 Single Row .170 .112 .029 For the callback case, it's slightly higher than the prior results with the filler column. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
On Mon, Feb 13, 2017 at 8:46 AM, Kyle Gearhart <kyle.gearhart@indigohill.io> wrote: > On 2/9/17 7:15 PM, Jim Nasby wrote: >> Can you run a trace to see where all the time is going in the single row case? I don't see an obvious time-suck with aquick look through the code. It'd be interesting to see how things change if you eliminate the filler column from the SELECT. > > Traces are attached, these are with callgrind. > > profile_nofiller.txt: single row without filler column > profile_filler.txt: single row with filler column > profile_filler_callback.txt: callback with filler column > > pqResultAlloc looks to hit malloc pretty hard. The callback reduces all of that to a single malloc for each row. Couldn't that be optimized, say, by preserving malloc'd memory when in single row mode and recycling it? (IIRC during the single row mode discussion this optimization was voted down). A barebones callback mode ISTM is a complete departure from the classic PGresult interface. This code is pretty unpleasant IMO: acct->abalance = *((int*)PQgetvalue(res, 0, i)); acct->abalance = __bswap_32(acct->abalance); Your code is faster but foists a lot of the work on the user, so it's kind of cheating in a way (although very carefully written applications might be able to benefit). merlin
On Mon, Feb 13, 2017 Merlin Moncure wrote: >A barebones callback mode ISTM is a complete departure from the classic PGresult interface. This code is pretty unpleasantIMO: acct->abalance = *((int*)PQgetvalue(res, 0, i)); abalance = acct->__bswap_32(acct->abalance); > Your code is faster but foists a lot of the work on the user, so it's kind of cheating in a way (although very carefullywritten applications might be able to benefit). The bit you call out above is for single row mode. Binary mode is a slippery slope, with or without the proposed callback. Let's remember that one of the biggest, often overlooked, gains when using an ORM is that it abstracts all this mess away. The goal here is to prevent all the ORM/framework folks from having to implement protocol. Otherwise they get to waiton libpq to copy from the socket to the PGconn buffer to the PGresult structure to their buffers. The callback keepsthe slowest guy on the team...on the bench. Kyle Gearhart
On 2/13/17 8:46 AM, Kyle Gearhart wrote: > profile_filler.txt > 61,410,901 ???:_int_malloc [/usr/lib64/libc-2.17.so] > 38,321,887 ???:_int_free [/usr/lib64/libc-2.17.so] > 31,400,139 ???:pqResultAlloc [/usr/local/pgsql/lib/libpq.so.5.10] > 22,839,505 ???:pqParseInput3 [/usr/local/pgsql/lib/libpq.so.5.10] > 17,600,004 ???:pqRowProcessor [/usr/local/pgsql/lib/libpq.so.5.10] > 16,002,817 ???:malloc [/usr/lib64/libc-2.17.so] > 14,716,359 ???:pqGetInt [/usr/local/pgsql/lib/libpq.so.5.10] > 14,400,000 ???:check_tuple_field_number [/usr/local/pgsql/lib/libpq.so.5.10] > 13,800,324 main.c:main [/usr/local/src/postgresql-perf/test] > profile_filler_callback.txt > 16,842,303 ???:pqParseInput3 [/usr/local/pgsql/lib/libpq.so.5.10] > 14,810,783 ???:_int_malloc [/usr/lib64/libc-2.17.so] > 12,616,338 ???:pqGetInt [/usr/local/pgsql/lib/libpq.so.5.10] > 10,000,000 ???:pqSkipnchar [/usr/local/pgsql/lib/libpq.so.5.10] > 9,200,004 main.c:process_callback [/usr/local/src/postgresql-perf/test] Wow, that's a heck of a difference. There's a ton of places where the backend copies data for no other purpose than to put it into a different memory context. I'm wondering if there's improvement to be had there as well, or whether palloc is so much faster than malloc that it's not an issue. I suspect that some of the effects are being masked by other things since presumably palloc and memcpy are pretty cheap on small volumes of data... -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532)