Обсуждение: [ADMIN] Processing very large TEXT columns (300MB+) using C/libpq
I have getting very poor performance using libpq to process very large TEXT columns (300MB+). I suspect it is IO related but can't be sure.
Anyone had experience with same issue that can help me resolve?
--cnemelka
I believe large columns get put into a TOAST table. Max page size is 8k. So you'll have lots of pages per row that need to be joined with a size like that: https://www.postgresql.org/docs/9.5/static/storage-toast.html
Aldo Sarmiento
President & CTO
8687 Research Dr, Irvine, CA 92618
On Thu, Oct 19, 2017 at 2:03 PM, Cory Nemelka <cnemelka@gmail.com> wrote:
I have getting very poor performance using libpq to process very large TEXT columns (300MB+). I suspect it is IO related but can't be sure.Anyone had experience with same issue that can help me resolve?--cnemelka
Yes, but I should be able to read them much faster. The psql client can display an 11MB column in a little over a minute, while in C using libpg library, it takes over an hour.
Anyone have any experience with the same issue that can help me resolve?
--cnemelka
On Thu, Oct 19, 2017 at 5:20 PM, Aldo Sarmiento <aldo@bigpurpledot.com> wrote:
I believe large columns get put into a TOAST table. Max page size is 8k. So you'll have lots of pages per row that need to be joined with a size like that: https://www.postgresql.org/docs/9.5/static/storage- toast.html Aldo SarmientoPresident & CTO
On Thu, Oct 19, 2017 at 2:03 PM, Cory Nemelka <cnemelka@gmail.com> wrote:I have getting very poor performance using libpq to process very large TEXT columns (300MB+). I suspect it is IO related but can't be sure.Anyone had experience with same issue that can help me resolve?--cnemelka
All I am am doing is iterating through the characters so I know it isn't my code.
--cnemelka
On Fri, Oct 20, 2017 at 9:14 AM, Cory Nemelka <cnemelka@gmail.com> wrote:
Yes, but I should be able to read them much faster. The psql client can display an 11MB column in a little over a minute, while in C using libpg library, it takes over an hour.Anyone have any experience with the same issue that can help me resolve?--cnemelkaOn Thu, Oct 19, 2017 at 5:20 PM, Aldo Sarmiento <aldo@bigpurpledot.com> wrote:I believe large columns get put into a TOAST table. Max page size is 8k. So you'll have lots of pages per row that need to be joined with a size like that: https://www.postgresql.org/docs/9.5/static/storage-toa st.html Aldo SarmientoPresident & CTO
On Thu, Oct 19, 2017 at 2:03 PM, Cory Nemelka <cnemelka@gmail.com> wrote:I have getting very poor performance using libpq to process very large TEXT columns (300MB+). I suspect it is IO related but can't be sure.Anyone had experience with same issue that can help me resolve?--cnemelka
It's probably worth removing the iterating code Just In Case.
Apologies for egg-suck-education, but I assume you're not doing something silly like
for (i=0; i < strlen(bigtextstring); i++) {
....
}
I know it sounds stupid, but you'd be amazed how many times that crops up, and for small strings it doesn't matter, but for large strings it's catastrophic.
I know it sounds stupid, but you'd be amazed how many times that crops up, and for small strings it doesn't matter, but for large strings it's catastrophic.
Geoff
On 20 October 2017 at 16:16, Cory Nemelka <cnemelka@gmail.com> wrote:
All I am am doing is iterating through the characters so I know it isn't my code.--cnemelkaOn Fri, Oct 20, 2017 at 9:14 AM, Cory Nemelka <cnemelka@gmail.com> wrote:Yes, but I should be able to read them much faster. The psql client can display an 11MB column in a little over a minute, while in C using libpg library, it takes over an hour.Anyone have any experience with the same issue that can help me resolve?--cnemelkaOn Thu, Oct 19, 2017 at 5:20 PM, Aldo Sarmiento <aldo@bigpurpledot.com> wrote:I believe large columns get put into a TOAST table. Max page size is 8k. So you'll have lots of pages per row that need to be joined with a size like that: https://www.postgresql.org/docs/9.5/static/storage-toa st.html Aldo SarmientoPresident & CTO
On Thu, Oct 19, 2017 at 2:03 PM, Cory Nemelka <cnemelka@gmail.com> wrote:I have getting very poor performance using libpq to process very large TEXT columns (300MB+). I suspect it is IO related but can't be sure.Anyone had experience with same issue that can help me resolve?--cnemelka
I'll take out all the code that isn't directly related to reading the data and see if that helps. That was next step I intended anyway.
thank you for the reply
--cnemelka
On Fri, Oct 20, 2017 at 9:43 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
It's probably worth removing the iterating code Just In Case.Apologies for egg-suck-education, but I assume you're not doing something silly likefor (i=0; i < strlen(bigtextstring); i++) {....}
I know it sounds stupid, but you'd be amazed how many times that crops up, and for small strings it doesn't matter, but for large strings it's catastrophic.GeoffOn 20 October 2017 at 16:16, Cory Nemelka <cnemelka@gmail.com> wrote:All I am am doing is iterating through the characters so I know it isn't my code.--cnemelkaOn Fri, Oct 20, 2017 at 9:14 AM, Cory Nemelka <cnemelka@gmail.com> wrote:Yes, but I should be able to read them much faster. The psql client can display an 11MB column in a little over a minute, while in C using libpg library, it takes over an hour.Anyone have any experience with the same issue that can help me resolve?--cnemelkaOn Thu, Oct 19, 2017 at 5:20 PM, Aldo Sarmiento <aldo@bigpurpledot.com> wrote:I believe large columns get put into a TOAST table. Max page size is 8k. So you'll have lots of pages per row that need to be joined with a size like that: https://www.postgresql.org/docs/9.5/static/storage-toa st.html Aldo SarmientoPresident & CTO
On Thu, Oct 19, 2017 at 2:03 PM, Cory Nemelka <cnemelka@gmail.com> wrote:I have getting very poor performance using libpq to process very large TEXT columns (300MB+). I suspect it is IO related but can't be sure.Anyone had experience with same issue that can help me resolve?--cnemelka
Cory Nemelka <cnemelka@gmail.com> writes: > Yes, but I should be able to read them much faster. The psql client can > display an 11MB column in a little over a minute, while in C using libpg > library, it takes over an hour. Well, of course psql relies on libpq, so it seems unlikely that libpq itself is where the time is going. Have you tried applying a profiler? "perf" or "oprofile" or similar tool ought to pinpoint the culprit pretty easily. regards, tom lane -- Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-admin
As an aside any halfway decent optimizer would realize that the results of strlen() are unchanging as long as the contents of what it's passed isn't modified. That's a common enough pattern that it should be checked.
What about buffer size? Are you using a smaller fetch size that results in lots of little packets?
Is there a way to specify a connection that uses compression?
On Fri, Oct 20, 2017 at 10:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Cory Nemelka <cnemelka@gmail.com> writes:
> Yes, but I should be able to read them much faster. The psql client can
> display an 11MB column in a little over a minute, while in C using libpg
> library, it takes over an hour.
Well, of course psql relies on libpq, so it seems unlikely that libpq
itself is where the time is going. Have you tried applying a profiler?
"perf" or "oprofile" or similar tool ought to pinpoint the culprit
pretty easily.
regards, tom lane
--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin
As an aside any halfway decent optimizer would realize that the results of strlen() are unchanging as long as the contents of what it's passed isn't modified. That's a common enough pattern that it should be checked.
IME this is a myth perpetuated by bad computer science lecturers who haven't thought through the consequences of what they're saying. strlen() can change because of changes inside the loop but also because of also threads in the system; I've not yet seen a compiler optimise that away, and neither should it, IMO.
G
In that case you must put a read lock on the string that covers the loop. If you're in a multi-threaded environment and not using locks when appropriate then all bets are off.
On Fri, Oct 20, 2017 at 12:05 PM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
As an aside any halfway decent optimizer would realize that the results of strlen() are unchanging as long as the contents of what it's passed isn't modified. That's a common enough pattern that it should be checked.IME this is a myth perpetuated by bad computer science lecturers who haven't thought through the consequences of what they're saying. strlen() can change because of changes inside the loop but also because of also threads in the system; I've not yet seen a compiler optimise that away, and neither should it, IMO.G
On 21 Oct 2017 12:32, "Bear Giles" <bgiles@coyotesong.com> wrote: > In that case you must put a read lock on the string that covers the loop. If you're in > a multi-threaded environment and not using locks when appropriate then all bets are off. You reckon a compiler can decide to blow up your code by making assumptions like that? Your loop could set a var for a state machine in a processing thread to modify the string. That doesn't preclude correct locking behaviour. If you think that's too contrived then forget threads, you could make a shared library call that the compiler can't assess at compile-time that could change the string. Yes, in either case, using strlen to check for that is poor code, but the compiler can't assume you're not using poor code. This argument is pretty pointless. The only way to be sure to avoid the problem is to assume that the compiler won't optimize bad code. FWIW gcc 4.8.5 with -O3 doesn't optimize away strlen even in code this simple: #include <stdio.h> #include <string.h> int main (int argc, char **argv) { int i; char *buff; buff=malloc(strlen(argv[1])); for (i=0; i < strlen(argv[1]); i++) { buff[i]=argv[1][i]; } printf("%s", buff); } .L3: movzbl 0(%rbp,%rbx), %edx movb %dl, (%r12,%rbx) movq 8(%r13), %rbp addq $1, %rbx .L2: movq %rbp, %rdi call strlen cmpq %rax, %rbx jb .L3 However, it _does_ optimize this code: int main (int argc, char **argv) { int i; char *buff; char *buff2; buff2=strdup(argv[1]); buff=malloc(strlen(buff2)); for(i=0; i < strlen(buff2); i++) { buff[i]=buff2[i]; } printf("%s", buff); } I assume that's because it can be certain at compile time that, since both buff and buff2 are local, nothing else is going to modify the source string (without some stack smashing, anyway). Geoff -- Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-admin
I wrote: > I assume that's because it can be certain at compile time that, since > both buff and buff2 are local, nothing else is going to modify the > source string (without some stack smashing, anyway). Ugh. Of course, they're not local: they're malloc'd; however it's still reasonable to assume that (since the pointers aren't passed anywhere) the buffers are untouched elsewhere, unless other threads are misbehaving. Geoff -- Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-admin
I meant the developer needs to add the locks, not the compiler. I think they're just advisory but everyone always rights well-behaved code, right? ;-)
On Mon, Oct 23, 2017 at 4:17 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
I wrote:
> I assume that's because it can be certain at compile time that, since
> both buff and buff2 are local, nothing else is going to modify the
> source string (without some stack smashing, anyway).
Ugh. Of course, they're not local: they're malloc'd; however it's
still reasonable to assume that (since the pointers aren't passed
anywhere) the buffers are untouched elsewhere, unless other threads
are misbehaving.
Geoff