Speeding up unicode decoding
От | Daniele Varrazzo |
---|---|
Тема | Speeding up unicode decoding |
Дата | |
Msg-id | CA+mi_8ayXAKRtNiEAv6sNA7T4vTd-EOAbO6eg2+RKuZniLBcrw@mail.gmail.com обсуждение исходный текст |
Список | psycopg |
Hello, I've taken a look at issue https://github.com/psycopg/psycopg2/issues/473, where it is reported that SQLAlchemy is faster than psycopg at decoding unicode (i.e. it's faster for SQLAlchemy to use psycopg to return bytes strings and decoding them than asking psycopg directly to return unicode). It seems from the discussion linked in the ticket that a relevant improvement can come from caching the codec. I've tried a quick test: storing a pointer to a fast C decode function for known codec in the connection (e.g. for an utf8 connection store the pointer to PyUnicode_DecodeUTF8). The results are totally worth more work. This script <https://gist.github.com/dvarrazzo/43b43d6ae96e13319cb085a3efe92ac8> generates unicode data on the server and measures the decode time (decode happens on fetch*() so the operation is not I/O bound but CPU and memory access). Decoding 400K of 1KB strings has a 17% speedup: $ PYTHONPATH=orig python ./timeenc.py -s 1000 -c 4096 -m 100 timing for strsize: 1000, chrrange: 4096, mult: 100 times: 2.588915, 2.310006, 2.308195, 2.305879, 2.304648 sec best: 2.304648 sec $ PYTHONPATH=fast python ./timeenc.py -s 1000 -c 4096 -m 100 timing for strsize: 1000, chrrange: 4096, mult: 100 times: 2.159055, 1.922977, 1.922651, 1.933926, 1.932110 sec best: 1.922651 sec Because the overhead paid for the codec lookup is per string, not per data size, the improvement is more relevant decoding the same amount of data, but in more, shorter strings: 55% for 4M of 100B strings: $ PYTHONPATH=orig python ./timeenc.py -s 100 -c 4096 -m 1000 timing for strsize: 100, chrrange: 4096, mult: 1000 times: 5.997742, 5.909936, 5.914419, 5.967713, 6.779648 sec best: 5.909936 sec $ PYTHONPATH=fast python ./timeenc.py -s 100 -c 4096 -m 1000 timing for strsize: 100, chrrange: 4096, mult: 1000 times: 2.738192, 2.669642, 2.647298, 2.657130, 2.651866 sec best: 2.647298 sec Other things to do: - the lookup can be cached also for other encodings, not only the two blessed ones for which there is a public C function in the Python API (similar in Python to saving codecs.getdecoder() instead of calling codecs.decode()) - encoding data to the connection can be optimised the same way. If someone wants contribute to the idea, the first commit is in the branch <https://github.com/psycopg/psycopg2/tree/fast-codecs>. Any feedback or help is welcome. -- Daniele
В списке psycopg по дате отправления: