Re: Parallel copy

Поиск

Список

Период

Сортировка

От	Ants Aasma
Тема	Re: Parallel copy
Дата	18 февраля 2020 г. 12:29:20
Msg-id	CANwKhkPmM18UYpOt_AEB4JC6fa0dfA1PfgiQyNzeNUxEpG=XUw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Parallel copy (Amit Kapila <amit.kapila16@gmail.com>)
Ответы	Re: Parallel copy
Список	pgsql-hackers

Дерево обсуждения

On Tue, 18 Feb 2020 at 12:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
> This is something similar to what I had also in mind for this idea.  I
> had thought of handing over complete chunk (64K or whatever we
> decide).  The one thing that slightly bothers me is that we will add
> some additional overhead of copying to and from shared memory which
> was earlier from local process memory.  And, the tokenization (finding
> line boundaries) would be serial.  I think that tokenization should be
> a small part of the overall work we do during the copy operation, but
> will do some measurements to ascertain the same.

I don't think any extra copying is needed. The reader can directly
fread()/pq_copymsgbytes() into shared memory, and the workers can run
CopyReadLineText() inner loop directly off of the buffer in shared memory.

For serial performance of tokenization into lines, I really think a SIMD
based approach will be fast enough for quite some time. I hacked up the code in
the simdcsv  project to only tokenize on line endings and it was able to
tokenize a CSV file with short lines at 8+ GB/s. There are going to be many
other bottlenecks before this one starts limiting. Patch attached if you'd
like to try that out.

Regards,
Ants Aasma

Вложения

simdcsv-find-only-lineendings.diff

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Parallel copy

Вложения