Re: WIP Incremental JSON Parser

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: WIP Incremental JSON Parser
Дата
Msg-id CA+TgmoZKqZ+xXN550e2DXM-F4ELCE+Ee=Ogy4pZZMhu9SBRmzQ@mail.gmail.com
обсуждение исходный текст
Ответ на WIP Incremental JSON Parser  (Andrew Dunstan <andrew@dunslane.net>)
Ответы Re: WIP Incremental JSON Parser  (Andrew Dunstan <andrew@dunslane.net>)
Re: WIP Incremental JSON Parser  (Nico Williams <nico@cryptonector.com>)
Список pgsql-hackers
On Tue, Dec 26, 2023 at 11:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> Quite a long time ago Robert asked me about the possibility of an
> incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
> performance is significantly worse that that of the current Recursive
> Descent parser. Nevertheless, I'm attaching my current WIP state for it,
> and I'll add it to the next CF to keep the conversation going.

Thanks for doing this. I think it's useful even if it's slower than
the current parser, although that probably necessitates keeping both,
which isn't great, but I don't see a better alternative.

> One possible use would be in parsing large manifest files for
> incremental backup. However, it struck me a few days ago that this might
> not work all that well. The current parser and the new parser both
> palloc() space for each field name and scalar token in the JSON (unless
> they aren't used, which is normally not the case), and they don't free
> it, so that particularly if done in frontend code this amounts to a
> possible memory leak, unless the semantic routines do the freeing
> themselves. So while we can save some memory by not having to slurp in
> the whole JSON in one hit, we aren't saving any of that other allocation
> of memory, which amounts to almost as much space as the raw JSON.

It seems like a pretty significant savings no matter what. Suppose the
backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
create an 1MB buffer and feed the data to the parser in 1MB chunks.
Well, that saves 2GB less 1MB, full stop. Now if we address the issue
you raise here in some way, we can potentially save even more memory,
which is great, but even if we don't, we still saved a bunch of memory
that could not have been saved in any other way.

As far as addressing that other issue, we could address the issue
either by having the semantic routines free the memory if they don't
need it, or alternatively by having the parser itself free the memory
after invoking any callbacks to which it might be passed. The latter
approach feels more conceptually pure, but the former might be the
more practical approach. I think what really matters here is that we
document who must or may do which things. When a callback gets passed
a pointer, we can document either that (1) it's a palloc'd chunk that
the calllback can free if they want or (2) that it's a palloc'd chunk
that the caller must not free or (3) that it's not a palloc'd chunk.
We can further document the memory context in which the chunk will be
allocated, if applicable, and when/if the parser will free it.

--
Robert Haas
EDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Daniel Verite"
Дата:
Сообщение: Re: psql's FETCH_COUNT (cursor) is not being respected for CTEs
Следующее
От: Robert Haas
Дата:
Сообщение: Re: trying again to get incremental backup