Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 2020-01-23 18:04, Robert Haas wrote:
>> Now, you might say "well, why don't we just do an encoding
>> conversion?", but we can't. When the filesystem tells us what the file
>> names are, it does not tell us what encoding the person who created
>> those files had in mind. We don't know that they had*any* encoding in
>> mind. IIUC, a file in the data directory can have a name that consists
>> of any sequence of bytes whatsoever, so long as it doesn't contain
>> prohibited characters like a path separator or \0 byte. But only some
>> of those possible octet sequences can be stored in a manifest that has
>> to be valid UTF-8.
> I think it wouldn't be unreasonable to require that file names in the
> database directory be consistently encoded (as defined by pg_control,
> probably). After all, this information is sometimes also shown in
> system views, so it's already difficult to process total junk. In
> practice, this shouldn't be an onerous requirement.
I don't entirely follow why we're discussing this at all, if the
requirement is backing up a PG data directory. There are not, and
are never likely to be, any legitimate files with non-ASCII names
in that context. Why can't we just skip any such files?
regards, tom lane