Обсуждение: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

Поиск
Список
Период
Сортировка

COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
"Joel Jacobson"
Дата:
Hi,

From time to time, I need to deal with bizarrely delimited text files,
having to use tools such as awk/sed/perl to reformat the files
so that they can be copied into PostgreSQL.

If such files could be imported to a table with a single text column,
we could then use PostgreSQL's now really fast regex-engine
to do data cleaning/reformatting, instead of having to rely on external awk-ward tools.

Furthermore, sometimes you don't want to clean/reformat the data at all,
but simply import the text lines "as is" without modifications,
such as when wanting to import unformatted log files,
where the log lines can contain any characters.

Could it be an idea to exploit the fact that DELIMITER E'\n' is currently an error?

    ERROR:  COPY delimiter cannot be newline or carriage return

That is, to change E'\n' to be a valid delimiter, which would simply read each line
delimited by newlines, as a single column.

The hack I'm currently abusing is to find some one-byte character that is not present anywhere in the text file,
and then to use that character as a delimiter. This doesn't work when needing to deal with a text file
which content is unknown at the time when writing the code though, so it's mostly useful for throwaway one-off queries.

Thoughts?

/Joel

MySQL seems to already support using \n as a delimiter (I haven't verified it myself though) [1]



Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
"David G. Johnston"
Дата:
On Wed, May 5, 2021 at 8:31 AM Joel Jacobson <joel@compiler.org> wrote:
Could it be an idea to exploit the fact that DELIMITER E'\n' is currently an error?


Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly what it says and does exactly what you desire?

David J.

Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
Chapman Flack
Дата:
On 05/05/21 13:02, David G. Johnston wrote:
> Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly
> what it says and does exactly what you desire?

What would it mean? That you get one column, multiple rows of text
corresponding to "lines" delimited by something, or that you get one
column, one row of text for the entire content of the file?

Regards,
-Chap



Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
Isaac Morland
Дата:
On Wed, 5 May 2021 at 13:23, Chapman Flack <chap@anastigmatix.net> wrote:
On 05/05/21 13:02, David G. Johnston wrote:
> Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly
> what it says and does exactly what you desire?

What would it mean? That you get one column, multiple rows of text
corresponding to "lines" delimited by something, or that you get one
column, one row of text for the entire content of the file?

It means no column delimiter. In other words, there is no character which marks the end of a data value, so the entire line is a single data value.

Would DELIMITER NULL make sense? The existing values are literal strings so NULL fits with that. Do we already have NONE as a keyword somewhere? It's listed in the keyword appendix to the documentation but I can't think of where it is used off the top of my head.

Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
"David G. Johnston"
Дата:
On Wed, May 5, 2021 at 10:34 AM Isaac Morland <isaac.morland@gmail.com> wrote:
On Wed, 5 May 2021 at 13:23, Chapman Flack <chap@anastigmatix.net> wrote:
On 05/05/21 13:02, David G. Johnston wrote:
> Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly
> what it says and does exactly what you desire?

What would it mean? That you get one column, multiple rows of text
corresponding to "lines" delimited by something, or that you get one
column, one row of text for the entire content of the file?

It means no column delimiter. In other words, there is no character which marks the end of a data value, so the entire line is a single data value.


This.  When dealing with COPY it's expected that each line becomes its own row.  On the server you can do pg_read_file() if you need the entire file to be considered a single value.  psql (\I and variables) is a bit more hackey, but I'd rather see that improved directly anyway if the goal is to try and make getting the "whole document" easier - copy isn't the right API for that IMO.

David J.

Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
"Joel Jacobson"
Дата:
On Wed, May 5, 2021, at 19:58, David G. Johnston wrote:
On Wed, May 5, 2021 at 10:34 AM Isaac Morland <isaac.morland@gmail.com> wrote:
On Wed, 5 May 2021 at 13:23, Chapman Flack <chap@anastigmatix.net> wrote:
On 05/05/21 13:02, David G. Johnston wrote:
> Why not just allow: "DELIMITER NONE" to be valid syntax meaning exactly
> what it says and does exactly what you desire?

What would it mean? That you get one column, multiple rows of text
corresponding to "lines" delimited by something, or that you get one
column, one row of text for the entire content of the file?

It means no column delimiter. In other words, there is no character which marks the end of a data value, so the entire line is a single data value.


This.  When dealing with COPY it's expected that each line becomes its own row.  On the server you can do pg_read_file() if you need the entire file to be considered a single value.  psql (\I and variables) is a bit more hackey, but I'd rather see that improved directly anyway if the goal is to try and make getting the "whole document" easier - copy isn't the right API for that IMO.

I think you misunderstood the problem.
I don't want the entire file to be considered a single value.
I want each line to become its own row, just a row with a single column.

So I actually think COPY seems like a perfect match for the job,
since it does precisely that, except there is no delimiter in this case.

I'm currently using the pg_read_file()-hack in a project,
and even though it can read files up to 1GB,
using e.g. regexp_split_to_table() to split on E'\n'
seems to need 4x as much memory, so it only
works with files less than ~256MB.

SELECT COUNT(*) FROM regexp_split_to_table(repeat(E'\n',1000000000),E'\n');
ERROR:  invalid memory alloc request size 4000000004
Time: 4151.374 ms (00:04.151)

/Joel

Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
"Joel Jacobson"
Дата:
On Wed, May 5, 2021, at 19:34, Isaac Morland wrote:
Would DELIMITER NULL make sense? The existing values are literal strings so NULL fits with that. Do we already have NONE as a keyword somewhere? It's listed in the keyword appendix to the documentation but I can't think of where it is used off the top of my head.

+1 to using some keyword. NULL or NONE seems fine to me. Or maybe WITHOUT DELIMITER?

/Joel

Re: Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
Tom Lane
Дата:
"Joel Jacobson" <joel@compiler.org> writes:
> I think you misunderstood the problem.
> I don't want the entire file to be considered a single value.
> I want each line to become its own row, just a row with a single column.

> So I actually think COPY seems like a perfect match for the job,
> since it does precisely that, except there is no delimiter in this case.

Well, there's more to it than just the column delimiter.

* What about \N being converted to NULL?
* What about \. being treated as EOF?
* Do you want to turn off the special behavior of backslash (ESCAPE)
  altogether?
* What about newline conversions (\r\n being seen as just \n, etc)?

I'm inclined to think that "use pg_read_file and then split at newlines"
might be a saner answer than delving into all these fine points.
Not least because people yell when you add cycles to the COPY
inner loops.

> I'm currently using the pg_read_file()-hack in a project,
> and even though it can read files up to 1GB,
> using e.g. regexp_split_to_table() to split on E'\n'
> seems to need 4x as much memory, so it only
> works with files less than ~256MB.

Yeah, that's because of the conversion to "chr".  But a regexp
is overkill for that anyway.  Don't we have something that will
split on simple substring matches?

            regards, tom lane



Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
Andrew Dunstan
Дата:
On 5/5/21 2:45 PM, Tom Lane wrote:
> "Joel Jacobson" <joel@compiler.org> writes:
>> I think you misunderstood the problem.
>> I don't want the entire file to be considered a single value.
>> I want each line to become its own row, just a row with a single column.
>> So I actually think COPY seems like a perfect match for the job,
>> since it does precisely that, except there is no delimiter in this case.
> Well, there's more to it than just the column delimiter.
>
> * What about \N being converted to NULL?
> * What about \. being treated as EOF?
> * Do you want to turn off the special behavior of backslash (ESCAPE)
>   altogether?
> * What about newline conversions (\r\n being seen as just \n, etc)?
>
> I'm inclined to think that "use pg_read_file and then split at newlines"
> might be a saner answer than delving into all these fine points.
> Not least because people yell when you add cycles to the COPY
> inner loops.



+1


Also we have generally been resistant to supporting odd formats. FDWs
can help here (e.g. file_text_array), but they can't use STDIN IIRC.


>
>> I'm currently using the pg_read_file()-hack in a project,
>> and even though it can read files up to 1GB,
>> using e.g. regexp_split_to_table() to split on E'\n'
>> seems to need 4x as much memory, so it only
>> works with files less than ~256MB.
> Yeah, that's because of the conversion to "chr".  But a regexp
> is overkill for that anyway.  Don't we have something that will
> split on simple substring matches?
>
>             



Not that I know of. There is split_part but I don't think that's fit for
purpose here. Do we need one, or have I missed something?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
Andrew Dunstan
Дата:
On 5/5/21 3:36 PM, Justin Pryzby wrote:
> On Wed, May 05, 2021 at 02:45:41PM -0400, Tom Lane wrote:
>>> I'm currently using the pg_read_file()-hack in a project,
>>> and even though it can read files up to 1GB,
>>> using e.g. regexp_split_to_table() to split on E'\n'
>>> seems to need 4x as much memory, so it only
>>> works with files less than ~256MB.
>> Yeah, that's because of the conversion to "chr".  But a regexp
>> is overkill for that anyway.  Don't we have something that will
>> split on simple substring matches?
> For v14
>
> commit 66f163068030b5c5fe792a0daee27822dac43791
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Wed Sep 2 18:23:56 2020 -0400
>
>     Add string_to_table() function.
>

Ha! just in time :-)


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
Tom Lane
Дата:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 5/5/21 2:45 PM, Tom Lane wrote:
>> Yeah, that's because of the conversion to "chr".  But a regexp
>> is overkill for that anyway.  Don't we have something that will
>> split on simple substring matches?

> Not that I know of. There is split_part but I don't think that's fit for
> purpose here. Do we need one, or have I missed something?

[ checks manual ... ]

string_to_array or string_to_table would do, I think.

            regards, tom lane



Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
"Joel Jacobson"
Дата:
On Wed, May 5, 2021, at 21:51, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 5/5/21 2:45 PM, Tom Lane wrote:
>> Yeah, that's because of the conversion to "chr".  But a regexp
>> is overkill for that anyway.  Don't we have something that will
>> split on simple substring matches?

> Not that I know of. There is split_part but I don't think that's fit for
> purpose here. Do we need one, or have I missed something?

[ checks manual ... ]

string_to_array or string_to_table would do, I think.

regards, tom lane


Thanks for these new functions, they seem really useful for a lot of cases.

However, I see two problems with using string_to_table() for this particular use-case.
 
- Doesn't work with files larger than 1GB, due to pg_read_file()'s limit.
- 68% slower than using the COPY-hack.

% ls -lah foo.txt
-rw-r--r--  1 joel  staff   623M May  6 07:31 foo.txt

% wc -l foo.txt
6771864 foo.txt

# \d txt
               Table "public.txt"
Column | Type | Collation | Nullable | Default
--------+------+-----------+----------+---------
line   | text |           |          |

# COPY txt (line) FROM 'foo.txt' DELIMITER '"';
COPY 6771864
Time: 9829.707 ms (00:09.830)
Time: 9552.286 ms (00:09.552)
Time: 9483.115 ms (00:09.483)

# TRUNCATE txt;
TRUNCATE TABLE

# INSERT INTO txt (line) SELECT string_to_table(pg_read_file('foo.txt'),E'\n');
INSERT 0 6771865
Time: 16556.078 ms (00:16.556)
Time: 14720.343 ms (00:14.720)
Time: 17266.088 ms (00:17.266)

/Joel

Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
Thomas Kellerer
Дата:
Joel Jacobson schrieb am 05.05.2021 um 17:30:
> Could it be an idea to exploit the fact that DELIMITER E'\n' is currently an error?
>
>     ERROR:  COPY delimiter cannot be newline or carriage return
>
> That is, to change E'\n' to be a valid delimiter, which would simply read each line
> delimited by newlines, as a single column.
>
> The hack I'm currently abusing is to find some one-byte character that is not present anywhere in the text file,
> and then to use that character as a delimiter. This doesn't work when needing to deal with a text file
> which content is unknown at the time when writing the code though, so it's mostly useful for throwaway one-off
queries.

What about

    delimiter E'\1'

The probability that a file contains the ASCII "character" 1 seems rather low.

Thomas





Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
"Joel Jacobson"
Дата:
On Wed, May 5, 2021, at 20:45, Tom Lane wrote:
"Joel Jacobson" <joel@compiler.org> writes:
> I think you misunderstood the problem.
> I don't want the entire file to be considered a single value.
> I want each line to become its own row, just a row with a single column.

> So I actually think COPY seems like a perfect match for the job,
> since it does precisely that, except there is no delimiter in this case.

Well, there's more to it than just the column delimiter.

* What about \N being converted to NULL?
* What about \. being treated as EOF?
* Do you want to turn off the special behavior of backslash (ESCAPE)
  altogether?
* What about newline conversions (\r\n being seen as just \n, etc)?

I'm inclined to think that "use pg_read_file and then split at newlines"
might be a saner answer than delving into all these fine points.
Not least because people yell when you add cycles to the COPY
inner loops.

Thanks for providing strong arguments why the COPY approach is a dead-end, I agree.

However, as demonstrated in my previous email, using

   string_to_table(pg_read_file( filename ), E'\n')

has its performance as well as max size issues.

Maybe these two problems could be solved by combining the two functions into one?

   file_to_table ( filename text, delimiter text [, null_string text ] ) → setof text

I'm thinking thanks to returning "setof text", such a function could read a stream,
and return a line as soon as a delimiter is encountered, not having to keep
the entire file in memory at any time.

/Joel

Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
Darafei "Komяpa" Praliaskouski
Дата:
I have similar problems and what is really needed is a way to get a file from client side into a server side object that can be dealt with later.
The most popular way is COPY and it is built into the psql tool. In general it supports \copy wrapper, and there is COPY FROM STDIN. However, it is not available to the files that are not following the csv-like structure. I had to use it for XML and huge JSON files before, and it's always `sed` before the import and a replace() after.

pg_read_file does not help on cloud and managed installs of postgres here.

What I would prefer is some new COPY mode like RAW that will just push whatever it gets on the stdin/input into the cell on the server side. This way it can be proxied by psql, utilize existing infra for passing streams and be used in shell scripting.



On Thu, May 6, 2021 at 9:14 AM Joel Jacobson <joel@compiler.org> wrote:
On Wed, May 5, 2021, at 20:45, Tom Lane wrote:
"Joel Jacobson" <joel@compiler.org> writes:
> I think you misunderstood the problem.
> I don't want the entire file to be considered a single value.
> I want each line to become its own row, just a row with a single column.

> So I actually think COPY seems like a perfect match for the job,
> since it does precisely that, except there is no delimiter in this case.

Well, there's more to it than just the column delimiter.

* What about \N being converted to NULL?
* What about \. being treated as EOF?
* Do you want to turn off the special behavior of backslash (ESCAPE)
  altogether?
* What about newline conversions (\r\n being seen as just \n, etc)?

I'm inclined to think that "use pg_read_file and then split at newlines"
might be a saner answer than delving into all these fine points.
Not least because people yell when you add cycles to the COPY
inner loops.

Thanks for providing strong arguments why the COPY approach is a dead-end, I agree.

However, as demonstrated in my previous email, using

   string_to_table(pg_read_file( filename ), E'\n')

has its performance as well as max size issues.

Maybe these two problems could be solved by combining the two functions into one?

   file_to_table ( filename text, delimiter text [, null_string text ] ) → setof text

I'm thinking thanks to returning "setof text", such a function could read a stream,
and return a line as soon as a delimiter is encountered, not having to keep
the entire file in memory at any time.

/Joel


--
Darafei "Komяpa" Praliaskouski
OSM BY Team - http://openstreetmap.by/

Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
Isaac Morland
Дата:
On Thu, 6 May 2021 at 02:21, Darafei "Komяpa" Praliaskouski <me@komzpa.net> wrote:
 
What I would prefer is some new COPY mode like RAW that will just push whatever it gets on the stdin/input into the cell on the server side. This way it can be proxied by psql, utilize existing infra for passing streams and be used in shell scripting.

Yes! A significant missing feature is “take this arbitrary bucket of bits and move it to/from the database from/to this file without modification of any kind”. There are all sorts of tutorials on the Web about how to almost do it or fake it or convert to hex or whatever but I’ve never seen a nice simple explanation of “here is the command:”.

Of course there is a lot to think about. Coming out of the database, the query result must be exactly one row containing exactly one column; how do we handle other database results? Coming into the database, only one data value can come from a single file; so how do we populate the other columns of whatever table we copy to?

Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
"Joel Jacobson"
Дата:
On Thu, May 6, 2021, at 13:41, Isaac Morland wrote:
On Thu, 6 May 2021 at 02:21, Darafei "Komяpa" Praliaskouski <me@komzpa.net> wrote:
 
What I would prefer is some new COPY mode like RAW that will just push whatever it gets on the stdin/input into the cell on the server side. This way it can be proxied by psql, utilize existing infra for passing streams and be used in shell scripting.

Yes! A significant missing feature is “take this arbitrary bucket of bits and move it to/from the database from/to this file without modification of any kind”. There are all sorts of tutorials on the Web about how to almost do it or fake it or convert to hex or whatever but I’ve never seen a nice simple explanation of “here is the command:”.

Of course there is a lot to think about. Coming out of the database, the query result must be exactly one row containing exactly one column; how do we handle other database results? Coming into the database, only one data value can come from a single file; so how do we populate the other columns of whatever table we copy to?


If the file is on the server, you can use pg_read_binary_file() for that purpose.

/Joel

Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
Isaac Morland
Дата:
On Thu, 6 May 2021 at 12:02, Joel Jacobson <joel@compiler.org> wrote:
On Thu, May 6, 2021, at 13:41, Isaac Morland wrote:
Yes! A significant missing feature is “take this arbitrary bucket of bits and move it to/from the database from/to this file without modification of any kind”. There are all sorts of tutorials on the Web about how to almost do it or fake it or convert to hex or whatever but I’ve never seen a nice simple explanation of “here is the command:”.

Of course there is a lot to think about. Coming out of the database, the query result must be exactly one row containing exactly one column; how do we handle other database results? Coming into the database, only one data value can come from a single file; so how do we populate the other columns of whatever table we copy to?
If the file is on the server, you can use pg_read_binary_file() for that purpose.

Yes, sorry, I should have explicitly said “on the client” somewhere up there. Getting files from the DB server into the DB is no problem.

Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
Andrew Dunstan
Дата:
On 5/6/21 7:41 AM, Isaac Morland wrote:
> On Thu, 6 May 2021 at 02:21, Darafei "Komяpa" Praliaskouski
> <me@komzpa.net <mailto:me@komzpa.net>> wrote:
>  
>
>     What I would prefer is some new COPY mode like RAW that will just
>     push whatever it gets on the stdin/input into the cell on the
>     server side. This way it can be proxied by psql, utilize existing
>     infra for passing streams and be used in shell scripting.
>
>
> Yes! A significant missing feature is “take this arbitrary bucket of
> bits and move it to/from the database from/to this file without
> modification of any kind”. There are all sorts of tutorials on the Web
> about how to almost do it or fake it or convert to hex or whatever but
> I’ve never seen a nice simple explanation of “here is the command:”.
>
> Of course there is a lot to think about. Coming out of the database,
> the query result must be exactly one row containing exactly one
> column; how do we handle other database results? Coming into the
> database, only one data value can come from a single file; so how do
> we populate the other columns of whatever table we copy to?
>


What I'd like is something a bit more general. We could extend the FDW
API to allow for a CopyStdin handler or some such. The input could be
sent unfiltered to the handler, which would do whatever it liked with
it. That way FDWs like file_fdw and file_text_array_fdw could read from
stdin, for example. I'm pretty sure it would handle the OP's use case.
The downside is you'd need to write an FDW handler, but that's not too
hard, and there are lots of examples.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
"Daniel Verite"
Дата:
    Darafei "Komяpa" Praliaskouski wrote:

> What I would prefer is some new COPY mode like RAW that will just push
> whatever it gets on the stdin/input into the cell on the server side. This
> way it can be proxied by psql, utilize existing infra for passing streams
> and be used in shell scripting.

COPY RAW has been proposed and discussed quite a bit previously:
https://commitfest.postgresql.org/12/676/


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite



Re: COPY table_name (single_column) FROM 'iso-8859-1.txt' DELIMITER E'\n'

От
"Joel Jacobson"
Дата:
What about:

   COPY ... FROM ... WITH PATTERN 'regexp_pattern'

Where the columns would be matched with the capture groups.

This could handle the quite common case of varying white-space as column separators:

   COPY log (col1, col2, col3) FROM 'log.txt' WITH PATTERN '^(\S+)\s+(\S+)\s+(\S+)$'

This could also handle $SUBJECT:

   COPY table_name (single_column) FROM 'unknown.txt' WITH PATTERN '^(.*)$';

And lots of other more complex use-cases.

/Joel