Re: Using COPY to import large xml file
От | Adrian Klaver |
---|---|
Тема | Re: Using COPY to import large xml file |
Дата | |
Msg-id | d3c834a0-edb0-4363-d882-e36422ebcdc9@aklaver.com обсуждение исходный текст |
Ответ на | Re: Using COPY to import large xml file (Anto Aravinth <anto.aravinth.cse@gmail.com>) |
Список | pgsql-general |
On 06/25/2018 07:25 AM, Anto Aravinth wrote: > Thanks a lot. But I do got lot of challenges! Looks like SO data > contains lot of tabs within itself.. So tabs delimiter didn't work for > me. I thought I can give a special demiliter but looks like Postrgesql > copy allow only one character as delimiter :( I use | as it is rarely found in data itself. > > Sad, I guess only way is to insert or do a through serialization of my > data into something that COPY can understand. > > On Mon, Jun 25, 2018 at 8:09 AM, Tim Cross <theophilusx@gmail.com > <mailto:theophilusx@gmail.com>> wrote: > > > > On Mon, 25 Jun 2018 at 11:38, Anto Aravinth > <anto.aravinth.cse@gmail.com <mailto:anto.aravinth.cse@gmail.com>> > wrote: > > > > On Mon, Jun 25, 2018 at 3:44 AM, Tim Cross > <theophilusx@gmail.com <mailto:theophilusx@gmail.com>> wrote: > > > Anto Aravinth <anto.aravinth.cse@gmail.com > <mailto:anto.aravinth.cse@gmail.com>> writes: > > > Thanks for the response. I'm not sure, how long does this tool takes for > > the 70GB data. > > > > I used node to stream the xml files into inserts.. which was very slow.. > > Actually the xml contains 40 million records, out of which 10Million took > > around 2 hrs using nodejs. Hence, I thought will use COPY command, as > > suggested on the internet. > > > > Definitely, will try the code and let you know.. But looks like it uses the > > same INSERT, not copy.. interesting if it runs quick on my machine. > > > > On Sun, Jun 24, 2018 at 9:23 PM, Adrien Nayrat <adrien.nayrat@anayrat.info <mailto:adrien.nayrat@anayrat.info>> > > wrote: > > > >> On 06/24/2018 05:25 PM, Anto Aravinth wrote: > >> > Hello Everyone, > >> > > >> > I have downloaded the Stackoverflow posts xml (contains all SO questions > >> till > >> > date).. the file is around 70GB.. I wanna import the data in those xml > >> to my > >> > table.. is there a way to do so in postgres? > >> > > >> > > >> > Thanks, > >> > Anto. > >> > >> Hello Anto, > >> > >> I used this tool : > >> https://github.com/Networks-Learning/stackexchange-dump-to-postgres > <https://github.com/Networks-Learning/stackexchange-dump-to-postgres> > >> > > If you are using nodejs, then you can easily use the > pg-copy-streams > module to insert the records into your database. I've been > using this > for inserting large numbers of records from NetCDF files. > Takes between > 40 to 50 minutes to insert 60 Million+ records and we are doing > additional calculations on the values, not just inserting them, > plus we are inserting into a database over the network and > into a database which is > also performing other processing. > > We found a significant speed improvement with COPY over > blocks of insert > transactions, which was faster than just individual inserts. > The only > downside with using COPY is that it either completely works or > completely fails and when it fails, it can be tricky to work > out which > record is causing the failure. A benefit of using blocks of > transactions > is that you have more fine grained control, allowing you to > recover from > some errors or providing more specific detail regarding the > cause of the > error. > > > Sure, let me try that.. I have a question here, COPY usually > works when you move data from files to your postgres instance, > right? Now in node.js, processing the whole file, can I use COPY > programmatically like COPY Stackoverflow <calculated value at > run time>? Because from doc: > > https://www.postgresql.org/docs/9.2/static/sql-copy.html > <https://www.postgresql.org/docs/9.2/static/sql-copy.html> > > I don't see its possible. May be I need to convert the files to > copy understandable first? > > Anto. > > > > > Yes. Essentially what you do is create a stream and feed whatever > information you want to copy into that stream. PG sees the. data as > if it was seeing each line in a file, so you push data onto the > stream wherre each item is seperated by a tab (or whatever). Here is > the basic low level function I use (Don't know how the formatting > will go!) > > async function copyInsert(sql, stringifyFN, records) { > const logName = `${moduleName}.copyInsert`; > var client; > > assert.ok(Array.isArray(records), "The records arg must be an > array"); > assert.ok(typeof(stringifyFN) === "function", "The stringifyFN > arg must be a function"); > return getClient() > .then(c => { > client = c; > return new Promise(function(resolve, reject) { > var stream, rs; > var idx = 0; > function done() { > releaseClient(client); > client = undefined; > resolve(idx + 1); > } > > function onError(err) { > if (client !== undefined) { > releaseClient(client); > } > reject(new VError(err, `${logName}: COPY failed at record > ${idx}`)); > } > > function arrayRead() { > if (idx === records.length) { > rs.push(null); > } else { > let rec = records[idx]; > rs.push(stringifyFN(rec)); > idx += 1; > } > } > > rs = new Readable; > rs._read = arrayRead; > rs.on("error", onError); > stream = client.query(copyFrom(sql)); > stream.on("error", onError); > stream.on("end", done); > rs.pipe(stream); > }); > }) > .catch(err => { > throw new VError(err, `${logName} Failed COPY insert`); > }); > } > > and I will call it like > > copyInsert(sql, stringifyClimateRecord, records) > > where sql and stringifycomateRecord arguments are > > const sql = `COPY access_s.climate_data_ensemble_${ensemble} ` > + "(forecast_dt,awap_id,rain,temp_min,temp_max,rsds," > + "vprp_09,vprp_15,wind_speed) FROM STDIN"; > > function stringifyClimateRecord(rec) { > return `'${rec[0].format("YYYY-MM-DD")}'\t${rec[2]}\t${rec[3]}\t` > + > `${rec[4]}\t${rec[5]}\t${rec[6]}\t${rec[7]}\t${rec[8]}\t${rec[9]}\n`; > } > > The stringifyClimateRecord returns a record to be inserted as a > 'line' into the stream with values separated by tabs. Records is an > array of data records where each record is an array. > > > -- > regards, > > Tim > > -- > Tim Cross > > -- Adrian Klaver adrian.klaver@aklaver.com
В списке pgsql-general по дате отправления:
Предыдущее
От: Alban HertroysДата:
Сообщение: Re: DB size growing exponentially when materialized view refreshedconcurrently (postgres 9.6)
Следующее
От: Adrian KlaverДата:
Сообщение: Re: Schema/Data conversion opensource tools from MySQL to PostgreSQL