I'm playing with various data models to compare performance and practicalities and not sure if I should be surprised by the numbers I'm getting. I hope this report isn't too wishy-washy for reasoned comment.
One model says a genotype is defined as follows:
Table "public.oldstyle" +-------------+--------------+-----------+ | Column | Type | Modifiers | +-------------+--------------+-----------+ | id | uuid | not null | | sample_name | text | not null | | marker_name | text | not null | | allele1 | character(1) | | | allele2 | character(1) | | +-------------+--------------+-----------+ (0. id is a Primary Key) (1. Take what you will from the table name.) (2. I hadn't thought of "char" type at this point) (3. Ultimately the names would become ids, RI included) (4. We're loading 39 samples and ~950K markers)
I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours (800+ records/sec). Then I tried COPY and killed that after 11.25 hours when I realised that I had added on non-unque index on the name fields after the first load. By that point is was on line 28301887, so ~0.75 done which implies it would have take ~15hours to complete.
Would the overhead of the index likely explain this decrease in throughput?
Absolutely.
Impatience got the better of me and I killed the second COPY. This time it had done 54% of the file in 6.75 hours, extrapolating to roughly 12 hours to do the whole thing.
Are you sure you actually dropped the indices? (And the primary key?)
I get about 375,000 lines per second with no indexes, triggers, constraints.