[BioC] RdbiPgSQL: slowness in dbWriteTable

Sean Davis sdavis2 at mail.nih.gov
Tue Sep 4 22:45:45 CEST 2007


Jeff Gentry wrote:
> Hi there ...
> 
> In the past I've had some difficulty inserting some larger tables into
> postgres using dbWriteTable() from RdbiPgSQL, in terms of speed
> performance but now I've run into some data.frames which are just
> too large (although they don't seem to be all that large in practical
> terms).
> 
> As an example, I was attempting to insert the assayData from two
> SnpCallSets into a DB over the weekend, 6 tables each of ~350 columns and
> 250K rows.  When I came back this morning (~3.5 days), it hadn't even
> completed one of these tables, with both CPU and RAM maxxed (on a machine
> w/ 16GB of RAM and a pretty decent CPU).
> 
> Anyone have ideas on a faster method to get these tables inserted?  One
> possibility that I've found in the past to be quicker is to manually loop
> INSERT statements, but the problem there is detection of data types (which
> I believe is the real slowdown from the RdbiPgSQL package in the first
> place) and proper quoting/non-quoting.

Hi, Jeff.  The fastest way to get data into postgresql is to use some
version of the copy protocol.  I do not know what RdbiPgSQL uses for
doing dbWriteTable.  I did a quick test on my machine (4Gb RAM,
3-year-old Mac G5) and was able to insert 250,000 rows by 350 columns of
floats in about 2 minutes.  In order to do this:

1)  Create the table.  You can generate the create statements in R using
the class of the various columns of your dataframe and combinations of
paste.  Alternatively, use dbWriteTable with a couple of rows to get the
table creation done automatically.  If you use dbWriteTable, do not
forget to do a TRUNCATE before loading data.

2)  Write your data.frame to disk, tab-delimited text, no column names,
with NULL values as "\N"

3)  start up psql and do:

  \copy tablename from filename

Generally, looping over inserts (or any form of insert statement) will
be much slower than using "copy".  Also, make sure that you drop indexes
before using copy (or doing inserts, for that matter)--these will kill
you on a large insert to a large table.  If you do need to do inserts,
be sure to issue a "begin" before doing the inserts and a "commit" at
the end; isolating things in a transaction block will gain you speed.

Hope that helps.

Sean



More information about the Bioconductor mailing list