[R] RMySQL - Bulk loading data and creating FK links

Matthew Dowle mdowle at mdowle.plus.com
Wed Jan 27 14:56:20 CET 2010


How many columns, and of what type are the columns ? As Olga asked too, it 
would be useful to know more about what you're really trying to do.

3.5m rows is not actually that many rows, even for 32bit R.  Its depends on 
the columns and what you want to do with those columns.

At the risk of suggesting something before we know the full facts,  one 
possibility is to load the data from flat file into data.table. Use setkey() 
to set your keys. Use tables() to summarise your various tables. Then do 
your joins etc all-in-R.  data.table has fast ways to do those sorts of 
joins (but we need more info about your task).

Alternatively,  you could check out the sqldf website. There is an 
sqlread.csv (or similar name) which can read your files directly into SQL 
instead of going via R. Gabor has some nice examples there about that and 
its faster.

You use some buzzwords which makes me think that SQL may not be appropriate 
for your task though.  Can't say for sure (because we don't have enough 
information) but its possible you are struggling because SQL has no row 
ordering concept built in.  That might be why you've created an increment 
field?  Do your queries include "order by incrementing field"? SQL is not 
good at "first" and "last" type logic. An all-in-R solution may well be 
better, since R is very good with ordered vectors. A 1GB data.table (or 
data.frame) for example,  at 3.5m rows,  could have 76 integer columns, or 
38 double columns. 1GB is well within 32bit and allows some space for 
working copies, depending on what you want to do with the data.  If you have 
38 or less columns,  or you have 64bit,  then an all-in-R solution *might* 
get your task done quicker,  depending on what your real goal is.

If this sounds plausible,  you could post more details and, if its 
appropriate, and luck is on your side, someone might even sketch out how to 
do an all-in-R solution.


"Nathan S. Watson-Haigh" <nathan.watson-haigh at csiro.au> wrote in message 
news:4B5FDE1B.10806 at csiro.au...
>I have a table (contact) with several fields and it's PK is an auto 
>increment field. I'm bulk loading data to this table from files which if 
>successful will be about 3.5million rows (approx 16000 rows per file). 
>However, I have a linking table (an_contact) to resolve a m:m relationship 
>between the an and contact tables. How can I retrieve the PK's for the data 
>bulk loaded into contact so I can insert the relevant data into an_contact.
>
> I currently load the data into contact using:
> dbWriteTable(con, "contact", dat, append=TRUE, row.names=FALSE)
>
> But I then need to get all the PK's which this dbWriteTable() appended to 
> the contact table so I can load the data into my an_contact link table. I 
> don't want to issue a separate INSERT query for each row in dat and then 
> use MySQLs LAST_INSERT_ID() function....not when I have 3.5million rows to 
> insert!
>
> Any pointers welcome,
> Nathan
>
> -- 
> --------------------------------------------------------
> Dr. Nathan S. Watson-Haigh
> OCE Post Doctoral Fellow
> CSIRO Livestock Industries
> University Drive
> Townsville, QLD 4810
> Australia
>
> Tel: +61 (0)7 4753 8548
> Fax: +61 (0)7 4753 8600
> Web: http://www.csiro.au/people/Nathan.Watson-Haigh.html
>



More information about the R-help mailing list