[R] RMySQL - Bulk loading data and creating FK links

Gabor Grothendieck ggrothendieck at gmail.com
Wed Jan 27 15:29:14 CET 2010


On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> How many columns, and of what type are the columns ? As Olga asked too, it
> would be useful to know more about what you're really trying to do.
>
> 3.5m rows is not actually that many rows, even for 32bit R.  Its depends on
> the columns and what you want to do with those columns.
>
> At the risk of suggesting something before we know the full facts,  one
> possibility is to load the data from flat file into data.table. Use setkey()
> to set your keys. Use tables() to summarise your various tables. Then do
> your joins etc all-in-R.  data.table has fast ways to do those sorts of
> joins (but we need more info about your task).
>
> Alternatively,  you could check out the sqldf website. There is an
> sqlread.csv (or similar name) which can read your files directly into SQL

read.csv.sql

> instead of going via R. Gabor has some nice examples there about that and
> its faster.
>
> You use some buzzwords which makes me think that SQL may not be appropriate
> for your task though.  Can't say for sure (because we don't have enough
> information) but its possible you are struggling because SQL has no row
> ordering concept built in.  That might be why you've created an increment

In the SQLite database it automatically assigns a self incrementing
hidden column called rowid to each row. e.g. using SQLite via the
sqldf package on CRAN and the BOD data frame which is built into R we
can display the rowid column explicitly by referring to it in our
select statement:

> library(sqldf)
> BOD
  Time demand
1    1    8.3
2    2   10.3
3    3   19.0
4    4   16.0
5    5   15.6
6    7   19.8
> sqldf("select rowid, * from BOD")
  rowid Time demand
1     1    1    8.3
2     2    2   10.3
3     3    3   19.0
4     4    4   16.0
5     5    5   15.6
6     6    7   19.8


> field?  Do your queries include "order by incrementing field"? SQL is not
> good at "first" and "last" type logic. An all-in-R solution may well be

In SQLite you can get the top 3 values, say, like this (continuing the
prior example):

> sqldf("select * from BOD order by Time desc limit 3")
  Time demand
1    7   19.8
2    5   15.6
3    4   16.0

> better, since R is very good with ordered vectors. A 1GB data.table (or
> data.frame) for example,  at 3.5m rows,  could have 76 integer columns, or
> 38 double columns. 1GB is well within 32bit and allows some space for
> working copies, depending on what you want to do with the data.  If you have
> 38 or less columns,  or you have 64bit,  then an all-in-R solution *might*
> get your task done quicker,  depending on what your real goal is.
>
> If this sounds plausible,  you could post more details and, if its
> appropriate, and luck is on your side, someone might even sketch out how to
> do an all-in-R solution.
>
>
> "Nathan S. Watson-Haigh" <nathan.watson-haigh at csiro.au> wrote in message
> news:4B5FDE1B.10806 at csiro.au...
>>I have a table (contact) with several fields and it's PK is an auto
>>increment field. I'm bulk loading data to this table from files which if
>>successful will be about 3.5million rows (approx 16000 rows per file).
>>However, I have a linking table (an_contact) to resolve a m:m relationship
>>between the an and contact tables. How can I retrieve the PK's for the data
>>bulk loaded into contact so I can insert the relevant data into an_contact.
>>
>> I currently load the data into contact using:
>> dbWriteTable(con, "contact", dat, append=TRUE, row.names=FALSE)
>>
>> But I then need to get all the PK's which this dbWriteTable() appended to
>> the contact table so I can load the data into my an_contact link table. I
>> don't want to issue a separate INSERT query for each row in dat and then
>> use MySQLs LAST_INSERT_ID() function....not when I have 3.5million rows to
>> insert!
>>
>> Any pointers welcome,
>> Nathan
>>
>> --
>> --------------------------------------------------------
>> Dr. Nathan S. Watson-Haigh
>> OCE Post Doctoral Fellow
>> CSIRO Livestock Industries
>> University Drive
>> Townsville, QLD 4810
>> Australia
>>
>> Tel: +61 (0)7 4753 8548
>> Fax: +61 (0)7 4753 8600
>> Web: http://www.csiro.au/people/Nathan.Watson-Haigh.html
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list