[R] RMySQL - Bulk loading data and creating FK links
Gabor Grothendieck
ggrothendieck at gmail.com
Wed Jan 27 15:29:14 CET 2010
On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> How many columns, and of what type are the columns ? As Olga asked too, it
> would be useful to know more about what you're really trying to do.
>
> 3.5m rows is not actually that many rows, even for 32bit R. Its depends on
> the columns and what you want to do with those columns.
>
> At the risk of suggesting something before we know the full facts, one
> possibility is to load the data from flat file into data.table. Use setkey()
> to set your keys. Use tables() to summarise your various tables. Then do
> your joins etc all-in-R. data.table has fast ways to do those sorts of
> joins (but we need more info about your task).
>
> Alternatively, you could check out the sqldf website. There is an
> sqlread.csv (or similar name) which can read your files directly into SQL
read.csv.sql
> instead of going via R. Gabor has some nice examples there about that and
> its faster.
>
> You use some buzzwords which makes me think that SQL may not be appropriate
> for your task though. Can't say for sure (because we don't have enough
> information) but its possible you are struggling because SQL has no row
> ordering concept built in. That might be why you've created an increment
In the SQLite database it automatically assigns a self incrementing
hidden column called rowid to each row. e.g. using SQLite via the
sqldf package on CRAN and the BOD data frame which is built into R we
can display the rowid column explicitly by referring to it in our
select statement:
> library(sqldf)
> BOD
Time demand
1 1 8.3
2 2 10.3
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8
> sqldf("select rowid, * from BOD")
rowid Time demand
1 1 1 8.3
2 2 2 10.3
3 3 3 19.0
4 4 4 16.0
5 5 5 15.6
6 6 7 19.8
> field? Do your queries include "order by incrementing field"? SQL is not
> good at "first" and "last" type logic. An all-in-R solution may well be
In SQLite you can get the top 3 values, say, like this (continuing the
prior example):
> sqldf("select * from BOD order by Time desc limit 3")
Time demand
1 7 19.8
2 5 15.6
3 4 16.0
> better, since R is very good with ordered vectors. A 1GB data.table (or
> data.frame) for example, at 3.5m rows, could have 76 integer columns, or
> 38 double columns. 1GB is well within 32bit and allows some space for
> working copies, depending on what you want to do with the data. If you have
> 38 or less columns, or you have 64bit, then an all-in-R solution *might*
> get your task done quicker, depending on what your real goal is.
>
> If this sounds plausible, you could post more details and, if its
> appropriate, and luck is on your side, someone might even sketch out how to
> do an all-in-R solution.
>
>
> "Nathan S. Watson-Haigh" <nathan.watson-haigh at csiro.au> wrote in message
> news:4B5FDE1B.10806 at csiro.au...
>>I have a table (contact) with several fields and it's PK is an auto
>>increment field. I'm bulk loading data to this table from files which if
>>successful will be about 3.5million rows (approx 16000 rows per file).
>>However, I have a linking table (an_contact) to resolve a m:m relationship
>>between the an and contact tables. How can I retrieve the PK's for the data
>>bulk loaded into contact so I can insert the relevant data into an_contact.
>>
>> I currently load the data into contact using:
>> dbWriteTable(con, "contact", dat, append=TRUE, row.names=FALSE)
>>
>> But I then need to get all the PK's which this dbWriteTable() appended to
>> the contact table so I can load the data into my an_contact link table. I
>> don't want to issue a separate INSERT query for each row in dat and then
>> use MySQLs LAST_INSERT_ID() function....not when I have 3.5million rows to
>> insert!
>>
>> Any pointers welcome,
>> Nathan
>>
>> --
>> --------------------------------------------------------
>> Dr. Nathan S. Watson-Haigh
>> OCE Post Doctoral Fellow
>> CSIRO Livestock Industries
>> University Drive
>> Townsville, QLD 4810
>> Australia
>>
>> Tel: +61 (0)7 4753 8548
>> Fax: +61 (0)7 4753 8600
>> Web: http://www.csiro.au/people/Nathan.Watson-Haigh.html
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list