[R] RMySQL - Bulk loading data and creating FK links

Matthew Dowle mdowle at mdowle.plus.com
Thu Jan 28 14:52:57 CET 2010


Are you claiming that SQL is that utopia?  SQL is a row store.  It cannot 
give the user the benefits of column store.

For example, why does SQL take 113 seconds in the example in this thread :
http://tolstoy.newcastle.edu.au/R/e9/help/10/01/1872.html
but data.table takes 5 seconds to get the same result ? How come the high 
level language SQL doesn't appear to hide the user from this detail ?

If you are just describing utopia, then of course I agree.  It would be 
great to have a language which hid us from this.  In the meantime the user 
has choices, and the best choice depends on the task and the real goal.

"Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in message 
news:971536df1001280428p345f8ff4v5f3a80c13f96dd15 at mail.gmail.com...
Its only important internally.  Externally its undesirable that the
user have to get involved in it.  The idea of making software easy to
write and use is to hide the implementation and focus on the problem.
That is why we use high level languages, object orientation, etc.

On Thu, Jan 28, 2010 at 4:37 AM, Matthew Dowle <mdowle at mdowle.plus.com> 
wrote:
> How it represents data internally is very important, depending on the real
> goal :
> http://en.wikipedia.org/wiki/Column-oriented_DBMS
>
>
> "Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in message
> news:971536df1001271710o4ea62333l7f1230b860114c7e at mail.gmail.com...
> How it represents data internally should not be important as long as
> you can do what you want. SQL is declarative so you just specify what
> you want rather than how to get it and invisibly to the user it
> automatically draws up a query plan and then uses that plan to get the
> result.
>
> On Wed, Jan 27, 2010 at 12:48 PM, Matthew Dowle <mdowle at mdowle.plus.com>
> wrote:
>>
>>> sqldf("select * from BOD order by Time desc limit 3")
>> Exactly. SQL requires use of order by. It knows the order, but it isn't
>> ordered. Thats not good, but might be fine, depending on what the real
>> goal
>> is.
>>
>>
>> "Gabor Grothendieck" <ggrothendieck at gmail.com> wrote in message
>> news:971536df1001270629w4795da89vb7d77af6e4e8be7f at mail.gmail.com...
>> On Wed, Jan 27, 2010 at 8:56 AM, Matthew Dowle <mdowle at mdowle.plus.com>
>> wrote:
>>> How many columns, and of what type are the columns ? As Olga asked too,
>>> it
>>> would be useful to know more about what you're really trying to do.
>>>
>>> 3.5m rows is not actually that many rows, even for 32bit R. Its depends
>>> on
>>> the columns and what you want to do with those columns.
>>>
>>> At the risk of suggesting something before we know the full facts, one
>>> possibility is to load the data from flat file into data.table. Use
>>> setkey()
>>> to set your keys. Use tables() to summarise your various tables. Then do
>>> your joins etc all-in-R. data.table has fast ways to do those sorts of
>>> joins (but we need more info about your task).
>>>
>>> Alternatively, you could check out the sqldf website. There is an
>>> sqlread.csv (or similar name) which can read your files directly into 
>>> SQL
>>
>> read.csv.sql
>>
>>> instead of going via R. Gabor has some nice examples there about that 
>>> and
>>> its faster.
>>>
>>> You use some buzzwords which makes me think that SQL may not be
>>> appropriate
>>> for your task though. Can't say for sure (because we don't have enough
>>> information) but its possible you are struggling because SQL has no row
>>> ordering concept built in. That might be why you've created an increment
>>
>> In the SQLite database it automatically assigns a self incrementing
>> hidden column called rowid to each row. e.g. using SQLite via the
>> sqldf package on CRAN and the BOD data frame which is built into R we
>> can display the rowid column explicitly by referring to it in our
>> select statement:
>>
>>> library(sqldf)
>>> BOD
>> Time demand
>> 1 1 8.3
>> 2 2 10.3
>> 3 3 19.0
>> 4 4 16.0
>> 5 5 15.6
>> 6 7 19.8
>>> sqldf("select rowid, * from BOD")
>> rowid Time demand
>> 1 1 1 8.3
>> 2 2 2 10.3
>> 3 3 3 19.0
>> 4 4 4 16.0
>> 5 5 5 15.6
>> 6 6 7 19.8
>>
>>
>>> field? Do your queries include "order by incrementing field"? SQL is not
>>> good at "first" and "last" type logic. An all-in-R solution may well be
>>
>> In SQLite you can get the top 3 values, say, like this (continuing the
>> prior example):
>>
>>> sqldf("select * from BOD order by Time desc limit 3")
>> Time demand
>> 1 7 19.8
>> 2 5 15.6
>> 3 4 16.0
>>
>>> better, since R is very good with ordered vectors. A 1GB data.table (or
>>> data.frame) for example, at 3.5m rows, could have 76 integer columns, or
>>> 38 double columns. 1GB is well within 32bit and allows some space for
>>> working copies, depending on what you want to do with the data. If you
>>> have
>>> 38 or less columns, or you have 64bit, then an all-in-R solution *might*
>>> get your task done quicker, depending on what your real goal is.
>>>
>>> If this sounds plausible, you could post more details and, if its
>>> appropriate, and luck is on your side, someone might even sketch out how
>>> to
>>> do an all-in-R solution.
>>>
>>>
>>> "Nathan S. Watson-Haigh" <nathan.watson-haigh at csiro.au> wrote in message
>>> news:4B5FDE1B.10806 at csiro.au...
>>>>I have a table (contact) with several fields and it's PK is an auto
>>>>increment field. I'm bulk loading data to this table from files which if
>>>>successful will be about 3.5million rows (approx 16000 rows per file).
>>>>However, I have a linking table (an_contact) to resolve a m:m
>>>>relationship
>>>>between the an and contact tables. How can I retrieve the PK's for the
>>>>data
>>>>bulk loaded into contact so I can insert the relevant data into
>>>>an_contact.
>>>>
>>>> I currently load the data into contact using:
>>>> dbWriteTable(con, "contact", dat, append=TRUE, row.names=FALSE)
>>>>
>>>> But I then need to get all the PK's which this dbWriteTable() appended
>>>> to
>>>> the contact table so I can load the data into my an_contact link table.
>>>> I
>>>> don't want to issue a separate INSERT query for each row in dat and 
>>>> then
>>>> use MySQLs LAST_INSERT_ID() function....not when I have 3.5million rows
>>>> to
>>>> insert!
>>>>
>>>> Any pointers welcome,
>>>> Nathan
>>>>
>>>> --
>>>> --------------------------------------------------------
>>>> Dr. Nathan S. Watson-Haigh
>>>> OCE Post Doctoral Fellow
>>>> CSIRO Livestock Industries
>>>> University Drive
>>>> Townsville, QLD 4810
>>>> Australia
>>>>
>>>> Tel: +61 (0)7 4753 8548
>>>> Fax: +61 (0)7 4753 8600
>>>> Web: http://www.csiro.au/people/Nathan.Watson-Haigh.html
>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list