[R] merging and working with BIG data sets. Is sqldf the best way??

Chris Howden chris at trickysolutions.com.au
Fri Oct 15 04:56:28 CEST 2010


Thanks for the suggestion and code Gabor,

I've tried creating 2 indices:

1) just for the variable I intend to merge on
2) on the entire data set I am merging (which I think is the one I should
be using??)

However neither seemed to work. The first was still going after 2 hours,
and the second after 12 hours, so I stopped the join.

If it's not too much bother I was wondering if U could let me know which
index I should be using?


Or in other words since I plan to merge using POINTID do I create an index
on

system.time(sqldf("create index ai1 on A(POINTID)"))
system.time(sqldf("create index ai2 on B(POINTID)"))

or

system.time(sqldf("create index ai1 on A(POINTID,alistair)"))
system.time(sqldf("create index ai2 on B(POINTID, alistair_range)")



I'm now using the following join statement
system.time(data2 <- sqldf("select * from A natural join B"))


thanks

Chris Howden
Founding Partner
Tricky Solutions
Tricky Solutions 4 Tricky Problems
Evidence Based Strategic Development, IP development, Data Analysis,
Modelling, and Training
(mobile) 0410 689 945
(fax / office) (+618) 8952 7878
chris at trickysolutions.com.au


-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
Sent: Thursday, 14 October 2010 9:02 AM
To: Chris Howden
Cc: r-help at r-project.org
Subject: Re: [R] merging and working with BIG data sets. Is sqldf the best
way??

On Tue, Oct 12, 2010 at 2:39 AM, Chris Howden
<chris at trickysolutions.com.au> wrote:
> I’m working with some very big datasets (each dataset has 11 million
rows
> and 2 columns). My first step is to merge all my individual data sets
> together (I have about 20)
>
> I’m using the following command from sqldf
>
>               data1 <- sqldf("select A.*, B.* from A inner join B
> using(ID)")
>
> But it’s taking A VERY VERY LONG TIME to merge just 2 of the datasets
(well
> over 2 hours, possibly longer since it’s still going).

You need to add indexes to your tables.   See example 4i on the sqldf home
page
http://sqldf.googlecode.com
This can result in huge speedups for large tables.

--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com



More information about the R-help mailing list