[R] merging and working with BIG data sets. Is sqldf the best way??

Fri Oct 15 12:14:33 CEST 2010

Thanks for the advice Gabor,

I was indeed not starting and finishing with sqldf(). Which was why it was
not working for me. Please forgive a blatantly obvious mistake.

I have tried what U suggested and unfortunately R is still having problems
doing the join. The problem seems to be one of memory since I am receiving
the below error message when I run the natural join using sqldf.
Error: cannot allocate vector of size 250.0 Mb
Timing stopped at: 32.8 0.48 34.79

I have tried it on a subset of the data and it works. So I think it's a
memory issue, being caused by my very large dataset (11 million rows and 2
columns).

I think I may have to admit that R cannot do this (on my machine). And try
doing it in a full blown database such as postgre.

Unless U (or anyone else) have any other suggestions???

Thanks again for your help.

For anyone who's interested here's all my code and R log.
##
# Info on input data
##
> class(A)
[1] "data.frame"
> class(B)
[1] "data.frame"

> names(A)
[1] "POINTID"  "alistair"
> names(B)
[1] "POINTID"        "alistair_range"

> dim(A)
[1] 11048592        2
> dim(B)
[1] 11048592        2

##
# Tried the join with an index on the entire data set
##
> sqldf()
<SQLiteConnection:(3852,0)>

>  system.time(sqldf("create index ai1 on A(POINTID, alistair)"))
   user  system elapsed
  76.85    0.34   79.67

>  system.time(sqldf("create index ai2 on B(POINTID, alistair_range)"))
   user  system elapsed
  75.43    0.43   77.16

> system.time(sqldf("select * from main.A natural join main.B"))
Error: cannot allocate vector of size 250.0 Mb
Timing stopped at: 32.8 0.48 34.79

> sqldf()
Error in sqliteCloseConnection(conn, ...) :
  RS-DBI driver: (close the pending result sets before closing this
connection)

##
# Also tried the join with an index built from only the variable I intend
to merge on, since I wasn't exactly sure which index was correct.
##
> sqldf()
<SQLiteConnection:(3852,1)>

> system.time(sqldf("create index ai1 on A(POINTID)"))
   user  system elapsed
  66.67    0.44   69.28

> system.time(sqldf("create index ai2 on B(POINTID)"))
   user  system elapsed
  68.18    0.31   68.73

> system.time(sqldf("select * from main.A natural join main.B"))
Error: cannot allocate vector of size 31.2 Mb
Timing stopped at: 10.56 0.04 10.87

> sqldf()
Error in sqliteCloseConnection(conn, ...) :
  RS-DBI driver: (close the pending result sets before closing this
connection)

##
# and some memory info
##
> memory.size()
[1] 412.6

> memory.size(NA)
[1] 4095

Chris Howden
Founding Partner
Tricky Solutions
Tricky Solutions 4 Tricky Problems
Evidence Based Strategic Development, IP development, Data Analysis,
Modelling, and Training
(mobile) 0410 689 945
(fax / office) (+618) 8952 7878
chris at trickysolutions.com.au

-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
Sent: Friday, 15 October 2010 1:03 PM
To: Chris Howden
Cc: r-help at r-project.org
Subject: Re: [R] merging and working with BIG data sets. Is sqldf the best
way??

On Thu, Oct 14, 2010 at 10:56 PM, Chris Howden
<chris at trickysolutions.com.au> wrote:
> Thanks for the suggestion and code Gabor,
>
> I've tried creating 2 indices:
>
> 1) just for the variable I intend to merge on
> 2) on the entire data set I am merging (which I think is the one I
should
> be using??)
>
> However neither seemed to work. The first was still going after 2 hours,
> and the second after 12 hours, so I stopped the join.
>
> If it's not too much bother I was wondering if U could let me know which
> index I should be using?
>
>
> Or in other words since I plan to merge using POINTID do I create an
index
> on
>
> system.time(sqldf("create index ai1 on A(POINTID)"))
> system.time(sqldf("create index ai2 on B(POINTID)"))
>
> or
>
> system.time(sqldf("create index ai1 on A(POINTID,alistair)"))
> system.time(sqldf("create index ai2 on B(POINTID, alistair_range)")
>
>
>
> I'm now using the following join statement
> system.time(data2 <- sqldf("select * from A natural join B"))
>

If you only ran the three sqldf statements you mentioned in your post
(thereby omitting 2 of the 5 sqldf calls in example 4i):

sqldf("create...")
sqldf("create...")
sqldf("select...")

then what you are doing is to create a database, upload your data to
it, create an index on it, destroy the database, then create a second
database, upload the data to this second database, create an second
index and then destroy that database too and then finally create a
third database, upload the data to it and then do a join without any
indexes.

You must bracket all this with empty sqldf calls as shown in 4i to
force persistence:

sqldf()
sqldf("create...")
sqldf("create...")
sqldf("select...")
sqldf()

or else put all of the sql statements in a vector to one sqldf call:

sqldf(c("create...", "create...", "select..."))

Also you can replace "select ..." with "explain query plan select
...", and it will let you know which indexes its actually using.  e.g.
in example 4i if we do that:

> sqldf("explain query plan select * from main.DF1 natural join main.DF2")
  order from                   detail
1     0    0                TABLE DF1
2     1    1 TABLE DF2 WITH INDEX ai2

we see that it really only used index ai2 and not index ai1 at all so
we could have saved the 19 seconds that it took to create ai1 as it
was never used.

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com