[R] Merging big data sets

William Dunlap wdunlap at tibco.com
Mon Sep 9 19:10:39 CEST 2013


Your sample code does not run and you didn't show the format of
your inputs.  If your 'ttoevP' is a table without duplicate Start/End
pairs that maps Start/End to OEV.T then using match and subscripting
can be quicker than merge.  E.g.,

   f0 <- function (dataP, ttoevP) 
   {
       encode <- function(df) paste(df$Start, df$End, sep = "\r")
       if (anyDuplicated(ttoevPEncoded <- encode(ttoevP))) {
           stop("duplicated Start/End pairs in ttoevP")
       }
       i <- match(encode(dataP), ttoevPEncoded)
       dataP$OEV.T <- ttoevP$OEV.T[i]
       dataP
   }

I made sample inputs with

makeData <- function (nrow, nTimes) 
{
    Start <- trunc(runif(nrow, 1, nTimes))
    End <- trunc(runif(nrow, Start, nTimes))
    dataP <- data.frame(Start = Start, End = End)
    ttoevP <- expand.grid(Start = seq_len(nTimes), End = seq_len(nTimes))
    ttoevP <- ttoevP[ttoevP$Start <= ttoevP$End, ]
    ttoevP$OEV.T <- paste(ttoevP$Start, ttoevP$End, sep = "-")
    list(dataP = dataP, ttoevP = ttoevP)
}

For nrows=4*10^4 and nTimes=10^3 f0 took 1.5 seconds and merge 10.5.
Aside from the order of the output, they produced the same output.

You can make your looping solution faster by removing repeated
operations from the loop, especially when those operations operate
on a data.frame (vectors operations are much faster, and no operation
is faster still).  E.g., replace
    for(i in seq_len(nrow(df))) {
         df$column[i] <- func(i)
    }
with
    column <- df$column
    for(i in seq_len(nrow(df))) {
        column[i] <- func(i)
    }
    df$column <- column

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of Renger van Nieuwkoop
> Sent: Monday, September 09, 2013 5:45 AM
> To: r-help at r-project.org
> Subject: [R] Merging big data sets
> 
> Hi
> I have 6 rather big data sets (between 400000 and 800000 lines) on transport data (times,
> distances and travelers between nodes). They all have a common index (start-end nodes).
> I want to aggregate this data, but for that I have to merge them.
> I tried to use "merge" with the result that R (3.0.1) crashes (Windows 8 machine, 16 Gb
> Ram).
> Then I tried the join from the data.table package. Here I got the message that 2^34 is too
> big (no idea why it is 2^34 as it is a left join).
> Then I decided to do a loop using the tables and assigning them, which takes a very, very
> long time (still running at the moment).
> 
> Here is the code:
> for (i in 1:length(dataP$Start)){
>     c<-dataP$Start[i]
>     d<-dataP$End[i]
>     dataP[J(c,d)]$OEV.T<-ttoevP[J(c,d)]$OEV.T
> }
> 
> dataP has 800'000 lines and ttoevP has about 500'000 lines.
> 
> Any hints to speed up this process are welcome.
> 
> Renger
> _________________________________________
> Centre of Economic Research (CER-ETH)
> Zürichbergstrasse 18 (ZUE)
> CH - 8032 Zürich
> +41 44 632 02 63
> mailto: rengerv at etzh.ch<mailto:rengerv at etzh.ch>
> blog.modelworks.ch
> 
> 
> 	[[alternative HTML version deleted]]



More information about the R-help mailing list