[R] Filling out a data frame row by row.... slow!

ilai keren at math.montana.edu
Wed Feb 15 20:41:57 CET 2012


First, in R there is no need to declare the dimensions of your objects
before they are populated so couldn't you reduce some run time by not
going through the double data.frame step ?
> df<- data.frame()
> df
data frame with 0 columns and 0 rows
> for(i in 1:100) for(j in 1:3) df[i,j]<- runif(1)
> str(df)
'data.frame':   100 obs. of  3 variables:
 ...

Second, about populating an environment ?assign might work better for you
> e<- new.env()
> system.time(for(i in 1:10000) e$a[i]<- rnorm(1,i))
   user  system elapsed
   0.97    0.00    0.96
> rm(e)
> e<- new.env()
> system.time(for(i in 1:10000) assign('a',rnorm(1,i),env=e))
   user  system elapsed
   0.17    0.00    0.17

Third, how are you reading in the file? and what does that mean "not
knowing in advance..." ? Bill's suggestion to not populate the
data.frame line by line is probably the "real" solution to your
problem, as otherwise it's a little like kicking a turtle to make it
go faster...try to find a rabbit instead.

Posting a minimal example of your file format would have really
helped. Often using ?scan to read the whole (or big chunks of the)
file into R, followed by a customized formatting function that
utilizes ?grep and ?strsplit to reconstruct the data you want in
columns, solves the NEED to populate a data frame line by line.

Hope this helps

Elai


> One complication is I don't know the names of the columns I'm assigning to
> before I read them off the file. And crazily, if I change this:
>       data$x[i] <- i + 0.1
>
> where data is an environment and x a primitive vector, to use a computed
> name instead:
>
>  data[[colname]][i] <- i + 0.1
>
> Then I get back to way-superlinear performance. Eventually I found I could
> work around it like:
>
> eval(substitute(var[ix] <- data,
>                          list(var=as.name(colname), ix=i, data = i+0.1)),
>               envir = data)
>
> but... as workarounds go that seems to be on the crazy nuts end of the
> scale. Why does [[]] impose a performance penalty?
>
> Peter
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list