[R] Extract values from multiple lists

Dénes Tóth toth.denes at ttk.mta.hu
Wed Dec 17 11:46:17 CET 2014


Dear Jeff,

On 12/17/2014 01:46 AM, Jeff Newmiller wrote:
> You are chasing ghosts of performance past, Denes.

In terms of memory efficiency, yes. In terms of CPU time, there can be 
significant difference, see below.


The data.frame
> function causes no problems, and if it is used then the OP would not
> need to presume they know the internal structure of the data frame.
> See below. (I am using R3.1.2.)
>
> a1 <- list(x = rnorm(1e6), y = rnorm(1e6))
> a2 <- list(x = rnorm(1e6), y = rnorm(1e6))
> a3 <- list(x = rnorm(1e6), y = rnorm(1e6))
>
> # get names of the objects
> out_names <- ls(pattern="a[[:digit:]]$")
>
> # amount of memory allocated
> gc(reset=TRUE)
>
> # Explicitly call data frame
> out2 <- data.frame( a1=a1[["x"]], a2=a2[["x"]], a3=a3[["x"]] )
>
> # No copying.
> gc()
>
> # Your suggested retreival method
> out3a <- lapply( lapply( out_names, get ), "[[", "x" )
> names( out3a ) <- out_names
> # The "obvious" way to finish the job works fine.
> out3 <- do.call( data.frame, out3a )

BTW, the even more "obvious" as.data.frame() produces the same with an 
even more intuitive interface.

However, for lists with a larger number of elements the transformation 
to a data.frame can be pretty slow. In the toy example, we created only 
a three-element list. Let's increase it a little bit.

---

# this is not even that large
datlen <- 1e2
listlen <- 1e5

# create a toy list
mylist <- matrix(seq_len(datlen * listlen),
                  nrow = datlen, ncol = listlen)
mylist <- lapply(1:ncol(mylist), function(i) mylist[, i])
names(mylist) <- paste0("V", seq_len(listlen))


# define the more efficient function ---
# note that I put class(x) first so that setattr does not
# modify the attributes of the original input (see ?setattr,
# you have to be careful)
setAttrib <- function(x) {
     class(x) <- "data.frame"
     data.table::setattr(x, "row.names", seq_along(x[[1]]))
     x
}

# benchmarking
# (we do not need microbenchmark here, the differences are
# extremely large) - on my machine, 9.4 sec, 8.1 sec vs 0.15 sec
gc(reset=TRUE)
system.time(df1 <- do.call(data.frame, mylist))
gc()
system.time(df2 <- as.data.frame(mylist))
gc()
system.time(df3 <- setAttrib(mylist))
gc()

# check results
identical(df1, df2)
identical(df1, df3)

----

Of course for small datasets, one should use the built-in and safe 
functions (either do.call or as.data.frame). BTW, for the original 
three-element list, these are even faster than the workaround.

All the best,
   Denes




>
> # No copying... well, you do end up with a new list in out3, but the
> data itself doesn't get copied.
> gc()
>
>
> On Tue, 16 Dec 2014, D?nes T?th wrote:
>
>> On 12/16/2014 06:06 PM, SH wrote:
>>> Dear List,
>>>
>>> I hope this posting is not redundant.  I have several list outputs
>>> with the
>>> same components.  I ran a function with three different scenarios below
>>> (e.g., scen1, scen2, and scen3,...,scenN).  I would like to extract the
>>> same components and group them as a data frame.  For example,
>>> pop.inf.r1 <- scen1[['pop.inf.r']]
>>> pop.inf.r2 <- scen2[['pop.inf.r']]
>>> pop.inf.r3 <- scen3[['pop.inf.r']]
>>> ...
>>> pop.inf.rN<-scenN[['pop.inf.r']]
>>> new.df <- data.frame(pop.inf.r1, pop.inf.r2, pop.inf.r3,...,pop.inf.rN)
>>>
>>> My final output would be 'new.df'.  Could you help me how I can do that
>>> efficiently?
>>
>> If efficiency is of concern, do not use data.frame() but create a list
>> and add the required attributes with data.table::setattr (the setattr
>> function of the data.table package). (You can also consider creating a
>> data.table instead of a data.frame.)
>>
>> # some largish lists
>> a1 <- list(x = rnorm(1e6), y = rnorm(1e6))
>> a2 <- list(x = rnorm(1e6), y = rnorm(1e6))
>> a3 <- list(x = rnorm(1e6), y = rnorm(1e6))
>>
>> # amount of memory allocated
>> gc(reset=TRUE)
>>
>> # get names of the objects
>> out_names <- ls(pattern="a[[:digit:]]$")
>>
>> # create a list
>> out <- lapply(lapply(out_names, get), "[[", "x")
>>
>> # note that no copying occured
>> gc()
>>
>> # decorate the list
>> data.table::setattr(out, "names", out_names)
>> data.table::setattr(out, "row.names", seq_along(out[[1]]))
>> class(out) <- "data.frame"
>>
>> # still no copy
>> gc()
>>
>> # output
>> head(out)
>>
>>
>> HTH,
>>  Denes
>>
>>
>>>
>>> Thanks in advance,
>>>
>>> Steve
>>>
>>> P.S.:  Below are some examples of summary outputs.
>>>
>>>
>>>> summary(scen1)
>>>                  Length Class  Mode
>>> aql                1   -none- numeric
>>> rql                1   -none- numeric
>>> alpha              1   -none- numeric
>>> beta               1   -none- numeric
>>> n.sim              1   -none- numeric
>>> N                  1   -none- numeric
>>> n.sample           1   -none- numeric
>>> n.acc              1   -none- numeric
>>> lot.inf.r          1   -none- numeric
>>> pop.inf.n       2000   -none- list
>>> pop.inf.r       2000   -none- list
>>> pop.decision.t1 2000   -none- list
>>> pop.decision.t2 2000   -none- list
>>> sp.inf.n        2000   -none- list
>>> sp.inf.r        2000   -none- list
>>> sp.decision     2000   -none- list
>>>> summary(scen2)
>>>                  Length Class  Mode
>>> aql                1   -none- numeric
>>> rql                1   -none- numeric
>>> alpha              1   -none- numeric
>>> beta               1   -none- numeric
>>> n.sim              1   -none- numeric
>>> N                  1   -none- numeric
>>> n.sample           1   -none- numeric
>>> n.acc              1   -none- numeric
>>> lot.inf.r          1   -none- numeric
>>> pop.inf.n       2000   -none- list
>>> pop.inf.r       2000   -none- list
>>> pop.decision.t1 2000   -none- list
>>> pop.decision.t2 2000   -none- list
>>> sp.inf.n        2000   -none- list
>>> sp.inf.r        2000   -none- list
>>> sp.decision     2000   -none- list
>>>> summary(scen3)
>>>                  Length Class  Mode
>>> aql                1   -none- numeric
>>> rql                1   -none- numeric
>>> alpha              1   -none- numeric
>>> beta               1   -none- numeric
>>> n.sim              1   -none- numeric
>>> N                  1   -none- numeric
>>> n.sample           1   -none- numeric
>>> n.acc              1   -none- numeric
>>> lot.inf.r          1   -none- numeric
>>> pop.inf.n       2000   -none- list
>>> pop.inf.r       2000   -none- list
>>> pop.decision.t1 2000   -none- list
>>> pop.decision.t2 2000   -none- list
>>> sp.inf.n        2000   -none- list
>>> sp.inf.r        2000   -none- list
>>> sp.decision     2000   -none- list
>>>
>>>     [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
>                                        Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------



More information about the R-help mailing list