[R] How to quickly convert a data.frame into a structure of lists

William Dunlap wdunlap at tibco.com
Wed Aug 10 21:55:56 CEST 2011


Here is code to transform the matrix that by() or array(split())
produces, along with an example of the speed of the various
approaches.  Using split(), either directly or via by() or tapply(),
saves a lot of time.

f0 <- function(df) {
    # original code with typos fixed.
    list_structure <- lapply(levels(df$A), function(levelA) {
        lapply(levels(df$B), function(levelB) {df$C[df$A==levelA & df$B==levelB]})
    })
    # Apply the names:
    names(list_structure)<-levels(df$A)
    for (i in 1:length(list_structure)) {
        names(list_structure[[i]])<-levels(df$B)
    }
    list_structure
}

f0a <- function(df) {
    # slightly faster version of f0, taking repeated
    # calculations out of loops.
    A <- df$A
    B <- df$B
    C <- df$C
    levelsA <- structure(levels(A), names=levels(A))
    levelsB <- structure(levels(B), names=levels(B))
    lapply(levelsA, function(levelA) {
            tmpA <- A == levelA # this is responsible for most of the savings
            lapply(levelsB, function(levelB) {C[tmpA & B==levelB]})
    })
}

f1 <- function(df) {
    # DM's code
    by(df$C, df[,1:2], identity)
}

f2 <- function(df) {
    # WD's code
    AB<- df[c("A","B")]
    array(split(df$C, AB), dim=sapply(AB, nlevels), dimnames=sapply(AB, levels))
}

matrix2ListOfRows <- function(mat) {
    # convert a matrix to a list of its rows, converting dimnames to names.
    retval <- structure(as.vector(mat), names=rep(colnames(mat), each=nrow(mat)))
    retval <- split(retval, row(mat))
    names(retval) <- rownames(mat)
    retval
}

The test involves 10^5 rows of data with 26 levels for A and 200 for B.

> r200 <- as.character(as.roman(1:200))
> set.seed(1)
> df <- data.frame(A=factor(sample(letters, size=1e5, replace=TRUE), levels=letters),
+                  B=factor(sample(r200, size=1e5, replace=TRUE), levels=r200),
+                  C=1:1e5)
> system.time(ls0 <- f0(df))
   user  system elapsed 
  74.08    2.34   76.60 
> system.time(ls0a <- f0a(df))
   user  system elapsed 
  43.09    0.44   43.73 
> all.equal(ls0, ls0a)
[1] TRUE
> system.time(ls2 <- matrix2ListOfRows(f2(df)))
   user  system elapsed 
   0.09    0.02    0.11 
> all.equal(ls0, ls2)
[1] TRUE
> system.time(ls1 <- matrix2ListOfRows(f1(df)))
   user  system elapsed 
   0.69    0.00    0.69 
> all.equal(ls0, ls1)
[1] TRUE


Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of William Dunlap
> Sent: Wednesday, August 10, 2011 10:05 AM
> To: Duncan Murdoch; Frederic F
> Cc: r-help at r-project.org
> Subject: Re: [R] How to quickly convert a data.frame into a structure of lists
> 
> I was going to suggest
>   > AB <- df[c("A","B")]
>   > ls2 <- array(split(df$C, AB), dim=sapply(AB, nlevels), dimnames=sapply(AB, levels))
> which produces a matrix very similar to what Duncan's by() call produces
>   > ls1 <- by(df$C, df[,1:2], identity)
> E.g.,
>   > ls2[["a","X"]]
>   [1] 1 2
>   > ls1[["a","X"]]
>   [1] 1 2
>   > ls1[["a","Y"]] # by assigns NULL to unoccupied slots
>   NULL
>   > ls2[["a","Y"]] # split gives the same type to all slots, copied from input
>   numeric(0)
> 
> They both are quick because they use split() to avoid the repeated
> evaluations of
>   bigVector[ anotherBigVector == scalar ]
> that your nested (not imbricated) loops do.  If you really need to convert
> the matrix to a list of lists that will probably be a quick transformation.
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
> > -----Original Message-----
> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Duncan Murdoch
> > Sent: Wednesday, August 10, 2011 9:43 AM
> > To: Frederic F
> > Cc: r-help at r-project.org
> > Subject: Re: [R] How to quickly convert a data.frame into a structure of lists
> >
> > On 10/08/2011 10:30 AM, Frederic F wrote:
> > > Hello Duncan,
> > >
> > > Here is a small example to illustrate what I am trying to do.
> > >
> > > # Example data.frame
> > > df=data.frame(A=c("a","a","b","b"), B=c("X","X","Y","Z"), C=c(1,2,3,4))
> > > #   A B C
> > > # 1 a X 1
> > > # 2 a X 2
> > > # 3 b Y 3
> > > # 4 b Z 4
> > >
> > > ### First way of getting the list structure (ls1) using imbricated lapply
> > > loops:
> > > # Get the structure and populate it:
> > > ls1<-lapply(levels(df$A), function(levelA) {
> > >        lapply(levels(df$B), function(levelB) {df$C[df$A==levelA&
> > > df$B==levelB]})
> > > })
> > > # Apply the names:
> > > names(list_structure)<-levels(df$A)
> > > for (i in 1:length(list_structure))
> > > {names(list_structure[[i]])<-levels(df$B)}
> > >
> > > # Result:
> > > ls1$a$X
> > > # [1] 1 2
> > > ls1$b$Z
> > > # [1] 4
> > >
> > > The data.frame will always be 'complete', i.e., there will be a value in
> > > every row for every column.
> > > I want to produce a structure like this one quickly (I aim at something
> > > below 10 seconds) for a dataset containing between 1 and 2 millions of rows.
> > >
> >
> > I don't know what the timing would be like for your real data, but this
> > does look like by() would work:
> >
> > ls1 <- by(df$C, df[,1:2], identity)
> >
> > When I repeat the rows of df a million times each, this finishes in a
> > few seconds.  It would definitely be slower if there were more levels of
> > A or B.
> >
> > Now ls1 will be a matrix whose entries are the subsets of C that you
> > want, so you can see your two results with slightly different syntax:
> >
> >  > ls1[["a", "X"]]
> > [1] 1 2
> >  > ls1[["b","Z"]]
> > [1] 4
> >
> > Duncan Murdoch
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list