[R] Dataframe of factors transform speed?

Sun Jul 22 04:26:12 CEST 2007

The problem is in the way that 'as.data.frame' works.  Use Rprof on a
small list and you will see where it is spending its time.

Now if you are really sure that all your data is consistent with being
a data frame,
you can create your own dataframe structure your self.  Not that I
would advocate it, but if you look at the output of 'dput' on a
dataframe, you can construct your own.

Here it took 20 seconds to create the test data with a list of 50,000
and only 2 seconds to create the data frame from that.

> set.seed(123)
> n <- 50000
> system.time({
+ genoT <- lapply(1:n, function(i) factor(sample(c("AA",
+ "AB", "BB"), 1000, prob=c(1000, 1, 1), rep=T)))
+ })
   user  system elapsed
  20.85    0.12   22.83
> names(genoT) = paste("snp", 1:n, sep="")
>
> # create your own data frame structure -- if you are real sure of your data
>
> system.time(genoTz <- structure(genoT, .Names=names(genoT),
+     row.names=c(NA, -length(genoT[[1]])), class='data.frame'))
   user  system elapsed
   2.00    0.08    2.11
> str(genoTz)
'data.frame':   1000 obs. of  50000 variables:
 $ snp1    : Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp2    : Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp3    : Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp4    : Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp5    : Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp6    : Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp7    : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp8    : Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp9    : Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp10   : Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ snp11   : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
>

On 7/21/07, Latchezar Dimitrov <ldimitro at wfubmc.edu> wrote:
> Jim,
>
> No, this is _not the problem. If you go to my 1st mail I have a monster
> (at least was when I purchased it) with 32GB (sic :-) of RAM and 4 dual
> core AMD64 285 (the fastest at that time and still pretty fast now :-)
>
> The machine stats paging when I run 2 copies of R working on two things
> like that :-). If you look at my last e-mail I found a solution but
> still have no clue why the heck x<-as.data.frame(y) where why is a list
> of the same columns take real for ever and this the thing that killed me
> before.
>
> Thanks,
> Latchezar
>
> > -----Original Message-----
> > From: jim holtman [mailto:jholtman at gmail.com]
> > Sent: Saturday, July 21, 2007 5:33 PM
> > To: Latchezar Dimitrov
> > Cc: Benilton Carvalho; r-help at stat.math.ethz.ch
> > Subject: Re: [R] Dataframe of factors transform speed?
> >
> > One of the problems is that you are probably paging on your
> > system with an object that size (240000 x 1000).  This is
> > about 1GB for a single object:
> >
> > > set.seed(123)
> > > n <- 240000
> > > system.time({
> > + genoT <- lapply(1:n, function(i) factor(sample(c("AA", "AB", "BB"),
> > + 1000, prob=c(1000, 1, 1), rep=T)))
> > + })
> >    user  system elapsed
> >   95.00    0.61  104.71
> > > names(genoT) = paste("snp", 1:n, sep="")
> > >
> > > object.size(genoT)
> > [1] 1045258752
> > >
> >
> > I can create it on my 2GB machine as a list, but have
> > problems converting it to a dataframe because I don't have
> > enough memory.
> >
> > So unless you have at least 4GB on your system, it might take
> > a long time.  Look at your performance measurements on your
> > system and see if you have run out of physical memory and are paging.
> >
> > On 7/21/07, Latchezar Dimitrov <ldimitro at wfubmc.edu> wrote:
> > > Hi,
> > >
> > > Thanks for the help. My 1st question still unanswered though :-)
> > > Please see bellow
> > >
> > > > -----Original Message-----
> > > > From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
> > > > Sent: Friday, July 20, 2007 3:30 AM
> > > > To: Latchezar Dimitrov
> > > > Cc: r-help at stat.math.ethz.ch
> > > > Subject: Re: [R] Dataframe of factors transform speed?
> > > >
> > > > set.seed(123)
> > > > genoT = lapply(1:240000, function(i) factor(sample(c("AA", "AB",
> > > > "BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T)))
> > > > names(genoT) = paste("snp", 1:240000, sep="") genoT =
> > > > as.data.frame(genoT)
> > >
> > > Now this _is the problem. Everything before converting to
> > data.frame
> > > worked almost instantaneously however as.data.frame runs forever.
> > > Obviously there is some scalability memory management issue. When I
> > > tried my own method but creating a new result (instead of modifying
> > > the
> > > old) dataframe it worked like a charm for the 1st 100 cols ~ .3s. I
> > > figured 300,000 cols should be ~1000s. Nope! It ran for about
> > > 50,000(!)s to finish about 42,000 cols only.
> > >
> > > BTW, what ver. of R is yours?
> > >
> > > Now here's what I "discovered" further.
> > >
> > > #-- create a 1-col frame:
> > >    geno   <-
> > >
> > data.frame(c(geno.GASP[[1]],geno.JAG[[1]]),row.names=c(rownames(geno.G
> > > AS
> > > P),rownames(geno.JAG)))
> > >
> > > #-- main code I repeated it w/ j in 1:1000, 2001:3000, and
> > 3001:4000,
> > > i.e., adding a 1000 of cols to geno each time
> > >
> > > system.time(
> > > #   for(j in 1:(ncol(geno.GASP      ))){
> > >    for(j in 3001:(4000              )){
> > >      gt.GASP<-geno.GASP[[j]]
> > >       for(l in 1:length(gt.GASP at levels)){
> > >         levels(gt.GASP)[l] <-
> > > switch(gt.GASP at levels[l],AA="0",AB="1",BB="2")
> > >       }
> > >       gt.JAG <-geno.JAG [[j]]
> > > #      for(l in 1:length(gt.JAG @levels)){
> > > #        levels(gt.JAG )[l] <- switch(gt.JAG
> > > @levels[l],AA="0",AB="1",BB="2")
> > > #      }
> > >       geno[[j]]<-factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
> > > ###               factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
> > >                          ,as.numeric(factor(gt.JAG, levels=0:2))-1
> > >                          )
> > >                        ,levels=0:2
> > >                        )
> > >    }
> > > )
> > >
> > > Times (each one is for a 1000 cols!):
> > > [1] 26.673  0.032 26.705  0.000  0.000 [1] 77.186  0.037
> > 77.225  0.000
> > > 0.000
> > > [1] 128.165   0.042 128.209   0.000   0.000
> > > [1] 180.940   0.047 180.989   0.000   0.000
> > >
> > > See the big diff and the scaling I mentioned above?
> > >
> > > Further more I removed geno[[j]] assignment leaving the operation
> > > though, i.e., replaced it with ### line above. Times:
> > >
> > > [1] 0.857 0.008 0.865 0.000 0.000
> > >
> > > Huh!? What the heck! That's my second question :-) Any ideas?
> > >
> > > I still believe my method is near optimal. Of course I have
> > to somehow
> > > get rid of the assignment bottleneck.
> > >
> > > For now the lesson is: "God bless lists"
> > >
> > > Here is my final solution:
> > >
> > > > system.time({
> > > +     geno.GASP.L<-lapply(geno.GASP
> > > +                        ,function(x){
> > > +                           for(l in
> > 1:length(x at levels)){levels(x)[l]
> > > + <-
> > > switch(x at levels[l],AA="0",AB="1",BB="2")}
> > > +                           factor(x,levels=0:2)
> > > +                         }
> > > +                  )
> > > +     geno.JAG.L <-lapply(geno.JAG
> > > +                        ,function(x){
> > > + #                         for(l in
> > 1:length(x at levels)){levels(x)[l] <-
> > > switch(x at levels[l],AA="0",AB="1",BB="2")}
> > > +                           factor(x,levels=0:2)
> > > +                         }
> > > +                  )
> > > + })
> > > [1] 192.800   1.566 194.413   0.000   0.000   !!!!!!!!! :-)))))
> > > > system.time({
> > > +     class    (geno.GASP.L)<-"data.frame"
> > > +     row.names(geno.GASP.L)<-row.names(geno.GASP)
> > > +     class    (geno.JAG.L )<-"data.frame"
> > > +     row.names(geno.JAG.L )<-row.names(geno.JAG )
> > > + })
> > > [1] 12.156  0.001 12.155  0.000  0.000
> > > > system.time({
> > > +     geno<-rbind(geno.GASP.L,geno.JAG.L)
> > > + })
> > > [1] 1542.340    9.072 2066.310    0.000    0.000
> > >
> > > I logged my notes here as I was trying various things. Partly the
> > > reason is my two questions:
> > >
> > > "What was wrong with me?" and
> > > "What the heck?!" remember above? :-)))
> > >
> > > which  still remain unanswered :-(
> > >
> > > I would have had a lot of fun if I had not to have this done by ...
> > > Yesterday :-))
> > >
> > > Thanks a lot for the help
> > >
> > > Latchezar
> > >
> > > > dim(genoT)
> > > > class(genoT)
> > > > system.time(out <- lapply(genoT, function(x) match(x,
> > c("AA", "AB",
> > > > "BB"))-1))
> > > > ##
> > > > ##
> > > >     user  system elapsed
> > > > 119.288   0.004 119.339
> > > >
> > > > (for all 240K)
> > > >
> > > > best,
> > > > b
> > > >
> > > > ps: note that "out" is a list.
> > > >
> > > > On Jul 20, 2007, at 2:01 AM, Latchezar Dimitrov wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > >> -----Original Message-----
> > > > >> From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
> > > > >> Sent: Friday, July 20, 2007 12:25 AM
> > > > >> To: Latchezar Dimitrov
> > > > >> Cc: r-help at stat.math.ethz.ch
> > > > >> Subject: Re: [R] Dataframe of factors transform speed?
> > > > >>
> > > > >> it looks like that whatever method you used to genotype the
> > > > >> 1002 samples on the STY array gave you a transposed matrix of
> > > > >> genotype calls. :-)
> > > > >
> > > > > It only looks like :-)
> > > > >
> > > > > Otherwise it is correctly created dataframe of 1002
> > samples X (big
> > > > > number) of columns (SNP genotypes). It worked perfectly until I
> > > > > decided to put together to cohorts independently processed in R
> > > > > already. I got stuck with my lack of foreseeing.
> > Otherwise I would
> > > > > have put 3 dummy lines w/ AA,AB, and AB on each one to make
> > > > sure all 3
> > > > > genotypes are present and that's it! Lesson for the future :-)
> > > > >
> > > > > Maybe I am not using columns and rows appropriately
> > here but the
> > > > > dataframe is correct (I have not used FORTRAN since
> > FORTRAN IV ;-)
> > > > > - as
> > > > > str says 1002 observ. of (big number) vars.
> > > > >
> > > > >>
> > > > >> i'd use:
> > > > >>
> > > > >> genoT = read.table(yourFile, stringsAsFactors = FALSE)
> > > > >>
> > > > >> as a starting point... but I don't think that would be
> > > > efficient (as
> > > > >> you'd need to fix one column at a time - lapply).
> > > > >
> > > > > No it was not efficient at all. 'matter of fact nothing is more
> > > > > efficient then loading already read data, alas :-(
> > > > >
> > > > >>
> > > > >> i'd preprocess yourFile before trying to load it:
> > > > >>
> > > > >> cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e
> > > > >> 's/BB/3/ g' > outFile
> > > > >>
> > > > >> and, now, in R:
> > > > >>
> > > > >> genoT = read.table(outFile, header=TRUE)
> > > > >
> > > > > ... Too late ;-) As it must be clear now I have two
> > > > dataframes I want
> > > > > to put together with rbind(geno1,geno2). The issue again is
> > > > > "uniformization" of factor variables w/ missing factors -
> > > > they ended
> > > > > up like levels AA,BB on one of the and levels AB,BB on the
> > > > other which
> > > > > means as.numeric of AA is 1 on the 1st and as.numeric
> > of AB is 1
> > > > > on the second - complete mess. That's why I tried to make both
> > > > uniform,
> > > > > i.e.
> > > > > levels "AA","AB", and "BB" for every SNP and then rbind works.
> > > > >
> > > > > In any case my 1st questions remains: "What's wrong
> > with me?" :-)
> > > > >
> > > > > Thanks,
> > > > > Latchezar
> > > > >
> > > > >>
> > > > >> b
> > > > >>
> > > > >> On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote:
> > > > >>
> > > > >>> Hello,
> > > > >>>
> > > > >>> This is a speed question. I have a dataframe genoT:
> > > > >>>
> > > > >>>> dim(genoT)
> > > > >>> [1]   1002 238304
> > > > >>>
> > > > >>>> str(genoT)
> > > > >>> 'data.frame':   1002 obs. of  238304 variables:
> > > > >>>  $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
> > > > >> 3 3 3 3 3
> > > > >>> ...
> > > > >>>  $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1
> > > > >> 1 1 2 2 2
> > > > >>> ...
> > > > >>>  $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1
> > > > >> 1 1 1 1 1
> > > > >>> ...
> > > > >>>  $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
> > > > >> 3 3 3 3 3
> > > > >>> ...
> > > > >>>  $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2
> > > > >> 3 2 3 3 1
> > > > >>> ...
> > > > >>>  $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA
> > > > 1 NA 2 1 1
> > > > >>> 2 1
> > > > >>> ...
> > > > >>>  $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
> > > > >> 1 1 1 1 2
> > > > >>> ...
> > > > >>>  $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3
> > > > >> 3 3 3 3 2
> > > > >>> ...
> > > > >>>  $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
> > > > >> 1 1 1 1 2
> > > > >>> ...
> > > > >>>  $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2
> > > > >> 1 2 1 1 3
> > > > >>> ...
> > > > >>>  $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB":
> > 2 2 3 3 3 2
> > > > >>> 1
> > > > >>> 2 2 3
> > > > >>> ...
> > > > >>>  $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB":
> > 3 3 3 3 3 3
> > > > >>> 3
> > > > >>> 3 3 3
> > > > >>> ...
> > > > >>>  $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB":
> > 2 2 2 1 1 1
> > > > >>> 2
> > > > >>> 2 2 2
> > > > >>> ...
> > > > >>>  $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1
> > 1 1 1 1 1
> > > > >>> 1
> > > > >>> 1 ...
> > > > >>>  $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB":
> > 2 3 2 2 3 2
> > > > >>> 2
> > > > >>> 1 1 2
> > > > >>> ...
> > > > >>>  $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB":
> > 1 1 1 1 1 1
> > > > >>> 1
> > > > >>> 1 1 1
> > > > >>> ...
> > > > >>>  $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB":
> > 1 1 2 1 1 1
> > > > >>> 1
> > > > >>> 1 1 1
> > > > >>> ...
> > > > >>>  $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1
> > 1 1 1 1 1
> > > > >>> 1
> > > > >>> 1 ...
> > > > >>>  $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB":
> > 1 1 1 1 1 2
> > > > >>> 1
> > > > >>> 1 1 2
> > > > >>> ...
> > > > >>>  $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2
> > > > >> 2 2 NA 1 NA
> > > > >>> 2
> > > > >>> 1 ...
> > > > >>>  $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB":
> > 1 2 2 1 1 1
> > > > >>> 3
> > > > >>> 1 1 1
> > > > >>> ...
> > > > >>>  $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2
> > 2 2 2 2 2
> > > > >>> 2
> > > > >>> 2 ...
> > > > >>>  $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1
> > 1 1 1 1 1
> > > > >>> 1
> > > > >>> 1 ...
> > > > >>>  $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB":
> > 1 1 1 1 1 1
> > > > >>> 1
> > > > >>> 2 2 1
> > > > >>> ...
> > > > >>>
> > > > >>> Its columns are factors with different number of levels
> > > > >> (from 1 to 3 -
> > > > >>> that's what I got from read.table, i.e., it dropped missing
> > > > >> levels). I
> > > > >>> want to convert it to uniform factors with 3 levels. The
> > > > >> 1st 10 rows
> > > > >>> above show already converted columns and the rest are not yet
> > > > >>> converted.
> > > > >>> Here's my attempt wich is a complete failure as speed:
> > > > >>>
> > > > >>>> system.time(
> > > > >>> +     for(j in 1:(10         )){ #-- this is to try 1st
> > > > 10 cols and
> > > > >>> measure the time, it otherwise is ncol(genoT) instead of 10
> > > > >>>
> > > > >>> +        gt<-genoT[[j]]          #-- this is to avoid
> > 2D indices
> > > > >>> +        for(l in 1:length(gt at levels)){
> > > > >>> +          levels(gt)[l] <-
> > > > >> switch(gt at levels[l],AA="0",AB="1",BB="2")
> > > > >>> #-- convert levels to "0","1", or "2"
> > > > >>> +          genoT[[j]]<-factor(gt,levels=0:2)   #--
> > make a 3-level
> > > > >>> factor
> > > > >>> and put it back
> > > > >>> +        }
> > > > >>> +     }
> > > > >>> + )
> > > > >>> [1] 785.085   4.358 789.454   0.000   0.000
> > > > >>>
> > > > >>> 789s for 10 columns only!
> > > > >>>
> > > > >>> To me it seems like replacing 10 x 3 levels and then making
> > > > >> a factor
> > > > >>> of
> > > > >>> 1002 element vector x 10 is a "negligible" amount of
> > operations
> > > > >>> needed.
> > > > >>>
> > > > >>> So, what's wrong with me? Any idea how to accelerate
> > > > >> significantly the
> > > > >>> transformation or (to go to the very beginning) to make
> > > > >> read.table use
> > > > >>> a fixed set of levels ("AA","AB", and "BB") and not
> > to drop any
> > > > >>> (missing)
> > > > >>> level?
> > > > >>>
> > > > >>> R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit
> > > > >>>
> > > > >>> The machine is with 32G RAM and AMD Opteron 285 (2.? GHz)
> > > > >> so it's not
> > > > >>> it.
> > > > >>>
> > > > >>> Thank you very much for the help,
> > > > >>>
> > > > >>> Latchezar Dimitrov,
> > > > >>> Analyst/Programmer IV,
> > > > >>> Wake Forest University School of Medicine,
> > Winston-Salem, North
> > > > >>> Carolina, USA
> > > > >>>
> > > > >>> ______________________________________________
> > > > >>> R-help at stat.math.ethz.ch mailing list
> > > > >>> https://stat.ethz.ch/mailman/listinfo/r-help
> > > > >>> PLEASE do read the posting guide
> > > > http://www.R-project.org/posting-
> > > > >>> guide.html and provide commented, minimal, self-contained,
> > > > >>> reproducible code.
> > > > >>
> > > >
> > >
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> > >
> >
> >
> > --
> > Jim Holtman
> > Cincinnati, OH
> > +1 513 646 9390
> >
> > What is the problem you are trying to solve?
> >
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?