# [R] Dataframe of factors transform speed?

Latchezar Dimitrov ldimitro at wfubmc.edu
Sat Jul 21 06:26:36 CEST 2007

```Hi,

Thanks for the help. My 1st question still unanswered though :-) Please
see bellow

> -----Original Message-----
> From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
> Sent: Friday, July 20, 2007 3:30 AM
> To: Latchezar Dimitrov
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] Dataframe of factors transform speed?
>
> set.seed(123)
> genoT = lapply(1:240000, function(i) factor(sample(c("AA",
> "AB", "BB"), 1000, prob=sample(c(1, 1000, 1000), 3), rep=T)))
> names(genoT) = paste("snp", 1:240000, sep="") genoT =
> as.data.frame(genoT)

Now this _is the problem. Everything before converting to data.frame
worked almost instantaneously however as.data.frame runs forever.
Obviously there is some scalability memory management issue. When I
tried my own method but creating a new result (instead of modifying the
old) dataframe it worked like a charm for the 1st 100 cols ~ .3s. I
figured 300,000 cols should be ~1000s. Nope! It ran for about 50,000(!)s
to finish about 42,000 cols only.

BTW, what ver. of R is yours?

Now here's what I "discovered" further.

#-- create a 1-col frame:
geno   <-
data.frame(c(geno.GASP[[1]],geno.JAG[[1]]),row.names=c(rownames(geno.GAS
P),rownames(geno.JAG)))

#-- main code I repeated it w/ j in 1:1000, 2001:3000, and 3001:4000,
i.e., adding a 1000 of cols to geno each time

system.time(
#   for(j in 1:(ncol(geno.GASP      ))){
for(j in 3001:(4000              )){
gt.GASP<-geno.GASP[[j]]
for(l in 1:length(gt.GASP at levels)){
levels(gt.GASP)[l] <-
switch(gt.GASP at levels[l],AA="0",AB="1",BB="2")
}
gt.JAG <-geno.JAG [[j]]
#      for(l in 1:length(gt.JAG @levels)){
#        levels(gt.JAG )[l] <- switch(gt.JAG
@levels[l],AA="0",AB="1",BB="2")
#      }
geno[[j]]<-factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
###               factor(c(as.numeric(factor(gt.GASP,levels=0:2))-1
,as.numeric(factor(gt.JAG, levels=0:2))-1
)
,levels=0:2
)
}
)

Times (each one is for a 1000 cols!):
[1] 26.673  0.032 26.705  0.000  0.000
[1] 77.186  0.037 77.225  0.000  0.000
[1] 128.165   0.042 128.209   0.000   0.000
[1] 180.940   0.047 180.989   0.000   0.000

See the big diff and the scaling I mentioned above?

Further more I removed geno[[j]] assignment leaving the operation
though, i.e., replaced it with ### line above. Times:

[1] 0.857 0.008 0.865 0.000 0.000

Huh!? What the heck! That's my second question :-) Any ideas?

I still believe my method is near optimal. Of course I have to somehow
get rid of the assignment bottleneck.

For now the lesson is: "God bless lists"

Here is my final solution:

> system.time({
+     geno.GASP.L<-lapply(geno.GASP
+                        ,function(x){
+                           for(l in 1:length(x at levels)){levels(x)[l] <-
switch(x at levels[l],AA="0",AB="1",BB="2")}
+                           factor(x,levels=0:2)
+                         }
+                  )
+     geno.JAG.L <-lapply(geno.JAG
+                        ,function(x){
+ #                         for(l in 1:length(x at levels)){levels(x)[l] <-
switch(x at levels[l],AA="0",AB="1",BB="2")}
+                           factor(x,levels=0:2)
+                         }
+                  )
+ })
[1] 192.800   1.566 194.413   0.000   0.000   !!!!!!!!! :-)))))
> system.time({
+     class    (geno.GASP.L)<-"data.frame"
+     row.names(geno.GASP.L)<-row.names(geno.GASP)
+     class    (geno.JAG.L )<-"data.frame"
+     row.names(geno.JAG.L )<-row.names(geno.JAG )
+ })
[1] 12.156  0.001 12.155  0.000  0.000
> system.time({
+     geno<-rbind(geno.GASP.L,geno.JAG.L)
+ })
[1] 1542.340    9.072 2066.310    0.000    0.000

I logged my notes here as I was trying various things. Partly the reason
is my two questions:

"What was wrong with me?" and
"What the heck?!" remember above? :-)))

which  still remain unanswered :-(

I would have had a lot of fun if I had not to have this done by ...
Yesterday :-))

Thanks a lot for the help

Latchezar

> dim(genoT)
> class(genoT)
> system.time(out <- lapply(genoT, function(x) match(x, c("AA", "AB",
> "BB"))-1))
> ##
> ##
>     user  system elapsed
> 119.288   0.004 119.339
>
> (for all 240K)
>
> best,
> b
>
> ps: note that "out" is a list.
>
> On Jul 20, 2007, at 2:01 AM, Latchezar Dimitrov wrote:
>
> > Hi,
> >
> >> -----Original Message-----
> >> From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu]
> >> Sent: Friday, July 20, 2007 12:25 AM
> >> To: Latchezar Dimitrov
> >> Cc: r-help at stat.math.ethz.ch
> >> Subject: Re: [R] Dataframe of factors transform speed?
> >>
> >> it looks like that whatever method you used to genotype the
> >> 1002 samples on the STY array gave you a transposed matrix of
> >> genotype calls. :-)
> >
> > It only looks like :-)
> >
> > Otherwise it is correctly created dataframe of 1002 samples X (big
> > number) of columns (SNP genotypes). It worked perfectly until I
> > decided to put together to cohorts independently processed in R
> > already. I got stuck with my lack of foreseeing. Otherwise I would
> > have put 3 dummy lines w/ AA,AB, and AB on each one to make
> sure all 3
> > genotypes are present and that's it! Lesson for the future :-)
> >
> > Maybe I am not using columns and rows appropriately here but the
> > dataframe is correct (I have not used FORTRAN since FORTRAN IV ;-)
> > - as
> > str says 1002 observ. of (big number) vars.
> >
> >>
> >> i'd use:
> >>
> >> genoT = read.table(yourFile, stringsAsFactors = FALSE)
> >>
> >> as a starting point... but I don't think that would be
> efficient (as
> >> you'd need to fix one column at a time - lapply).
> >
> > No it was not efficient at all. 'matter of fact nothing is more
> >
> >>
> >> i'd preprocess yourFile before trying to load it:
> >>
> >> cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e
> >> 's/BB/3/ g' > outFile
> >>
> >> and, now, in R:
> >>
> >
> > ... Too late ;-) As it must be clear now I have two
> dataframes I want
> > to put together with rbind(geno1,geno2). The issue again is
> > "uniformization" of factor variables w/ missing factors -
> they ended
> > up like levels AA,BB on one of the and levels AB,BB on the
> other which
> > means as.numeric of AA is 1 on the 1st and as.numeric of AB is 1 on
> > the second - complete mess. That's why I tried to make both
> uniform,
> > i.e.
> > levels "AA","AB", and "BB" for every SNP and then rbind works.
> >
> > In any case my 1st questions remains: "What's wrong with me?" :-)
> >
> > Thanks,
> > Latchezar
> >
> >>
> >> b
> >>
> >> On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote:
> >>
> >>> Hello,
> >>>
> >>> This is a speed question. I have a dataframe genoT:
> >>>
> >>>> dim(genoT)
> >>> [1]   1002 238304
> >>>
> >>>> str(genoT)
> >>> 'data.frame':   1002 obs. of  238304 variables:
> >>>  \$ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
> >> 3 3 3 3 3
> >>> ...
> >>>  \$ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1
> >> 1 1 2 2 2
> >>> ...
> >>>  \$ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1
> >> 1 1 1 1 1
> >>> ...
> >>>  \$ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3
> >> 3 3 3 3 3
> >>> ...
> >>>  \$ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2
> >> 3 2 3 3 1
> >>> ...
> >>>  \$ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA
> 1 NA 2 1 1
> >>> 2 1
> >>> ...
> >>>  \$ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
> >> 1 1 1 1 2
> >>> ...
> >>>  \$ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3
> >> 3 3 3 3 2
> >>> ...
> >>>  \$ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1
> >> 1 1 1 1 2
> >>> ...
> >>>  \$ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2
> >> 1 2 1 1 3
> >>> ...
> >>>  \$ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1
> >>> 2 2 3
> >>> ...
> >>>  \$ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3
> >>> 3 3 3
> >>> ...
> >>>  \$ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2
> >>> 2 2 2
> >>> ...
> >>>  \$ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1
> >>> 1 ...
> >>>  \$ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2
> >>> 1 1 2
> >>> ...
> >>>  \$ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1
> >>> 1 1 1
> >>> ...
> >>>  \$ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1
> >>> 1 1 1
> >>> ...
> >>>  \$ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1
> >>> 1 ...
> >>>  \$ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1
> >>> 1 1 2
> >>> ...
> >>>  \$ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2
> >> 2 2 NA 1 NA
> >>> 2
> >>> 1 ...
> >>>  \$ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3
> >>> 1 1 1
> >>> ...
> >>>  \$ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2
> >>> 2 ...
> >>>  \$ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1
> >>> 1 ...
> >>>  \$ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1
> >>> 2 2 1
> >>> ...
> >>>
> >>> Its columns are factors with different number of levels
> >> (from 1 to 3 -
> >>> that's what I got from read.table, i.e., it dropped missing
> >> levels). I
> >>> want to convert it to uniform factors with 3 levels. The
> >> 1st 10 rows
> >>> above show already converted columns and the rest are not yet
> >>> converted.
> >>> Here's my attempt wich is a complete failure as speed:
> >>>
> >>>> system.time(
> >>> +     for(j in 1:(10         )){ #-- this is to try 1st
> 10 cols and
> >>> measure the time, it otherwise is ncol(genoT) instead of 10
> >>>
> >>> +        gt<-genoT[[j]]          #-- this is to avoid 2D indices
> >>> +        for(l in 1:length(gt at levels)){
> >>> +          levels(gt)[l] <-
> >> switch(gt at levels[l],AA="0",AB="1",BB="2")
> >>> #-- convert levels to "0","1", or "2"
> >>> +          genoT[[j]]<-factor(gt,levels=0:2)   #-- make a 3-level
> >>> factor
> >>> and put it back
> >>> +        }
> >>> +     }
> >>> + )
> >>> [1] 785.085   4.358 789.454   0.000   0.000
> >>>
> >>> 789s for 10 columns only!
> >>>
> >>> To me it seems like replacing 10 x 3 levels and then making
> >> a factor
> >>> of
> >>> 1002 element vector x 10 is a "negligible" amount of operations
> >>> needed.
> >>>
> >>> So, what's wrong with me? Any idea how to accelerate
> >> significantly the
> >>> transformation or (to go to the very beginning) to make
> >> read.table use
> >>> a fixed set of levels ("AA","AB", and "BB") and not to drop any
> >>> (missing)
> >>> level?
> >>>
> >>> R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit
> >>>
> >>> The machine is with 32G RAM and AMD Opteron 285 (2.? GHz)
> >> so it's not
> >>> it.
> >>>
> >>> Thank you very much for the help,
> >>>
> >>> Latchezar Dimitrov,
> >>> Analyst/Programmer IV,
> >>> Wake Forest University School of Medicine, Winston-Salem, North
> >>> Carolina, USA
> >>>
> >>> ______________________________________________
> >>> R-help at stat.math.ethz.ch mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> http://www.R-project.org/posting-
> >>> guide.html and provide commented, minimal, self-contained,
> >>> reproducible code.
> >>
>

```