[R] Dataframe of factors transform speed?

jim holtman jholtman at gmail.com
Fri Jul 20 06:47:49 CEST 2007


Is this what you want?  It took 0.01 seconds to convert 20 rows of the
test data:

> # create some data     (20 rows with 1000 columns)
> n <- 20
> result <- list()
> vals <- c("AA", "AB", "BB")
> for (i in 1:n){
+     result[[as.character(i)]] <- sample(vals,1000, replace=TRUE,
prob=c(9000,1,1))
+ }
> result.df <- do.call('data.frame', result)
>
>
> str(result.df)
'data.frame':   1000 obs. of  20 variables:
 $ X1 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X2 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X3 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X4 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X5 : Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ X6 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X7 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X8 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X9 : Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X10: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X11: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X12: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X13: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X14: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X15: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X16: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X17: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X18: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X19: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
 $ X20: Factor w/ 1 level "AA": 1 1 1 1 1 1 1 1 1 1 ...
>
> # go through each row and convert the factors according to 'vals' above
> system.time({      # time to convert 20 rows
+     x <- lapply(result.df, function(facts){
+         factor(match(as.character(facts), vals) - 1, levels=0:2)
+     })
+     result.df <- do.call('data.frame', x)
+ })
   user  system elapsed
   0.01    0.00    0.01
>
> str(result.df)
'data.frame':   1000 obs. of  20 variables:
 $ X1 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X2 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X3 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X4 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X5 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X6 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X7 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X8 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X9 : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X10: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X11: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X12: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X13: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X14: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X15: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X16: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X17: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X18: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X19: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ X20: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
>


On 7/19/07, Latchezar Dimitrov <ldimitro at wfubmc.edu> wrote:
> Hello,
>
> This is a speed question. I have a dataframe genoT:
>
> > dim(genoT)
> [1]   1002 238304
>
> > str(genoT)
> 'data.frame':   1002 obs. of  238304 variables:
>  $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
> ...
>  $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 1 1 2 2 2
> ...
>  $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1
> ...
>  $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
> ...
>  $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 3 2 3 3 1
> ...
>  $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1 2 1
> ...
>  $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
> ...
>  $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2
> ...
>  $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
> ...
>  $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 1 2 1 1 3
> ...
>  $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1 2 2 3
> ...
>  $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3 3 3 3
> ...
>  $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2 2 2 2
> ...
>  $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
>  $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2 1 1 2
> ...
>  $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1
> ...
>  $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1 1 1 1
> ...
>  $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
>  $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1 1 1 2
> ...
>  $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 NA 1 NA 2
> 1 ...
>  $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3 1 1 1
> ...
>  $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2 2 ...
>  $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 1 ...
>  $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 2 2 1
> ...
>
> Its columns are factors with different number of levels (from 1 to 3 -
> that's what I got from read.table, i.e., it dropped missing levels). I
> want to convert it to uniform factors with 3 levels. The 1st 10 rows
> above show already converted columns and the rest are not yet converted.
> Here's my attempt wich is a complete failure as speed:
>
> > system.time(
> +     for(j in 1:(10         )){ #-- this is to try 1st 10 cols and
> measure the time, it otherwise is ncol(genoT) instead of 10
>
> +        gt<-genoT[[j]]          #-- this is to avoid 2D indices
> +        for(l in 1:length(gt at levels)){
> +          levels(gt)[l] <- switch(gt at levels[l],AA="0",AB="1",BB="2")
> #-- convert levels to "0","1", or "2"
> +          genoT[[j]]<-factor(gt,levels=0:2)   #-- make a 3-level factor
> and put it back
> +        }
> +     }
> + )
> [1] 785.085   4.358 789.454   0.000   0.000
>
> 789s for 10 columns only!
>
> To me it seems like replacing 10 x 3 levels and then making a factor of
> 1002 element vector x 10 is a "negligible" amount of operations needed.
>
> So, what's wrong with me? Any idea how to accelerate significantly the
> transformation or (to go to the very beginning) to make read.table use a
> fixed set of levels ("AA","AB", and "BB") and not to drop any (missing)
> level?
>
> R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit
>
> The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) so it's not
> it.
>
> Thank you very much for the help,
>
> Latchezar Dimitrov,
> Analyst/Programmer IV,
> Wake Forest University School of Medicine,
> Winston-Salem, North Carolina, USA
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?



More information about the R-help mailing list