[R] Dataframe of factors transform speed?

Latchezar Dimitrov ldimitro at wfubmc.edu
Fri Jul 20 05:51:24 CEST 2007


Hello,

This is a speed question. I have a dataframe genoT: 

> dim(genoT)
[1]   1002 238304

> str(genoT)
'data.frame':   1002 obs. of  238304 variables:
 $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
...
 $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 1 1 2 2 2
...
 $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1
...
 $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3
...
 $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 3 2 3 3 1
...
 $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1 2 1
...
 $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
...
 $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 3 3 3 3 2
...
 $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 1 1 1 1 2
...
 $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 1 2 1 1 3
...
 $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1 2 2 3
...
 $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3 3 3 3
...
 $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2 2 2 2
...
 $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2 1 1 2
...
 $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 1 1 1
...
 $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1 1 1 1
...
 $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1 1 ...
 $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1 1 1 2
...
 $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 2 2 NA 1 NA 2
1 ...
 $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3 1 1 1
...
 $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2 2 ...
 $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1 1 ...
 $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1 2 2 1
...

Its columns are factors with different number of levels (from 1 to 3 -
that's what I got from read.table, i.e., it dropped missing levels). I
want to convert it to uniform factors with 3 levels. The 1st 10 rows
above show already converted columns and the rest are not yet converted.
Here's my attempt wich is a complete failure as speed:

> system.time(
+     for(j in 1:(10         )){ #-- this is to try 1st 10 cols and
measure the time, it otherwise is ncol(genoT) instead of 10

+        gt<-genoT[[j]]          #-- this is to avoid 2D indices
+        for(l in 1:length(gt at levels)){
+          levels(gt)[l] <- switch(gt at levels[l],AA="0",AB="1",BB="2")
#-- convert levels to "0","1", or "2"
+          genoT[[j]]<-factor(gt,levels=0:2)   #-- make a 3-level factor
and put it back
+        }
+     }
+ )
[1] 785.085   4.358 789.454   0.000   0.000

789s for 10 columns only!

To me it seems like replacing 10 x 3 levels and then making a factor of
1002 element vector x 10 is a "negligible" amount of operations needed.

So, what's wrong with me? Any idea how to accelerate significantly the
transformation or (to go to the very beginning) to make read.table use a
fixed set of levels ("AA","AB", and "BB") and not to drop any (missing)
level?

R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit

The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) so it's not
it.

Thank you very much for the help,

Latchezar Dimitrov,
Analyst/Programmer IV,
Wake Forest University School of Medicine,
Winston-Salem, North Carolina, USA



More information about the R-help mailing list