[R] Dataframe of factors transform speed?

Fri Jul 20 08:01:27 CEST 2007

Hi,

> -----Original Message-----
> From: Benilton Carvalho [mailto:bcarvalh at jhsph.edu] 
> Sent: Friday, July 20, 2007 12:25 AM
> To: Latchezar Dimitrov
> Cc: r-help at stat.math.ethz.ch
> Subject: Re: [R] Dataframe of factors transform speed?
> 
> it looks like that whatever method you used to genotype the 
> 1002 samples on the STY array gave you a transposed matrix of 
> genotype calls. :-)

It only looks like :-)

Otherwise it is correctly created dataframe of 1002 samples X (big
number) of columns (SNP genotypes). It worked perfectly until I decided
to put together to cohorts independently processed in R already. I got
stuck with my lack of foreseeing. Otherwise I would have put 3 dummy
lines w/ AA,AB, and AB on each one to make sure all 3 genotypes are
present and that's it! Lesson for the future :-)

Maybe I am not using columns and rows appropriately here but the
dataframe is correct (I have not used FORTRAN since FORTRAN IV ;-) - as
str says 1002 observ. of (big number) vars.

> 
> i'd use:
> 
> genoT = read.table(yourFile, stringsAsFactors = FALSE)
> 
> as a starting point... but I don't think that would be 
> efficient (as you'd need to fix one column at a time - lapply).

No it was not efficient at all. 'matter of fact nothing is more
efficient then loading already read data, alas :-(

> 
> i'd preprocess yourFile before trying to load it:
> 
> cat yourFile | sed -e 's/AA/1/g' | sed -e 's/AB/2/g' | sed -e 
> 's/BB/3/ g' > outFile
> 
> and, now, in R:
> 
> genoT = read.table(outFile, header=TRUE)

... Too late ;-) As it must be clear now I have two dataframes I want to
put together with rbind(geno1,geno2). The issue again is
"uniformization" of factor variables w/ missing factors - they ended up
like levels AA,BB on one of the and levels AB,BB on the other which
means as.numeric of AA is 1 on the 1st and as.numeric of AB is 1 on the
second - complete mess. That's why I tried to make both uniform, i.e.
levels "AA","AB", and "BB" for every SNP and then rbind works.

In any case my 1st questions remains: "What's wrong with me?" :-)

Thanks,
Latchezar

> 
> b
> 
> On Jul 19, 2007, at 11:51 PM, Latchezar Dimitrov wrote:
> 
> > Hello,
> >
> > This is a speed question. I have a dataframe genoT:
> >
> >> dim(genoT)
> > [1]   1002 238304
> >
> >> str(genoT)
> > 'data.frame':   1002 obs. of  238304 variables:
> >  $ SNP_A.4261647: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 
> 3 3 3 3 3 
> > ...
> >  $ SNP_A.4261610: Factor w/ 3 levels "0","1","2": 1 1 3 3 1 
> 1 1 2 2 2 
> > ...
> >  $ SNP_A.4261601: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 
> 1 1 1 1 1 
> > ...
> >  $ SNP_A.4261704: Factor w/ 3 levels "0","1","2": 3 3 3 3 3 
> 3 3 3 3 3 
> > ...
> >  $ SNP_A.4261563: Factor w/ 3 levels "0","1","2": 3 1 2 1 2 
> 3 2 3 3 1 
> > ...
> >  $ SNP_A.4261554: Factor w/ 3 levels "0","1","2": 1 1 NA 1 NA 2 1 1
> > 2 1
> > ...
> >  $ SNP_A.4261666: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 
> 1 1 1 1 2 
> > ...
> >  $ SNP_A.4261634: Factor w/ 3 levels "0","1","2": 3 3 2 3 3 
> 3 3 3 3 2 
> > ...
> >  $ SNP_A.4261656: Factor w/ 3 levels "0","1","2": 1 1 2 1 1 
> 1 1 1 1 2 
> > ...
> >  $ SNP_A.4261637: Factor w/ 3 levels "0","1","2": 1 3 2 3 2 
> 1 2 1 1 3 
> > ...
> >  $ SNP_A.4261597: Factor w/ 3 levels "AA","AB","BB": 2 2 3 3 3 2 1
> > 2 2 3
> > ...
> >  $ SNP_A.4261659: Factor w/ 3 levels "AA","AB","BB": 3 3 3 3 3 3 3
> > 3 3 3
> > ...
> >  $ SNP_A.4261594: Factor w/ 3 levels "AA","AB","BB": 2 2 2 1 1 1 2
> > 2 2 2
> > ...
> >  $ SNP_A.4261698: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1
> > 1 ...
> >  $ SNP_A.4261538: Factor w/ 3 levels "AA","AB","BB": 2 3 2 2 3 2 2
> > 1 1 2
> > ...
> >  $ SNP_A.4261621: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1
> > 1 1 1
> > ...
> >  $ SNP_A.4261553: Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 1 1 1
> > 1 1 1
> > ...
> >  $ SNP_A.4261528: Factor w/ 2 levels "AA","AB": 1 1 1 1 1 1 1 1 1
> > 1 ...
> >  $ SNP_A.4261579: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 2 1
> > 1 1 2
> > ...
> >  $ SNP_A.4261513: Factor w/ 3 levels "AA","AB","BB": 2 1 2 
> 2 2 NA 1 NA 
> > 2
> > 1 ...
> >  $ SNP_A.4261532: Factor w/ 3 levels "AA","AB","BB": 1 2 2 1 1 1 3
> > 1 1 1
> > ...
> >  $ SNP_A.4261600: Factor w/ 2 levels "AB","BB": 2 2 2 2 2 2 2 2 2
> > 2 ...
> >  $ SNP_A.4261706: Factor w/ 2 levels "AA","BB": 1 1 1 1 1 1 1 1 1
> > 1 ...
> >  $ SNP_A.4261575: Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 1 1 1
> > 2 2 1
> > ...
> >
> > Its columns are factors with different number of levels 
> (from 1 to 3 - 
> > that's what I got from read.table, i.e., it dropped missing 
> levels). I 
> > want to convert it to uniform factors with 3 levels. The 
> 1st 10 rows 
> > above show already converted columns and the rest are not yet 
> > converted.
> > Here's my attempt wich is a complete failure as speed:
> >
> >> system.time(
> > +     for(j in 1:(10         )){ #-- this is to try 1st 10 cols and
> > measure the time, it otherwise is ncol(genoT) instead of 10
> >
> > +        gt<-genoT[[j]]          #-- this is to avoid 2D indices
> > +        for(l in 1:length(gt at levels)){
> > +          levels(gt)[l] <- 
> switch(gt at levels[l],AA="0",AB="1",BB="2")
> > #-- convert levels to "0","1", or "2"
> > +          genoT[[j]]<-factor(gt,levels=0:2)   #-- make a 3-level  
> > factor
> > and put it back
> > +        }
> > +     }
> > + )
> > [1] 785.085   4.358 789.454   0.000   0.000
> >
> > 789s for 10 columns only!
> >
> > To me it seems like replacing 10 x 3 levels and then making 
> a factor 
> > of
> > 1002 element vector x 10 is a "negligible" amount of operations 
> > needed.
> >
> > So, what's wrong with me? Any idea how to accelerate 
> significantly the 
> > transformation or (to go to the very beginning) to make 
> read.table use 
> > a fixed set of levels ("AA","AB", and "BB") and not to drop any
> > (missing)
> > level?
> >
> > R-devel_2006-08-26, Sun Solaris 10 OS - x86 64-bit
> >
> > The machine is with 32G RAM and AMD Opteron 285 (2.? GHz) 
> so it's not 
> > it.
> >
> > Thank you very much for the help,
> >
> > Latchezar Dimitrov,
> > Analyst/Programmer IV,
> > Wake Forest University School of Medicine, Winston-Salem, North 
> > Carolina, USA
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting- 
> > guide.html and provide commented, minimal, self-contained, 
> > reproducible code.
>