[R] Loop avoidance and logical subscripts

Thu May 21 19:18:52 CEST 2009

retama wrote:
> Patrick Burns kindly provided an article about this issue called 'The R
> Inferno'. However, I will expand a little bit my question because I think it
> is not clear and, if I coud improve the code it will be more understandable
> to other users reading this messages when I will paste it :)
> 
> In my example, I have a dataframe with several hundreds of DNA sequences in
> the column data$sequences (each value is a long string written in an
> alphabet of four characters, which are A, C, T and G). I'm trying to know
> parameter number of Gs plus Cs over the total  [G+C/(A+T+C+G)] in each
> sequence. In example, data$sequence [1] is something like AATTCCCGGGGGG but
> a little bit longer, and, its G+C content is 0.69 . I need to compute a
> vector with all G+C contents (in my example, in data$GCsequence, in which
> data$GCsequence[1] is 0.69).

A very efficient way to do this is

   library(Biostrings)
   dna = DNAStringSet(data$sequence)
   alf = alphabetFrequency(dna, baseOnly=TRUE)
   gc = rowSums(alf[,c("G", "C")]) / rowSums(alf)

this takes about .8 second for 3 million 36mers, for instance. 
Biostrings is installed with

   source('http://bioconductor.org/biocLite.R')
   biocLite('Biostrings')

Martin

> 
> So the question was if making a loop and a combination of values with c() or
> cbind() or with logical subscripts is ok or not. And which approach should
> produce better results in terms of efficiency (my script goes really slow).
> 
> Thank you,
> 
> Retama
> 
> 

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793