[R] Loop avoidance and logical subscripts
Martin Morgan
mtmorgan at fhcrc.org
Thu May 21 19:18:52 CEST 2009
retama wrote:
> Patrick Burns kindly provided an article about this issue called 'The R
> Inferno'. However, I will expand a little bit my question because I think it
> is not clear and, if I coud improve the code it will be more understandable
> to other users reading this messages when I will paste it :)
>
> In my example, I have a dataframe with several hundreds of DNA sequences in
> the column data$sequences (each value is a long string written in an
> alphabet of four characters, which are A, C, T and G). I'm trying to know
> parameter number of Gs plus Cs over the total [G+C/(A+T+C+G)] in each
> sequence. In example, data$sequence [1] is something like AATTCCCGGGGGG but
> a little bit longer, and, its G+C content is 0.69 . I need to compute a
> vector with all G+C contents (in my example, in data$GCsequence, in which
> data$GCsequence[1] is 0.69).
A very efficient way to do this is
library(Biostrings)
dna = DNAStringSet(data$sequence)
alf = alphabetFrequency(dna, baseOnly=TRUE)
gc = rowSums(alf[,c("G", "C")]) / rowSums(alf)
this takes about .8 second for 3 million 36mers, for instance.
Biostrings is installed with
source('http://bioconductor.org/biocLite.R')
biocLite('Biostrings')
Martin
>
> So the question was if making a loop and a combination of values with c() or
> cbind() or with logical subscripts is ok or not. And which approach should
> produce better results in terms of efficiency (my script goes really slow).
>
> Thank you,
>
> Retama
>
>
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
More information about the R-help
mailing list