[R] Loop avoidance and logical subscripts

Thu May 21 19:18:28 CEST 2009

On 21-May-09 16:56:23, retama wrote:
> Patrick Burns kindly provided an article about this issue called
> 'The R Inferno'. However, I will expand a little bit my question
> because I think it is not clear and, if I coud improve the code
> it will be more understandable to other users reading this messages
> when I will paste it :)
> 
> In my example, I have a dataframe with several hundreds of DNA
> sequences in the column data$sequences (each value is a long string
> written in an alphabet of four characters, which are A, C, T and G).
> I'm trying to know parameter number of Gs plus Cs over the total 
> [G+C/(A+T+C+G)] in each sequence. In example, data$sequence [1] is
> something like AATTCCCGGGGGG but a little bit longer, and, its G+C
> content is 0.69 . I need to compute a vector with all G+C contents
> (in my example, in data$GCsequence, in which data$GCsequence[1] is
> 0.69).
> 
> So the question was if making a loop and a combination of values with
> c() or cbind() or with logical subscripts is ok or not. And which
> approach should produce better results in terms of efficiency (my
> script goes really slow).
> 
> Thank you,
> Retama

Perhaps the following could be the basis of your code for the bigger
problem:

  S <- unlist(strsplit("AATTCCCGGGGGG",""))
  S
#  [1] "A" "A" "T" "T" "C" "C" "C" "G" "G" "G" "G" "G" "G"
  (sum((S=="C")|(S=="G")))
# [1] 9
  (sum((S=="C")|(S=="G")))/length(S)
# [1] 0.6923077

You could build a function on those lines, to evaluate what you
want for any given string; and then apply() it to the elements
(which are the separate character strings) of data$sequences
(which is presumably a vector of character strings).

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 21-May-09                                       Time: 18:18:24
------------------------------ XFMail ------------------------------