# [R] Loop avoidance and logical subscripts

(Ted Harding) Ted.Harding at manchester.ac.uk
Thu May 21 19:18:28 CEST 2009

```On 21-May-09 16:56:23, retama wrote:
> 'The R Inferno'. However, I will expand a little bit my question
> because I think it is not clear and, if I coud improve the code
> it will be more understandable to other users reading this messages
> when I will paste it :)
>
> In my example, I have a dataframe with several hundreds of DNA
> sequences in the column data\$sequences (each value is a long string
> written in an alphabet of four characters, which are A, C, T and G).
> I'm trying to know parameter number of Gs plus Cs over the total
> [G+C/(A+T+C+G)] in each sequence. In example, data\$sequence  is
> something like AATTCCCGGGGGG but a little bit longer, and, its G+C
> content is 0.69 . I need to compute a vector with all G+C contents
> (in my example, in data\$GCsequence, in which data\$GCsequence is
> 0.69).
>
> So the question was if making a loop and a combination of values with
> c() or cbind() or with logical subscripts is ok or not. And which
> approach should produce better results in terms of efficiency (my
> script goes really slow).
>
> Thank you,
> Retama

Perhaps the following could be the basis of your code for the bigger
problem:

S <- unlist(strsplit("AATTCCCGGGGGG",""))
S
#   "A" "A" "T" "T" "C" "C" "C" "G" "G" "G" "G" "G" "G"
(sum((S=="C")|(S=="G")))
#  9
(sum((S=="C")|(S=="G")))/length(S)
#  0.6923077

You could build a function on those lines, to evaluate what you
want for any given string; and then apply() it to the elements
(which are the separate character strings) of data\$sequences
(which is presumably a vector of character strings).

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 21-May-09                                       Time: 18:18:24
------------------------------ XFMail ------------------------------

```