[R] For help in R coding

David Winsemius dwinsemius at comcast.net
Sat Jul 2 00:25:20 CEST 2011


On Jul 1, 2011, at 12:47 PM, Bansal, Vikas wrote:

> Dear all,
>
> I am doing a project on variant calling using R.I am working on  
> pileup file.There are 10 columns in my data frame and I want to  
> count the number of A,C,G and T in each row for column 9.example of  
> column 9 is given below-
>
>            .a,g,,
>            .t,t,,
>            .,c,c,
>            .,a,,,
>            .,t,t,t
>            .c,,g,^!.
>            .g,ggg.^!,
>            .$,,,,,.,
>            a,g,,t,
>            ,,,,,.,^!.
>            ,$,,,,.,.
>
> This is a bit confusing for me as these characters are in one column  
> and how can we scan them for each row to print number of A,C,G and T  
> for each row.

Seems a bit clunky but this does the job (first the data):
 > txt <- " .a,g,,
+            .t,t,,
+            .,c,c,
+            .,a,,,
+            .,t,t,t
+            .c,,g,^!.
+            .g,ggg.^!,
+            .$,,,,,.,
+            a,g,,t,
+            ,,,,,.,^!.
+            ,$,,,,.,."

 > txtvec <- readLines(textConnection(txt))

Now the clunky solution, Basically subtracts 1 from the counts of  
"fragments" that result from splitting on each letter in turn. Could  
be made prettier with a function that did the job.

 > data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,  
split="a"), length) , "-", 1)),
+ C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c"),  
length) , "-", 1)),
+ G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g"),  
length) , "-", 1)),
+ T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t"),  
length) , "-", 1)) )
                       A C G T
  .a,g,,               1 0 1 0
            .t,t,,     0 0 0 2
            .,c,c,     0 2 0 0
            .,a,,,     1 0 0 0
            .,t,t,t    0 0 0 2
            .c,,g,^!.  0 1 1 0
            .g,ggg.^!, 0 0 4 0
            .$,,,,,.,  0 0 0 0
            a,g,,t,    1 0 1 1
            ,,,,,.,^!. 0 0 0 0
            ,$,,,,.,.  0 0 0 0

Has the advantage that the input data ends up as rownames, which was a  
surprise.

If you wanted to count "A" and "a" as equivalent, then the split  
argument should be "a|A"


> Most of the rows have      .         and      ,    and other symbols  
> but we will ignore them.I just want to run a loop with a counter  
> which will count the number of A,C,G and T for each row and will  
> give output something like this-
>
>
> A   C   G  T
> 1   0   1  0
> 0   0   0  2
> 0   2   0  0
> 1   0   0  0
> 0   0   0  3
>
> This output is for first 5 rows from the example given above.
>
> I am new to R can you please help me.I will be very thankful to you.
>
>
>
> Thanking you,
> Warm Regards
> Vikas Bansal
> Msc Bioinformatics
> Kings College London
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list