[R] For help in R coding

David Winsemius dwinsemius at comcast.net
Sat Jul 2 22:04:18 CEST 2011


On reflection and a bit of testing I think the best approach would be  
to use gregexpr. For counting the number of commas, this appears quite  
straightforward.

 > sapply(gregexpr("\\,", txtvec), function(x) if ( x[[1]] != -1)  
length(x) else 0 )
  [1] 3 3 3 4 3 3 2 6 4 6 6

It easily generalizes to period and the `|` (or) operation on letters.  
( did need to add the check since the length of gregexpr is always at  
least one but ihas value -1 when there is no match

 > sapply(gregexpr("t|T", txtvec), function(x) if ( x[[1]] != -1)  
length(x) else 0 )
  [1] 0 2 0 0 3 0 0 0 1 0 0


On Jul 2, 2011, at 3:22 PM, Dennis Murphy wrote:

> Hi:
>
> There seems to be a problem if the string ends in , or . , which makes
> it difficult for strsplit() to pick up if it is splitting on those
> characters. Here is an alternative, splitting on individual characters
> and using charmatch() instead:
>
> charsum <- function(s, char) {
>    u <- strsplit(s, "")
>    sum(sapply(u, function(x) charmatch(x, char)), na.rm = TRUE)
>   }
>
> unname(sapply(txtvec, function(x) charsum(x, ',')))
> unname(sapply(txtvec, function(x) charsum(x, '.')))
>
> Putting this into a data frame,
>
> dfout <- data.frame(periods = unname(sapply(txtvec, function(x)
> charsum(x, '.'))),
>                                commas = unname(sapply(txtvec,
> function(x) charsum(x, '.'))) )
> txtvec
>
> HTH,
> Dennis
>
> On Sat, Jul 2, 2011 at 10:19 AM, David Winsemius <dwinsemius at comcast.net 
> > wrote:
>>
>> On Jul 2, 2011, at 12:34 PM, Bansal, Vikas wrote:
>>
>>>
>>>
>>>>> Dear all,
>>>>>
>>>>> I am doing a project on variant calling using R.I am working on
>>>>> pileup file.There are 10 columns in my data frame and I want to
>>>>> count the number of A,C,G and T in each row for column 9.example  
>>>>> of
>>>>> column 9 is given below-
>>>>>
>>>>>         .a,g,,
>>>>>         .t,t,,
>>>>>         .,c,c,
>>>>>         .,a,,,
>>>>>         .,t,t,t
>>>>>         .c,,g,^!.
>>>>>         .g,ggg.^!,
>>>>>         .$,,,,,.,
>>>>>         a,g,,t,
>>>>>         ,,,,,.,^!.
>>>>>         ,$,,,,.,.
>>>>>
>>>>> This is a bit confusing for me as these characters are in one  
>>>>> column
>>>>> and how can we scan them for each row to print number of A,C,G  
>>>>> and T
>>>>> for each row.
>>>>
>>>> Seems a bit clunky but this does the job (first the data):
>>>>>
>>>>> txt <- " .a,g,,
>>>>
>>>> +            .t,t,,
>>>> +            .,c,c,
>>>> +            .,a,,,
>>>> +            .,t,t,t
>>>> +            .c,,g,^!.
>>>> +            .g,ggg.^!,
>>>> +            .$,,,,,.,
>>>> +            a,g,,t,
>>>> +            ,,,,,.,^!.
>>>> +            ,$,,,,.,."
>>>>
>>>>> txtvec <- readLines(textConnection(txt))
>>>>
>>>> Now the clunky solution, Basically subtracts 1 from the counts of
>>>> "fragments" that result from splitting on each letter in turn.  
>>>> Could
>>>> be made prettier with a function that did the job.
>>>>
>>>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>>>
>>>> split="a"), length) , "-", 1)),
>>>> + C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c"),
>>>> length) , "-", 1)),
>>>> + G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g"),
>>>> length) , "-", 1)),
>>>> + T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t"),
>>>> length) , "-", 1)) )
>>>>                     A C G T
>>>> .a,g,,               1 0 1 0
>>>>          .t,t,,     0 0 0 2
>>>>          .,c,c,     0 2 0 0
>>>>          .,a,,,     1 0 0 0
>>>>          .,t,t,t    0 0 0 2
>>>>          .c,,g,^!.  0 1 1 0
>>>>          .g,ggg.^!, 0 0 4 0
>>>>          .$,,,,,.,  0 0 0 0
>>>>          a,g,,t,    1 0 1 1
>>>>          ,,,,,.,^!. 0 0 0 0
>>>>          ,$,,,,.,.  0 0 0 0
>>>>
>>>> Has the advantage that the input data ends up as rownames, which  
>>>> was a
>>>> surprise.
>>>>
>>>> If you wanted to count "A" and "a" as equivalent, then the split
>>>> argument should be "a|A"
>>>>
>>>>
>>>
>>>>> AS YOU MENTIONED THAT IF I WANT TO COUNT A AND a I SHOULD SPLIT  
>>>>> LIKE
>>>>> THIS.
>>>
>>> BUT CAN I COUNT . AND , ALSO USING-
>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>> split=".|,"), length) , "-", 1)),
>>>
>>> I TRIED IT BUT ITS NOT WORKING.IT IS GIVING THE OUTPUT BUT AT SOME  
>>> PLACES
>>> IT IS SHOWING MORE NUMBER OF . AND , AND SOMEWHERE IT IS NOT EVEN
>>> CALCULATING AND JUST SHOWING 0.
>>
>> You need to use valid regex expressions for 'split'. Since "." and  
>> "," are
>> special characters they need to be escaped when you wnat the  
>> literals to be
>> recognized as such.
>>
>> I haven't figured out why but you need to drop the final operation of
>> subtracting 1 from the values when counting commas:
>>
>> data.frame(periods = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>                             split="\\."), length) , "-", 1))
>>  ,commas = unlist( lapply( sapply(txtvec, strsplit,
>>                             split="\\,"), length) ) )
>>                       periods commas
>>  .a,g,,                      1      3
>>            .t,t,,           1      3
>>            .,c,c,           1      3
>>            .,a,,,           1      4
>>            .,t,t,t          1      4
>>            .c,,g,^!.        1      4
>>            .g,ggg.^!,       2      2
>>            .$,,,,,.,        2      6
>>            a,g,,t,          0      4
>>            ,,,,,.,^!.       1      7
>>            ,$,,,,.,.        1      7
>>
>> --
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list