[R] For help in R coding

Sun Jul 3 04:57:33 CEST 2011

On Jul 2, 2011, at 4:46 PM, Bansal, Vikas wrote:

> DEAR ALL,
> I TRIED THIS CODE AND THIS IS RUNNING PERFECTLY...
>
> df=read.table("Case2.pileup",fill=T,sep="\t",colClasses = "character")
> txt=df[,9]
> txtvec <- readLines(textConnection(txt))
> dad=data.frame(A = unlist(sapply(gregexpr("A|a", txtvec),  
> function(x) if ( x[[1]] != -1)
> length(x) else 0 )),
> C = unlist(sapply(gregexpr("C|c", txtvec), function(x) if ( x[[1]] ! 
> = -1)
> length(x) else 0 )),
> G = unlist(sapply(gregexpr("G|g", txtvec), function(x) if ( x[[1]] ! 
> = -1)
> length(x) else 0 )),
> T = unlist(sapply(gregexpr("T|t", txtvec), function(x) if ( x[[1]] ! 
> = -1)
> length(x) else 0 )),
> N = unlist(sapply(gregexpr("\\,|\\.", txtvec), function(x) if  
> ( x[[1]] != -1)
> length(x) else 0 )))
>

The unlist operation is unnecessary since the sapply operation returns  
a vector.  (It doesn't hurt, but it is unnecessary.)
>
>
>
>
> Thanking you,
> Warm Regards
> Vikas Bansal
> Msc Bioinformatics
> Kings College London
> ________________________________________
> From: David Winsemius [dwinsemius at comcast.net]
> Sent: Saturday, July 02, 2011 9:04 PM
> To: Dennis Murphy
> Cc: r-help at r-project.org; Bansal, Vikas
> Subject: Re: [R] For help in R coding
>
> On reflection and a bit of testing I think the best approach would be
> to use gregexpr. For counting the number of commas, this appears quite
> straightforward.
>
>> sapply(gregexpr("\\,", txtvec), function(x) if ( x[[1]] != -1)
> length(x) else 0 )
>  [1] 3 3 3 4 3 3 2 6 4 6 6
>
> It easily generalizes to period and the `|` (or) operation on letters.
> ( did need to add the check since the length of gregexpr is always at
> least one but ihas value -1 when there is no match
>
>> sapply(gregexpr("t|T", txtvec), function(x) if ( x[[1]] != -1)
> length(x) else 0 )
>  [1] 0 2 0 0 3 0 0 0 1 0 0
>
>
> On Jul 2, 2011, at 3:22 PM, Dennis Murphy wrote:
>
>> Hi:
>>
>> There seems to be a problem if the string ends in , or . , which  
>> makes
>> it difficult for strsplit() to pick up if it is splitting on those
>> characters. Here is an alternative, splitting on individual  
>> characters
>> and using charmatch() instead:
>>
>> charsum <- function(s, char) {
>>   u <- strsplit(s, "")
>>   sum(sapply(u, function(x) charmatch(x, char)), na.rm = TRUE)
>>  }
>>
>> unname(sapply(txtvec, function(x) charsum(x, ',')))
>> unname(sapply(txtvec, function(x) charsum(x, '.')))
>>
>> Putting this into a data frame,
>>
>> dfout <- data.frame(periods = unname(sapply(txtvec, function(x)
>> charsum(x, '.'))),
>>                               commas = unname(sapply(txtvec,
>> function(x) charsum(x, '.'))) )
>> txtvec
>>
>> HTH,
>> Dennis
>>
>> On Sat, Jul 2, 2011 at 10:19 AM, David Winsemius <dwinsemius at comcast.net
>>> wrote:
>>>
>>> On Jul 2, 2011, at 12:34 PM, Bansal, Vikas wrote:
>>>
>>>>
>>>>
>>>>>> Dear all,
>>>>>>
>>>>>> I am doing a project on variant calling using R.I am working on
>>>>>> pileup file.There are 10 columns in my data frame and I want to
>>>>>> count the number of A,C,G and T in each row for column 9.example
>>>>>> of
>>>>>> column 9 is given below-
>>>>>>
>>>>>>        .a,g,,
>>>>>>        .t,t,,
>>>>>>        .,c,c,
>>>>>>        .,a,,,
>>>>>>        .,t,t,t
>>>>>>        .c,,g,^!.
>>>>>>        .g,ggg.^!,
>>>>>>        .$,,,,,.,
>>>>>>        a,g,,t,
>>>>>>        ,,,,,.,^!.
>>>>>>        ,$,,,,.,.
>>>>>>
>>>>>> This is a bit confusing for me as these characters are in one
>>>>>> column
>>>>>> and how can we scan them for each row to print number of A,C,G
>>>>>> and T
>>>>>> for each row.
>>>>>
>>>>> Seems a bit clunky but this does the job (first the data):
>>>>>>
>>>>>> txt <- " .a,g,,
>>>>>
>>>>> +            .t,t,,
>>>>> +            .,c,c,
>>>>> +            .,a,,,
>>>>> +            .,t,t,t
>>>>> +            .c,,g,^!.
>>>>> +            .g,ggg.^!,
>>>>> +            .$,,,,,.,
>>>>> +            a,g,,t,
>>>>> +            ,,,,,.,^!.
>>>>> +            ,$,,,,.,."
>>>>>
>>>>>> txtvec <- readLines(textConnection(txt))
>>>>>
>>>>> Now the clunky solution, Basically subtracts 1 from the counts of
>>>>> "fragments" that result from splitting on each letter in turn.
>>>>> Could
>>>>> be made prettier with a function that did the job.
>>>>>
>>>>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>>>>
>>>>> split="a"), length) , "-", 1)),
>>>>> + C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c"),
>>>>> length) , "-", 1)),
>>>>> + G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g"),
>>>>> length) , "-", 1)),
>>>>> + T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t"),
>>>>> length) , "-", 1)) )
>>>>>                    A C G T
>>>>> .a,g,,               1 0 1 0
>>>>>         .t,t,,     0 0 0 2
>>>>>         .,c,c,     0 2 0 0
>>>>>         .,a,,,     1 0 0 0
>>>>>         .,t,t,t    0 0 0 2
>>>>>         .c,,g,^!.  0 1 1 0
>>>>>         .g,ggg.^!, 0 0 4 0
>>>>>         .$,,,,,.,  0 0 0 0
>>>>>         a,g,,t,    1 0 1 1
>>>>>         ,,,,,.,^!. 0 0 0 0
>>>>>         ,$,,,,.,.  0 0 0 0
>>>>>
>>>>> Has the advantage that the input data ends up as rownames, which
>>>>> was a
>>>>> surprise.
>>>>>
>>>>> If you wanted to count "A" and "a" as equivalent, then the split
>>>>> argument should be "a|A"
>>>>>
>>>>>
>>>>
>>>>>> AS YOU MENTIONED THAT IF I WANT TO COUNT A AND a I SHOULD SPLIT
>>>>>> LIKE
>>>>>> THIS.
>>>>
>>>> BUT CAN I COUNT . AND , ALSO USING-
>>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>>> split=".|,"), length) , "-", 1)),
>>>>
>>>> I TRIED IT BUT ITS NOT WORKING.IT IS GIVING THE OUTPUT BUT AT SOME
>>>> PLACES
>>>> IT IS SHOWING MORE NUMBER OF . AND , AND SOMEWHERE IT IS NOT EVEN
>>>> CALCULATING AND JUST SHOWING 0.
>>>
>>> You need to use valid regex expressions for 'split'. Since "." and
>>> "," are
>>> special characters they need to be escaped when you wnat the
>>> literals to be
>>> recognized as such.
>>>
>>> I haven't figured out why but you need to drop the final operation  
>>> of
>>> subtracting 1 from the values when counting commas:
>>>
>>> data.frame(periods = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>>                            split="\\."), length) , "-", 1))
>>> ,commas = unlist( lapply( sapply(txtvec, strsplit,
>>>                            split="\\,"), length) ) )
>>>                      periods commas
>>> .a,g,,                      1      3
>>>           .t,t,,           1      3
>>>           .,c,c,           1      3
>>>           .,a,,,           1      4
>>>           .,t,t,t          1      4
>>>           .c,,g,^!.        1      4
>>>           .g,ggg.^!,       2      2
>>>           .$,,,,,.,        2      6
>>>           a,g,,t,          0      4
>>>           ,,,,,.,^!.       1      7
>>>           ,$,,,,.,.        1      7
>>>
>>> --
>>>
>>> David Winsemius, MD
>>> West Hartford, CT
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>
> David Winsemius, MD
> West Hartford, CT
>

David Winsemius, MD
West Hartford, CT