[R] For help in R coding
David Winsemius
dwinsemius at comcast.net
Sun Jul 3 04:57:33 CEST 2011
On Jul 2, 2011, at 4:46 PM, Bansal, Vikas wrote:
> DEAR ALL,
> I TRIED THIS CODE AND THIS IS RUNNING PERFECTLY...
>
> df=read.table("Case2.pileup",fill=T,sep="\t",colClasses = "character")
> txt=df[,9]
> txtvec <- readLines(textConnection(txt))
> dad=data.frame(A = unlist(sapply(gregexpr("A|a", txtvec),
> function(x) if ( x[[1]] != -1)
> length(x) else 0 )),
> C = unlist(sapply(gregexpr("C|c", txtvec), function(x) if ( x[[1]] !
> = -1)
> length(x) else 0 )),
> G = unlist(sapply(gregexpr("G|g", txtvec), function(x) if ( x[[1]] !
> = -1)
> length(x) else 0 )),
> T = unlist(sapply(gregexpr("T|t", txtvec), function(x) if ( x[[1]] !
> = -1)
> length(x) else 0 )),
> N = unlist(sapply(gregexpr("\\,|\\.", txtvec), function(x) if
> ( x[[1]] != -1)
> length(x) else 0 )))
>
The unlist operation is unnecessary since the sapply operation returns
a vector. (It doesn't hurt, but it is unnecessary.)
>
>
>
>
> Thanking you,
> Warm Regards
> Vikas Bansal
> Msc Bioinformatics
> Kings College London
> ________________________________________
> From: David Winsemius [dwinsemius at comcast.net]
> Sent: Saturday, July 02, 2011 9:04 PM
> To: Dennis Murphy
> Cc: r-help at r-project.org; Bansal, Vikas
> Subject: Re: [R] For help in R coding
>
> On reflection and a bit of testing I think the best approach would be
> to use gregexpr. For counting the number of commas, this appears quite
> straightforward.
>
>> sapply(gregexpr("\\,", txtvec), function(x) if ( x[[1]] != -1)
> length(x) else 0 )
> [1] 3 3 3 4 3 3 2 6 4 6 6
>
> It easily generalizes to period and the `|` (or) operation on letters.
> ( did need to add the check since the length of gregexpr is always at
> least one but ihas value -1 when there is no match
>
>> sapply(gregexpr("t|T", txtvec), function(x) if ( x[[1]] != -1)
> length(x) else 0 )
> [1] 0 2 0 0 3 0 0 0 1 0 0
>
>
> On Jul 2, 2011, at 3:22 PM, Dennis Murphy wrote:
>
>> Hi:
>>
>> There seems to be a problem if the string ends in , or . , which
>> makes
>> it difficult for strsplit() to pick up if it is splitting on those
>> characters. Here is an alternative, splitting on individual
>> characters
>> and using charmatch() instead:
>>
>> charsum <- function(s, char) {
>> u <- strsplit(s, "")
>> sum(sapply(u, function(x) charmatch(x, char)), na.rm = TRUE)
>> }
>>
>> unname(sapply(txtvec, function(x) charsum(x, ',')))
>> unname(sapply(txtvec, function(x) charsum(x, '.')))
>>
>> Putting this into a data frame,
>>
>> dfout <- data.frame(periods = unname(sapply(txtvec, function(x)
>> charsum(x, '.'))),
>> commas = unname(sapply(txtvec,
>> function(x) charsum(x, '.'))) )
>> txtvec
>>
>> HTH,
>> Dennis
>>
>> On Sat, Jul 2, 2011 at 10:19 AM, David Winsemius <dwinsemius at comcast.net
>>> wrote:
>>>
>>> On Jul 2, 2011, at 12:34 PM, Bansal, Vikas wrote:
>>>
>>>>
>>>>
>>>>>> Dear all,
>>>>>>
>>>>>> I am doing a project on variant calling using R.I am working on
>>>>>> pileup file.There are 10 columns in my data frame and I want to
>>>>>> count the number of A,C,G and T in each row for column 9.example
>>>>>> of
>>>>>> column 9 is given below-
>>>>>>
>>>>>> .a,g,,
>>>>>> .t,t,,
>>>>>> .,c,c,
>>>>>> .,a,,,
>>>>>> .,t,t,t
>>>>>> .c,,g,^!.
>>>>>> .g,ggg.^!,
>>>>>> .$,,,,,.,
>>>>>> a,g,,t,
>>>>>> ,,,,,.,^!.
>>>>>> ,$,,,,.,.
>>>>>>
>>>>>> This is a bit confusing for me as these characters are in one
>>>>>> column
>>>>>> and how can we scan them for each row to print number of A,C,G
>>>>>> and T
>>>>>> for each row.
>>>>>
>>>>> Seems a bit clunky but this does the job (first the data):
>>>>>>
>>>>>> txt <- " .a,g,,
>>>>>
>>>>> + .t,t,,
>>>>> + .,c,c,
>>>>> + .,a,,,
>>>>> + .,t,t,t
>>>>> + .c,,g,^!.
>>>>> + .g,ggg.^!,
>>>>> + .$,,,,,.,
>>>>> + a,g,,t,
>>>>> + ,,,,,.,^!.
>>>>> + ,$,,,,.,."
>>>>>
>>>>>> txtvec <- readLines(textConnection(txt))
>>>>>
>>>>> Now the clunky solution, Basically subtracts 1 from the counts of
>>>>> "fragments" that result from splitting on each letter in turn.
>>>>> Could
>>>>> be made prettier with a function that did the job.
>>>>>
>>>>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>>>>
>>>>> split="a"), length) , "-", 1)),
>>>>> + C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c"),
>>>>> length) , "-", 1)),
>>>>> + G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g"),
>>>>> length) , "-", 1)),
>>>>> + T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t"),
>>>>> length) , "-", 1)) )
>>>>> A C G T
>>>>> .a,g,, 1 0 1 0
>>>>> .t,t,, 0 0 0 2
>>>>> .,c,c, 0 2 0 0
>>>>> .,a,,, 1 0 0 0
>>>>> .,t,t,t 0 0 0 2
>>>>> .c,,g,^!. 0 1 1 0
>>>>> .g,ggg.^!, 0 0 4 0
>>>>> .$,,,,,., 0 0 0 0
>>>>> a,g,,t, 1 0 1 1
>>>>> ,,,,,.,^!. 0 0 0 0
>>>>> ,$,,,,.,. 0 0 0 0
>>>>>
>>>>> Has the advantage that the input data ends up as rownames, which
>>>>> was a
>>>>> surprise.
>>>>>
>>>>> If you wanted to count "A" and "a" as equivalent, then the split
>>>>> argument should be "a|A"
>>>>>
>>>>>
>>>>
>>>>>> AS YOU MENTIONED THAT IF I WANT TO COUNT A AND a I SHOULD SPLIT
>>>>>> LIKE
>>>>>> THIS.
>>>>
>>>> BUT CAN I COUNT . AND , ALSO USING-
>>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>>> split=".|,"), length) , "-", 1)),
>>>>
>>>> I TRIED IT BUT ITS NOT WORKING.IT IS GIVING THE OUTPUT BUT AT SOME
>>>> PLACES
>>>> IT IS SHOWING MORE NUMBER OF . AND , AND SOMEWHERE IT IS NOT EVEN
>>>> CALCULATING AND JUST SHOWING 0.
>>>
>>> You need to use valid regex expressions for 'split'. Since "." and
>>> "," are
>>> special characters they need to be escaped when you wnat the
>>> literals to be
>>> recognized as such.
>>>
>>> I haven't figured out why but you need to drop the final operation
>>> of
>>> subtracting 1 from the values when counting commas:
>>>
>>> data.frame(periods = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>> split="\\."), length) , "-", 1))
>>> ,commas = unlist( lapply( sapply(txtvec, strsplit,
>>> split="\\,"), length) ) )
>>> periods commas
>>> .a,g,, 1 3
>>> .t,t,, 1 3
>>> .,c,c, 1 3
>>> .,a,,, 1 4
>>> .,t,t,t 1 4
>>> .c,,g,^!. 1 4
>>> .g,ggg.^!, 2 2
>>> .$,,,,,., 2 6
>>> a,g,,t, 0 4
>>> ,,,,,.,^!. 1 7
>>> ,$,,,,.,. 1 7
>>>
>>> --
>>>
>>> David Winsemius, MD
>>> West Hartford, CT
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>
> David Winsemius, MD
> West Hartford, CT
>
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list