Bansal, Vikas
vikas.bansal at kcl.ac.uk
Sat Jul 2 22:46:32 CEST 2011
DEAR ALL,
I TRIED THIS CODE AND THIS IS RUNNING PERFECTLY...
df=read.table("Case2.pileup",fill=T,sep="\t",colClasses = "character")
txt=df[,9]
txtvec <- readLines(textConnection(txt))
dad=data.frame(A = unlist(sapply(gregexpr("A|a", txtvec), function(x) if ( x[[1]] != -1)
length(x) else 0 )),
C = unlist(sapply(gregexpr("C|c", txtvec), function(x) if ( x[[1]] != -1)
length(x) else 0 )),
G = unlist(sapply(gregexpr("G|g", txtvec), function(x) if ( x[[1]] != -1)
length(x) else 0 )),
T = unlist(sapply(gregexpr("T|t", txtvec), function(x) if ( x[[1]] != -1)
length(x) else 0 )),
N = unlist(sapply(gregexpr("\\,|\\.", txtvec), function(x) if ( x[[1]] != -1)
length(x) else 0 )))
Thanking you,
Warm Regards
Vikas Bansal
Msc Bioinformatics
Kings College London
________________________________________
From: David Winsemius [dwinsemius at comcast.net]
Sent: Saturday, July 02, 2011 9:04 PM
To: Dennis Murphy
Cc: r-help at r-project.org; Bansal, Vikas
Subject: Re: [R] For help in R coding
On reflection and a bit of testing I think the best approach would be
to use gregexpr. For counting the number of commas, this appears quite
straightforward.
> sapply(gregexpr("\\,", txtvec), function(x) if ( x[[1]] != -1)
length(x) else 0 )
[1] 3 3 3 4 3 3 2 6 4 6 6
It easily generalizes to period and the `|` (or) operation on letters.
( did need to add the check since the length of gregexpr is always at
least one but ihas value -1 when there is no match
> sapply(gregexpr("t|T", txtvec), function(x) if ( x[[1]] != -1)
length(x) else 0 )
[1] 0 2 0 0 3 0 0 0 1 0 0
On Jul 2, 2011, at 3:22 PM, Dennis Murphy wrote:
> Hi:
>
> There seems to be a problem if the string ends in , or . , which makes
> it difficult for strsplit() to pick up if it is splitting on those
> characters. Here is an alternative, splitting on individual characters
> and using charmatch() instead:
>
> charsum <- function(s, char) {
> u <- strsplit(s, "")
> sum(sapply(u, function(x) charmatch(x, char)), na.rm = TRUE)
> }
>
> unname(sapply(txtvec, function(x) charsum(x, ',')))
> unname(sapply(txtvec, function(x) charsum(x, '.')))
>
> Putting this into a data frame,
>
> dfout <- data.frame(periods = unname(sapply(txtvec, function(x)
> charsum(x, '.'))),
> commas = unname(sapply(txtvec,
> function(x) charsum(x, '.'))) )
> txtvec
>
> HTH,
> Dennis
>
On Sat, Jul 2, 2011 at 10:19 AM, David Winsemius <dwinsemius at comcast.net> wrote:
> > wrote:
>>
On Jul 2, 2011, at 12:34 PM, Bansal, Vikas wrote:
>>
>>>
>>>
>>>>> Dear all,
>>>>>
>>>>> I am doing a project on variant calling using R.I am working on
>>>>> pileup file.There are 10 columns in my data frame and I want to
>>>>> count the number of A,C,G and T in each row for column 9.example
>>>>> of
>>>>> column 9 is given below-
>>>>>
>>>>> .a,g,,
>>>>> .t,t,,
>>>>> .,c,c,
>>>>> .,a,,,
>>>>> .,t,t,t
>>>>> .c,,g,^!.
>>>>> .g,ggg.^!,
>>>>> .$,,,,,.,
>>>>> a,g,,t,
>>>>> ,,,,,.,^!.
>>>>> ,$,,,,.,.
>>>>>
>>>>> This is a bit confusing for me as these characters are in one
>>>>> column
>>>>> and how can we scan them for each row to print number of A,C,G
>>>>> and T
>>>>> for each row.
>>>>
>>>> Seems a bit clunky but this does the job (first the data):
>>>>>
>>>>> txt <- " .a,g,,
>>>>
>>>> + .t,t,,
>>>> + .,c,c,
>>>> + .,a,,,
>>>> + .,t,t,t
>>>> + .c,,g,^!.
>>>> + .g,ggg.^!,
>>>> + .$,,,,,.,
>>>> + a,g,,t,
>>>> + ,,,,,.,^!.
>>>> + ,$,,,,.,."
>>>>
>>>>> txtvec <- readLines(textConnection(txt))
>>>>
>>>> Now the clunky solution, Basically subtracts 1 from the counts of
>>>> "fragments" that result from splitting on each letter in turn.
>>>> Could
>>>> be made prettier with a function that did the job.
>>>>
>>>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>>>
>>>> split="a"), length) , "-", 1)),
>>>> + C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c"),
>>>> length) , "-", 1)),
>>>> + G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g"),
>>>> length) , "-", 1)),
>>>> + T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t"),
>>>> length) , "-", 1)) )
>>>> A C G T
>>>> .a,g,, 1 0 1 0
>>>> .t,t,, 0 0 0 2
>>>> .,c,c, 0 2 0 0
>>>> .,a,,, 1 0 0 0
>>>> .,t,t,t 0 0 0 2
>>>> .c,,g,^!. 0 1 1 0
>>>> .g,ggg.^!, 0 0 4 0
>>>> .$,,,,,., 0 0 0 0
>>>> a,g,,t, 1 0 1 1
>>>> ,,,,,.,^!. 0 0 0 0
>>>> ,$,,,,.,. 0 0 0 0
>>>>
>>>> Has the advantage that the input data ends up as rownames, which
>>>> was a
>>>> surprise.
>>>>
>>>> If you wanted to count "A" and "a" as equivalent, then the split
>>>> argument should be "a|A"
>>>>
>>>>
>>>
>>>>> AS YOU MENTIONED THAT IF I WANT TO COUNT A AND a I SHOULD SPLIT
>>>>> LIKE
>>>>> THIS.
>>>
>>> BUT CAN I COUNT . AND , ALSO USING-
>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>> split=".|,"), length) , "-", 1)),
>>>
>>> I TRIED IT BUT ITS NOT WORKING.IT IS GIVING THE OUTPUT BUT AT SOME
>>> PLACES
>>> IT IS SHOWING MORE NUMBER OF . AND , AND SOMEWHERE IT IS NOT EVEN
>>> CALCULATING AND JUST SHOWING 0.
>>
>> You need to use valid regex expressions for 'split'. Since "." and
>> "," are
>> special characters they need to be escaped when you wnat the
>> literals to be
>> recognized as such.
>>
>> I haven't figured out why but you need to drop the final operation of
>> subtracting 1 from the values when counting commas:
>>
>> data.frame(periods = unlist(lapply( lapply( sapply(txtvec, strsplit,
>> split="\\."), length) , "-", 1))
>> ,commas = unlist( lapply( sapply(txtvec, strsplit,
>> split="\\,"), length) ) )
>> periods commas
>> .a,g,, 1 3
>> .t,t,, 1 3
>> .,c,c, 1 3
>> .,a,,, 1 4
>> .,t,t,t 1 4
>> .c,,g,^!. 1 4
>> .g,ggg.^!, 2 2
>> .$,,,,,., 2 6
>> a,g,,t, 0 4
>> ,,,,,.,^!. 1 7
>> ,$,,,,.,. 1 7
>>
>> --
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
