[R] For help in R coding
David Winsemius
dwinsemius at comcast.net
Sun Jul 3 21:58:15 CEST 2011
On Jul 3, 2011, at 1:07 PM, Bansal, Vikas wrote:
> Yes you are right. unlist operation is unnecessary and I have tried
> it yesterday and it is working without that operation also.But I
> have one more problem on which I have worked whole day but did not
> get any solution.As I told you I am new to R,I want to ask that how
> I can use the (if condition) in the following code
>
> df=read.table("Case2.pileup",fill=T,sep="\t",colClasses = "character")
> txtvec <- readLines(textConnection(df[,9]))
> dad=data.frame(A = (sapply(gregexpr("A|a", (df[,9])), function(x) if
> ( x[[1]] != -1)
> length(x) else 0 )),
> C = (sapply(gregexpr("C|c", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 )),
> G = (sapply(gregexpr("G|g", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 )),
> T = (sapply(gregexpr("T|t", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 )),
> N = (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] !
> = -1)
> length(x) else 0 )))
>
>
> Now my problem is in my data frame I have alphabets A,C,G and T in
> 3rd column also.Now these commas (,)and dots(.) in column 9 are for
> these alphabets which are in column 3.I want to use if condition
> like this
>
> if in my dataframe column 3 have A then A = (sapply(gregexpr("\\,|\
> \.", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 ))) else (A = (sapply(gregexpr("A|a", (df[,9])),
> function(x) if ( x[[1]] != -1)
> length(x) else 0 )),if in my dataframe column 3 haveCA then C =
> (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 ))) else C = (sapply(gregexpr("C|c", (df[,9])),
> function(x) if ( x[[1]] != -1)
> length(x) else 0 )), if in my dataframe column 3 have G then G =
> (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 ))) else G = (sapply(gregexpr("G|g", (df[,9])),
> function(x) if ( x[[1]] != -1)
> length(x) else 0 )) if in my dataframe column 3 have T then T =
> (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 ))) else T = (sapply(gregexpr("T|t", (df[,9])),
> function(x) if ( x[[1]] != -1)
> length(x) else 0 )),
>
I finally figured out that you wanted this:
> dat$newcol <- apply(dat, 1, function(x) gsub("\\,|\\.", x[3],
x[9]) )
# So that replaces any instance of "," or "." in col9 with the letter
in col3
# Then the same old routine as yesterday
> dat$A <- sapply(gregexpr("A|a", (dat[,"newcol"])), function(x) if
( x[[1]] != -1) length(x) else 0 )
> dat$C <- sapply(gregexpr("C|c", (dat[,"newcol"])), function(x) if
( x[[1]] != -1) length(x) else 0 )
> dat$G <- sapply(gregexpr("G|g", (dat[,"newcol"])), function(x) if
( x[[1]] != -1) length(x) else 0 )
> dat$T <- sapply(gregexpr("T|t", (dat[,"newcol"])), function(x) if
( x[[1]] != -1) length(x) else 0 )
> dat[, c("A","C", "G", "T")]
A C G T
1 1 0 1 4
2 4 0 0 2
3 4 2 0 0
4 1 5 0 0
5 0 0 4 3
6 5 1 1 0
7 4 0 4 0
8 8 0 0 0
9 1 4 1 1
10 0 0 0 8
11 0 0 0 8
>
> So I want to code so that it will give the output like this-
>
> DATA FRAME (Input)
>
> col3 col 9
> T .a,g,,
> A .t,t,,
> A .,c,c,
> C .,a,,,
> G .,t,t,t
> A .c,,g,^!.
> A .g,ggg.^!,
> A .$,,,,,.,
> C a,g,,t,
> T ,,,,,.,^!.
> T ,$,,,,.,."
>
>
> output
>
> A C G T
> 1 0 1 4
> 4 0 0 2
> 4 2 0 0
> 1 5 0 0
> 0 0 4 3
>
>
>
> This is the output for first five rows.
>
>
>
> Can you please help me how to use this if condition in your coding
> or we can also do it by using some other condition rather than if
> condition?
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Jul 2, 2011, at 4:46 PM, Bansal, Vikas wrote:
>
>> DEAR ALL,
>> I TRIED THIS CODE AND THIS IS RUNNING PERFECTLY...
>>
>> df=read.table("Case2.pileup",fill=T,sep="\t",colClasses =
>> "character")
>> txt=df[,9]
>> txtvec <- readLines(textConnection(txt))
>> dad=data.frame(A = unlist(sapply(gregexpr("A|a", txtvec),
>> function(x) if ( x[[1]] != -1)
>> length(x) else 0 )),
>> C = unlist(sapply(gregexpr("C|c", txtvec), function(x) if ( x[[1]] !
>> = -1)
>> length(x) else 0 )),
>> G = unlist(sapply(gregexpr("G|g", txtvec), function(x) if ( x[[1]] !
>> = -1)
>> length(x) else 0 )),
>> T = unlist(sapply(gregexpr("T|t", txtvec), function(x) if ( x[[1]] !
>> = -1)
>> length(x) else 0 )),
>> N = unlist(sapply(gregexpr("\\,|\\.", txtvec), function(x) if
>> ( x[[1]] != -1)
>> length(x) else 0 )))
>>
>
> The unlist operation is unnecessary since the sapply operation returns
> a vector. (It doesn't hurt, but it is unnecessary.)
>>
>>
>>
>>
>> Thanking you,
>> Warm Regards
>> Vikas Bansal
>> Msc Bioinformatics
>> Kings College London
>>
>> On reflection and a bit of testing I think the best approach would be
>> to use gregexpr. For counting the number of commas, this appears
>> quite
>> straightforward.
>>
>>> sapply(gregexpr("\\,", txtvec), function(x) if ( x[[1]] != -1)
>> length(x) else 0 )
>> [1] 3 3 3 4 3 3 2 6 4 6 6
>>
>> It easily generalizes to period and the `|` (or) operation on
>> letters.
>> ( did need to add the check since the length of gregexpr is always at
>> least one but ihas value -1 when there is no match
>>
>>> sapply(gregexpr("t|T", txtvec), function(x) if ( x[[1]] != -1)
>> length(x) else 0 )
>> [1] 0 2 0 0 3 0 0 0 1 0 0
>>
>>
>> On Jul 2, 2011, at 3:22 PM, Dennis Murphy wrote:
>>
>>> Hi:
>>>
>>> There seems to be a problem if the string ends in , or . , which
>>> makes
>>> it difficult for strsplit() to pick up if it is splitting on those
>>> characters. Here is an alternative, splitting on individual
>>> characters
>>> and using charmatch() instead:
>>>
>>> charsum <- function(s, char) {
>>> u <- strsplit(s, "")
>>> sum(sapply(u, function(x) charmatch(x, char)), na.rm = TRUE)
>>> }
>>>
>>> unname(sapply(txtvec, function(x) charsum(x, ',')))
>>> unname(sapply(txtvec, function(x) charsum(x, '.')))
>>>
>>> Putting this into a data frame,
>>>
>>> dfout <- data.frame(periods = unname(sapply(txtvec, function(x)
>>> charsum(x, '.'))),
>>> commas = unname(sapply(txtvec,
>>> function(x) charsum(x, '.'))) )
>>> txtvec
>>>
>>> HTH,
>>> Dennis
>>>
>>> On Sat, Jul 2, 2011 at 10:19 AM, David Winsemius <dwinsemius at comcast.net
>>>> wrote:
>>>>
>>>> On Jul 2, 2011, at 12:34 PM, Bansal, Vikas wrote:
>>>>
>>>>>
>>>>>
>>>>>>> Dear all,
>>>>>>>
>>>>>>> I am doing a project on variant calling using R.I am working on
>>>>>>> pileup file.There are 10 columns in my data frame and I want to
>>>>>>> count the number of A,C,G and T in each row for column 9.example
>>>>>>> of
>>>>>>> column 9 is given below-
>>>>>>>
>>>>>>> .a,g,,
>>>>>>> .t,t,,
>>>>>>> .,c,c,
>>>>>>> .,a,,,
>>>>>>> .,t,t,t
>>>>>>> .c,,g,^!.
>>>>>>> .g,ggg.^!,
>>>>>>> .$,,,,,.,
>>>>>>> a,g,,t,
>>>>>>> ,,,,,.,^!.
>>>>>>> ,$,,,,.,.
>>>>>>>
>>>>>>> This is a bit confusing for me as these characters are in one
>>>>>>> column
>>>>>>> and how can we scan them for each row to print number of A,C,G
>>>>>>> and T
>>>>>>> for each row.
>>>>>>
>>>>>> Seems a bit clunky but this does the job (first the data):
>>>>>>>
>>>>>>> txt <- " .a,g,,
>>>>>>
>>>>>> + .t,t,,
>>>>>> + .,c,c,
>>>>>> + .,a,,,
>>>>>> + .,t,t,t
>>>>>> + .c,,g,^!.
>>>>>> + .g,ggg.^!,
>>>>>> + .$,,,,,.,
>>>>>> + a,g,,t,
>>>>>> + ,,,,,.,^!.
>>>>>> + ,$,,,,.,."
>>>>>>
>>>>>>> txtvec <- readLines(textConnection(txt))
>>>>>>
>>>>>> Now the clunky solution, Basically subtracts 1 from the counts of
>>>>>> "fragments" that result from splitting on each letter in turn.
>>>>>> Could
>>>>>> be made prettier with a function that did the job.
>>>>>>
>>>>>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>>>>>
>>>>>> split="a"), length) , "-", 1)),
>>>>>> + C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c"),
>>>>>> length) , "-", 1)),
>>>>>> + G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g"),
>>>>>> length) , "-", 1)),
>>>>>> + T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t"),
>>>>>> length) , "-", 1)) )
>>>>>> A C G T
>>>>>> .a,g,, 1 0 1 0
>>>>>> .t,t,, 0 0 0 2
>>>>>> .,c,c, 0 2 0 0
>>>>>> .,a,,, 1 0 0 0
>>>>>> .,t,t,t 0 0 0 2
>>>>>> .c,,g,^!. 0 1 1 0
>>>>>> .g,ggg.^!, 0 0 4 0
>>>>>> .$,,,,,., 0 0 0 0
>>>>>> a,g,,t, 1 0 1 1
>>>>>> ,,,,,.,^!. 0 0 0 0
>>>>>> ,$,,,,.,. 0 0 0 0
>>>>>>
>>>>>> Has the advantage that the input data ends up as rownames, which
>>>>>> was a
>>>>>> surprise.
>>>>>>
>>>>>> If you wanted to count "A" and "a" as equivalent, then the split
>>>>>> argument should be "a|A"
>>>>>>
>>>>>>
>>>>>
>>>>>>> AS YOU MENTIONED THAT IF I WANT TO COUNT A AND a I SHOULD SPLIT
>>>>>>> LIKE
>>>>>>> THIS.
>>>>>
>>>>> BUT CAN I COUNT . AND , ALSO USING-
>>>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>>>> split=".|,"), length) , "-", 1)),
>>>>>
>>>>> I TRIED IT BUT ITS NOT WORKING.IT IS GIVING THE OUTPUT BUT AT SOME
>>>>> PLACES
>>>>> IT IS SHOWING MORE NUMBER OF . AND , AND SOMEWHERE IT IS NOT EVEN
>>>>> CALCULATING AND JUST SHOWING 0.
>>>>
>>>> You need to use valid regex expressions for 'split'. Since "." and
>>>> "," are
>>>> special characters they need to be escaped when you wnat the
>>>> literals to be
>>>> recognized as such.
>>>>
>>>> I haven't figured out why but you need to drop the final operation
>>>> of
>>>> subtracting 1 from the values when counting commas:
>>>>
>>>> data.frame(periods = unlist(lapply( lapply( sapply(txtvec,
>>>> strsplit,
>>>> split="\\."), length) , "-", 1))
>>>> ,commas = unlist( lapply( sapply(txtvec, strsplit,
>>>> split="\\,"), length) ) )
>>>> periods commas
>>>> .a,g,, 1 3
>>>> .t,t,, 1 3
>>>> .,c,c, 1 3
>>>> .,a,,, 1 4
>>>> .,t,t,t 1 4
>>>> .c,,g,^!. 1 4
>>>> .g,ggg.^!, 2 2
>>>> .$,,,,,., 2 6
>>>> a,g,,t, 0 4
>>>> ,,,,,.,^!. 1 7
>>>> ,$,,,,.,. 1 7
>>>>
>>>> --
>>>>
>>
>
>
David Winsemius, MD
West Hartford, CT
