[R] For help in R coding

Mon Jul 4 00:10:40 CEST 2011

________________________________________
From: David Winsemius [dwinsemius at comcast.net]
Sent: Sunday, July 03, 2011 7:08 PM
To: Bansal, Vikas
Cc: Dennis Murphy; r-help at r-project.org
Subject: Re: [R] For help in R coding

On Jul 3, 2011, at 1:07 PM, Bansal, Vikas wrote:

> Yes you are right. unlist operation is unnecessary and I have tried
> it yesterday and it is working without that operation also.But I
> have one more problem on which I have worked whole day but did not
> get any solution.As I told you I am new to R,I want to ask that how
> I can use the (if condition) in the following code
>
> df=read.table("Case2.pileup",fill=T,sep="\t",colClasses = "character")
> txtvec <- readLines(textConnection(df[,9]))
> dad=data.frame(A = (sapply(gregexpr("A|a", (df[,9])), function(x) if
> ( x[[1]] != -1)
> length(x) else 0 )),
> C = (sapply(gregexpr("C|c", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 )),
> G = (sapply(gregexpr("G|g", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 )),
> T = (sapply(gregexpr("T|t", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 )),
> N = (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] !
> = -1)
> length(x) else 0 )))
>
>
> Now my problem is in my data frame I have alphabets A,C,G and T in
> 3rd column also.Now these commas (,)and dots(.) in column 9 are for
> these alphabets which are in column 3.I want to use if condition
> like this
>
> if in my dataframe column 3 have  A then A = (sapply(gregexpr("\\,|\
> \.", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 ))) else (A = (sapply(gregexpr("A|a", (df[,9])),
> function(x) if ( x[[1]] != -1)
> length(x) else 0 )),if in my dataframe column 3 haveCA then C =
> (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 ))) else C = (sapply(gregexpr("C|c", (df[,9])),
> function(x) if ( x[[1]] != -1)
> length(x) else 0 )), if in my dataframe column 3 have  G then G =
> (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 ))) else G = (sapply(gregexpr("G|g", (df[,9])),
> function(x) if ( x[[1]] != -1)
> length(x) else 0 )) if in my dataframe column 3 have  T then T =
> (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 ))) else T = (sapply(gregexpr("T|t", (df[,9])),
> function(x) if ( x[[1]] != -1)
> length(x) else 0 )),
>
>
> So I want to code so that it will give the output like this-
>
> DATA FRAME (Input)
>
>   col3                 col 9
>    T                      .a,g,,
>    A                    .t,t,,
>    A                    .,c,c,
>    C                     .,a,,,
>    G                     .,t,t,t
>    A                     .c,,g,^!.
>    A                      .g,ggg.^!,
>    A                      .$,,,,,.,
>    C                      a,g,,t,
>    T                      ,,,,,.,^!.
>    T                       ,$,,,,.,."
>
>
> output
>
> A            C                 G                        T
> 1             0                  1                        4
> 4             0                  0                        2
> 4              2                 0                        0
> 1              5                 0                        0
> 0              0                 4                        3
>
>
>
> This is the output for first five rows.v

I was unable to follow the logic and because complete output was not
offered, I am unable to check my guesses against you full
specifications.

Oh sorry.I will explain it again.As I told you my dataframe has ten columns.but i am working on 3rd and 9th column. to calculate the number of A C G T . and , we used the following code-
> df=read.table("Case2.pileup",fill=T,sep="\t",colClasses = "character")
> txtvec <- readLines(textConnection(df[,9]))
> dad=data.frame(A = (sapply(gregexpr("A|a", (df[,9])), function(x) if
> ( x[[1]] != -1)
> length(x) else 0 )),
> C = (sapply(gregexpr("C|c", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 )),
> G = (sapply(gregexpr("G|g", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 )),
> T = (sapply(gregexpr("T|t", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 )),
> N = (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] !
> = -1)
> length(x) else 0 )))

now in 3rd column of my dataframe I have chAracters A or C or G or T.so my 3rd column and 9th column is like this-

col3                 col 9
>    T                      .a,g,,
>    A                    .t,t,,
>    A                    .,c,c,
>    C                     .,a,,,
>    G                     .,t,t,t
>    A                     .c,,g,^!.
>    A                      .g,ggg.^!,
>    A                      .$,,,,,.,
>    C                      a,g,,t,
>    T                      ,,,,,.,^!.
>    T                       ,$,,,,.,."

Initially we were working on 9th column only to calculate number of A,C,G and T and (.) and (,) separately using code provided by you shown above.
but now i want that if in column 3 I have T so it should make it equal to the number of .|,
as I showed you my output

output
>
> A            C                 G                        T
> 1             0                  1                        4
> 4             0                  0                        2
> 4              2                 0                        0
> 1              5                 0                        0
> 0              0                 4                        3
In the first row of my input I have T in 3rd column.so T=number of total . and , that is 4.and a and g are 1
in second row of my input i have A in 3rd column so A should be equal to total number of (.) and (,) that is 4 and remaining are the 2 T.

that is why i wrote this thing using if condotion
> if in my dataframe column 3 have  A then A = (sapply(gregexpr("\\,|\
> \.", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 ))) else (A = (sapply(gregexpr("A|a", (df[,9])),
> function(x) if ( x[[1]] != -1)
> length(x) else 0 )),

if in my dataframe column 3 haveCA then C =
> (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 ))) else C = (sapply(gregexpr("C|c", (df[,9])),
> function(x) if ( x[[1]] != -1)
> length(x) else 0 )),

 if in my dataframe column 3 have  G then G =
> (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 ))) else G = (sapply(gregexpr("G|g", (df[,9])),
> function(x) if ( x[[1]] != -1)
> length(x) else 0 ))

 if in my dataframe column 3 have  T then T =
> (sapply(gregexpr("\\,|\\.", (df[,9])), function(x) if ( x[[1]] != -1)
> length(x) else 0 ))) else T = (sapply(gregexpr("T|t", (df[,9])),
> function(x) if ( x[[1]] != -1)
> length(x) else 0 )),

the code is same i just want to add a condition so that  it should check that if in column 3, the character is A then make number of A equal to total number of . and ,

Should I explain better or can you please tell me which thing is not clear?

>
--
David.
>
>
>
> Can you please help me how to use this if condition in your coding
> or we can also do it by using some other condition rather than if
> condition?
>
>
>
>
>
>
>
>
>
>
>
>
> ________________________________________
> From: David Winsemius [dwinsemius at comcast.net]
> Sent: Sunday, July 03, 2011 3:57 AM
> To: Bansal, Vikas
> Cc: Dennis Murphy; r-help at r-project.org
> Subject: Re: [R] For help in R coding
>
> On Jul 2, 2011, at 4:46 PM, Bansal, Vikas wrote:
>
>> DEAR ALL,
>> I TRIED THIS CODE AND THIS IS RUNNING PERFECTLY...
>>
>> df=read.table("Case2.pileup",fill=T,sep="\t",colClasses =
>> "character")
>> txt=df[,9]
>> txtvec <- readLines(textConnection(txt))
>> dad=data.frame(A = unlist(sapply(gregexpr("A|a", txtvec),
>> function(x) if ( x[[1]] != -1)
>> length(x) else 0 )),
>> C = unlist(sapply(gregexpr("C|c", txtvec), function(x) if ( x[[1]] !
>> = -1)
>> length(x) else 0 )),
>> G = unlist(sapply(gregexpr("G|g", txtvec), function(x) if ( x[[1]] !
>> = -1)
>> length(x) else 0 )),
>> T = unlist(sapply(gregexpr("T|t", txtvec), function(x) if ( x[[1]] !
>> = -1)
>> length(x) else 0 )),
>> N = unlist(sapply(gregexpr("\\,|\\.", txtvec), function(x) if
>> ( x[[1]] != -1)
>> length(x) else 0 )))
>>
>
> The unlist operation is unnecessary since the sapply operation returns
> a vector.  (It doesn't hurt, but it is unnecessary.)
>>
>>
>>
>>
>> Thanking you,
>> Warm Regards
>> Vikas Bansal
>> Msc Bioinformatics
>> Kings College London
>> ________________________________________
>> From: David Winsemius [dwinsemius at comcast.net]
>> Sent: Saturday, July 02, 2011 9:04 PM
>> To: Dennis Murphy
>> Cc: r-help at r-project.org; Bansal, Vikas
>> Subject: Re: [R] For help in R coding
>>
>> On reflection and a bit of testing I think the best approach would be
>> to use gregexpr. For counting the number of commas, this appears
>> quite
>> straightforward.
>>
>>> sapply(gregexpr("\\,", txtvec), function(x) if ( x[[1]] != -1)
>> length(x) else 0 )
>> [1] 3 3 3 4 3 3 2 6 4 6 6
>>
>> It easily generalizes to period and the `|` (or) operation on
>> letters.
>> ( did need to add the check since the length of gregexpr is always at
>> least one but ihas value -1 when there is no match
>>
>>> sapply(gregexpr("t|T", txtvec), function(x) if ( x[[1]] != -1)
>> length(x) else 0 )
>> [1] 0 2 0 0 3 0 0 0 1 0 0
>>
>>
>> On Jul 2, 2011, at 3:22 PM, Dennis Murphy wrote:
>>
>>> Hi:
>>>
>>> There seems to be a problem if the string ends in , or . , which
>>> makes
>>> it difficult for strsplit() to pick up if it is splitting on those
>>> characters. Here is an alternative, splitting on individual
>>> characters
>>> and using charmatch() instead:
>>>
>>> charsum <- function(s, char) {
>>>  u <- strsplit(s, "")
>>>  sum(sapply(u, function(x) charmatch(x, char)), na.rm = TRUE)
>>> }
>>>
>>> unname(sapply(txtvec, function(x) charsum(x, ',')))
>>> unname(sapply(txtvec, function(x) charsum(x, '.')))
>>>
>>> Putting this into a data frame,
>>>
>>> dfout <- data.frame(periods = unname(sapply(txtvec, function(x)
>>> charsum(x, '.'))),
>>>                              commas = unname(sapply(txtvec,
>>> function(x) charsum(x, '.'))) )
>>> txtvec
>>>
>>> HTH,
>>> Dennis
>>>
>>> On Sat, Jul 2, 2011 at 10:19 AM, David Winsemius <dwinsemius at comcast.net
>>>> wrote:
>>>>
>>>> On Jul 2, 2011, at 12:34 PM, Bansal, Vikas wrote:
>>>>
>>>>>
>>>>>
>>>>>>> Dear all,
>>>>>>>
>>>>>>> I am doing a project on variant calling using R.I am working on
>>>>>>> pileup file.There are 10 columns in my data frame and I want to
>>>>>>> count the number of A,C,G and T in each row for column 9.example
>>>>>>> of
>>>>>>> column 9 is given below-
>>>>>>>
>>>>>>>       .a,g,,
>>>>>>>       .t,t,,
>>>>>>>       .,c,c,
>>>>>>>       .,a,,,
>>>>>>>       .,t,t,t
>>>>>>>       .c,,g,^!.
>>>>>>>       .g,ggg.^!,
>>>>>>>       .$,,,,,.,
>>>>>>>       a,g,,t,
>>>>>>>       ,,,,,.,^!.
>>>>>>>       ,$,,,,.,.
>>>>>>>
>>>>>>> This is a bit confusing for me as these characters are in one
>>>>>>> column
>>>>>>> and how can we scan them for each row to print number of A,C,G
>>>>>>> and T
>>>>>>> for each row.
>>>>>>
>>>>>> Seems a bit clunky but this does the job (first the data):
>>>>>>>
>>>>>>> txt <- " .a,g,,
>>>>>>
>>>>>> +            .t,t,,
>>>>>> +            .,c,c,
>>>>>> +            .,a,,,
>>>>>> +            .,t,t,t
>>>>>> +            .c,,g,^!.
>>>>>> +            .g,ggg.^!,
>>>>>> +            .$,,,,,.,
>>>>>> +            a,g,,t,
>>>>>> +            ,,,,,.,^!.
>>>>>> +            ,$,,,,.,."
>>>>>>
>>>>>>> txtvec <- readLines(textConnection(txt))
>>>>>>
>>>>>> Now the clunky solution, Basically subtracts 1 from the counts of
>>>>>> "fragments" that result from splitting on each letter in turn.
>>>>>> Could
>>>>>> be made prettier with a function that did the job.
>>>>>>
>>>>>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>>>>>
>>>>>> split="a"), length) , "-", 1)),
>>>>>> + C = unlist(lapply( lapply( sapply(txtvec, strsplit, split="c"),
>>>>>> length) , "-", 1)),
>>>>>> + G = unlist(lapply( lapply( sapply(txtvec, strsplit, split="g"),
>>>>>> length) , "-", 1)),
>>>>>> + T = unlist(lapply( lapply( sapply(txtvec, strsplit, split="t"),
>>>>>> length) , "-", 1)) )
>>>>>>                   A C G T
>>>>>> .a,g,,               1 0 1 0
>>>>>>        .t,t,,     0 0 0 2
>>>>>>        .,c,c,     0 2 0 0
>>>>>>        .,a,,,     1 0 0 0
>>>>>>        .,t,t,t    0 0 0 2
>>>>>>        .c,,g,^!.  0 1 1 0
>>>>>>        .g,ggg.^!, 0 0 4 0
>>>>>>        .$,,,,,.,  0 0 0 0
>>>>>>        a,g,,t,    1 0 1 1
>>>>>>        ,,,,,.,^!. 0 0 0 0
>>>>>>        ,$,,,,.,.  0 0 0 0
>>>>>>
>>>>>> Has the advantage that the input data ends up as rownames, which
>>>>>> was a
>>>>>> surprise.
>>>>>>
>>>>>> If you wanted to count "A" and "a" as equivalent, then the split
>>>>>> argument should be "a|A"
>>>>>>
>>>>>>
>>>>>
>>>>>>> AS YOU MENTIONED THAT IF I WANT TO COUNT A AND a I SHOULD SPLIT
>>>>>>> LIKE
>>>>>>> THIS.
>>>>>
>>>>> BUT CAN I COUNT . AND , ALSO USING-
>>>>> data.frame(A = unlist(lapply( lapply( sapply(txtvec, strsplit,
>>>>> split=".|,"), length) , "-", 1)),
>>>>>
>>>>> I TRIED IT BUT ITS NOT WORKING.IT IS GIVING THE OUTPUT BUT AT SOME
>>>>> PLACES
>>>>> IT IS SHOWING MORE NUMBER OF . AND , AND SOMEWHERE IT IS NOT EVEN
>>>>> CALCULATING AND JUST SHOWING 0.
>>>>
>>>> You need to use valid regex expressions for 'split'. Since "." and
>>>> "," are
>>>> special characters they need to be escaped when you wnat the
>>>> literals to be
>>>> recognized as such.
>>>>
>>>> I haven't figured out why but you need to drop the final operation
>>>> of
>>>> subtracting 1 from the values when counting commas:
>>>>
>>>> data.frame(periods = unlist(lapply( lapply( sapply(txtvec,
>>>> strsplit,
>>>>                           split="\\."), length) , "-", 1))
>>>> ,commas = unlist( lapply( sapply(txtvec, strsplit,
>>>>                           split="\\,"), length) ) )
>>>>                     periods commas
>>>> .a,g,,                      1      3
>>>>          .t,t,,           1      3
>>>>          .,c,c,           1      3
>>>>          .,a,,,           1      4
>>>>          .,t,t,t          1      4
>>>>          .c,,g,^!.        1      4
>>>>          .g,ggg.^!,       2      2
>>>>          .$,,,,,.,        2      6
>>>>          a,g,,t,          0      4
>>>>          ,,,,,.,^!.       1      7
>>>>          ,$,,,,.,.        1      7
>>>>
>>>> --
>>>>
>>>> David Winsemius, MD
>>>> West Hartford, CT
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>
>> David Winsemius, MD
>> West Hartford, CT
>>
>
> David Winsemius, MD
> West Hartford, CT
>

David Winsemius, MD
West Hartford, CT