[R] How to do the same thing for all levels of a column?

Tue Jul 24 18:37:44 CEST 2012

OK, I admit it: I re-read what you wrote and now I'm confused. Is:

> sapply(myfile[,-c(1,2)],function(x)prop.table(tapply(f,x)))

            X1       X2        X3       X4     X5  X6    X7  X8
[1,] 0.1428571 0.2 0.2857143 0.125 0.2 0.2 0.125 0.2
[2,] 0.4285714 0.2 0.1428571 0.250 0.4 0.2 0.375 0.2
[3,] 0.1428571 0.4 0.2857143 0.375 0.2 0.2 0.250 0.4
[4,] 0.2857143 0.2 0.2857143 0.250 0.2 0.4 0.250 0.2

what you want?

-- Bert
On Tue, Jul 24, 2012 at 9:17 AM, Bert Gunter <bgunter at gene.com> wrote:
> The OP's request is a bit ambiguous to me: at a given residue, do you
> wish to calculate the proportions for only those amino acids that
> appear at that residue, or do you wish to include the proportions for
> all amino acids, some of which might then be 0.
>
> Assuming the former, then I don't think one needs to go to the lengths
> described by John below.
>
> Using your example (thanks!), the following seems to suffice:
>
>> sapply(myfile[,-c(1,2)],function(x)prop.table(table(x)))
>
> $X1
> x
>    L    R    T
> 0.50 0.25 0.25
>
> $X2
> x
>    E    M
> 0.75 0.25
>
> $X3
> x
>    N    Y
> 0.25 0.75
>
> $X4
> x
>    I    L    Q
> 0.25 0.50 0.25
>
> $X5
> x
>    I    V
> 0.75 0.25
>
> $X6
> x
>    P    S
> 0.75 0.25
>
> $X7
> x
>    D    E    G
> 0.25 0.50 0.25
>
> $X8
> x
>    A    C
> 0.75 0.25
>
>
> This could, of course, then be modified to add zero proportions for
> all non-appearing amino acids.
>
> -- Cheers,
> Bert
>
> On Tue, Jul 24, 2012 at 8:18 AM, John Kane <jrkrideau at inbox.com> wrote:
>>
>>    I think this does what you want using two packages, plyr and reshape2 that
>>    you may have to install.  If so install.packages("plyr", "reshape2") should
>>    do the trick.
>>    library(plyr)
>>    library(reshape2)
>>    # using supplied file 'myfile" from below
>>    time0total = sum(myfile[,2])
>>    mydata  <-  myfile[, 2:10]
>>    md1  <-  melt(mydata, id = "Time_zero")
>>    ddply(md1, .(variable, value), summarise, sum = sum(Time_zero)/time0total)
>>
>>
>>    John Kane
>>    Kingston ON Canada
>>
>>    -----Original Message-----
>>    From: zj29 at cornell.edu
>>    Sent: Tue, 24 Jul 2012 10:25:21 -0400
>>    To: jrkrideau at inbox.com
>>    Subject: Re: [R] How to do the same thing for all levels of a column?
>>
>>    Hi John,
>>    Thank you for the tips. My apologies about the unreadable sample data...
>>    So here is the output of the sample data, and hopefully it works this time
>>    :)
>>    myfile  <-  structure(list(Proteins = structure(1:4, .Label = c("p1", "p2",
>>    "p3", "p4"), class = "factor"), Time_zero = c(0.0050723, 0.0002731,
>>    9.76e-05, 0.0002077), X1 = structure(c(1L, 3L, 1L, 2L), .Label = c("L",
>>    "R", "T"), class = "factor"), X2 = structure(c(1L, 1L, 2L, 1L
>>    ), .Label = c("E", "M"), class = "factor"), X3 = structure(c(2L,
>>    1L, 2L, 2L), .Label = c("N", "Y"), class = "factor"), X4 = structure(c(1L,
>>    2L,  3L,  2L),  .Label  =  c("I",  "L",  "Q"), class = "factor"), X5 =
>>    structure(c(1L,
>>    2L, 1L, 1L), .Label = c("I", "V"), class = "factor"), X6 = structure(c(1L,
>>    1L, 1L, 2L), .Label = c("P", "S"), class = "factor"), X7 = structure(c(1L,
>>    3L,  2L,  2L),  .Label  =  c("D",  "E",  "G"), class = "factor"), X8 =
>>    structure(c(1L,
>>    1L,  2L,  1L),  .Label  =  c("A",  "C"),  class = "factor")), .Names =
>>    c("Proteins",
>>    "Time_zero", "X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8"), row.names =
>>    c(NA,
>>    4L), class = "data.frame")
>>    And here is my original question:
>>    Basically, I have a bunch of protein sequences composed of different amino
>>    acid residues, and each residue is represented by an uppercase letter. I
>>    want  to  calculate the ratio of different amino acid residues at each
>>    position of the proteins.
>>
>>    If  I  name  this table as myfile.txt, I have the following scripts to
>>    calculate the ratio of each amino acid residue at position 1:
>>
>>    # showing levels of the 3rd column, which means the types of residues
>>
>>    >myfile[,3]
>>
>>
>>    # calculating the ratio of L
>>
>>    >list=c(which(myfile[,3]=="L"))
>>
>>    >time0total=sum(myfile[,2])
>>
>>    >AA_L=0
>>
>>    >for (i in 1:length(list)){AA_L=sum(myfile[list[[i]],2]+AA_L)}
>>
>>    >ratio_L=AA_L/time0total
>>
>>
>>    So how can I write a script to do the same thing for the other two levels (T
>>    and R) in column 3, and also do this for every column that contains amino
>>    acid residues?
>>
>>    Thanks a lot!
>>
>>    Regards,
>>
>>    Zhao
>>    2012/7/24 John Kane <[1]jrkrideau at inbox.com>
>>
>>      First thing is to supply the data in a useable format.  As is it is
>>      essenatially unreadable.  All R-beginners do this. :)
>>      Have a look at the dput function  (?dput) for a good way to supply sample
>>      data in an email.
>>      If you have a large dataset probably a few dozen lines of data would be
>>      fine.
>>      Something like dput(head(mydata)) should be fine.  Just copy and paste the
>>      output into your email.
>>      Welcome to R.  I think you will like it.
>>      John Kane
>>      Kingston ON Canada
>>
>>    > -----Original Message-----
>>    > From: [2]zj29 at cornell.edu
>>    > Sent: Mon, 23 Jul 2012 18:01:11 -0400
>>    > To: [3]r-help at r-project.org
>>    > Subject: [R] How to do the same thing for all levels of a column?
>>    >
>>    > Dear all,
>>    >
>>    >
>>    >
>>    > I am a R beginner, and I am looking for a way to do the same thing for
>>    > all
>>    > levels of a column in a table.
>>    >
>>    >
>>    >
>>    > Basically, I have a bunch of protein sequences composed of different
>>    > amino
>>    > acid residues, and each residue is represented by an uppercase letter. I
>>    > want to calculate the ratio of different amino acid residues at each
>>    > position of the proteins. Here is an example table:
>>    >
>>    > Proteins
>>    >
>>    > Time_zero
>>    >
>>    > 1
>>    >
>>    > 2
>>    >
>>    > 3
>>    >
>>    > 4
>>    >
>>    > 5
>>    >
>>    > 6
>>    >
>>    > 7
>>    >
>>    > 8
>>    >
>>    > p1
>>    >
>>    > 0.0050723
>>    >
>>    > L
>>    >
>>    > E
>>    >
>>    > Y
>>    >
>>    > I
>>    >
>>    > I
>>    >
>>    > P
>>    >
>>    > D
>>    >
>>    > A
>>    >
>>    > p2
>>    >
>>    > 0.0002731
>>    >
>>    > T
>>    >
>>    > E
>>    >
>>    > N
>>    >
>>    > L
>>    >
>>    > V
>>    >
>>    > P
>>    >
>>    > G
>>    >
>>    > A
>>    >
>>    > p3
>>    >
>>    > 9.757E-05
>>    >
>>    > L
>>    >
>>    > M
>>    >
>>    > Y
>>    >
>>    > Q
>>    >
>>    > I
>>    >
>>    > P
>>    >
>>    > E
>>    >
>>    > C
>>    >
>>    > p4
>>    >
>>    > 0.0002077
>>    >
>>    > R
>>    >
>>    > E
>>    >
>>    > Y
>>    >
>>    > L
>>    >
>>    > I
>>    >
>>    > S
>>    >
>>    > E
>>    >
>>    > A
>>    >
>>    >
>>    >
>>    > If I name this table as myfile.txt, I have the following scripts to
>>    > calculate the ratio of each amino acid residue at position 1:
>>    >
>>    > # showing levels of the 3rd column, which means the types of residues
>>    >
>>    > >myfile[,3]
>>    >
>>    >
>>    >
>>    > # calculating the ratio of L
>>    >
>>    > >list=c(which(myfile[,3]=="L"))
>>    >
>>    > >time0total=sum(myfile[,2])
>>    >
>>    > >AA_L=0
>>    >
>>    > >for (i in 1:length(list)){AA_L=sum(myfile[list[[i]],2]+AA_L)}
>>    >
>>    > >ratio_L=AA_L/time0total
>>    >
>>    >
>>    >
>>    > So how can I write a script to do the same thing for the other two levels
>>    > (T and R) in column 3, and also do this for every column that contains
>>    > amino acid residues?
>>    >
>>    >
>>    >
>>    > Many thanks for any help you could give me on this topic! :)
>>    >
>>    >
>>    >
>>    > Regards,
>>    >
>>    > Zhao
>>    > --
>>    > Zhao JIN
>>    > Ph.D. Candidate
>>    > Ruth Ley Lab
>>    > 467 Biotech
>>    > Field of Microbiology, Cornell University
>>    > Lab: 607.255.4954
>>    > Cell: 412.889.3675
>>    >
>>
>>      >       [[alternative HTML version deleted]]
>>      >
>>      > ______________________________________________
>>      > [4]R-help at r-project.org mailing list
>>      > [5]https://stat.ethz.ch/mailman/listinfo/r-help
>>      > PLEASE do read the posting guide
>>      > [6]http://www.R-project.org/posting-guide.html
>>      > and provide commented, minimal, self-contained, reproducible code.
>>      ____________________________________________________________
>>      FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on
>>      your desktop!
>>      Check it out at [7]http://www.inbox.com/marineaquarium
>>
>>    --
>>    Zhao JIN
>>    Ph.D. Candidate
>>    Ruth Ley Lab
>>    467 Biotech
>>    Field of Microbiology, Cornell University
>>    Lab: 607.255.4954
>>    Cell: 412.889.3675
>>      _________________________________________________________________
>>
>>    [8]3D Earth Screensaver Preview
>>    Free 3D Earth Screensaver
>>    Watch   the   Earth   right   on   your   desktop!  Check  it  out  at
>>    [9]www.inbox.com/earth
>>
>> References
>>
>>    1. mailto:jrkrideau at inbox.com
>>    2. mailto:zj29 at cornell.edu
>>    3. mailto:r-help at r-project.org
>>    4. mailto:R-help at r-project.org
>>    5. https://stat.ethz.ch/mailman/listinfo/r-help
>>    6. http://www.R-project.org/posting-guide.html
>>    7. http://www.inbox.com/marineaquarium
>>    8. http://www.inbox.com/earth
>>    9. http://www.inbox.com/earth
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> Internal Contact Info:
> Phone: 467-7374
> Website:
> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm