[R] How to do the same thing for all levels of a column?

John Kane jrkrideau at inbox.com
Wed Jul 25 15:44:58 CEST 2012


   No it's actually telling it to split by the two variables (variable, value)
   if I understand your question correctly.
   The confusion is my fault. I tend to be lazy when running examples and did
   not rename the melt() output to something meaningful. I sometimes forget
   that it's not just me reading the code.
   If you run:
   md1  <-  melt(mydata, id = "Time_zero",
            variable.name="xvars",
           value.name="aminos")
   ddply(md1, .(xvars, aminos), summarise, sum = sum(Time_zero)/time0total)
   I think it will show what is happening.



   John Kane
   Kingston ON Canada

   -----Original Message-----
   From: zj29 at cornell.edu
   Sent: Tue, 24 Jul 2012 15:26:52 -0400
   To: gunter.berton at gene.com
   Subject: Re: [R] How to do the same thing for all levels of a column?

   Hi John and Bert,
   Thank you so much for your replies. Both of your scripts worked well, so now
   I've learnt two ways to do it. :)
   Bert: I was not very clear on what I wanted to do. I just would like to
   calculate the residues shown in the table, not all residues. The apply
   functions are amazing!
   John: as I am still digesting the codes, I am not sure if I fully understood
   the argument .(variables, value) in the ddply line. The description of ddply
   says that .variables show the variables to split data frame by, as quoted
   variables, a formula or character vector. So does .(variables, value) tell R
   to  split  the data frame by values, which are the types of amino acid
   residues?
   Thank you all again.
   Cheers,
   Zhao
   2012/7/24 Bert Gunter <[1]gunter.berton at gene.com>

     ... and I neglected to mention that f = myfiles[,2]
     Sigh....  More coffee needed.
     -- Bert

   On Tue, Jul 24, 2012 at 9:43 AM, Bert Gunter <[2]bgunter at gene.com> wrote:
   > Sorry. Typo in my previous. Should be:
   >
   >> sapply(myfile[,-c(1,2)],function(x)prop.table(tapply(f,x,sum)))
   > $X1
   >          L          R          T
   > 0.91491320 0.03675651 0.04833030
   >
   > $X2
   >         E         M
   > 0.9827278 0.0172722
   >
   > $X3
   >         N         Y
   > 0.0483303 0.9516697
   >
   > $X4
   >         I         L         Q
   > 0.8976410 0.0850868 0.0172722
   >
   > $X5
   >         I         V
   > 0.9516697 0.0483303
   >
   > $X6
   >          P          S
   > 0.96324349 0.03675651
   >
   > $X7
   >         D         E         G
   > 0.8976410 0.0540287 0.0483303
   >
   > $X8
   >         A         C
   > 0.9827278 0.0172722
   >
   >
   >
   > On Tue, Jul 24, 2012 at 9:37 AM, Bert Gunter <[3]bgunter at gene.com> wrote:
   >> OK, I admit it: I re-read what you wrote and now I'm confused. Is:
   >>
   >>> sapply(myfile[,-c(1,2)],function(x)prop.table(tapply(f,x)))
   >>
   >>             X1       X2        X3       X4     X5  X6    X7  X8
   >> [1,] 0.1428571 0.2 0.2857143 0.125 0.2 0.2 0.125 0.2
   >> [2,] 0.4285714 0.2 0.1428571 0.250 0.4 0.2 0.375 0.2
   >> [3,] 0.1428571 0.4 0.2857143 0.375 0.2 0.2 0.250 0.4
   >> [4,] 0.2857143 0.2 0.2857143 0.250 0.2 0.4 0.250 0.2
   >>
   >> what you want?
   >>
   >> -- Bert
   >> On Tue, Jul 24, 2012 at 9:17 AM, Bert Gunter <[4]bgunter at gene.com> wrote:
   >>> The OP's request is a bit ambiguous to me: at a given residue, do you
   >>> wish to calculate the proportions for only those amino acids that
   >>> appear at that residue, or do you wish to include the proportions for
   >>> all amino acids, some of which might then be 0.
   >>>
   >>> Assuming the former, then I don't think one needs to go to the lengths
   >>> described by John below.
   >>>
   >>> Using your example (thanks!), the following seems to suffice:
   >>>
   >>>> sapply(myfile[,-c(1,2)],function(x)prop.table(table(x)))
   >>>
   >>> $X1
   >>> x
   >>>    L    R    T
   >>> 0.50 0.25 0.25
   >>>
   >>> $X2
   >>> x
   >>>    E    M
   >>> 0.75 0.25
   >>>
   >>> $X3
   >>> x
   >>>    N    Y
   >>> 0.25 0.75
   >>>
   >>> $X4
   >>> x
   >>>    I    L    Q
   >>> 0.25 0.50 0.25
   >>>
   >>> $X5
   >>> x
   >>>    I    V
   >>> 0.75 0.25
   >>>
   >>> $X6
   >>> x
   >>>    P    S
   >>> 0.75 0.25
   >>>
   >>> $X7
   >>> x
   >>>    D    E    G
   >>> 0.25 0.50 0.25
   >>>
   >>> $X8
   >>> x
   >>>    A    C
   >>> 0.75 0.25
   >>>
   >>>
   >>> This could, of course, then be modified to add zero proportions for
   >>> all non-appearing amino acids.
   >>>
   >>> -- Cheers,
   >>> Bert
   >>>
   >>> On Tue, Jul 24, 2012 at 8:18 AM, John Kane <[5]jrkrideau at inbox.com>
   wrote:
   >>>>
   >>>>      I think this does what you want using two packages, plyr and
   reshape2 that
   >>>>    you may have to install.  If so install.packages("plyr", "reshape2")
   should
   >>>>    do the trick.
   >>>>    library(plyr)
   >>>>    library(reshape2)
   >>>>    # using supplied file 'myfile" from below
   >>>>    time0total = sum(myfile[,2])
   >>>>    mydata  <-  myfile[, 2:10]
   >>>>    md1  <-  melt(mydata, id = "Time_zero")
   >>>>         ddply(md1,   .(variable,   value),   summarise,   sum   =
   sum(Time_zero)/time0total)
   >>>>
   >>>>
   >>>>    John Kane
   >>>>    Kingston ON Canada
   >>>>
   >>>>    -----Original Message-----
   >>>>    From: [6]zj29 at cornell.edu
   >>>>    Sent: Tue, 24 Jul 2012 10:25:21 -0400
   >>>>    To: [7]jrkrideau at inbox.com
   >>>>     Subject: Re: [R] How to do the same thing for all levels of a
   column?
   >>>>
   >>>>    Hi John,
   >>>>    Thank you for the tips. My apologies about the unreadable sample
   data...
   >>>>    So here is the output of the sample data, and hopefully it works
   this time
   >>>>    :)
   >>>>     myfile  <-  structure(list(Proteins = structure(1:4, .Label =
   c("p1", "p2",
   >>>>    "p3", "p4"), class = "factor"), Time_zero = c(0.0050723, 0.0002731,
   >>>>    9.76e-05, 0.0002077), X1 = structure(c(1L, 3L, 1L, 2L), .Label =
   c("L",
   >>>>    "R", "T"), class = "factor"), X2 = structure(c(1L, 1L, 2L, 1L
   >>>>    ), .Label = c("E", "M"), class = "factor"), X3 = structure(c(2L,
   >>>>      1L,  2L,  2L), .Label = c("N", "Y"), class = "factor"), X4 =
   structure(c(1L,
   >>>>    2L,  3L,  2L),  .Label  =  c("I",  "L",  "Q"), class = "factor"), X5
   =
   >>>>    structure(c(1L,
   >>>>      2L,  1L,  1L), .Label = c("I", "V"), class = "factor"), X6 =
   structure(c(1L,
   >>>>      1L,  1L,  2L), .Label = c("P", "S"), class = "factor"), X7 =
   structure(c(1L,
   >>>>    3L,  2L,  2L),  .Label  =  c("D",  "E",  "G"), class = "factor"), X8
   =
   >>>>    structure(c(1L,
   >>>>    1L,  2L,  1L),  .Label  =  c("A",  "C"),  class = "factor")), .Names
   =
   >>>>    c("Proteins",
   >>>>     "Time_zero", "X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8"),
   row.names =
   >>>>    c(NA,
   >>>>    4L), class = "data.frame")
   >>>>    And here is my original question:
   >>>>    Basically, I have a bunch of protein sequences composed of different
   amino
   >>>>    acid residues, and each residue is represented by an uppercase
   letter. I
   >>>>    want  to  calculate the ratio of different amino acid residues at
   each
   >>>>    position of the proteins.
   >>>>
   >>>>    If  I  name  this table as myfile.txt, I have the following scripts
   to
   >>>>    calculate the ratio of each amino acid residue at position 1:
   >>>>
   >>>>      # showing levels of the 3rd column, which means the types of
   residues
   >>>>
   >>>>    >myfile[,3]
   >>>>
   >>>>
   >>>>    # calculating the ratio of L
   >>>>
   >>>>    >list=c(which(myfile[,3]=="L"))
   >>>>
   >>>>    >time0total=sum(myfile[,2])
   >>>>
   >>>>    >AA_L=0
   >>>>
   >>>>    >for (i in 1:length(list)){AA_L=sum(myfile[list[[i]],2]+AA_L)}
   >>>>
   >>>>    >ratio_L=AA_L/time0total
   >>>>
   >>>>
   >>>>    So how can I write a script to do the same thing for the other two
   levels (T
   >>>>    and R) in column 3, and also do this for every column that contains
   amino
   >>>>    acid residues?
   >>>>
   >>>>    Thanks a lot!
   >>>>
   >>>>    Regards,
   >>>>
   >>>>    Zhao
   >>>>    2012/7/24 John Kane <[1][8]jrkrideau at inbox.com>
   >>>>
   >>>>      First thing is to supply the data in a useable format.  As is it
   is
   >>>>      essenatially unreadable.  All R-beginners do this. :)
   >>>>      Have a look at the dput function  (?dput) for a good way to supply
   sample
   >>>>      data in an email.
   >>>>      If you have a large dataset probably a few dozen lines of data
   would be
   >>>>      fine.
   >>>>      Something like dput(head(mydata)) should be fine.  Just copy and
   paste the
   >>>>      output into your email.
   >>>>      Welcome to R.  I think you will like it.
   >>>>      John Kane
   >>>>      Kingston ON Canada
   >>>>
   >>>>    > -----Original Message-----
   >>>>    > From: [2][9]zj29 at cornell.edu
   >>>>    > Sent: Mon, 23 Jul 2012 18:01:11 -0400
   >>>>    > To: [3][10]r-help at r-project.org
   >>>>    > Subject: [R] How to do the same thing for all levels of a column?
   >>>>    >
   >>>>    > Dear all,
   >>>>    >
   >>>>    >
   >>>>    >
   >>>>    > I am a R beginner, and I am looking for a way to do the same thing
   for
   >>>>    > all
   >>>>    > levels of a column in a table.
   >>>>    >
   >>>>    >
   >>>>    >
   >>>>      > Basically, I have a bunch of protein sequences composed of
   different
   >>>>    > amino
   >>>>    > acid residues, and each residue is represented by an uppercase
   letter. I
   >>>>    > want to calculate the ratio of different amino acid residues at
   each
   >>>>    > position of the proteins. Here is an example table:
   >>>>    >
   >>>>    > Proteins
   >>>>    >
   >>>>    > Time_zero
   >>>>    >
   >>>>    > 1
   >>>>    >
   >>>>    > 2
   >>>>    >
   >>>>    > 3
   >>>>    >
   >>>>    > 4
   >>>>    >
   >>>>    > 5
   >>>>    >
   >>>>    > 6
   >>>>    >
   >>>>    > 7
   >>>>    >
   >>>>    > 8
   >>>>    >
   >>>>    > p1
   >>>>    >
   >>>>    > 0.0050723
   >>>>    >
   >>>>    > L
   >>>>    >
   >>>>    > E
   >>>>    >
   >>>>    > Y
   >>>>    >
   >>>>    > I
   >>>>    >
   >>>>    > I
   >>>>    >
   >>>>    > P
   >>>>    >
   >>>>    > D
   >>>>    >
   >>>>    > A
   >>>>    >
   >>>>    > p2
   >>>>    >
   >>>>    > 0.0002731
   >>>>    >
   >>>>    > T
   >>>>    >
   >>>>    > E
   >>>>    >
   >>>>    > N
   >>>>    >
   >>>>    > L
   >>>>    >
   >>>>    > V
   >>>>    >
   >>>>    > P
   >>>>    >
   >>>>    > G
   >>>>    >
   >>>>    > A
   >>>>    >
   >>>>    > p3
   >>>>    >
   >>>>    > 9.757E-05
   >>>>    >
   >>>>    > L
   >>>>    >
   >>>>    > M
   >>>>    >
   >>>>    > Y
   >>>>    >
   >>>>    > Q
   >>>>    >
   >>>>    > I
   >>>>    >
   >>>>    > P
   >>>>    >
   >>>>    > E
   >>>>    >
   >>>>    > C
   >>>>    >
   >>>>    > p4
   >>>>    >
   >>>>    > 0.0002077
   >>>>    >
   >>>>    > R
   >>>>    >
   >>>>    > E
   >>>>    >
   >>>>    > Y
   >>>>    >
   >>>>    > L
   >>>>    >
   >>>>    > I
   >>>>    >
   >>>>    > S
   >>>>    >
   >>>>    > E
   >>>>    >
   >>>>    > A
   >>>>    >
   >>>>    >
   >>>>    >
   >>>>    > If I name this table as myfile.txt, I have the following scripts
   to
   >>>>    > calculate the ratio of each amino acid residue at position 1:
   >>>>    >
   >>>>    > # showing levels of the 3rd column, which means the types of
   residues
   >>>>    >
   >>>>    > >myfile[,3]
   >>>>    >
   >>>>    >
   >>>>    >
   >>>>    > # calculating the ratio of L
   >>>>    >
   >>>>    > >list=c(which(myfile[,3]=="L"))
   >>>>    >
   >>>>    > >time0total=sum(myfile[,2])
   >>>>    >
   >>>>    > >AA_L=0
   >>>>    >
   >>>>    > >for (i in 1:length(list)){AA_L=sum(myfile[list[[i]],2]+AA_L)}
   >>>>    >
   >>>>    > >ratio_L=AA_L/time0total
   >>>>    >
   >>>>    >
   >>>>    >
   >>>>    > So how can I write a script to do the same thing for the other two
   levels
   >>>>    > (T and R) in column 3, and also do this for every column that
   contains
   >>>>    > amino acid residues?
   >>>>    >
   >>>>    >
   >>>>    >
   >>>>    > Many thanks for any help you could give me on this topic! :)
   >>>>    >
   >>>>    >
   >>>>    >
   >>>>    > Regards,
   >>>>    >
   >>>>    > Zhao
   >>>>    > --
   >>>>    > Zhao JIN
   >>>>    > Ph.D. Candidate
   >>>>    > Ruth Ley Lab
   >>>>    > 467 Biotech
   >>>>    > Field of Microbiology, Cornell University
   >>>>    > Lab: 607.255.4954
   >>>>    > Cell: 412.889.3675
   >>>>    >
   >>>>
   >>>>      >       [[alternative HTML version deleted]]
   >>>>      >
   >>>>      > ______________________________________________
   >>>>      > [4][11]R-help at r-project.org mailing list
   >>>>      > [5][12]https://stat.ethz.ch/mailman/listinfo/r-help
   >>>>      > PLEASE do read the posting guide
   >>>>      > [6][13]http://www.R-project.org/posting-guide.html
   >>>>      > and provide commented, minimal, self-contained, reproducible
   code.
   >>>>      ____________________________________________________________
   >>>>      FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks &
   orcas on
   >>>>      your desktop!
   >>>>      Check it out at [7][14]http://www.inbox.com/marineaquarium
   >>>>
   >>>>    --
   >>>>    Zhao JIN
   >>>>    Ph.D. Candidate
   >>>>    Ruth Ley Lab
   >>>>    467 Biotech
   >>>>    Field of Microbiology, Cornell University
   >>>>    Lab: 607.255.4954
   >>>>    Cell: 412.889.3675
   >>>>      _________________________________________________________________
   >>>>
   >>>>    [8]3D Earth Screensaver Preview
   >>>>    Free 3D Earth Screensaver
   >>>>    Watch   the   Earth   right   on   your   desktop!  Check  it  out
   at
   >>>>    [9][15]www.inbox.com/earth
   >>>>
   >>>> References
   >>>>
   >>>>    1. mailto:[16]jrkrideau at inbox.com
   >>>>    2. mailto:[17]zj29 at cornell.edu
   >>>>    3. mailto:[18]r-help at r-project.org
   >>>>    4. mailto:[19]R-help at r-project.org
   >>>>    5. [20]https://stat.ethz.ch/mailman/listinfo/r-help
   >>>>    6. [21]http://www.R-project.org/posting-guide.html
   >>>>    7. [22]http://www.inbox.com/marineaquarium
   >>>>    8. [23]http://www.inbox.com/earth
   >>>>    9. [24]http://www.inbox.com/earth
   >>>> ______________________________________________
   >>>> [25]R-help at r-project.org mailing list
   >>>> [26]https://stat.ethz.ch/mailman/listinfo/r-help
   >>>> PLEASE do read the posting guide
   [27]http://www.R-project.org/posting-guide.html
   >>>> and provide commented, minimal, self-contained, reproducible code.
   >>>
   >>>
   >>>
   >>> --
   >>>
   >>> Bert Gunter
   >>> Genentech Nonclinical Biostatistics
   >>>
   >>> Internal Contact Info:
   >>> Phone: 467-7374
   >>> Website:
   >>>
   [28]http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-b
   iostatistics/pdb-ncb-home.htm
   >>
   >>
   >>
   >> --
   >>
   >> Bert Gunter
   >> Genentech Nonclinical Biostatistics
   >>
   >> Internal Contact Info:
   >> Phone: 467-7374
   >> Website:
   >>
   [29]http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-b
   iostatistics/pdb-ncb-home.htm
   >
   >
   >
   > --
   >
   > Bert Gunter
   > Genentech Nonclinical Biostatistics
   >
   > Internal Contact Info:
   > Phone: 467-7374
   > Website:
   >
   [30]http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-b
   iostatistics/pdb-ncb-home.htm
   --
   Bert Gunter
   Genentech Nonclinical Biostatistics
   Internal Contact Info:
   Phone: 467-7374
   Website:
   [31]http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-b
   iostatistics/pdb-ncb-home.htm

   --
   Zhao JIN
   Ph.D. Candidate
   Ruth Ley Lab
   467 Biotech
   Field of Microbiology, Cornell University
   Lab: 607.255.4954
   Cell: 412.889.3675
     _________________________________________________________________

   [32]3D Marine Aquarium Screensaver Preview 
   Free 3D Marine Aquarium Screensaver
   Watch  dolphins,  sharks  &  orcas  on  your  desktop! Check it out at
   [33]www.inbox.com/marineaquarium

References

   1. mailto:gunter.berton at gene.com
   2. mailto:bgunter at gene.com
   3. mailto:bgunter at gene.com
   4. mailto:bgunter at gene.com
   5. mailto:jrkrideau at inbox.com
   6. mailto:zj29 at cornell.edu
   7. mailto:jrkrideau at inbox.com
   8. mailto:jrkrideau at inbox.com
   9. mailto:zj29 at cornell.edu
  10. mailto:r-help at r-project.org
  11. mailto:R-help at r-project.org
  12. https://stat.ethz.ch/mailman/listinfo/r-help
  13. http://www.R-project.org/posting-guide.html
  14. http://www.inbox.com/marineaquarium
  15. http://www.inbox.com/earth
  16. mailto:jrkrideau at inbox.com
  17. mailto:zj29 at cornell.edu
  18. mailto:r-help at r-project.org
  19. mailto:R-help at r-project.org
  20. https://stat.ethz.ch/mailman/listinfo/r-help
  21. http://www.R-project.org/posting-guide.html
  22. http://www.inbox.com/marineaquarium
  23. http://www.inbox.com/earth
  24. http://www.inbox.com/earth
  25. mailto:R-help at r-project.org
  26. https://stat.ethz.ch/mailman/listinfo/r-help
  27. http://www.R-project.org/posting-guide.html
  28. http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
  29. http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
  30. http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
  31. http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
  32. http://www.inbox.com/marineaquarium
  33. http://www.inbox.com/marineaquarium


More information about the R-help mailing list