[R] Newbie clustering/classification question
Sean Davis
sdavis2 at mail.nih.gov
Sun Mar 26 13:48:17 CEST 2006
Mark A. Miller wrote:
> My laboratory is measuring the abundance of various proteins in the
> blood from either healthy individuals or from individuals with various
> diseases. I would like to determine which proteins, if any, have
> significantly different abundances between the healthy and diseased
> individuals. Currently, one of my colleagues is performing an ANOVA on
> each protein with MS Excel. I would like to analyze the data sets with
> a scriptable tool, like R. I could use another tool, but I am trying
> to stick to open source. I have basic procedural programming skills (I
> do a lot of PHP/MySQL), but I'm not very good with anything that
> requires thinking in vectors and matrices.
> One approach I'm imagining is looping through all of the columns and
> doing an ANOVA, like my colleague is doing manually. I have heard
> other people in my field talking about other tests for this kind of
> data. Would a Kruskal-Wallis test, hierarchical data clustering,
> principal component analysis, or random forests be appropriate for the
> question I am asking? If so, how would I write a reusable script for
> the test? The data table will always have the same basic structure,
> but the number of proteins could vary, as could the number of
> conditions or the number of repeats within each condition.
> I especially want to export the results of this test in a format
> roughly like the example below. (I'd like the mean of each protein's
> abundance for each condition, some measure of variability within each
> condition, and a measure of significance for whether the protein
> abundances are different between conditions.) I have gotten to the
> point of doing an ANOVA on a single protein R and viewing the results
> interactively, but I have no idea how to analyze the differences for
> all of the proteins (in a loop, or all at once) or how to save the
> results to a file.
> Any suggestions?
>
> Example input (tab delimited)
> condition protA protB protC protD protE protF protG protH
> healthy1 11111 22222 33333 70681 61735 66666 77777 88888
> healthy1 12121 21111 32132 57230 69715 67890 87878 98989
> healthy1 10101 20202 30303 67223 51967 65656 78900 111111
> healthy2 12345 23111 32100 65931 67650 60001 80001 101010
> healthy2 13333 21231 34111 58761 54086 60002 80002 122222
> healthy2 13232 20101 30009 68752 70360 60003 80003 91919
> asthma 32132 19889 30733 59959 71783 60237 65603 20374
> asthma 34344 20483 31182 70531 59630 40445 56370 98404
> asthma 39999 20464 29793 58395 66976 50577 39908 65367
> diabetes 10000 20102 29486 51260 68447 42960 50875 216227
> diabetes 10111 19143 31275 52573 55459 71337 53090 151505
> diabetes 10001 21790 31470 54222 57318 64058 44166 207427
> diabetes 15555 20123 30131 59882 71191 46203 44633 197430
> acne 12222 31221 51381 64431 55016 43463 60388 74243
> acne 12221 30535 49199 61419 65096 71551 41811 104317
> acne 10001 30649 49199 56731 69871 61816 44321 125068
>
>
> Desired output
> condition protA protB protC protD protE protF protG protH
> healthy1.mean
> healthy1.sd
> healthy1.pval
> healthy2.mean
> healthy2.sd
> healthy2.pval
> asthma.mean
> asthma.sd
> asthma.pval
> diabetes.mean
> diabetes.sd
> diabetes.pval
> acne.mean
> acne.sd
> acne.pval
>
Hi, Mark. With data like these, you will want to look at the
BioConductor (http://www.bioconductor.org) project. If you transpose
your matrix so that individuals are in columns and proteins are in rows,
then you have data in exactly the same form as a microarray analysis, so
most of the tools in BioConductor will apply. In addition, there are
tools specifically designed for mass-spec data. For your question
directly, look at the limma package; it will do a protein-by-protein
anova for you. There is an extensive user guide available.
Sean
More information about the R-help
mailing list