[R] Newbie clustering/classification question
Mark A. Miller
mamillerpa at yahoo.com
Sun Mar 26 05:27:23 CEST 2006
My laboratory is measuring the abundance of various proteins in the
blood from either healthy individuals or from individuals with various
diseases. I would like to determine which proteins, if any, have
significantly different abundances between the healthy and diseased
individuals. Currently, one of my colleagues is performing an ANOVA on
each protein with MS Excel. I would like to analyze the data sets with
a scriptable tool, like R. I could use another tool, but I am trying
to stick to open source. I have basic procedural programming skills (I
do a lot of PHP/MySQL), but I'm not very good with anything that
requires thinking in vectors and matrices.
One approach I'm imagining is looping through all of the columns and
doing an ANOVA, like my colleague is doing manually. I have heard
other people in my field talking about other tests for this kind of
data. Would a Kruskal-Wallis test, hierarchical data clustering,
principal component analysis, or random forests be appropriate for the
question I am asking? If so, how would I write a reusable script for
the test? The data table will always have the same basic structure,
but the number of proteins could vary, as could the number of
conditions or the number of repeats within each condition.
I especially want to export the results of this test in a format
roughly like the example below. (I'd like the mean of each protein's
abundance for each condition, some measure of variability within each
condition, and a measure of significance for whether the protein
abundances are different between conditions.) I have gotten to the
point of doing an ANOVA on a single protein R and viewing the results
interactively, but I have no idea how to analyze the differences for
all of the proteins (in a loop, or all at once) or how to save the
results to a file.
Any suggestions?
Example input (tab delimited)
condition protA protB protC protD protE protF protG protH
healthy1 11111 22222 33333 70681 61735 66666 77777 88888
healthy1 12121 21111 32132 57230 69715 67890 87878 98989
healthy1 10101 20202 30303 67223 51967 65656 78900 111111
healthy2 12345 23111 32100 65931 67650 60001 80001 101010
healthy2 13333 21231 34111 58761 54086 60002 80002 122222
healthy2 13232 20101 30009 68752 70360 60003 80003 91919
asthma 32132 19889 30733 59959 71783 60237 65603 20374
asthma 34344 20483 31182 70531 59630 40445 56370 98404
asthma 39999 20464 29793 58395 66976 50577 39908 65367
diabetes 10000 20102 29486 51260 68447 42960 50875 216227
diabetes 10111 19143 31275 52573 55459 71337 53090 151505
diabetes 10001 21790 31470 54222 57318 64058 44166 207427
diabetes 15555 20123 30131 59882 71191 46203 44633 197430
acne 12222 31221 51381 64431 55016 43463 60388 74243
acne 12221 30535 49199 61419 65096 71551 41811 104317
acne 10001 30649 49199 56731 69871 61816 44321 125068
Desired output
condition protA protB protC protD protE protF protG protH
healthy1.mean
healthy1.sd
healthy1.pval
healthy2.mean
healthy2.sd
healthy2.pval
asthma.mean
asthma.sd
asthma.pval
diabetes.mean
diabetes.sd
diabetes.pval
acne.mean
acne.sd
acne.pval
--- --- --- --- --- --- --- ---
Mark A. Miller
More information about the R-help
mailing list