[R] Newbie clustering/classification question

Sun Mar 26 13:48:17 CEST 2006

Mark A. Miller wrote:
> 	My laboratory is measuring the abundance of various proteins in the
> blood from either healthy individuals or from individuals with various
> diseases.  I would like to determine which proteins, if any, have
> significantly different abundances between the healthy and diseased
> individuals.  Currently, one of my colleagues is performing an ANOVA on
> each protein with MS Excel.  I would like to analyze the data sets with
> a scriptable tool, like R.  I could use another tool, but I am trying
> to stick to open source.  I have basic procedural programming skills (I
> do a lot of PHP/MySQL), but I'm not very good with anything that
> requires thinking in vectors and matrices. 
> 	One approach I'm imagining is looping through all of the columns and
> doing an ANOVA, like my colleague is doing manually.  I have heard
> other people in my field talking about other tests for this kind of
> data.  Would a Kruskal-Wallis test, hierarchical data clustering,
> principal component analysis, or random forests be appropriate for the
> question I am asking?  If so, how would I write a reusable script for
> the test?  The data table will always have the same basic structure,
> but the number of proteins could vary, as could the number of
> conditions or the number of repeats within each condition.
> 	I especially want to export the results of this test in a format
> roughly like the example below.  (I'd like the mean of each protein's
> abundance for each condition, some measure of variability within each
> condition, and a measure of significance for whether the protein
> abundances are different between conditions.)  I have gotten to the
> point of doing an ANOVA on a single protein R and viewing the results
> interactively, but I have no idea how to analyze the differences for
> all of the proteins (in a loop, or all at once) or how to save the
> results to a file.
> Any suggestions?
>
> Example input (tab delimited)
> condition	protA	protB	protC	protD	protE	protF	protG	protH
> healthy1	11111	22222	33333	70681	61735	66666	77777	88888
> healthy1	12121	21111	32132	57230	69715	67890	87878	98989
> healthy1	10101	20202	30303	67223	51967	65656	78900	111111
> healthy2	12345	23111	32100	65931	67650	60001	80001	101010
> healthy2	13333	21231	34111	58761	54086	60002	80002	122222
> healthy2	13232	20101	30009	68752	70360	60003	80003	91919
> asthma	32132	19889	30733	59959	71783	60237	65603	20374
> asthma	34344	20483	31182	70531	59630	40445	56370	98404
> asthma	39999	20464	29793	58395	66976	50577	39908	65367
> diabetes	10000	20102	29486	51260	68447	42960	50875	216227
> diabetes	10111	19143	31275	52573	55459	71337	53090	151505
> diabetes	10001	21790	31470	54222	57318	64058	44166	207427
> diabetes	15555	20123	30131	59882	71191	46203	44633	197430
> acne	12222	31221	51381	64431	55016	43463	60388	74243
> acne	12221	30535	49199	61419	65096	71551	41811	104317
> acne	10001	30649	49199	56731	69871	61816	44321	125068
>
>
> Desired output
> condition	protA	protB	protC	protD	protE	protF	protG	protH
> healthy1.mean								
> healthy1.sd								
> healthy1.pval								
> healthy2.mean								
> healthy2.sd								
> healthy2.pval								
> asthma.mean								
> asthma.sd								
> asthma.pval								
> diabetes.mean								
> diabetes.sd								
> diabetes.pval								
> acne.mean								
> acne.sd								
> acne.pval								
>   
Hi, Mark.  With data like these, you will want to look at the 
BioConductor (http://www.bioconductor.org) project.  If you transpose 
your matrix so that individuals are in columns and proteins are in rows, 
then you have data in exactly the same form as a microarray analysis, so 
most of the tools in BioConductor will apply.  In addition, there are 
tools specifically designed for mass-spec data.  For your question 
directly, look at the limma package; it will do a protein-by-protein 
anova for you.  There is an extensive user guide available.

Sean