[BioC] Classification in Mass Spectrometry Data

Wed Nov 7 07:47:20 CET 2012

Dear List,

I am new to the analysis of Mass Spectrometry data. In particular, I am using SELDI-TOF data. 

I have used the package PROcess to analyze the ovarian cancer data found in Petricoin, et. al. (2004),  
as found in http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp.

I was hoping to use MLInterfaces to classify the data into samples (cancer vs. control). However, I want 
to use pre-determined peaks to classify the results.  Here's what I have so far.

## I used only 20 samples from the data (10 cancer and 10 control) to cut down on computation time
## testNorm is the baseline subtracted, normalized data matrix
peakfile = paste(getwd(),"testpeakinfo.csv",sep ="/")
getPeaks(testNorm,peakfile)
testBio = pk2bmkr(peakfile, testNorm, bmkfile)
bks = getMzs(testBio)
## Gives a 20 by 7 matrix of biomarkers that should discriminate between cancer and control samples
## at least I know that these peaks are aligned

I created an expression set from the testNorm file. It has phenoData related to the treatment type, and 
featureData related to the M/Z ratio for each peak. The following runs successfully - I had to filter out 
some features because of memory issues. This is pretty naive and I have no justification for it other 
than I've used similar functions on microarray data:

## testES is the expression set created from testNorm. It has > 11K features and 20 samples
mads = apply(exprs(testES),1,mad)
testFilt = testES[mads > sort(mads,decr=TRUE)[301],]
dldMS = MLearn(treat ~ .,testFilt,dldaI,xvalSpec("LOG",5,balKfold.xvspec(5),fs.absT(30)))

What I really want to do is use the proto-biomarkers (bks above) as the classifiers so that I can determine whether 
the suggested biomarkers do a good job of differentiating between the two samples.  I would also like to be able 
to conduct a differential expression test on the normalized data 
and compare those results with the results from classification via proto-biomarkers. Finally, I would like to 
take the peaks given in the original paper and use those to classify the samples - again to verify (or not) what 
the original authors found. Eventually, I would like to do it all on the whole data set, which has approximately 
250 samples, roughly 90 of which are control.

I was hoping to assign this for homework to my graduate students in a bioinformatics class, 
but I can't do that if I can't work the problem myself :).

Thanks!
Monnie

Here's my session info:
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] en_US.UTF-8

attached base packages:
[1] splines   tools     stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] PROcess_1.32.0       Icens_1.28.0         survival_2.36-14     MLInterfaces_1.36.1  sfsmisc_1.0-20      
 [6] cluster_1.14.2       annotate_1.34.1      AnnotationDbi_1.18.4 rda_1.0.2-2          rpart_3.1-54        
[11] genefilter_1.38.0    MASS_7.3-21          ALL_1.4.12           Biobase_2.16.0       BiocGenerics_0.2.0  

loaded via a namespace (and not attached):
 [1] DBI_0.2-5       gdata_2.12.0    grid_2.15.1     gtools_2.7.0    IRanges_1.14.4  lattice_0.20-10 Matrix_1.0-9   
 [8] mboost_2.1-3    RSQLite_0.11.2  stats4_2.15.1   XML_3.95-0      xtable_1.7-0   

Monnie McGee, PhD
Associate Professor
Statistical Science
Southern Methodist University
Office: 214-768-2462
Fax: 214-768-4035
Website: http://faculty.smu.edu/mmcgee