[BioC] Classification in Mass Spectrometry Data
McGee, Monnie
mmcgee at mail.smu.edu
Wed Nov 7 07:47:20 CET 2012
Dear List,
I am new to the analysis of Mass Spectrometry data. In particular, I am using SELDI-TOF data.
I have used the package PROcess to analyze the ovarian cancer data found in Petricoin, et. al. (2004),
as found in http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp.
I was hoping to use MLInterfaces to classify the data into samples (cancer vs. control). However, I want
to use pre-determined peaks to classify the results. Here's what I have so far.
## I used only 20 samples from the data (10 cancer and 10 control) to cut down on computation time
## testNorm is the baseline subtracted, normalized data matrix
peakfile = paste(getwd(),"testpeakinfo.csv",sep ="/")
getPeaks(testNorm,peakfile)
testBio = pk2bmkr(peakfile, testNorm, bmkfile)
bks = getMzs(testBio)
## Gives a 20 by 7 matrix of biomarkers that should discriminate between cancer and control samples
## at least I know that these peaks are aligned
I created an expression set from the testNorm file. It has phenoData related to the treatment type, and
featureData related to the M/Z ratio for each peak. The following runs successfully - I had to filter out
some features because of memory issues. This is pretty naive and I have no justification for it other
than I've used similar functions on microarray data:
## testES is the expression set created from testNorm. It has > 11K features and 20 samples
mads = apply(exprs(testES),1,mad)
testFilt = testES[mads > sort(mads,decr=TRUE)[301],]
dldMS = MLearn(treat ~ .,testFilt,dldaI,xvalSpec("LOG",5,balKfold.xvspec(5),fs.absT(30)))
What I really want to do is use the proto-biomarkers (bks above) as the classifiers so that I can determine whether
the suggested biomarkers do a good job of differentiating between the two samples. I would also like to be able
to conduct a differential expression test on the normalized data
and compare those results with the results from classification via proto-biomarkers. Finally, I would like to
take the peaks given in the original paper and use those to classify the samples - again to verify (or not) what
the original authors found. Eventually, I would like to do it all on the whole data set, which has approximately
250 samples, roughly 90 of which are control.
I was hoping to assign this for homework to my graduate students in a bioinformatics class,
but I can't do that if I can't work the problem myself :).
Thanks!
Monnie
Here's my session info:
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)
locale:
[1] en_US.UTF-8
attached base packages:
[1] splines tools stats graphics grDevices utils datasets methods base
other attached packages:
[1] PROcess_1.32.0 Icens_1.28.0 survival_2.36-14 MLInterfaces_1.36.1 sfsmisc_1.0-20
[6] cluster_1.14.2 annotate_1.34.1 AnnotationDbi_1.18.4 rda_1.0.2-2 rpart_3.1-54
[11] genefilter_1.38.0 MASS_7.3-21 ALL_1.4.12 Biobase_2.16.0 BiocGenerics_0.2.0
loaded via a namespace (and not attached):
[1] DBI_0.2-5 gdata_2.12.0 grid_2.15.1 gtools_2.7.0 IRanges_1.14.4 lattice_0.20-10 Matrix_1.0-9
[8] mboost_2.1-3 RSQLite_0.11.2 stats4_2.15.1 XML_3.95-0 xtable_1.7-0
Monnie McGee, PhD
Associate Professor
Statistical Science
Southern Methodist University
Office: 214-768-2462
Fax: 214-768-4035
Website: http://faculty.smu.edu/mmcgee
More information about the Bioconductor
mailing list