[BioC] removal of outliers in matrix
Saroj Mohapatra
smohapat at vbi.vt.edu
Wed Nov 14 16:56:40 CET 2007
Hello Johannes:
If I understand correctly, you have a matrix of data that have variables
(metabolites) as rows and sample-replicates as columns. For example, for
two metabolites:
> my.data
Con.1 Con.2 Con.3 Con.4 Con.5 Trt.1 Trt.2 Trt.3 Trt.4 Trt.5
Metab.1 0 0 0 0 0 5.58 4.15 4.08 0.00 4.79
Metab.2 0 0 0 0 0 5.58 0.00 4.08 4.08 4.79
The outliers are. for Metab.1, Trt.4 and for Metab.2, Trt.2
I could use simple rules like (any value that is 1 S.D below or above
mean) to detect the outliers.
> apply(my.data, 1, function(y) {x=y[6:10]; which(x<(mean(x)-sd(x)) | x
> (mean(x)+sd(x))) } )
Metab.1 Metab.2
4 2
Gives you the sample that is the outlier for each metabolite.
If you want a new matrix with the outliers removed:
> new.data=t(apply(my.data, 1, function(y) {x=y[6:10];
sel=(x>(mean(x)-sd(x))&(x<(mean(x)+sd(x))));c(y[1:5],x[sel])}))
> new.data
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
Metab.1 0 0 0 0 0 5.58 4.15 4.08 4.79
Metab.2 0 0 0 0 0 5.58 4.08 4.08 4.79
I have assumed that (1) there is only one outlier, and (2) the
replicates are tightly close to each other, except for the outlier.
HTH
Saroj
Johannes Hanson wrote:
>Dear all,
>
>After some work with analysis of micro array data I am now facing my first
>metabolomics dataset.
>The first problem I encountered is that the structure of the data is
>different from what I am used to. Due to the alignment of the chromatogram I
>do have extreme outliers within the dataset. The alignment is good (and I
>don't want to manually adjust 8000 peaks). If I could easily remove the
>outliers the rest of the analysis would be easier.
>The outliers I want to remove are most often a total lack of signal as the
>peak is missing. I do have five replicates of each treatment I am looking
>for something that could remove only the extreme outliers (sample nr nine in
>the example below).
>
>A typical outlier:
>Untreated
>0.00040016 0.001029071 0.00101226 0.000739958 0.000288475
>Treated
>5.58151787 4.146639291 4.080655391 0.00120032 4.786810001
>
>The data is structured as a matrix with one line per peak and the replicates
>as individual columns (much like micro array data).
>
>Thanks for any suggestions on how to continue
>
>Johannes
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
More information about the Bioconductor
mailing list