[BioC] removal of outliers in matrix

Wed Nov 14 16:56:40 CET 2007

Hello Johannes:

If I understand correctly, you have a matrix of data that have variables 
(metabolites) as rows and sample-replicates as columns. For example, for 
two metabolites:

 > my.data
        Con.1 Con.2 Con.3 Con.4 Con.5 Trt.1 Trt.2 Trt.3 Trt.4 Trt.5
Metab.1     0     0     0     0     0  5.58  4.15  4.08  0.00  4.79
Metab.2     0     0     0     0     0  5.58  0.00  4.08  4.08  4.79

The outliers are. for Metab.1, Trt.4 and for Metab.2, Trt.2

I could use simple rules like (any value that is 1 S.D below or above 
mean) to detect the outliers.

 > apply(my.data, 1, function(y) {x=y[6:10]; which(x<(mean(x)-sd(x)) | x 
 > (mean(x)+sd(x))) } )
Metab.1 Metab.2
      4       2

Gives you the sample that is the outlier for each metabolite.

If you want a new matrix with the outliers removed:
 > new.data=t(apply(my.data, 1, function(y) {x=y[6:10]; 
sel=(x>(mean(x)-sd(x))&(x<(mean(x)+sd(x))));c(y[1:5],x[sel])}))
 > new.data
        [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
Metab.1    0    0    0    0    0 5.58 4.15 4.08 4.79
Metab.2    0    0    0    0    0 5.58 4.08 4.08 4.79

I have assumed that (1) there is only one outlier, and (2) the 
replicates are tightly close to each other, except for the outlier.

HTH

Saroj

Johannes Hanson wrote:

>Dear all,
>
>After some work with analysis of micro array data I am now facing my first
>metabolomics dataset. 
>The first problem I encountered is that the structure of the data is
>different from what I am used to. Due to the alignment of the chromatogram I
>do have extreme outliers within the dataset. The alignment is good (and I
>don't want to manually adjust 8000 peaks). If I could easily remove the
>outliers the rest of the analysis would be easier. 
>The outliers I want to remove are most often a total lack of signal as the
>peak is missing. I do have five replicates of each treatment I am looking
>for something that could remove only the extreme outliers (sample nr nine in
>the example below). 
>
>A typical outlier:
>Untreated
>0.00040016	0.001029071	0.00101226	0.000739958	0.000288475 
>Treated
>5.58151787	4.146639291	4.080655391	0.00120032	4.786810001
>
>The data is structured as a matrix with one line per peak and the replicates
>as individual columns (much like micro array data). 
>
>Thanks for any suggestions on how to continue
>
>Johannes
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://stat.ethz.ch/mailman/listinfo/bioconductor
>Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>  
>