[BioC] outlier removal from gene chip
Sean Davis
sdavis2 at mail.nih.gov
Tue Sep 19 19:33:24 CEST 2006
On 9/19/06 1:02 PM, "Weiwei Shi" <helprhelp at gmail.com> wrote:
> dear listers:
>
> I have a question on whether bioconductor has some tool-kit to detect
> outliers and remove them.
>
> my original dataset looks like this:
> V1 V51 V53 V55 V57
> 1 -493249600 1.459459 -3.069444 -1.300000 1.935484
> 2 -1613096495 -1.139269 -5.525281 -16.592593 -1.831978
> 3 1626196571 -3.500000 -1.011662 2.223881 3.921053
> 4 -1397009217 -3.571429 1.685714 -1.180297 -6.807692
> 5 1428659728 -1.405405 -1.469004 -4.779754 -1.033708
> 6 459853658 -2.158879 -7.510823 -1.085581 -9.382979
> 7 530182506 -1.431677 -1.336343 -3.126437 4.878788
> 8 1173842263 1.215385 1.856410 -2.059794 -6.020833
> 9 28847 2.407895 -2.048889 -1.730337 -1.178947
> 10 -1961875610 2.864159 -2.301234 -4.733264 -1.172058
>
> V1: internal probe id
> the rests are different samples. the cells are fold-change of disease/normal.
>
> summary of the sample columns( V51, ... V57) gives the following:
> V51 V53 V55 V57
> Min. :-482.000 Min. : -55.7342 Min. :-122.074 Min. :-14086.750
> 1st Qu.: -2.159 1st Qu.: -1.7312 1st Qu.: -2.125 1st Qu.: -1.831
> Median : -1.199 Median : -1.0416 Median : -1.200 Median : -1.080
> Mean : -0.918 Mean : 0.1662 Mean : -1.027 Mean : -1.874
> 3rd Qu.: 1.441 3rd Qu.: 1.5721 3rd Qu.: 1.419 3rd Qu.: 1.521
> Max. : 198.434 Max. :1478.1639 Max. : 95.768 Max. : 683.519
>
>
> My question is, is there any package which can detect those outliers
> (like -14086.750)and remove them and get an "average" for each gene
> (instead of each probe)?
Hi, Weiwei.
The better option, probably, is to remove datapoints that are questionable
BEFORE making a ratio using good quality control, plots, etc. Extreme
ratios may be biologically very important, so simply removing them is
probably not the best option. I would suggest looking at the two data
values that went into making the ratios that you think are in question and
see if there is an explanation (for example, one probe of the two failed,
for example). Simply removing ratios because they look like outliers is
potentially removing your most interesting data.
Sean
More information about the Bioconductor
mailing list