Dear Mark:
I used the function "findAllOutliers()" in the beadarray package to
detect the outliers in the sample bead level data (BLData) enclosed with
the beadarray package. However I found this function did not do what it
is supposed to do. The documentation for this function
"findAllOutliers()" says the outliers to be removed are those beads
which are 3 MAD from the *mean* values. But it seems that this function
removes the outliers which are 3 MAD from the *median* values.
Below is my R scripts:
-----------------------
# first get the list of outliers removed by the function "findAllOutliers"
> library(beadarray)
> data(BLData)
> outlier_indices <- findAllOutliers(BLData, 1)
> outlier_indices[1:10]
[1] 19 120 133 305 499 816 894 897 932 1339
# now let's examine how these outliers were detected
> x <- BLData[[1]]
> x$ProbeID[19]
[1] 2 #outlier with index 19 has the ProbeID 2
# now let's get all the beads with ProbeID 2
> bead_type <- x$G[x$ProbeID == 2]
> bead_type
[1] 1746.807 1871.963 1548.773 1492.135 1859.391 1723.123 1748.783
1657.136 1620.981 2038.145 1977.975 1694.616 1574.794 1973.933 1854.132
2152.593
[17] 1771.693 1673.885 2344.341 1681.198 1591.964
# detect the outliers which are 3 MAD from the *mean* value, we got 0 hits
> sum(bead_type > mean(bead_type) + 3*mad(bead_type))
[1] 0
> sum(bead_type < mean(bead_type) - 3*mad(bead_type))
[1] 0
# detect the outliers which are 3 MAD from the *median* value, we got 1 hit
> sum(bead_type > median(bead_type) + 3*mad(bead_type))
[1] 1
> sum(bead_type < median(bead_type) - 3*mad(bead_type))
[1] 0
# this hit gives exactly the same bead as the one detected by the
function "findAllOutliers"
> bead_type[bead_type > median(bead_type) + 3*mad(bead_type)]
[1] 2344.341
> x$G[19]
[1] 2344.341
# now let's examine the outliers detected by the function
"findAllOutliers" for another bead type
> x$ProbeID[120]
[1] 23
> x$ProbeID[133]
[1] 23
# so beads with ProbeID 23 have two outliers detected by the function
"findAllOutliers"
> bead_type <- x$G[x$ProbeID == 23]
> bead_type
[1] 1806.182 1590.193 1502.012 1934.863 1149.456 1734.869 1651.342
1541.102 1531.266 1735.293 1279.364 1642.195 1512.581 1583.358 1813.992
1624.385
[17] 1585.761 1138.922 1473.952 1467.488 1404.607 1647.807 1577.068
# 3 outliers were detected if 3 MAD from the *mean* value was used,
which is greater than the number of outliers detected by the function
"findAllOutliers"
> sum(bead_type > mean(bead_type) + 3*mad(bead_type))
[1] 1
> sum(bead_type < mean(bead_type) - 3*mad(bead_type))
[1] 2
# If we used 3 MAD from the *median* value to detect outliers, we got
the exactly same results with the function "findAllOutliers"
> sum(bead_type > median(bead_type) + 3*mad(bead_type))
[1] 0
> sum(bead_type < median(bead_type) - 3*mad(bead_type))
[1] 2
> bead_type[bead_type < median(bead_type) - 3*mad(bead_type)]
[1] 1149.456 1138.922
> x$G[c(120,133)]
[1] 1149.456 1138.922
---------------------------
I do not know it is the mistake of the documentation or the mistake
of the function implementation. I had a look at the Illumina Manuals for
the outlier removal. They only say Median Absolute Deviation method is
used to remove the outliers, but they do not say they use mean value or
median value when removing outliers. It does not seem to make sense to
use median absolute deviation together with the mean value.
I am using R version 2.5.0 and beadarray package version 1.4.0.
There is same problem with beadarray version 1.2.2. I actually
identified this problem in the version 1.2.2 and then confirmed this
problem in the latest version.
Cheers,
-------------------------
Dr. Wei Shi
Bioinformatics Division
The Walter and Eliza Hall Institute of Medical Research
1G Royal Parade, Parkville, VIC
Australia
[[alternative HTML version deleted]]