[BioC] Removing duplicate probes from expressionset

Martin Morgan mtmorgan at fhcrc.org
Sun Apr 15 01:37:55 CEST 2012


On 04/14/2012 03:59 PM, Angela McDonald wrote:
> Hello,
>
> I am wondering how to remove duplicate probes from an expression set in Bioconductor.  I have tried to use nsFilter with no success.
>
> When I use the following:
>
> featureFilter(xenexp, require.entrez=TRUE, remove.dupEntrez=TRUE)
>
> The error I get is:
>
> Error in rowQ(exprs(imat), which) :
>    cannot calculate order statistic on object with 2 columns
>
>   The xenexp expression set includes two samples on the mgu74av2 array

Hi Angela --

featureFilter tries to identify which duplicate ENTREZ id to remove by 
identifying the probeset with the largest interquartile range. The 
interquartile range is not defined for a sample of size 2, leading to 
the error above.

 From looking at the source for featureFilter

 > featureFilter
function (eset, require.entrez = TRUE, require.GOBP = FALSE,
     require.GOCC = FALSE, require.GOMF = FALSE, require.CytoBand = FALSE,
     remove.dupEntrez = TRUE, feature.exclude = "^AFFX")
{

[...]

you'll see that duplicate probes are removed by the lines

     if (remove.dupEntrez) {
         uniqGenes <- findLargest(featureNames(eset), rowIQRs(eset),
             annotation(eset))
         eset <- eset[uniqGenes, ]
     }

so after consulting ?findLargest you could use some statistic other than 
rowIQRs (row inter-quartile range) to select which probeset to retain, 
e.g., using the 'sample.ExpressionSet' data and select probesets with 
the largest range for subsequent analysis

   data(sample.ExpressionSet)
   eset <- sample.ExpressionSet
   rng <- apply(exprs(eset), 1, function(x) diff(range(x)))
   uniqGenes <- findLargest(featureNames(eset), rng, annotation(eset))
   eset <- eset[uniqGenes,]

You're asking to remove duplicate Entrez gene identifiers, rather than 
duplicate probesets; it is not uncommon to perform analysis without 
removing duplicates, anticipating in the results that probesets from the 
same gene will be qualitatively similar in the signal that they convey. 
Also the small sample size restricts the type of analysis possible 
anyway, so the usual motivation for removing duplicates -- reducing 
number of statistical tests -- may not be relevant.

Martin

>
> Thank you so much,
>
> Angela
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioconductor mailing list