[BioC] Non-Specific Filtering with "nsFilter" Question

Wed Jun 20 16:46:04 CEST 2012

Hi Zeynep,

On 6/20/2012 5:18 AM, zeynep özkeserli wrote:
> Hi All,
>
> I am trying to apply Non-Specific Filtering to Affymetrix GeneChip hgu133
> plus2 data.
>
> Since it has been shown that there are multiple probe sets mapping to the
> same gene in Affymetrix GeneChips (ref:
> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1784106/), I thought it is
> necessary to filter those. So I decided to use nsFilter{geneFilter}.
>
> First I preprocessed the data, obtained an ExpressionSet object and then I
> set my criterion as it was suggested as an example for nsFilter.
>
> - used require.entrez= TRUE, which filters out features without Entrez Gene
> ID's.
> - used remove.dupEntrez=TRUE, which filters features mapping to the same
> Entrez Gene ID. (I turned off the variance filter to see how many will be
> removed because of mapping to the same Entrez Gene ID.)
>
> And,
>
> - first filter removed 13009 features
> - second filter removed 21629 features.
>
> "feature" here being genes. Because this filter is under geneFilter, which
> filters genes :). Am I wrong?

Well, you are sort of wrong. In this context, feature means probeset, 
and each probeset is designed to interrogate either a gene transcript or 
a putative gene transcript.

>
> And here are my questions:
>
> - If I did not perform the filtering wrongly, is it possible that there are
> this many duplicates? Or is it really too many? Because in hgu133 arrays
> data sheet It says that "Analyzes the relative expression level of more
> than 47,000 transcripts and variants, including more than 38,500 well
> characterized genes and UniGenes."
> (ref:
> http://media.affymetrix.com/support/technical/datasheets/hgu133arrays_datasheet.pdf
> )

There is no telling if you did it right or wrong, as you neglected to 
show us your code. What you did and what you think you did may actually 
be different things. I can tell you this:

 > length(unique(Rkeys(hgu133plus2ENTREZID)))
[1] 42094

So there are 42,094 unique Entrez Gene IDs represented on this array. 
Note carefully that Affy states '47,000 transcripts and variants', so 
they include transcript variants in that count, and these transcript 
variants will by definition have the same Entrez Gene ID.

>
> - Can anybody suggest a mind-map to follow while performing non-specific
> filtering? I think this must be done very carefully.

Agreed. I have never personally been fond of non-specific filtering, as 
to my mind it is a fairly blunt ax where a scalpel is required. 
Additionally, it is intended to 'fix' problems that I am not sure are 
either fixable or even exist.

For instance, removing duplicated genes assumes that any feature with 
the same Entrez Gene is by definition intended to measure the same 
thing. If there were no transcript variants this would be true. But 
there are transcript variants, so you end up removing things that may 
well be measuring different things. Not much of a fix IMO.

In addition, one rationale for filtering genes is to reduce the number 
of multiple comparisons. This makes sense to a certain extent if you are 
simply computing a statistic of some sort and then ranking genes in a 
univariate manner. I say to a certain extent because things like FDR are 
monotonic transforms - you aren't changing the order, just moving the 
cutoff between 'interesting' and 'uninteresting'. That's sort of passe 
these days - instead of looking for individual genes, we have moved on 
to looking for perturbed pathways or gene sets, and for that I think 
removing data is a hindrance not a help.

>
> And another question regarding the filtering process.
>
> To my understanding, we should not use features mapping to the same Entrez
> Gene ID, because they represent non-specific hybridization, thus they give
> exaggerated signal intensities. So, does it effect preprocessing? If it
> does, is it meaningful to filter them out after the preprocessing step? Or
> am I doing it wrong from the first step? Should this filtering be done
> before the preprocessing?

I'm not sure where you got that idea, but I think it is wrong. Why would 
having more than one feature that purports to measure transcript from 
the same gene represent non-specific hybridization? It might represent 
duplicate measurement of the same thing, which would be bad because you 
are increasing the number of comparisons without actually comparing more 
things.

You might be talking about features that might measure more than one 
transcript, and these may well exist. In fact, the probeset IDs are 
supposed to alert you to this possibility:

http://www.affymetrix.com/support/help/faqs/hgu133_2/faq_7.jsp

The short version of that FAQ is that _a_at indicates the probeset may 
bind to multiple transcripts of the same gene, the _s_at indicates that 
the probeset may bind to multiple transcripts from the same gene family, 
and the _x_at indicates that the probeset may bind to multiple 
transcripts from unrelated genes.

For that you can either take these probesets with a grain of salt, or 
you might look at the MBNI remapped cdfs, which attempt to remove probes 
that behave poorly.

Best,

Jim

>
> I am a little puzzled here. So any help would be appreciated.
>
> Thank you,
>
> Zeynep Ozkeserli
> Ankara University Biotechnology Institute
> Genomics Unit
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099