[BioC] Non-Specific Filtering with "nsFilter" Question
James W. MacDonald
jmacdon at uw.edu
Wed Jun 20 16:46:04 CEST 2012
Hi Zeynep,
On 6/20/2012 5:18 AM, zeynep özkeserli wrote:
> Hi All,
>
> I am trying to apply Non-Specific Filtering to Affymetrix GeneChip hgu133
> plus2 data.
>
> Since it has been shown that there are multiple probe sets mapping to the
> same gene in Affymetrix GeneChips (ref:
> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1784106/), I thought it is
> necessary to filter those. So I decided to use nsFilter{geneFilter}.
>
> First I preprocessed the data, obtained an ExpressionSet object and then I
> set my criterion as it was suggested as an example for nsFilter.
>
> - used require.entrez= TRUE, which filters out features without Entrez Gene
> ID's.
> - used remove.dupEntrez=TRUE, which filters features mapping to the same
> Entrez Gene ID. (I turned off the variance filter to see how many will be
> removed because of mapping to the same Entrez Gene ID.)
>
> And,
>
> - first filter removed 13009 features
> - second filter removed 21629 features.
>
> "feature" here being genes. Because this filter is under geneFilter, which
> filters genes :). Am I wrong?
Well, you are sort of wrong. In this context, feature means probeset,
and each probeset is designed to interrogate either a gene transcript or
a putative gene transcript.
>
> And here are my questions:
>
> - If I did not perform the filtering wrongly, is it possible that there are
> this many duplicates? Or is it really too many? Because in hgu133 arrays
> data sheet It says that "Analyzes the relative expression level of more
> than 47,000 transcripts and variants, including more than 38,500 well
> characterized genes and UniGenes."
> (ref:
> http://media.affymetrix.com/support/technical/datasheets/hgu133arrays_datasheet.pdf
> )
There is no telling if you did it right or wrong, as you neglected to
show us your code. What you did and what you think you did may actually
be different things. I can tell you this:
> length(unique(Rkeys(hgu133plus2ENTREZID)))
[1] 42094
So there are 42,094 unique Entrez Gene IDs represented on this array.
Note carefully that Affy states '47,000 transcripts and variants', so
they include transcript variants in that count, and these transcript
variants will by definition have the same Entrez Gene ID.
>
> - Can anybody suggest a mind-map to follow while performing non-specific
> filtering? I think this must be done very carefully.
Agreed. I have never personally been fond of non-specific filtering, as
to my mind it is a fairly blunt ax where a scalpel is required.
Additionally, it is intended to 'fix' problems that I am not sure are
either fixable or even exist.
For instance, removing duplicated genes assumes that any feature with
the same Entrez Gene is by definition intended to measure the same
thing. If there were no transcript variants this would be true. But
there are transcript variants, so you end up removing things that may
well be measuring different things. Not much of a fix IMO.
In addition, one rationale for filtering genes is to reduce the number
of multiple comparisons. This makes sense to a certain extent if you are
simply computing a statistic of some sort and then ranking genes in a
univariate manner. I say to a certain extent because things like FDR are
monotonic transforms - you aren't changing the order, just moving the
cutoff between 'interesting' and 'uninteresting'. That's sort of passe
these days - instead of looking for individual genes, we have moved on
to looking for perturbed pathways or gene sets, and for that I think
removing data is a hindrance not a help.
>
> And another question regarding the filtering process.
>
> To my understanding, we should not use features mapping to the same Entrez
> Gene ID, because they represent non-specific hybridization, thus they give
> exaggerated signal intensities. So, does it effect preprocessing? If it
> does, is it meaningful to filter them out after the preprocessing step? Or
> am I doing it wrong from the first step? Should this filtering be done
> before the preprocessing?
I'm not sure where you got that idea, but I think it is wrong. Why would
having more than one feature that purports to measure transcript from
the same gene represent non-specific hybridization? It might represent
duplicate measurement of the same thing, which would be bad because you
are increasing the number of comparisons without actually comparing more
things.
You might be talking about features that might measure more than one
transcript, and these may well exist. In fact, the probeset IDs are
supposed to alert you to this possibility:
http://www.affymetrix.com/support/help/faqs/hgu133_2/faq_7.jsp
The short version of that FAQ is that _a_at indicates the probeset may
bind to multiple transcripts of the same gene, the _s_at indicates that
the probeset may bind to multiple transcripts from the same gene family,
and the _x_at indicates that the probeset may bind to multiple
transcripts from unrelated genes.
For that you can either take these probesets with a grain of salt, or
you might look at the MBNI remapped cdfs, which attempt to remove probes
that behave poorly.
Best,
Jim
>
> I am a little puzzled here. So any help would be appreciated.
>
> Thank you,
>
> Zeynep Ozkeserli
> Ankara University Biotechnology Institute
> Genomics Unit
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list