[BioC] Non-Specific Filtering with "nsFilter" Question
James W. MacDonald
jmacdon at uw.edu
Wed Jun 20 18:31:00 CEST 2012
Hi Zeynep,
On 6/20/2012 11:47 AM, zeynep özkeserli wrote:
> Hi James,
>
> Thank you for your detailed answer which covered all the black holes
> on this subject on my mind.
>
> In fact, the problem started with the control probes. The problem was
> that, when I performed limma analysis without any filters, the control
> probes were on top of the differentially expressed gene list. I
> couldn't find out why, it didn't seem to be an experimental defect (I
> concluded it from QC Reports). So while I was trying to find out a
> solution for this, I also started to think on filtering to reduce the
> number of multiple comparisons (and my misunderstandings on probe
> design suddenly popped out, sorry for some of the unnecessary
> questions.) Do you have any idea why control probes would appear to be
> significantly differentially expressed? Is it logical to just move them?
Ugh. I hate when that happens.
So, it depends on what you mean by control probes, as there are various
types. If you are talking about the beta-actin or other 'housekeeping'
genes, then it isn't clear to me if this is a problem or not. The
general assumption is that these genes are constituitively up-regulated,
and never vary. But I have always wondered about that. It's sort of like
the 'no two snow flakes are alike' hypothesis - in general circulation,
but by definition untestable. So housekeeping genes make me wonder, but
don't really cause much teeth gnashing.
The same is true for the 'normalizing control set' of 100 probesets that
Affy claim are not differentially expressed in different tissues. I
think that really depends. I had one study back in the day where they
were comparing normal C. elegans to C. elegans that had some deadly
mutation, and something like 95% of the genes were differentially
expressed. It was just ridiculous. But the point to me was that you
can't know if a gene or set of genes are never affected - it is too
context dependent.
That said, I would recommend ensuring that everything is OK. I don't
know what you mean by QC Reports - perhaps you used the affyQCReport
package, or arrayQualityMetrics? I would certainly run these data
through one of those packages. I would also do things like PCA plots of
the expression values, and maybe image plots that you can generate using
the affyPLM package.
Now if you have things like the Poly-A controls or the Hybridization
controls popping up, then you may have a real problem, as those are
spiked in during the processing. This could indicate big technical
variability between batches that may not be resolvable.
>
> And about getting rid of the "passe" analysis pipeline; does the
> search for interesting pathways start after deciding "important" genes
> set or is it another approach which seeks those sets in the whole data
> set in a different manner? Can you please recommend me any papers
> where I could learn this approach?
Well, the general idea started with Gene Ontology analyses where you
take the 'top' genes, based on a cutoff, and try to find GO terms that
are over or under-represented in the set of significant genes. The
underlying weakness there is that you are relying on a cutoff, which can
be fairly arbitrarily set.
Another way to think about it is to just take your ranked list of genes
(all genes on the chip, ranked by some statistic), and then see if a
certain group of genes (where 'group' is defined as an existing gene set
that somebody else already found, or a set of genes in a GO category, or
what have you) is 'higher up' in the ranked list than would be expected
by chance. For this approach you really need to filter down to a set of
unique genes, but in general I don't think you filter further. I'm no
expert on the literature, but I think one of the seminal papers is by Tian:
http://www.pnas.org/content/102/38/13544.short
There are also several out of Robert Gentleman's group that I have found
helpful. Do a Google Scholar of gsea gentleman, and they will be near
the top.
Best,
Jim
>
> Thanks again for your help and comments. Very much appreciated.
>
> Zeynep
>
>
>
> On Wed, Jun 20, 2012 at 5:46 PM, James W. MacDonald <jmacdon at uw.edu
> <mailto:jmacdon at uw.edu>> wrote:
>
> Hi Zeynep,
>
>
> On 6/20/2012 5:18 AM, zeynep özkeserli wrote:
>
> Hi All,
>
> I am trying to apply Non-Specific Filtering to Affymetrix
> GeneChip hgu133
> plus2 data.
>
> Since it has been shown that there are multiple probe sets
> mapping to the
> same gene in Affymetrix GeneChips (ref:
> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1784106/), I
> thought it is
> necessary to filter those. So I decided to use
> nsFilter{geneFilter}.
>
> First I preprocessed the data, obtained an ExpressionSet
> object and then I
> set my criterion as it was suggested as an example for nsFilter.
>
> - used require.entrez= TRUE, which filters out features
> without Entrez Gene
> ID's.
> - used remove.dupEntrez=TRUE, which filters features mapping
> to the same
> Entrez Gene ID. (I turned off the variance filter to see how
> many will be
> removed because of mapping to the same Entrez Gene ID.)
>
> And,
>
> - first filter removed 13009 features
> - second filter removed 21629 features.
>
> "feature" here being genes. Because this filter is under
> geneFilter, which
> filters genes :). Am I wrong?
>
>
> Well, you are sort of wrong. In this context, feature means
> probeset, and each probeset is designed to interrogate either a
> gene transcript or a putative gene transcript.
>
>
>
> And here are my questions:
>
> - If I did not perform the filtering wrongly, is it possible
> that there are
> this many duplicates? Or is it really too many? Because in
> hgu133 arrays
> data sheet It says that "Analyzes the relative expression
> level of more
> than 47,000 transcripts and variants, including more than
> 38,500 well
> characterized genes and UniGenes."
> (ref:
> http://media.affymetrix.com/support/technical/datasheets/hgu133arrays_datasheet.pdf
> )
>
>
> There is no telling if you did it right or wrong, as you neglected
> to show us your code. What you did and what you think you did may
> actually be different things. I can tell you this:
>
> > length(unique(Rkeys(hgu133plus2ENTREZID)))
> [1] 42094
>
> So there are 42,094 unique Entrez Gene IDs represented on this
> array. Note carefully that Affy states '47,000 transcripts and
> variants', so they include transcript variants in that count, and
> these transcript variants will by definition have the same Entrez
> Gene ID.
>
>
>
> - Can anybody suggest a mind-map to follow while performing
> non-specific
> filtering? I think this must be done very carefully.
>
>
> Agreed. I have never personally been fond of non-specific
> filtering, as to my mind it is a fairly blunt ax where a scalpel
> is required. Additionally, it is intended to 'fix' problems that I
> am not sure are either fixable or even exist.
>
> For instance, removing duplicated genes assumes that any feature
> with the same Entrez Gene is by definition intended to measure the
> same thing. If there were no transcript variants this would be
> true. But there are transcript variants, so you end up removing
> things that may well be measuring different things. Not much of a
> fix IMO.
>
> In addition, one rationale for filtering genes is to reduce the
> number of multiple comparisons. This makes sense to a certain
> extent if you are simply computing a statistic of some sort and
> then ranking genes in a univariate manner. I say to a certain
> extent because things like FDR are monotonic transforms - you
> aren't changing the order, just moving the cutoff between
> 'interesting' and 'uninteresting'. That's sort of passe these days
> - instead of looking for individual genes, we have moved on to
> looking for perturbed pathways or gene sets, and for that I think
> removing data is a hindrance not a help.
>
>
>
> And another question regarding the filtering process.
>
> To my understanding, we should not use features mapping to the
> same Entrez
> Gene ID, because they represent non-specific hybridization,
> thus they give
> exaggerated signal intensities. So, does it effect
> preprocessing? If it
> does, is it meaningful to filter them out after the
> preprocessing step? Or
> am I doing it wrong from the first step? Should this filtering
> be done
> before the preprocessing?
>
>
> I'm not sure where you got that idea, but I think it is wrong. Why
> would having more than one feature that purports to measure
> transcript from the same gene represent non-specific
> hybridization? It might represent duplicate measurement of the
> same thing, which would be bad because you are increasing the
> number of comparisons without actually comparing more things.
>
> You might be talking about features that might measure more than
> one transcript, and these may well exist. In fact, the probeset
> IDs are supposed to alert you to this possibility:
>
> http://www.affymetrix.com/support/help/faqs/hgu133_2/faq_7.jsp
>
> The short version of that FAQ is that _a_at indicates the probeset
> may bind to multiple transcripts of the same gene, the _s_at
> indicates that the probeset may bind to multiple transcripts from
> the same gene family, and the _x_at indicates that the probeset
> may bind to multiple transcripts from unrelated genes.
>
> For that you can either take these probesets with a grain of salt,
> or you might look at the MBNI remapped cdfs, which attempt to
> remove probes that behave poorly.
>
> Best,
>
> Jim
>
>
>
> I am a little puzzled here. So any help would be appreciated.
>
> Thank you,
>
> Zeynep Ozkeserli
> Ankara University Biotechnology Institute
> Genomics Unit
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
>
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list