[BioC] SAM warning
elliott harrison
e.harrison at epistem.co.uk
Thu Jan 15 12:54:02 CET 2009
Hi,
I'm trying the MBCB correction for Illumina data and then running sam
afterwards.
I've run this successfully as few times but one experiment I get the
message
"Warning message:
There are 1 variables with zero variance. These variables are removed,
and their d-values are set to NA. "
Is this just referring to one genes values having no variance (I guess
so as just a warning) or one of the experiment groups?
If it is just one gene do I need worry? I guess not as is just fluke
chance.
Thanks
Elliott
-----Original Message-----
From: bioconductor-bounces at stat.math.ethz.ch
[mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of
bioconductor-request at stat.math.ethz.ch
Sent: Thursday, January 15, 2009 11:00 AM
To: bioconductor at stat.math.ethz.ch
Subject: Bioconductor Digest, Vol 71, Issue 11
Send Bioconductor mailing list submissions to
bioconductor at stat.math.ethz.ch
To subscribe or unsubscribe via the World Wide Web, visit
https://stat.ethz.ch/mailman/listinfo/bioconductor
or, via email, send a message with subject or body 'help' to
bioconductor-request at stat.math.ethz.ch
You can reach the person managing the list at
bioconductor-owner at stat.math.ethz.ch
When replying, please edit your Subject line so it is more specific than
"Re: Contents of Bioconductor digest..."
Today's Topics:
1. Re: Filtering before differential expression analysis of
microarrays - New paper out (Steve Lianoglou)
2. Re: multiple locations for probeset in hgu133plus2CHRLOC vs.
UCSC PSL data (Robert Gentleman)
3. Re: Filtering before differential expression analysis of
microarrays - New paper out (Steve Lianoglou)
4. Re: Filtering before differential expression analysis of
microarrays - New paper out (James W. MacDonald)
5. Re: Filtering before differential expression analysis of
microarrays - New paper out (Daniel Brewer)
----------------------------------------------------------------------
Message: 1
Date: Wed, 14 Jan 2009 11:10:38 -0500
From: Steve Lianoglou <mailinglist.honeypot at gmail.com>
Subject: Re: [BioC] Filtering before differential expression analysis
of microarrays - New paper out
To: Gordon Smyth <smyth at wehi.EDU.AU>
Cc: "James W. MacDonald" <jmacdon at med.umich.edu>, Bioconductor
mailing
list <bioconductor at stat.math.ethz.ch>
Message-ID: <F2B2EEEF-480A-4180-BE00-FD21909D59BC at gmail.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Hi Gordon,
As someone who has been dealing more and more with raw data, I always
appreciate detailed answers from the masters, such as the one you just
wrote. Even after reading several of the published articles regarding
these normalization practices, I always find these less formal emails
quite helpful.
That said, one point you mention isn't exactly clear to me, and I'm
wondering if you could elaborate just a bit here:
> Filtering non-expressed probes tends not be emphasised on this list
> because users of this list are often sophisticated enough to use
> variance stabilizing normalization methods such as rma, vsn, normexp
> or vst. This means that low-expression filtering is done more for
> multiplicity issues than for variance stabilization, and therefore
> often doesn't make a huge difference. When using earlier
> normalization methods such as MAS for Affy or local background
> correction for two-color arrays, expression-filtering is absolutely
> essential, because the normalized expression values are so unstable at
> low intensity levels.
When you say "... low-expression filtering is done more for multiplicity
issues than for variance stabilization", what exactly do you mean by
"multiplicity issues"?
Thanks,
-steve
--
Steve Lianoglou
Graduate Student: Physiology, Biophysics and Systems Biology Weill
Medical College of Cornell University
http://cbio.mskcc.org/~lianos
------------------------------
Message: 2
Date: Wed, 14 Jan 2009 09:00:59 -0800
From: "Robert Gentleman" <rgentlem at fhcrc.org>
Subject: Re: [BioC] multiple locations for probeset in
hgu133plus2CHRLOC vs. UCSC PSL data
To: "Marc Carlson" <mcarlson at fhcrc.org>
Cc: "Bazeley, Peter" <Peter.Bazeley at utoledo.edu>,
bioconductor at stat.math.ethz.ch
Message-ID:
<b796582f0901140900nca4fc66x5f84eb96330a2d04 at mail.gmail.com>
Content-Type: text/plain
To follow up slightly
On Tue, Nov 18, 2008 at 9:57 AM, Marc Carlson <mcarlson at fhcrc.org>
wrote:
> Hi Peter,
>
> I think that your confusion is coming from the fact that these are the
> chromosome start locations for the genes and not the probes.
> According to Affy, that probe is supposed to be measuring that gene
> and we took their word for that. We then gave you the start positions
> for transcripts of that gene according to UCSC. We don't currently
> provide the data for where the probe aligns to the genome or to which
> transcripts in the genome the probe might stick to.
You can easily find all genomic regions using Biostrings, and this is
one of the examples in the vignette, I believe.
Finding all transcripts is harder (at least in the sense that we have
not yet developed a pipeline for it). You would need to download all
the transcripts sequences from somewhere (RefSeq?), and then basically
modify the example in the Biostrings vignette to do the matching.
These are not particularly large or hard problems, so a few hours
would deal with the first, maybe a day or two for the second.
best wishes
Robert
>
>
>
> Marc
>
>
>
>
> Bazeley, Peter wrote:
> > Hello,
> >
> > R version: 2.8.0
> >
> > I just installed the hgu133plus2.db package, and am looking at the
> hgu133plus2CHRLOC environment. I've noticed that some of the probeset
> entries (e.g. "201268_at") have multiple locations compared to Affy's
> annotation file. I'd like to figure out if these multiple locations
> are current, in which case it is some sort of overlapping/repeating
duplication.
> For example:
> >
> >
> >> as.list(hgu133plus2CHRLOC)$'201268_at'
> >>
> > 17 17 17 17
> > 46598879 46597889 46598637 46599081
> >
> > indicates that the probeset maps to 4 locations. Compare this to the
> alignments info in the Affy's annotation file (from 7/8/08,
> http://www.affymetrix.com/Auth/analysis/downloads/na26/ivt/HG-U133_Plu
> s_2.na26.annot.csv.zip
> ):
> >
> > chr12:119204403-119205041 (+) // 91.49 // q24.31 ///
> chr17:46598810-46604103 (+) // 96.87 // q21.33
> >
> > which suggests one location on chromosome 17 (I'm ignoring
> > chromosome 12
> for now). This is a "_at" probeset, so it should map uniquely to a
> sequence, according to Affy's "Data Analysis Fundamentals" document
> (and speaking to a rep).
> >
> > >From the information provided by "?hgu133plus2CHRLOC", I downloaded
> >
> ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/d
> atabase/affyU133Plus2.txt.gz
> > from UCSC to see how this occured, but it is not clear. Actually,
> > the
> file:
> >
> http://www.affymetrix.com/Auth/analysis/downloads/psl/HG-U133_Plus_2.l
> ink.psl.zip
> > from Affy's support page has the same alignment info. Here's the
> > relevant
> PSL info:
> >
> > Target sequence name: chr17
> > Alignment start position in target: 46598810 Alignment end position
> > in target: 46604103 Number of blocks in the alignment (a block
> > contains no gaps): 5 Comma-separated list of sizes of each block:
> > 47,130,102,113,257, Comma-separated list of starting positions of
> > each block in target:
> 46598810,46599186,46600601,46602296,46603846,
> >
> >
> > The second location provided by CHRLOC (46597889) occurs before the
> > start
> of the alignment in the PSL info, so perhaps this one CHRLOC location
> corresponds to the PSL alignment? The mappings were obtained from UCSC
> on 2006-Apr14, so perhaps additional alignments existed at the time,
> which have since been removed.
> >
> >
> > Thank you for any help. Hopefully I'm just missing something obvious
> (well, non-obvious for me).
> >
> > Peter Bazeley
> >
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> >
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
--
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org
[[alternative HTML version deleted]]
------------------------------
Message: 3
Date: Wed, 14 Jan 2009 12:59:53 -0500
From: Steve Lianoglou <mailinglist.honeypot at gmail.com>
Subject: Re: [BioC] Filtering before differential expression analysis
of microarrays - New paper out
To: "James W. MacDonald" <jmacdon at med.umich.edu>
Cc: Gordon Smyth <smyth at wehi.EDU.AU>, Bioconductor mailing list
<bioconductor at stat.math.ethz.ch>
Message-ID: <052D02CD-EB11-4DB6-AE65-DF00B118943F at gmail.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Thanks, Jim!
Multiplicity as in multiple testing makes sense, I wasn't sure if he was
referring to something about probes appearing in multiple places or
something within arrays, or across arrays, or something (which I was
trying to parse into how that might be relevant here).
Cheers,
-steve
On Jan 14, 2009, at 12:50 PM, James W. MacDonald wrote:
> Hi Steve,
>
> The question wasn't really asked of me, but Gordon is likely in bed
> right now ;-D
>
> Steve Lianoglou wrote:
>> Hi Gordon,
>> As someone who has been dealing more and more with raw data, I always
>> appreciate detailed answers from the masters, such as the one you
>> just wrote. Even after reading several of the published articles
>> regarding these normalization practices, I always find these less
>> formal emails quite helpful.
>> That said, one point you mention isn't exactly clear to me, and I'm
>> wondering if you could elaborate just a bit here:
>>> Filtering non-expressed probes tends not be emphasised on this list
>>> because users of this list are often sophisticated enough to use
>>> variance stabilizing normalization methods such as rma, vsn, normexp
>>> or vst. This means that low-expression filtering is done more for
>>> multiplicity issues than for variance stabilization, and therefore
>>> often doesn't make a huge difference. When using earlier
>>> normalization methods such as MAS for Affy or local background
>>> correction for two-color arrays, expression-filtering is absolutely
>>> essential, because the normalized expression values are so unstable
>>> at low intensity levels.
>> When you say "... low-expression filtering is done more for
>> multiplicity issues than for variance stabilization", what exactly do
>> you mean by "multiplicity issues"?
>
> By multiplicity issues Gordon was referring to the multiple
> comparisons problem. A p-value is an estimate of the probability of a
> type 1 error, in which we say there is a difference when in fact there
> isn't (a false positive). If we reject the null hypothesis at an alpha
> level of 0.05, we are in essence taking a 5% chance of being wrong.
>
> For one test this isn't a problem, but as you make more and more tests
> simultaneously, you expect to see more and more false positives (e.g,
> if you do 20 tests at an alpha of 0.05, and there are really no
> differences for any of the tests, you still expect about one of them
> to appear significant even though none are).
>
> There are lots of ways to adjust for multiple comparisons, but one of
> the best things you can do is not make so many comparisons in the
> first place, by filtering out data based on one or more criteria.
>
> Best,
>
> Jim
>> Thanks,
>> -steve
>> --
>> Steve Lianoglou
>> Graduate Student: Physiology, Biophysics and Systems Biology Weill
>> Medical College of Cornell University http://cbio.mskcc.org/~lianos
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> Hildebrandt Lab
> 8220D MSRB III
> 1150 W. Medical Center Drive
> Ann Arbor MI 48109-5646
> 734-936-8662
--
Steve Lianoglou
Graduate Student: Physiology, Biophysics and Systems Biology Weill
Medical College of Cornell University
http://cbio.mskcc.org/~lianos
------------------------------
Message: 4
Date: Wed, 14 Jan 2009 12:50:54 -0500
From: "James W. MacDonald" <jmacdon at med.umich.edu>
Subject: Re: [BioC] Filtering before differential expression analysis
of microarrays - New paper out
To: Steve Lianoglou <mailinglist.honeypot at gmail.com>
Cc: Gordon Smyth <smyth at wehi.EDU.AU>, Bioconductor mailing list
<bioconductor at stat.math.ethz.ch>
Message-ID: <496E25FE.1020003 at med.umich.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Hi Steve,
The question wasn't really asked of me, but Gordon is likely in bed
right now ;-D
Steve Lianoglou wrote:
> Hi Gordon,
>
> As someone who has been dealing more and more with raw data, I always
> appreciate detailed answers from the masters, such as the one you just
> wrote. Even after reading several of the published articles regarding
> these normalization practices, I always find these less formal emails
> quite helpful.
>
> That said, one point you mention isn't exactly clear to me, and I'm
> wondering if you could elaborate just a bit here:
>
>> Filtering non-expressed probes tends not be emphasised on this list
>> because users of this list are often sophisticated enough to use
>> variance stabilizing normalization methods such as rma, vsn, normexp
>> or vst. This means that low-expression filtering is done more for
>> multiplicity issues than for variance stabilization, and therefore
>> often doesn't make a huge difference. When using earlier
>> normalization methods such as MAS for Affy or local background
>> correction for two-color arrays, expression-filtering is absolutely
>> essential, because the normalized expression values are so unstable
>> at low intensity levels.
>
>
> When you say "... low-expression filtering is done more for
> multiplicity issues than for variance stabilization", what exactly do
> you mean by "multiplicity issues"?
By multiplicity issues Gordon was referring to the multiple comparisons
problem. A p-value is an estimate of the probability of a type 1 error,
in which we say there is a difference when in fact there isn't (a false
positive). If we reject the null hypothesis at an alpha level of 0.05,
we are in essence taking a 5% chance of being wrong.
For one test this isn't a problem, but as you make more and more tests
simultaneously, you expect to see more and more false positives (e.g, if
you do 20 tests at an alpha of 0.05, and there are really no differences
for any of the tests, you still expect about one of them to appear
significant even though none are).
There are lots of ways to adjust for multiple comparisons, but one of
the best things you can do is not make so many comparisons in the first
place, by filtering out data based on one or more criteria.
Best,
Jim
>
> Thanks,
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Physiology, Biophysics and Systems Biology Weill
> Medical College of Cornell University
>
> http://cbio.mskcc.org/~lianos
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
Hildebrandt Lab
8220D MSRB III
1150 W. Medical Center Drive
Ann Arbor MI 48109-5646
734-936-8662
------------------------------
Message: 5
Date: Thu, 15 Jan 2009 10:49:32 +0000
From: Daniel Brewer <daniel.brewer at icr.ac.uk>
Subject: Re: [BioC] Filtering before differential expression analysis
of microarrays - New paper out
To: Gordon Smyth <smyth at wehi.EDU.AU>
Cc: Bioconductor mailing list <bioconductor at stat.math.ethz.ch>
Message-ID: <496F14BC.4040601 at icr.ac.uk>
Content-Type: text/plain; charset=ISO-8859-1
Thanks for the brilliant answer. Very interesting stuff. The only
other question I would like to ask concerning this is when do you define
a probe as non-expressed? Is this done by observation of some kind of
plot e.g. MA plot, a fixed percentage of probes or some absolute value
known by experience. For Affy arrays you can use the DaBG results but I
am not sure what the correct approach would be with two colour
microarrays.
Many thanks
Dan
Gordon Smyth wrote:
> Dear Dan,
>
> It's very common practice to keep all the probes for normalization,
then
> to filter control probes and consistently non-expressed probes before
> differential expression analysis. I recommend and do it this myself.
> It's such common practice that it's surprising to see a paper on it at
> this stage.
>
> It is in the spirit of normalization methods that all probes should be
> retained for normalization, except in unusual cases in which some
probes
> are obviously poor quality for reasons other than expression level.
>
> At the differential expression step, probes can be usefully filtered
out
> if they are not of any potential interest. This means control probes,
> or probes which appear to be non-expressed across all conditions in
the
> experiment, i.e., on all arrays. I have frequently complained on this
> mailing list about the practice of filtering individual low intensity
> probes on individual arrays, which IMO is a very destructive practice.
> If you filter a probe on the basis of expression, it must be filtered
on
> all arrays.
>
> Filtering non-expressed probes tends not be emphasised on this list
> because users of this list are often sophisticated enough to use
> variance stabilizing normalization methods such as rma, vsn, normexp
or
> vst. This means that low-expression filtering is done more for
> multiplicity issues than for variance stabilization, and therefore
often
> doesn't make a huge difference. When using earlier normalization
> methods such as MAS for Affy or local background correction for
> two-color arrays, expression-filtering is absolutely essential,
because
> the normalized expression values are so unstable at low intensity
levels.
>
> To James, it is not necessary to give retain all the probes on the
array
> for eBayes(). The only requirement is that eBayes() sees all the
probes
> which are under consideration for differential expression. So
filtering
> out consistently non-expressed probes before linear modelling is
> generally a good idea. In fact, filtering often improves the eBayes()
> assumptions. eBayes assumes that the residual variances are not
> intensity-dependent. However very lowly expressed probes often follow
a
> mean-variance relationship which is somewhat different from the other
> probes, even after variance stabilization, in which case filtering
will
> improve the constancy of variance assumption. This tends not to be a
> big issue with rma-Affy data, but it is an important issue with
> vst-Illumina data for example.
>
> Best wishes
> Gordon
--
**************************************************************
Daniel Brewer, Ph.D.
Institute of Cancer Research
Molecular Carcinogenesis
MUCRC
15 Cotswold Road
Sutton, Surrey SM2 5NG
United Kingdom
Tel: +44 (0) 20 8722 4109
Email: daniel.brewer at icr.ac.uk
**************************************************************
The Institute of Cancer Research: Royal Cancer Hospital, a charitable
Company Limited by Guarantee, Registered in England under Company No.
534147 with its Registered Office at 123 Old Brompton Road, London SW7
3RP.
This e-mail message is confidential and for use by the a...{{dropped:18}}
More information about the Bioconductor
mailing list