[BioC] limma

Sun Apr 10 21:59:38 CEST 2011

Dear Wolfgang and Gordon,
Thank you both for having this discussion publicly.  I have found it 
very informative about an aspect of the analysis that I had not 
previously thought much about.

--Naomi

At 10:33 AM 4/10/2011, Wolfgang Huber wrote:
>Hi Gordon
>
>I think we agree, my only reason for making this post is to point 
>out that while the choice of '16' might be a good comprise overall, 
>better choices can exist depending on the dataset. E.g. if there are 
>many replicates, then these can compensate the imprecision from 
>using a smaller offset and improved biases may be had.
>
>Variance stabilising transformations are of course very related to 
>the offseting that you propose. The "glog" in vsn is
>
>   glog(x) = log( x+ sqrt(x^2 + c) )
>
>which is in practice similar to the shifted log, i.e. log (x+a). (It 
>even avoids problems that the shifted log has when x approaches -a. 
>See also e.g. PMID 12761059.)
>
>Such transformations provide a good way of looking at microarray 
>data, and indeed at high-thoughput sequencing data, for many though 
>not all uses.
>
>As you say in your NAR article, there is a continuum of choices 
>along the precision - bias tradeoff axis, which can be parameterized 
>by 'a' or 'c'.
>
>Btw, a subtlety of vst that many users seem not to be aware of is 
>that it uses the between beads variance (for the same gene, in the 
>same sample) rather than the more relevant (and larger) between 
>replicates variance for finding its 'c'.
>
>         Best wishes
>         Wolfgang
>
>
>PS A pet peeve of mine, how come that people are OK to state numbers 
>such as '16' without reference to units: ideally, 'official' SI 
>units, but even an ad hoc definition would do, based on a common 
>agreement on reagent quantities, scanner settings, software 
>settings, to make the oxymoron 'arbitrary fluorescence units' just a 
>little bit less so.
>
>
>
>Il Apr/9/11 3:35 AM, Gordon K Smyth ha scritto:
>>Hi Wolfgang,
>>
>>>Date: Thu, 07 Apr 2011 12:07:05 +0200
>>>From: Wolfgang Huber <whuber at embl.de>
>>>To: bioconductor at r-project.org
>>>Subject: Re: [BioC] limma
>>>
>>>Hi Gordon
>>>
>>>>.... "limma ensures that all probes are
>>>>assigned at least a minimum non-zero expression level on all arrays, in
>>>>order to minimize the variability of log-intensities for lowly expressed
>>>>probes. Probes that are expressed in one condition but not other will be
>>>>assigned a large fold change for which the denominator is the minimum
>>>>expression level. This approach has the advantage that genes can be
>>>>ranked by fold change in a meaningful way, because genes with larger
>>>>expression expression changes will always be assigned a larger fold
>>>>change."
>>
>>This comment was in the context of genes expressed in one condition and
>>not the other (and was part of a longer post). In this context the
>>estimated fold change is essentially monotonic in the higher expression
>>level, provided the zero value is offset away from zero, so larger
>>expression changes do translate into larger fold changes. In other
>>contexts, it is a question of importance ranking, which I guess is the
>>issue that you're raising below.
>>
>>>I am not sure I follow:
>>>
>>>(i) (20 + 16) / (10 + 16) < (15000 + 16) / (10000 + 16)
>>>
>>>but
>>>
>>>(ii) 20 / 10 > 15000 / 10000
>>>
>>>You assume that measurements of 20 and 10 are less reliable (or perhaps
>>>biologically less important?) than measurements of 20000 and 10000, thus
>>>that ranking (i) should be used
>>
>>Generally I rank probes by a combination of statistical significance and
>>fold change, not by fold change alone. However, the discussion is in the
>>context of Illumina expression data, and Illumina intensities of 10 and
>>20 are almost certain to be from non-expressed probes, hence contain no
>>biological signal. So, yes, I would generally view measurements of 20000
>>and 10000 as both statistically more precise and biologically more
>>important than 20 and 10, and I would therefore want to rank as (i)
>>rather than (ii). I'm pretty sure that you would too.
>>
>>>- but that depends on an error model (which you encode in the
>>>pseudocount parameter '16')
>>
>>I put more faith in experimental evidence than I do in statistical error
>>models. The fact that offsetting the intensities away from zero reduces
>>the FDR is an observation from considerable testing with calibration
>>data sets. The evidence doesn't rely on an error model. Much of the
>>evidence is laid out in the paper that I cited in my earlier email:
>>
>>Shi, W, Oshlack, A, and Smyth, GK (2010). Optimizing the noise versus
>>bias trade-off for Illumina Whole Genome Expression BeadChips. Nucleic
>>Acids Research 38, e204.
>>
>>>and a subjective trade-off between precision and effect size.
>>
>>The fact that the value is chosen from experience with data, rather than
>>as as a parameter estimated from a mathematical model, doesn't make it
>>subjective. As I've said, I take mathematical models with a grain of salt.
>>
>>It's easy to verify experimentally that well known preprocessing
>>algorithms, like RMA for Affy data or vst for Illumina data (you're an
>>author!), also have the effect of offsetting intensities away from zero
>>before logging them. I think it is a useful insight to observe that this
>>offsetting is a good part of why those algorithms have good statistical
>>properties. vst has an effective offset of around 200 (Wei et al, Tables
>>2 and 3). As far as I know, the offset was not designed into either of
>>the above algorithms. I suspect it was rather a fortuitious but
>>unexpected outcome. The offset that vst seems to have isn't a natural
>>outcome of the variance stabilization model, because it generally turns
>>out to be much larger than the offset that would best stabilize the
>>variance. Anyway, we find that by using more modest offsets in the range
>>16-50 for Illumina data, we can achieve FDR as good as vst but with less
>>bias, much less contraction of the fold changes. Again, this is a
>>conclusion from testing rather than from modelling.
>>
>>I prefer to make the offset explicit, clearly visible to users, rather
>>than leaving it implicit or unexpected. This approach (neqc etc) isn't
>>the only good way to address noise, bias and variance stabilization
>>issues, but it's the one that seems to work best for me at the moment.
>>
>>Cheers
>>Gordon
>>
>>>I agree with you that the approach is useful, and also that it is good
>>>to provide a very simple recipe for people that either cannot deal with
>>>or do not care about the quantitative details. Still, this post is for
>>>the people that do :)
>>>
>>>Cheers
>>>Wolfgang
>>
>>______________________________________________________________________
>>The information in this email is confidential and inte...{{dropped:11}}