[BioC] edgeR i get 377 significant genes where in DESeq i got 0

Thu Mar 15 01:57:49 CET 2012

Dear Simon,

On re-reading the original post, you may well be right that Pap has no 
biological replicates, in which case he or she has ignored warning 
messages from both our packages, and the discussion is not very 
interesting.  There's ambiguity in the post -- it does mention biological 
replicates and, although there are only two files, a file may contain more 
than one library.

I have been disappointed by users ignoring the warning message from edgeR 
about lack of replication, so I recently changed the developmental version 
of edgeR so that it returns NA dispersion when there is no replication, so 
a user is now forced to think more about what they are doing and take one 
of the actions outlined in the edge User's Guide.

You seem to be suggesting that edgeR has inserted a numeric dispersion 
value different from zero in the case of no replication.  If that is what 
you are suggesting, then you are incorrect.

My comment about the conservatism of DESeq stands, as it is our 
experience.  I think that you and I have a disagreement about what 
constitutes "typical data".  In my opinion, some of the datasets being 
publicly presented as examples of RNA-Seq data show obvious signs that 
there are missing explanatory variables such as batch effects.  Such data 
will produce outlier dispersions.  You may wish to design an algorithm for 
these circumstances.  My concern however is with experimental data that my 
biomedical collaborators present to me, and we take pains to avoid 
important uncontrolled factors or to identify and model them.

Best wishes
Gordon

> Date: Wed, 14 Mar 2012 08:41:12 +0100
> From: Simon Anders <anders at embl.de>
> To: bioconductor at r-project.org
> Subject: Re: [BioC] edgeR i get 377 significant genes where in DESeq i
> 	got 0
>
> Hi
>
> On 2012-03-14 00:16, Gordon K Smyth wrote:
>> Probably you haven't made a mistake. In our experience, this is the
>> typical behaviour of the two packages.
>>
>> The DESeq people should be applauded for trying something different, but
>> commonsense would tell you that setting dispersions equal to the
>> "maximum" of two extremes is likely to be conservative, especially when
>> there are so few replicates.
>
> Of course, I need to chime in here.
>
> First of all, the poster did indeed make a mistake, see below.
>
> Second, in our experience, the use of the maximum rule, while admittedly
> looking overly conservative, costs surprisingly little power for typical
> data. We performed simulations that convinced at least us that this
> simple scheme is more robust than the WML scheme in edgeR, which, in our
> hands, failed to control type-I error when presented with simulated data
> with few replicates when the dispersion values were drawn from rather
> wide distributions. Of course, a systematic comparison is still lacking
> and should maybe be done by somebody more unbiased. In my opinion, such
> a comparison should be based on a simulation study that tests how the
> methods deal with simulated data with true dispersion values drawn from
> distributions of different shapees and widths, modeled after real data
> where available.
>
> Now, to Pap's data set:
>
>>> Hi,
>>> Assuming i have 2 files:
>>> 1's have 1,000,000 reads- one condition
>>> 2's have 3,000,000 reads- second condition
>
> Pap has two samples in total, not two replicates per condition, and so
> the whole discussion above is not applicable anyway.
>
> In this case, a proper statistical analysis is not possible. We can try
> to get at least something with workarounds, though.
>
> EdgeR used to switch to Poisson mode if presented with data without
> replication, i.e., it assumed zero biological variation, which, of
> course, leads to a large number of hits, which one cannot expect to be
> reproducible. Given that there are only 377 hits, this seems to have
> changed, and the edgeR authors will be able to comment on that.
>
> DESeq offers the method "blind" to deal with data without replicates,
> where it assumes that most genes are not differentially expressed and
> hence estimates the dispersion from a comparison across the two samples.
> Only those genes that "stick out" by showing much stronger differences
> than most genes will be reported.
>
> This method cannot be combined with the "maximum" rule, because then,
> every gene that is "sticking out" would be compared to itself.
>
> This is why this command here
>
>>> estimateDispersions(cds,method="blind",sharingMode="maximum",fitType="local")
>
> produces a warning informing the user that 'method="blind"' should only
> be used together with 'sharingMode="fit-only"'.
>
> Pap, you may have overlooked this warning. Maybe, I should maybe change
> it to an error.
>
> Please try again with 'sharingMode="fit-only"' and let us know what you get.
>
>   Simon
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}