[BioC] edgeR on microRNA data

Mon Oct 3 10:23:24 CEST 2011

Works fine for me.  My guess is that you can't install any Bioc package, 
not just edgeR.

Gordon

On Mon, 3 Oct 2011, Helena Persson wrote:

> Well, I installed the devel version of R, but if I then do:

> source("http://www.bioconductor.org/biocLite.R")
> biocLite("edgeR")

I get the error message:

Using R version 2.14.0, biocinstall version 2.8.4.
Installing Bioconductor version 2.8 packages:
[1] "edgeR"
Please wait...

Warning: unable to access index for repository http://bioconductor.org/packages/2.8/bioc/bin/macosx/leopard/contrib/2.14
Warning: unable to access index for repository http://bioconductor.org/packages/2.8/data/annotation/bin/macosx/leopard/contrib/2.14
Warning: unable to access index for repository http://bioconductor.org/packages/2.8/data/experiment/bin/macosx/leopard/contrib/2.14
Warning: unable to access index for repository http://bioconductor.org/packages/2.8/extra/bin/macosx/leopard/contrib/2.14
Warning: unable to access index for repository http://brainarray.mbni.med.umich.edu/bioc/bin/macosx/leopard/contrib/2.14
Warning message:
In getDependencies(pkgs, dependencies, available, lib) :
   package ‘edgeR’ is not available (for R Under development)

... so I guess I am doing something wrong.

Helena

________________________________________
Från: Gordon K Smyth [smyth at wehi.EDU.AU]
Skickat: den 3 oktober 2011 08:53
Till: Helena Persson
Kopia: Bioconductor mailing list
Ämne: Re: SV: edgeR on microRNA data

You need to install the devel version of R, available from CRAN.  Then you
get the devel version of edgeR and other Bioconductor packages
automatically.

Gordon

On Mon, 3 Oct 2011, Helena Persson wrote:

Dear Gordon,
Upgrading sounds like a good idea – how do I install the devel version of edgeR?

Best,
Helena

________________________________________
Från: Gordon K Smyth [smyth at wehi.EDU.AU]
Skickat: den 3 oktober 2011 05:33
Till: Helena Persson
Kopia: Bioconductor mailing list
Ämne: Re: edgeR on microRNA data

Dear Helena,

You will find it very helpful to upgrade your version of edgeR to the
current developmental version (although you will need to be using R devel
aka R 2.14 to do so).  You will find that exactTest() is now much faster
and less memory consuming.  The current release version is time consuming
when the counts are large, mainly because of a change to the way in which
the rejection region is computed that we implemented two months ago.

Fair comment about adding comments on prop.used.  We had not considered
that users would generally change this.

If you choose prior.n very small, then edgeR will simply use the genewise
dispersion estimate that depends on the data from that gene alone.  This
is not over-fitting in itself.  However it can lead to an increase in the
FDR because edgeR does not take into account when doing significance tests
of the uncertainty with which the dispersion is estimated.

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.wehi.edu.au
http://www.statsci.org/smyth

------------ original message --------------

On Mon, 3 Oct 2011, Helena Persson wrote:

> Dear Gordon,
I guess I should start with some clarifications:

>I am concerned that you have decreased prop.used its default
> value of 0.3.  I would tend to increase this rather than decrease it.

For the microRNA data I have few genes but a relatively large expression
range. My reason for decreasing the prop.used was that I suspected that
using 30% would bin genes that had very different means of expression. I
did not give this a lot of thought at the time and have now rerun the
analysis using 0.3. Maybe it would be good to comment a bit more on this
parameter in the R help page or the edgeR vignette?

> On the other hand, you have increased prior.n from its default value, which
> for your data would be a little over 0.5.  Is this simply because it gave
> better looking results?  Anyway, increasing prior.n does not result in
> overfitting.  The risk with larger prior.n is simply that it may start to
> return differentially expressed miRs that are increased or decreased in only
> a few of the samples, rather than consistently for all samples in a group.

I decided to remove two of the samples in the control group because they
appeared to be outliers from the rest, so my smallest group is actually 8
samples. I did not put together the control samples, but judging from the
clinical data I got it is more hetereogeneous than the patient groups.
Choosing 2 for the prior.n was a compromise (I realised I should go quite
low for my dataset, but using 0 as suggested by someone I talked to
produced very short lists of genes that did not look any better judging
from boxplots). Actually, I was wondering if setting the prior very low
(rather than high) could lead to overfitting of the variance estimate.

>How large are the common and tagwise dispersions for your data?

The common dispersion varies a little depending on how I group the
samples:

[1] 0.2681829 (three groups, 8 ctrl and 2 x 15 patients)
[1] 0.2788752 (two groups, 8 ctrl and 32 patients)

The tagwise dispersions (cds1 <- estimateTagwiseDisp(cds1, prior.n=1,
trend=TRUE, prop.used=0.3, grid=FALSE)):

      Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.09599 0.18370 0.24550 0.28160 0.31190 2.23000
      Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    0.1022  0.1894  0.2534  0.2916  0.3183  2.1890

A strange thing: When I run exactTest for the miRNA data (618 genes x 38
samples) edgeR becomes extremely memory-consuming, basically using up all
of the 8 GB RAM on my laptop and then becomes painfully slow as the memory
starts switching. When I run exactTest for CAGE data for the same samples
(15066 genes x 40 samples) it never goes above 3 GB and finishes rapidly.
I use a grid search for the CAGE data tagwise dispersion estimate and the
library sizes are smaller (around 7 million counts vs 15 million), but
otherwise the previous steps are basically the same. Any experience (or
qualified guess) of what might make the analysis use so much memory?

Thanks again,

Helena

On Sat, 1 Oct 2011, Gordon K Smyth wrote:

> Dear Helena,
>
> Compared with mRNA-Seq, you have an unusually small number of transcripts but
> a relatively large number of biological replicates.  This suggests that you
> should use a relative small value for prior.n but a relatively large value
> for prop.used.  I am concerned that you have decreased prop.used its default
> value of 0.3.  I would tend to increase this rather than decrease it.
>
> On the other hand, you have increased prior.n from its default value, which
> for your data would be a little over 0.5.  Is this simply because it gave
> better looking results?  Anyway, increasing prior.n does not result in
> overfitting.  The risk with larger prior.n is simply that it may start to
> return differentially expressed miRs that are increased or decreased in only
> a few of the samples, rather than consistently for all samples in a group.
>
> Your experience with prior.n is unintuitive to me.  Generally speaking,
> choosing prior.n small means that each miR gets to set its own dispersion, so
> that miR with large variance will not appear in the topTag list.  When you
> say "variance outliers", do you mean large or small variance?
>
> Since your minimum group sample size is 10, I would have required miRs to
> satisfy your cpm requirement in >= 10 samples rather than 5.
>
> Best wishes
> Gordon
>
>> Date: Thu, 29 Sep 2011 05:25:14 +0000
>> From: Helena Persson <helena.persson at ki.se>
>> To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>> Subject: [BioC] edgeR on microRNA data
>>
>> Hi,
>
>> I would be grateful for some input on using edgeR for small RNA sequence
>> data. I have been testing edgeR on a set of miRNA data (3 groups with n=10,
>> 15 and 15). After removing genes that are not expressed at >= 0.2 cpm in >=
>> 5 samples I have ~600 rows left. I tried calculating the tagwise dispersion
>> estimate with:
>>
>> cds1 <- estimateTagwiseDisp(cds1, prior.n=2, trend=TRUE, prop.used=0.1,
>> grid=FALSE)
>>
>> Increasing the prior to e.g. 10 gives more differentially expressed genes
>> that do not look bad. Decreasing the prior to 0 leaves me with extremely
>> few differentially expressed genes that are mainly variance outliers. I
>> guess that miRNA data is likely to behave differently from mRNA data since
>> there are so few genes (but still a very large dynamic range). Is it
>> possible that I am over-fitting the estimate? Would you recommend changing
>> any other parameters?
>>
>> Best regards,
>> Helena
>> _________________________________
>>
>> Helena Persson, PhD
>>
>> Karolinska Institutet
>> Dept of Biosciences and Nutrition
>> Hälsovägen 7-9
>> SE-141 83 Huddinge
>> Sweden
>>
>> Helena.Persson at ki.se
>>
>> tel. +46-(0)8-52481058

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:15}}