[BioC] edgeR on microRNA data

Thu Oct 13 01:06:22 CEST 2011

On Mon, Oct 3, 2011 at 12:09 AM, Helena Persson <helena.persson at ki.se> wrote:
> Well, I installed the devel version of R, but if I then do:
>
>> source("http://www.bioconductor.org/biocLite.R")
>> biocLite("edgeR")
>
> I get the error message:
>
> Using R version 2.14.0, biocinstall version 2.8.4.
> Installing Bioconductor version 2.8 packages:
> [1] "edgeR"
> Please wait...
>
> Warning: unable to access index for repository http://bioconductor.org/packages/2.8/bioc/bin/macosx/leopard/contrib/2.14
> Warning: unable to access index for repository http://bioconductor.org/packages/2.8/data/annotation/bin/macosx/leopard/contrib/2.14
> Warning: unable to access index for repository http://bioconductor.org/packages/2.8/data/experiment/bin/macosx/leopard/contrib/2.14
> Warning: unable to access index for repository http://bioconductor.org/packages/2.8/extra/bin/macosx/leopard/contrib/2.14
> Warning: unable to access index for repository http://brainarray.mbni.med.umich.edu/bioc/bin/macosx/leopard/contrib/2.14
> Warning message:
> In getDependencies(pkgs, dependencies, available, lib) :
>  package ‘edgeR’ is not available (for R Under development)
>
> ... so I guess I am doing something wrong.

My guess is your version of R-2.14 is a little old. If you're using
R-2.14 newer than SVN revision 55733, you'd be using a new
installation mechanism (the BiocInstaller package), and you would see
the word BiocInstaller in your output instead of "biocinstall".

Try updating to the latest R-devel, as Hervé suggests, from
http://r.research.att.com.

If you still have trouble, do send us the output of sessionInfo() as
suggested earlier.

Dan

>
> Helena
>
> ________________________________________
> Från: Gordon K Smyth [smyth at wehi.EDU.AU]
> Skickat: den 3 oktober 2011 08:53
> Till: Helena Persson
> Kopia: Bioconductor mailing list
> Ämne: Re: SV: edgeR on microRNA data
>
> You need to install the devel version of R, available from CRAN.  Then you
> get the devel version of edgeR and other Bioconductor packages
> automatically.
>
> Gordon
>
>
> On Mon, 3 Oct 2011, Helena Persson wrote:
>
> Dear Gordon,
> Upgrading sounds like a good idea – how do I install the devel version of edgeR?
>
> Best,
> Helena
>
> ________________________________________
> Från: Gordon K Smyth [smyth at wehi.EDU.AU]
> Skickat: den 3 oktober 2011 05:33
> Till: Helena Persson
> Kopia: Bioconductor mailing list
> Ämne: Re: edgeR on microRNA data
>
> Dear Helena,
>
> You will find it very helpful to upgrade your version of edgeR to the
> current developmental version (although you will need to be using R devel
> aka R 2.14 to do so).  You will find that exactTest() is now much faster
> and less memory consuming.  The current release version is time consuming
> when the counts are large, mainly because of a change to the way in which
> the rejection region is computed that we implemented two months ago.
>
> Fair comment about adding comments on prop.used.  We had not considered
> that users would generally change this.
>
> If you choose prior.n very small, then edgeR will simply use the genewise
> dispersion estimate that depends on the data from that gene alone.  This
> is not over-fitting in itself.  However it can lead to an increase in the
> FDR because edgeR does not take into account when doing significance tests
> of the uncertainty with which the dispersion is estimated.
>
> Best wishes
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> http://www.wehi.edu.au
> http://www.statsci.org/smyth
>
> ------------ original message --------------
>
> On Mon, 3 Oct 2011, Helena Persson wrote:
>
>> Dear Gordon,
> I guess I should start with some clarifications:
>
>>I am concerned that you have decreased prop.used its default
>> value of 0.3.  I would tend to increase this rather than decrease it.
>
> For the microRNA data I have few genes but a relatively large expression
> range. My reason for decreasing the prop.used was that I suspected that
> using 30% would bin genes that had very different means of expression. I
> did not give this a lot of thought at the time and have now rerun the
> analysis using 0.3. Maybe it would be good to comment a bit more on this
> parameter in the R help page or the edgeR vignette?
>
>> On the other hand, you have increased prior.n from its default value, which
>> for your data would be a little over 0.5.  Is this simply because it gave
>> better looking results?  Anyway, increasing prior.n does not result in
>> overfitting.  The risk with larger prior.n is simply that it may start to
>> return differentially expressed miRs that are increased or decreased in only
>> a few of the samples, rather than consistently for all samples in a group.
>
> I decided to remove two of the samples in the control group because they
> appeared to be outliers from the rest, so my smallest group is actually 8
> samples. I did not put together the control samples, but judging from the
> clinical data I got it is more hetereogeneous than the patient groups.
> Choosing 2 for the prior.n was a compromise (I realised I should go quite
> low for my dataset, but using 0 as suggested by someone I talked to
> produced very short lists of genes that did not look any better judging
> from boxplots). Actually, I was wondering if setting the prior very low
> (rather than high) could lead to overfitting of the variance estimate.
>
>>How large are the common and tagwise dispersions for your data?
>
> The common dispersion varies a little depending on how I group the
> samples:
>
> [1] 0.2681829 (three groups, 8 ctrl and 2 x 15 patients)
> [1] 0.2788752 (two groups, 8 ctrl and 32 patients)
>
> The tagwise dispersions (cds1 <- estimateTagwiseDisp(cds1, prior.n=1,
> trend=TRUE, prop.used=0.3, grid=FALSE)):
>
>     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
> 0.09599 0.18370 0.24550 0.28160 0.31190 2.23000
>     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>   0.1022  0.1894  0.2534  0.2916  0.3183  2.1890
>
> A strange thing: When I run exactTest for the miRNA data (618 genes x 38
> samples) edgeR becomes extremely memory-consuming, basically using up all
> of the 8 GB RAM on my laptop and then becomes painfully slow as the memory
> starts switching. When I run exactTest for CAGE data for the same samples
> (15066 genes x 40 samples) it never goes above 3 GB and finishes rapidly.
> I use a grid search for the CAGE data tagwise dispersion estimate and the
> library sizes are smaller (around 7 million counts vs 15 million), but
> otherwise the previous steps are basically the same. Any experience (or
> qualified guess) of what might make the analysis use so much memory?
>
> Thanks again,
>
> Helena
>
>
> On Sat, 1 Oct 2011, Gordon K Smyth wrote:
>
>> Dear Helena,
>>
>> Compared with mRNA-Seq, you have an unusually small number of transcripts but
>> a relatively large number of biological replicates.  This suggests that you
>> should use a relative small value for prior.n but a relatively large value
>> for prop.used.  I am concerned that you have decreased prop.used its default
>> value of 0.3.  I would tend to increase this rather than decrease it.
>>
>> On the other hand, you have increased prior.n from its default value, which
>> for your data would be a little over 0.5.  Is this simply because it gave
>> better looking results?  Anyway, increasing prior.n does not result in
>> overfitting.  The risk with larger prior.n is simply that it may start to
>> return differentially expressed miRs that are increased or decreased in only
>> a few of the samples, rather than consistently for all samples in a group.
>>
>> Your experience with prior.n is unintuitive to me.  Generally speaking,
>> choosing prior.n small means that each miR gets to set its own dispersion, so
>> that miR with large variance will not appear in the topTag list.  When you
>> say "variance outliers", do you mean large or small variance?
>>
>> Since your minimum group sample size is 10, I would have required miRs to
>> satisfy your cpm requirement in >= 10 samples rather than 5.
>>
>> Best wishes
>> Gordon
>>
>>> Date: Thu, 29 Sep 2011 05:25:14 +0000
>>> From: Helena Persson <helena.persson at ki.se>
>>> To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>> Subject: [BioC] edgeR on microRNA data
>>>
>>> Hi,
>>
>>> I would be grateful for some input on using edgeR for small RNA sequence
>>> data. I have been testing edgeR on a set of miRNA data (3 groups with n=10,
>>> 15 and 15). After removing genes that are not expressed at >= 0.2 cpm in >=
>>> 5 samples I have ~600 rows left. I tried calculating the tagwise dispersion
>>> estimate with:
>>>
>>> cds1 <- estimateTagwiseDisp(cds1, prior.n=2, trend=TRUE, prop.used=0.1,
>>> grid=FALSE)
>>>
>>> Increasing the prior to e.g. 10 gives more differentially expressed genes
>>> that do not look bad. Decreasing the prior to 0 leaves me with extremely
>>> few differentially expressed genes that are mainly variance outliers. I
>>> guess that miRNA data is likely to behave differently from mRNA data since
>>> there are so few genes (but still a very large dynamic range). Is it
>>> possible that I am over-fitting the estimate? Would you recommend changing
>>> any other parameters?
>>>
>>> Best regards,
>>> Helena
>>> _________________________________
>>>
>>> Helena Persson, PhD
>>>
>>> Karolinska Institutet
>>> Dept of Biosciences and Nutrition
>>> Hälsovägen 7-9
>>> SE-141 83 Huddinge
>>> Sweden
>>>
>>> Helena.Persson at ki.se
>>>
>>> tel. +46-(0)8-52481058
>
> ______________________________________________________________________
> The information in this email is confidential and intend...{{dropped:13}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>