[BioC] edgeR on microRNA data

Thu Oct 13 02:43:55 CEST 2011

Hi Hervé and Dan,
Thanks for your advice! Yes, I downloaded R from http://r.research.att.com/ as well. I originally solved the problem by using the Package Installer instead, which worked fine. For some reason, trying the BiocInstaller today works with the same installation of R that I was using before (I guess they must have changed something in the Matrix...). I will still update my R 2.14 though.

Thanks again,
Helena

> sessionInfo()
R Under development (unstable) (2011-10-01 r57123)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocInstaller_1.1.28

loaded via a namespace (and not attached):
[1] tools_2.14.0

________________________________________
Från: Hervé Pagès [hpages at fhcrc.org]
Skickat: den 12 oktober 2011 22:19
Till: Helena Persson
Kopia: Bioconductor mailing list
Ämne: Re: [BioC] edgeR on microRNA data

Hi Helena,

Did you manage to solve this problem? This works for me with R 2.14
alpha:

   > source("http://bioconductor.org/biocLite.R")
   BiocInstaller version 1.1.28, ?biocLite for help

   > biocLite("edgeR")
   BioC_mirror: 'http://www.bioconductor.org'
   Using R version 2.14, BiocInstaller version 1.1.28.
   Installing package(s) 'edgeR'
   Warning: unable to access index for repository
http://www.stats.ox.ac.uk/pub/RWin/bin/macosx/leopard/contrib/2.14
   trying URL
'http://www.bioconductor.org/packages/2.9/bioc/bin/macosx/leopard/contrib/2.14/edgeR_2.3.52.tgz'
   Content type 'application/x-gzip' length 1169631 bytes (1.1 Mb)
   opened URL
   ==================================================
   downloaded 1.1 Mb

   The downloaded packages are in
        /tmp/Rtmp895H8Z/downloaded_packages
   Warning: unable to access index for repository
http://www.stats.ox.ac.uk/pub/RWin/bin/macosx/leopard/contrib/2.14

   > sessionInfo()
   R version 2.14.0 alpha (2011-10-11 r57214)
   Platform: i386-apple-darwin9.8.0/i386 (32-bit)

   locale:
   [1] C

   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base

   other attached packages:
   [1] BiocInstaller_1.1.28

   loaded via a namespace (and not attached):
   [1] tools_2.14.0

How did you install R 2.14 on your Mac? Mine is coming from
http://r.research.att.com/, which is the recommended place for
getting the latest R devel binary for Mac.

Please provide your sessionInfo(). Thanks!

H.

On 11-10-03 12:09 AM, Helena Persson wrote:
> Well, I installed the devel version of R, but if I then do:
>
>> source("http://www.bioconductor.org/biocLite.R")
>> biocLite("edgeR")
>
> I get the error message:
>
> Using R version 2.14.0, biocinstall version 2.8.4.
> Installing Bioconductor version 2.8 packages:
> [1] "edgeR"
> Please wait...
>
> Warning: unable to access index for repository http://bioconductor.org/packages/2.8/bioc/bin/macosx/leopard/contrib/2.14
> Warning: unable to access index for repository http://bioconductor.org/packages/2.8/data/annotation/bin/macosx/leopard/contrib/2.14
> Warning: unable to access index for repository http://bioconductor.org/packages/2.8/data/experiment/bin/macosx/leopard/contrib/2.14
> Warning: unable to access index for repository http://bioconductor.org/packages/2.8/extra/bin/macosx/leopard/contrib/2.14
> Warning: unable to access index for repository http://brainarray.mbni.med.umich.edu/bioc/bin/macosx/leopard/contrib/2.14
> Warning message:
> In getDependencies(pkgs, dependencies, available, lib) :
>    package ‘edgeR’ is not available (for R Under development)
>
> ... so I guess I am doing something wrong.
>
> Helena
>
> ________________________________________
> Från: Gordon K Smyth [smyth at wehi.EDU.AU]
> Skickat: den 3 oktober 2011 08:53
> Till: Helena Persson
> Kopia: Bioconductor mailing list
> Ämne: Re: SV: edgeR on microRNA data
>
> You need to install the devel version of R, available from CRAN.  Then you
> get the devel version of edgeR and other Bioconductor packages
> automatically.
>
> Gordon
>
>
> On Mon, 3 Oct 2011, Helena Persson wrote:
>
> Dear Gordon,
> Upgrading sounds like a good idea – how do I install the devel version of edgeR?
>
> Best,
> Helena
>
> ________________________________________
> Från: Gordon K Smyth [smyth at wehi.EDU.AU]
> Skickat: den 3 oktober 2011 05:33
> Till: Helena Persson
> Kopia: Bioconductor mailing list
> Ämne: Re: edgeR on microRNA data
>
> Dear Helena,
>
> You will find it very helpful to upgrade your version of edgeR to the
> current developmental version (although you will need to be using R devel
> aka R 2.14 to do so).  You will find that exactTest() is now much faster
> and less memory consuming.  The current release version is time consuming
> when the counts are large, mainly because of a change to the way in which
> the rejection region is computed that we implemented two months ago.
>
> Fair comment about adding comments on prop.used.  We had not considered
> that users would generally change this.
>
> If you choose prior.n very small, then edgeR will simply use the genewise
> dispersion estimate that depends on the data from that gene alone.  This
> is not over-fitting in itself.  However it can lead to an increase in the
> FDR because edgeR does not take into account when doing significance tests
> of the uncertainty with which the dispersion is estimated.
>
> Best wishes
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> http://www.wehi.edu.au
> http://www.statsci.org/smyth
>
> ------------ original message --------------
>
> On Mon, 3 Oct 2011, Helena Persson wrote:
>
>> Dear Gordon,
> I guess I should start with some clarifications:
>
>> I am concerned that you have decreased prop.used its default
>> value of 0.3.  I would tend to increase this rather than decrease it.
>
> For the microRNA data I have few genes but a relatively large expression
> range. My reason for decreasing the prop.used was that I suspected that
> using 30% would bin genes that had very different means of expression. I
> did not give this a lot of thought at the time and have now rerun the
> analysis using 0.3. Maybe it would be good to comment a bit more on this
> parameter in the R help page or the edgeR vignette?
>
>> On the other hand, you have increased prior.n from its default value, which
>> for your data would be a little over 0.5.  Is this simply because it gave
>> better looking results?  Anyway, increasing prior.n does not result in
>> overfitting.  The risk with larger prior.n is simply that it may start to
>> return differentially expressed miRs that are increased or decreased in only
>> a few of the samples, rather than consistently for all samples in a group.
>
> I decided to remove two of the samples in the control group because they
> appeared to be outliers from the rest, so my smallest group is actually 8
> samples. I did not put together the control samples, but judging from the
> clinical data I got it is more hetereogeneous than the patient groups.
> Choosing 2 for the prior.n was a compromise (I realised I should go quite
> low for my dataset, but using 0 as suggested by someone I talked to
> produced very short lists of genes that did not look any better judging
> from boxplots). Actually, I was wondering if setting the prior very low
> (rather than high) could lead to overfitting of the variance estimate.
>
>> How large are the common and tagwise dispersions for your data?
>
> The common dispersion varies a little depending on how I group the
> samples:
>
> [1] 0.2681829 (three groups, 8 ctrl and 2 x 15 patients)
> [1] 0.2788752 (two groups, 8 ctrl and 32 patients)
>
> The tagwise dispersions (cds1<- estimateTagwiseDisp(cds1, prior.n=1,
> trend=TRUE, prop.used=0.3, grid=FALSE)):
>
>       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
> 0.09599 0.18370 0.24550 0.28160 0.31190 2.23000
>       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>     0.1022  0.1894  0.2534  0.2916  0.3183  2.1890
>
> A strange thing: When I run exactTest for the miRNA data (618 genes x 38
> samples) edgeR becomes extremely memory-consuming, basically using up all
> of the 8 GB RAM on my laptop and then becomes painfully slow as the memory
> starts switching. When I run exactTest for CAGE data for the same samples
> (15066 genes x 40 samples) it never goes above 3 GB and finishes rapidly.
> I use a grid search for the CAGE data tagwise dispersion estimate and the
> library sizes are smaller (around 7 million counts vs 15 million), but
> otherwise the previous steps are basically the same. Any experience (or
> qualified guess) of what might make the analysis use so much memory?
>
> Thanks again,
>
> Helena
>
>
> On Sat, 1 Oct 2011, Gordon K Smyth wrote:
>
>> Dear Helena,
>>
>> Compared with mRNA-Seq, you have an unusually small number of transcripts but
>> a relatively large number of biological replicates.  This suggests that you
>> should use a relative small value for prior.n but a relatively large value
>> for prop.used.  I am concerned that you have decreased prop.used its default
>> value of 0.3.  I would tend to increase this rather than decrease it.
>>
>> On the other hand, you have increased prior.n from its default value, which
>> for your data would be a little over 0.5.  Is this simply because it gave
>> better looking results?  Anyway, increasing prior.n does not result in
>> overfitting.  The risk with larger prior.n is simply that it may start to
>> return differentially expressed miRs that are increased or decreased in only
>> a few of the samples, rather than consistently for all samples in a group.
>>
>> Your experience with prior.n is unintuitive to me.  Generally speaking,
>> choosing prior.n small means that each miR gets to set its own dispersion, so
>> that miR with large variance will not appear in the topTag list.  When you
>> say "variance outliers", do you mean large or small variance?
>>
>> Since your minimum group sample size is 10, I would have required miRs to
>> satisfy your cpm requirement in>= 10 samples rather than 5.
>>
>> Best wishes
>> Gordon
>>
>>> Date: Thu, 29 Sep 2011 05:25:14 +0000
>>> From: Helena Persson<helena.persson at ki.se>
>>> To: "bioconductor at stat.math.ethz.ch"<bioconductor at stat.math.ethz.ch>
>>> Subject: [BioC] edgeR on microRNA data
>>>
>>> Hi,
>>
>>> I would be grateful for some input on using edgeR for small RNA sequence
>>> data. I have been testing edgeR on a set of miRNA data (3 groups with n=10,
>>> 15 and 15). After removing genes that are not expressed at>= 0.2 cpm in>=
>>> 5 samples I have ~600 rows left. I tried calculating the tagwise dispersion
>>> estimate with:
>>>
>>> cds1<- estimateTagwiseDisp(cds1, prior.n=2, trend=TRUE, prop.used=0.1,
>>> grid=FALSE)
>>>
>>> Increasing the prior to e.g. 10 gives more differentially expressed genes
>>> that do not look bad. Decreasing the prior to 0 leaves me with extremely
>>> few differentially expressed genes that are mainly variance outliers. I
>>> guess that miRNA data is likely to behave differently from mRNA data since
>>> there are so few genes (but still a very large dynamic range). Is it
>>> possible that I am over-fitting the estimate? Would you recommend changing
>>> any other parameters?
>>>
>>> Best regards,
>>> Helena
>>> _________________________________
>>>
>>> Helena Persson, PhD
>>>
>>> Karolinska Institutet
>>> Dept of Biosciences and Nutrition
>>> Hälsovägen 7-9
>>> SE-141 83 Huddinge
>>> Sweden
>>>
>>> Helena.Persson at ki.se
>>>
>>> tel. +46-(0)8-52481058
>
> ______________________________________________________________________
> The information in this email is confidential and intend...{{dropped:13}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319