[BioC] Design matrix and BCV

Fri Apr 26 08:38:21 CEST 2013

Dear Manoj,

First of all, can I please persuade you to install the latest version of 
edgeR?  You need R 3.0.0 and Bioconductor Release 2.12.

> Date: Wed, 24 Apr 2013 13:33:05 -0700
> From: Manoj Hariharan <h_manoj at yahoo.com>
> To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
> Subject: [BioC] Design matrix and BCV
>
> Hello,
>
> I am new to RNA-seq analysis. I have worked on a few not-too-complicated 
> projects and have found edgeR to be right for my work. In this project I 
> have RNA-seq data from 18 human tissues (normal, no treatment). All 
> tissues except 5 of 18 have at least 2 replicates. The replicate tissues 
> are obtained from separate individuals (they are of different age and 
> sex). There are a few issues I need to discuss with the experts in the 
> group:
>
> 1. The BCV value is quite high (Disp = 0.36621 , BCV = 0.6052). I think 
> this is partly due to the way we have collected replicates - they are 
> from separate individuals - different age and sex. Is this really bad - 
> I had read in the User Guide that BCV of ~40% is acceptable in tumor 
> samples? Does adjusting the prior.df? help (I've attached the BCV 
> plots)? At a later stage I plan to include age and sex as "factors" and 
> re-do the analysis.

I would view this BCV as unacceptably high in my own research.  Page 69 of 
the edgeR User's Guide shows a BCV plot for unrelated individuals:

http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

and I don't think that the BCV should get much higher than this for a 
designed experiment.  Another concern is that the dispersion trend in your 
data looks a bit strange.

In your case, I'd be looking for outliers or batch effects or other 
problems.  The prior.df does not affect the common dispersion.

> 2. I am interested in the differentially expressed genes - across these 
> 18 tissues. I guess I should be using the approach explained in section 
> 3.2.5 of the User Guide (ANOVA-like test).

Yes.

> Below, is the output. The problem is that the first tissue "AD" is 
> absorbed into the intercept. I have read in other discussion threads 
> that this is normal.

Yes, this is normal.  I don't see why it should cause any problem.

> But I do need the logFC values for the AD tissue also.

The fitted model gives you logFC for AD vs each of the other tissues.

> If I use the "design <- 
> model.matrix(~0+tiss_groups, data=D$samples)", I can get the AD column 
> in the design matrix, but then, I would not be able to get the baseline 
> intercept column, and I get all genes differentially expressed. Is there 
> a work-around? How can I handle this issue?

There is no reason to do this.

> 3. How best can I decide on the prior.df? I read the threads on choosing 
> the value based on the number of libraries and groups. But I am not 
> sure. So I tried with prior.df default (20), 10 and 2 with varying 
> number of DE genes.

There is no need to set the prior.df, because the glmQLFTest() function 
estimates the prior.df for you automatically.  The idea is to use 
estimateGLMTrendedDisp() then call glmQLFTest().

Alternatively and better, please upgrade to the current version of edgeR 
and follow the case study in Section 4.6.

It is not actually correct to input tagwise dispersion estimates to 
glmQLFTest.  There was no check against in this in edgeR version 3.0.X, 
but there is in the current release.

Best wishes
Gordon

> R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
> Copyright (C) 2012 The R Foundation for Statistical Computing
> ISBN 3-900051-07-0
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>
> ? Natural language support but running in an English locale
>
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
>
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
>
> Loading required package: DBI
> Loading required package: AnnotationDbi
> Loading required package: BiocGenerics
>
> Attaching package: ?BiocGenerics?
>
> The following object(s) are masked from ?package:stats?:
>
> ??? xtabs
>
> The following object(s) are masked from ?package:base?:
>
> ??? anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find,
> ??? get, intersect, lapply, Map, mapply, mget, order, paste, pmax,
> ??? pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int,
> ??? rownames, sapply, setdiff, table, tapply, union, unique
>
> Loading required package: Biobase
> Welcome to Bioconductor
>
> ??? Vignettes contain introductory material; view with
> ??? 'browseVignettes()'. To cite Bioconductor, see
> ??? 'citation("Biobase")', and for packages 'citation("pkgname")'.
>
>
> Loading Tcl/Tk interface ... done
>
> KEGG.db contains mappings based on older data because the original
> ? resource was removed from the the public domain before the most
> ? recent update was produced. This package should now be considered
> ? deprecated and future versions of Bioconductor may not have it
> ? available.? One possible alternative to consider is to look at the
> ? reactome.db package
>
> [Workspace loaded from /users/manoj/.RData]
>
>>
>>
>>
>> setwd('/Users/manoj/Data/SDEC_hg19/AllCountDataStrndd/')
> Warning message:
> package ?AnnotationDbi? was built under R version 2.15.2
>>
>> library(edgeR)
> Loading required package: limma
> Warning messages:
> 1: package ?edgeR? was built under R version 2.15.2
> 2: package ?limma? was built under R version 2.15.2
>>
>>
>> targets <- read.delim("AllCountData_AllTiss_Info" , stringsAsFactors = FALSE , header=TRUE)
>> D <- readDGE(targets)
>> keep <- rowSums(cpm(D)>1) >= 10
>> D <- D[keep,]
>> tiss_groups <- factor(c("AD","AD","AO","AO","BL","EG","EG","FT","FT","FT","GA","GA","GA","LG","LG","LI","LV","LV","OV","PA","PA","PO","PO","PO","RA","RV","RV","SB","SB","SB","SG","SG","SG","SX","SX","SX","TH"))
>> design <- model.matrix(~tiss_groups)
>>
>> design
> ?? (Intercept) tiss_groupsAO tiss_groupsBL tiss_groupsEG tiss_groupsFT tiss_groupsGA tiss_groupsLG tiss_groupsLI tiss_groupsLV tiss_groupsOV

...

> attr(,"assign")
> ?[1] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> attr(,"contrasts")
> attr(,"contrasts")$tiss_groups
> [1] "contr.treatment"
>
>
>> D <- calcNormFactors(D)
>> D <- estimateGLMCommonDisp(D, design, verbose=TRUE)
> Disp = 0.36621 , BCV = 0.6052
>>
>> D <- estimateGLMTrendedDisp(D, design)
> Loading required package: splines
>>
>
>
>
>> D <- estimateGLMTagwiseDisp(D, design)
>> plotBCV(D, main="BCV Plot: default prior df")
>> D <- estimateGLMTagwiseDisp(D, design, prior.df=10)
>> plotBCV(D, main="BCV Plot: default prior df of 10")
>
>> D <- estimateGLMTagwiseDisp(D, design, prior.df=2)
>> plotBCV(D, main="BCV Plot: default prior df of 2")
>
>
>> fit <- glmFit(D, design)
>> QLF_lrt <- glmQLFTest(fit,coef=2:18)
>> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH")
>> sum(FDR_Stsfd < 0.05)
> [1] 8308
>>
>
>> glm_lrt <- glmLRT(fit,coef=2:18)
>> FDR_Stsfd <- p.adjust(glm_lrt$table$PValue, method="BH")
>> sum(FDR_Stsfd < 0.05)
> [1] 11255
>
>
> Using different parameters (prior.df) for estimateGLMTagwiseDisp:
>> D <- calcNormFactors(D)
>> D <- estimateGLMCommonDisp(D, design, verbose=TRUE)
> Disp = 0.36621 , BCV = 0.6052
>> D <- estimateGLMTrendedDisp(D, design)
>> fit <- glmFit(D, design)
>> QLF_lrt <- glmQLFTest(fit,coef=2:18)
>> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH")
>> sum(FDR_Stsfd < 0.05)
> [1] 8308
>>
>> D <- estimateGLMTagwiseDisp(D, design)
>> fit <- glmFit(D, design)
>> QLF_lrt <- glmQLFTest(fit,coef=2:18)
>> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH")
>> sum(FDR_Stsfd < 0.05)
> [1] 10935
>>
>> D <- estimateGLMTagwiseDisp(D, design, prior.df=2)
>> fit <- glmFit(D, design)
>> QLF_lrt <- glmQLFTest(fit,coef=2:18)
>> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH")
>> sum(FDR_Stsfd < 0.05)
> [1] 12622
>>
>>
>> D <- estimateGLMTagwiseDisp(D, design, prior.df=10)
>> fit <- glmFit(D, design)
>> QLF_lrt <- glmQLFTest(fit,coef=2:18)
>> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH")
>> sum(FDR_Stsfd < 0.05)
> [1] 12033
>>
>
>
>
>
>
> Design matrix without intercept:
>
>> design <- model.matrix(~0+tiss_groups, data=D$samples)
>> design
> ?? tiss_groupsAD tiss_groupsAO tiss_groupsBL tiss_groupsEG tiss_groupsFT tiss_groupsGA tiss_groupsLG tiss_groupsLI tiss_groupsLV tiss_groupsOV
...

> attr(,"assign")
> ?[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> attr(,"contrasts")
> attr(,"contrasts")$tiss_groups
> [1] "contr.treatment"
>
>>
>>
>> D <- estimateGLMCommonDisp(D, design, verbose=TRUE)
> Disp = 0.36621 , BCV = 0.6052
>> D <- estimateGLMTrendedDisp(D, design)
>> fit <- glmFit(D, design)
>> QLF_lrt <- glmQLFTest(fit,coef=2:18)
>> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH")
>> sum(FDR_Stsfd < 0.05)
> [1] 20364
>>
>
>
>
>> sessionInfo()
> R version 2.15.1 (2012-06-22)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] splines?? stats???? graphics? grDevices utils???? datasets? methods?? base????
>
> other attached packages:
> [1] edgeR_3.0.7????????? limma_3.14.3???????? AnnotationDbi_1.20.3 Biobase_2.18.0?????? BiocGenerics_0.4.0?? RSQLite_0.11.2?????? DBI_0.2-5??????????
>
> loaded via a namespace (and not attached):
> ?[1] clusterProfiler_1.6.0 colorspace_1.2-0????? dichromat_1.2-4?????? digest_0.6.0????????? DO.db_2.5.0?????????? DOSE_1.4.0??????????
> ?[7] ggplot2_0.9.3???????? GO.db_2.8.0?????????? GOSemSim_1.16.1?????? grid_2.15.1?????????? gtable_0.1.2????????? igraph_0.6-3????????
> [13] IRanges_1.16.4??????? KEGG.db_2.8.0???????? labeling_0.1????????? MASS_7.3-23?????????? munsell_0.4?????????? parallel_2.15.1?????
> [19] plyr_1.8????????????? proto_0.3-10????????? qvalue_1.32.0???????? RColorBrewer_1.0-5??? reshape2_1.2.2??????? scales_0.2.3????????
> [25] stats4_2.15.1???????? stringr_0.6.2???????? tcltk_2.15.1????????? tools_2.15.1????????
>> ?
>
> Thanks,
> Manoj.
>
> ------------------------------
>
> Manoj Hariharan, Ph.D.
> Staff Researcher
> The Salk Institute for Biological Studies
> La Jolla, CA 92037
> Office: 858.453.4100 x2143
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: BCVPlot_dfDefault.png
> Type: image/png
> Size: 91573 bytes
> Desc: not available
> URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20130424/e3360da1/attachment-0003.png>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: BCVPlot_df10.png
> Type: image/png
> Size: 92661 bytes
> Desc: not available
> URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20130424/e3360da1/attachment-0004.png>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: BCVPlot_df2.png
> Type: image/png
> Size: 95313 bytes
> Desc: not available
> URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20130424/e3360da1/attachment-0005.png>
>
> ------------------------------
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}