[BioC] Design matrix and BCV
Manoj Hariharan
h_manoj at yahoo.com
Wed Apr 24 22:33:05 CEST 2013
Hello,
I am new to RNA-seq analysis. I have worked on a few not-too-complicated projects and have found edgeR to be right for my work. In this project I have RNA-seq data from 18 human tissues (normal, no treatment). All tissues except 5 of 18 have at least 2 replicates. The replicate tissues are obtained from separate individuals (they are of different age and sex). There are a few issues I need to discuss with the experts in the group:
1. The BCV value is quite high (Disp = 0.36621 , BCV = 0.6052). I think this is partly due to the way we have collected replicates - they are from separate individuals - different age and sex. Is this really bad - I had read in the User Guide that BCV of ~40% is acceptable in tumor samples? Does adjusting the prior.df help (I've attached the BCV plots)? At a later stage I plan to include age and sex as "factors" and re-do the analysis.
2. I am interested in the differentially expressed genes - across these 18 tissues. I guess I should be using the approach explained in section 3.2.5 of the User Guide (ANOVA-like test). Below, is the output. The problem is that the first tissue "AD" is absorbed into the intercept. I have read in other discussion threads that this is normal. But I do need the logFC values for the AD tissue also. If I use the "design <- model.matrix(~0+tiss_groups, data=D$samples)", I can get the AD column in the design matrix, but then, I would not be able to get the baseline intercept column, and I get all genes differentially expressed. Is there a work-around? How can I handle this issue?
3. How best can I decide on the prior.df? I read the threads on choosing the value based on the number of libraries and groups. But I am not sure. So I tried with prior.df default (20), 10 and 2 with varying number of DE genes.
R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
Loading required package: DBI
Loading required package: AnnotationDbi
Loading required package: BiocGenerics
Attaching package: ‘BiocGenerics’
The following object(s) are masked from ‘package:stats’:
xtabs
The following object(s) are masked from ‘package:base’:
anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find,
get, intersect, lapply, Map, mapply, mget, order, paste, pmax,
pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int,
rownames, sapply, setdiff, table, tapply, union, unique
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading Tcl/Tk interface ... done
KEGG.db contains mappings based on older data because the original
resource was removed from the the public domain before the most
recent update was produced. This package should now be considered
deprecated and future versions of Bioconductor may not have it
available. One possible alternative to consider is to look at the
reactome.db package
[Workspace loaded from /users/manoj/.RData]
>
>
>
> setwd('/Users/manoj/Data/SDEC_hg19/AllCountDataStrndd/')
Warning message:
package ‘AnnotationDbi’ was built under R version 2.15.2
>
> library(edgeR)
Loading required package: limma
Warning messages:
1: package ‘edgeR’ was built under R version 2.15.2
2: package ‘limma’ was built under R version 2.15.2
>
>
> targets <- read.delim("AllCountData_AllTiss_Info" , stringsAsFactors = FALSE , header=TRUE)
> D <- readDGE(targets)
> keep <- rowSums(cpm(D)>1) >= 10
> D <- D[keep,]
> tiss_groups <- factor(c("AD","AD","AO","AO","BL","EG","EG","FT","FT","FT","GA","GA","GA","LG","LG","LI","LV","LV","OV","PA","PA","PO","PO","PO","RA","RV","RV","SB","SB","SB","SG","SG","SG","SX","SX","SX","TH"))
> design <- model.matrix(~tiss_groups)
>
> design
(Intercept) tiss_groupsAO tiss_groupsBL tiss_groupsEG tiss_groupsFT tiss_groupsGA tiss_groupsLG tiss_groupsLI tiss_groupsLV tiss_groupsOV
1 1 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0
3 1 1 0 0 0 0 0 0 0 0
4 1 1 0 0 0 0 0 0 0 0
5 1 0 1 0 0 0 0 0 0 0
6 1 0 0 1 0 0 0 0 0 0
7 1 0 0 1 0 0 0 0 0 0
8 1 0 0 0 1 0 0 0 0 0
9 1 0 0 0 1 0 0 0 0 0
10 1 0 0 0 1 0 0 0 0 0
11 1 0 0 0 0 1 0 0 0 0
12 1 0 0 0 0 1 0 0 0 0
13 1 0 0 0 0 1 0 0 0 0
14 1 0 0 0 0 0 1 0 0 0
15 1 0 0 0 0 0 1 0 0 0
16 1 0 0 0 0 0 0 1 0 0
17 1 0 0 0 0 0 0 0 1 0
18 1 0 0 0 0 0 0 0 1 0
19 1 0 0 0 0 0 0 0 0 1
20 1 0 0 0 0 0 0 0 0 0
21 1 0 0 0 0 0 0 0 0 0
22 1 0 0 0 0 0 0 0 0 0
23 1 0 0 0 0 0 0 0 0 0
24 1 0 0 0 0 0 0 0 0 0
25 1 0 0 0 0 0 0 0 0 0
26 1 0 0 0 0 0 0 0 0 0
27 1 0 0 0 0 0 0 0 0 0
28 1 0 0 0 0 0 0 0 0 0
29 1 0 0 0 0 0 0 0 0 0
30 1 0 0 0 0 0 0 0 0 0
31 1 0 0 0 0 0 0 0 0 0
32 1 0 0 0 0 0 0 0 0 0
33 1 0 0 0 0 0 0 0 0 0
34 1 0 0 0 0 0 0 0 0 0
35 1 0 0 0 0 0 0 0 0 0
36 1 0 0 0 0 0 0 0 0 0
37 1 0 0 0 0 0 0 0 0 0
tiss_groupsPA tiss_groupsPO tiss_groupsRA tiss_groupsRV tiss_groupsSB tiss_groupsSG tiss_groupsSX tiss_groupsTH
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0
20 1 0 0 0 0 0 0 0
21 1 0 0 0 0 0 0 0
22 0 1 0 0 0 0 0 0
23 0 1 0 0 0 0 0 0
24 0 1 0 0 0 0 0 0
25 0 0 1 0 0 0 0 0
26 0 0 0 1 0 0 0 0
27 0 0 0 1 0 0 0 0
28 0 0 0 0 1 0 0 0
29 0 0 0 0 1 0 0 0
30 0 0 0 0 1 0 0 0
31 0 0 0 0 0 1 0 0
32 0 0 0 0 0 1 0 0
33 0 0 0 0 0 1 0 0
34 0 0 0 0 0 0 1 0
35 0 0 0 0 0 0 1 0
36 0 0 0 0 0 0 1 0
37 0 0 0 0 0 0 0 1
attr(,"assign")
[1] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$tiss_groups
[1] "contr.treatment"
> D <- calcNormFactors(D)
> D <- estimateGLMCommonDisp(D, design, verbose=TRUE)
Disp = 0.36621 , BCV = 0.6052
>
> D <- estimateGLMTrendedDisp(D, design)
Loading required package: splines
>
> D <- estimateGLMTagwiseDisp(D, design)
> plotBCV(D, main="BCV Plot: default prior df")
> D <- estimateGLMTagwiseDisp(D, design, prior.df=10)
> plotBCV(D, main="BCV Plot: default prior df of 10")
> D <- estimateGLMTagwiseDisp(D, design, prior.df=2)
> plotBCV(D, main="BCV Plot: default prior df of 2")
> fit <- glmFit(D, design)
> QLF_lrt <- glmQLFTest(fit,coef=2:18)
> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH")
> sum(FDR_Stsfd < 0.05)
[1] 8308
>
> glm_lrt <- glmLRT(fit,coef=2:18)
> FDR_Stsfd <- p.adjust(glm_lrt$table$PValue, method="BH")
> sum(FDR_Stsfd < 0.05)
[1] 11255
Using different parameters (prior.df) for estimateGLMTagwiseDisp:
> D <- calcNormFactors(D)
> D <- estimateGLMCommonDisp(D, design, verbose=TRUE)
Disp = 0.36621 , BCV = 0.6052
> D <- estimateGLMTrendedDisp(D, design)
> fit <- glmFit(D, design)
> QLF_lrt <- glmQLFTest(fit,coef=2:18)
> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH")
> sum(FDR_Stsfd < 0.05)
[1] 8308
>
> D <- estimateGLMTagwiseDisp(D, design)
> fit <- glmFit(D, design)
> QLF_lrt <- glmQLFTest(fit,coef=2:18)
> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH")
> sum(FDR_Stsfd < 0.05)
[1] 10935
>
> D <- estimateGLMTagwiseDisp(D, design, prior.df=2)
> fit <- glmFit(D, design)
> QLF_lrt <- glmQLFTest(fit,coef=2:18)
> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH")
> sum(FDR_Stsfd < 0.05)
[1] 12622
>
>
> D <- estimateGLMTagwiseDisp(D, design, prior.df=10)
> fit <- glmFit(D, design)
> QLF_lrt <- glmQLFTest(fit,coef=2:18)
> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH")
> sum(FDR_Stsfd < 0.05)
[1] 12033
>
Design matrix without intercept:
> design <- model.matrix(~0+tiss_groups, data=D$samples)
> design
tiss_groupsAD tiss_groupsAO tiss_groupsBL tiss_groupsEG tiss_groupsFT tiss_groupsGA tiss_groupsLG tiss_groupsLI tiss_groupsLV tiss_groupsOV
1 1 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0 0 0
5 0 0 1 0 0 0 0 0 0 0
6 0 0 0 1 0 0 0 0 0 0
7 0 0 0 1 0 0 0 0 0 0
8 0 0 0 0 1 0 0 0 0 0
9 0 0 0 0 1 0 0 0 0 0
10 0 0 0 0 1 0 0 0 0 0
11 0 0 0 0 0 1 0 0 0 0
12 0 0 0 0 0 1 0 0 0 0
13 0 0 0 0 0 1 0 0 0 0
14 0 0 0 0 0 0 1 0 0 0
15 0 0 0 0 0 0 1 0 0 0
16 0 0 0 0 0 0 0 1 0 0
17 0 0 0 0 0 0 0 0 1 0
18 0 0 0 0 0 0 0 0 1 0
19 0 0 0 0 0 0 0 0 0 1
20 0 0 0 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0 0 0 0
22 0 0 0 0 0 0 0 0 0 0
23 0 0 0 0 0 0 0 0 0 0
24 0 0 0 0 0 0 0 0 0 0
25 0 0 0 0 0 0 0 0 0 0
26 0 0 0 0 0 0 0 0 0 0
27 0 0 0 0 0 0 0 0 0 0
28 0 0 0 0 0 0 0 0 0 0
29 0 0 0 0 0 0 0 0 0 0
30 0 0 0 0 0 0 0 0 0 0
31 0 0 0 0 0 0 0 0 0 0
32 0 0 0 0 0 0 0 0 0 0
33 0 0 0 0 0 0 0 0 0 0
34 0 0 0 0 0 0 0 0 0 0
35 0 0 0 0 0 0 0 0 0 0
36 0 0 0 0 0 0 0 0 0 0
37 0 0 0 0 0 0 0 0 0 0
tiss_groupsPA tiss_groupsPO tiss_groupsRA tiss_groupsRV tiss_groupsSB tiss_groupsSG tiss_groupsSX tiss_groupsTH
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0
20 1 0 0 0 0 0 0 0
21 1 0 0 0 0 0 0 0
22 0 1 0 0 0 0 0 0
23 0 1 0 0 0 0 0 0
24 0 1 0 0 0 0 0 0
25 0 0 1 0 0 0 0 0
26 0 0 0 1 0 0 0 0
27 0 0 0 1 0 0 0 0
28 0 0 0 0 1 0 0 0
29 0 0 0 0 1 0 0 0
30 0 0 0 0 1 0 0 0
31 0 0 0 0 0 1 0 0
32 0 0 0 0 0 1 0 0
33 0 0 0 0 0 1 0 0
34 0 0 0 0 0 0 1 0
35 0 0 0 0 0 0 1 0
36 0 0 0 0 0 0 1 0
37 0 0 0 0 0 0 0 1
attr(,"assign")
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$tiss_groups
[1] "contr.treatment"
>
>
> D <- estimateGLMCommonDisp(D, design, verbose=TRUE)
Disp = 0.36621 , BCV = 0.6052
> D <- estimateGLMTrendedDisp(D, design)
> fit <- glmFit(D, design)
> QLF_lrt <- glmQLFTest(fit,coef=2:18)
> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH")
> sum(FDR_Stsfd < 0.05)
[1] 20364
>
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] splines stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_3.0.7 limma_3.14.3 AnnotationDbi_1.20.3 Biobase_2.18.0 BiocGenerics_0.4.0 RSQLite_0.11.2 DBI_0.2-5
loaded via a namespace (and not attached):
[1] clusterProfiler_1.6.0 colorspace_1.2-0 dichromat_1.2-4 digest_0.6.0 DO.db_2.5.0 DOSE_1.4.0
[7] ggplot2_0.9.3 GO.db_2.8.0 GOSemSim_1.16.1 grid_2.15.1 gtable_0.1.2 igraph_0.6-3
[13] IRanges_1.16.4 KEGG.db_2.8.0 labeling_0.1 MASS_7.3-23 munsell_0.4 parallel_2.15.1
[19] plyr_1.8 proto_0.3-10 qvalue_1.32.0 RColorBrewer_1.0-5 reshape2_1.2.2 scales_0.2.3
[25] stats4_2.15.1 stringr_0.6.2 tcltk_2.15.1 tools_2.15.1
>
Thanks,
Manoj.
------------------------------
Manoj Hariharan, Ph.D.
Staff Researcher
The Salk Institute for Biological Studies
La Jolla, CA 92037
Office: 858.453.4100 x2143
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BCVPlot_dfDefault.png
Type: image/png
Size: 91573 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20130424/e3360da1/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BCVPlot_df10.png
Type: image/png
Size: 92661 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20130424/e3360da1/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BCVPlot_df2.png
Type: image/png
Size: 95313 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20130424/e3360da1/attachment-0002.png>
More information about the Bioconductor
mailing list