[BioC] edgeR for RNA-seq: nested paired analysis with uneven groups
Laurence [guest]
guest at bioconductor.org
Mon May 26 17:29:30 CEST 2014
Hello,
I am new to the field and have a question regarding analysis of RNA-seq data with the edgeR package. I am looking at the difference in gene expression between two groups of patients (control and disease) after stimuli. I need to perform a paired analysis for each participant within each group, and then compare groups (different individuals between groups). My problem is that I have a different number of participants between in each group.
Hereâs a summary of my experimental design:
1) Paired analysis for control group (n=7)
control patient 1: measurement 1 -> disease stimuli -> measurement 2
control patient 2: measurement 1 -> disease stimuli -> measurement 2
â¦. Control group expression (n=7)
2) Paired analysis for disease group (n=6)
diseased patient1: measurement 1 -> disease stimuli -> measurement 2
diseased patient2: measurement 1 -> disease stimuli -> measurement 2
⦠Disease group expression (n=6)
3) Differential expression Diseases vs healthy groups
So I am basically looking for genes that demonstrate the highest variation after stimuli between the two groups.
Based on the section 3.5 of the edgeR users guide, I first came up with the following nested paired analysis:
--------------------------------------------------------------------------------------------------------
#Individual patients
> patient <- factor(c('patient1', 'patient1', 'patient2', 'patient2', 'patient3', 'patient3', 'patient4', 'patient4', 'patient5', 'patient5','patient6', 'patient6', 'patient7', 'patient7', 'patient8', 'patient8', 'patient9', 'patient9', 'patient10', 'patient10', 'patient11', 'patient11', 'patient12', 'patient12', 'patient13', 'patient13' ))
#I re-numbered each patient
> n <- factor(c(1, 1,2, 2, 3, 3, 4, 4,5, 5, 6, 6, 7, 7,1, 1,2, 2, 3, 3, 4, 4, 5, 5, 6, 6), levels=c(1,2,3,4,5,6,7))
#I assigned them to their respective group
> group <- factor(c('control', 'control','control', 'control','control','control','control','control','control','control','control','control','control','control', 'disease', 'disease', 'disease', 'disease', 'disease', 'disease', 'disease', 'disease','disease', 'disease', 'disease', 'disease'), levels=c("control","disease"))
# time 1 refers to measurement before stimuli and time 2 refers to measurement after stimuli
> time <- factor(c(1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2), levels=c(1,2))
> data.frame(sample=colnames(y),patient,n,time,group)
sample patient n time group
1 patient1 1 1 control
2 patient1 1 2 control
3 patient2 2 1 control
4 patient2 2 2 control
5 patient3 3 1 control
6 patient3 3 2 control
7 patient4 4 1 control
8 patient4 4 2 control
9 patient5 5 1 control
10 patient5 5 2 control
11 patient6 6 1 control
12 patient6 6 2 control
13 patient7 7 1 control
14 patient7 7 2 control
15 patient8 1 1 disease
16 patient8 1 2 disease
17 patient9 2 1 disease
18 patient9 2 2 disease
19 patient10 3 1 disease
20 patient10 3 2 disease
21 patient11 4 1 disease
22 patient11 4 2 disease
23 patient12 5 1 disease
24 patient12 5 2 disease
25 patient13 6 1 disease
26 patient13 6 2 disease
> design <- model.matrix(~group+group:n+group:time)
> colnames(design)
[1] "(Intercept)" "groupdisease" "groupcontrol:n2" "groupdisease:n2" "groupcontrol:n3" "groupdisease:n3" "groupcontrol:n4" "groupdisease:n4"
[9] "groupcontrol:n5" "groupdisease:n5" "groupcontrol:n6" "groupdisease:n6" "groupcontrol:n7" "groupdisease:n7" "groupcontrol:time2" "groupdisease:time2"
#Estimate dispersion
> y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
> y$common.dispersion
> y <- estimateGLMTrendedDisp(y, design)
> names(y)
> summary(y$tagwise.dispersion)
> y <- estimateGLMTagwiseDisp(y, design)
#Differential expression
> fit <- glmFit(y, design)
> lrt <- glmLRT(fit, contrast=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,1))
---------------------------------------------------------------------------------------------
Now, it obviously didn't work since my groups are not of equal number and that re-numbering my sample this way and the contrast matrix make no sense. So I thought of re-numbering them this way:
>n <- factor(c(1, 1,2, 2, 3, 3, 4, 4,5, 5, 6, 6, 7, 7,8, 8,9, 9, 10, 10, 11, 11, 12, 12, 13, 13), levels=c(1,2,3,4,5,6,7,8,9,10,11,12,13))
And designing my matrix this way:
> design <- model.matrix(~group+group:n+n:time)
> colnames(design)
With the entire analysis:
---------------------------------------------------------------------------------------------
> patient <- factor(c('patient1', 'patient1', 'patient2', 'patient2', 'patient3', 'patient3', 'patient4', 'patient4', 'patient5', 'patient5','patient6', 'patient6', 'patient7', 'patient7', 'patient8', 'patient8', 'patient9', 'patient9', 'patient10', 'patient10', 'patient11', 'patient11', 'patient12', 'patient12', 'patient13', 'patient13' ))
> n <- factor(c(1, 1,2, 2, 3, 3, 4, 4,5, 5, 6, 6, 7, 7,8, 8,9, 9, 10, 10, 11, 11, 12, 12, 13, 13), levels=c(1,2,3,4,5,6,7,8,9,10,11,12,13))
> group <- factor(c('control', 'control','control', 'control','control','control','control','control','control','control','control','control','control','control', 'disease', 'disease', 'disease', 'disease', 'disease', 'disease', 'disease', 'disease','disease', 'disease', 'disease', 'disease'), levels=c("control","disease"))
> time <- factor(c(1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2), levels=c(1,2))
sample patient n time group
1 patient1 1 1 control
2 patient1 1 2 control
3 patient2 2 1 control
4 patient2 2 2 control
5 patient3 3 1 control
6 patient3 3 2 control
7 patient4 4 1 control
8 patient4 4 2 control
9 patient5 5 1 control
10 patient5 5 2 control
11 patient6 6 1 control
12 patient6 6 2 control
13 patient7 7 1 control
14 patient7 7 2 control
15 patient8 8 1 disease
16 patient8 8 2 disease
17 patient9 9 1 disease
18 patient9 9 2 disease
19 patient10 10 1 disease
20 patient10 10 2 disease
21 patient11 11 1 disease
22 patient11 11 2 disease
23 patient12 12 1 disease
24 patient12 12 2 disease
25 patient13 13 1 disease
26 patient13 13 2 disease
>design <- model.matrix(~group+group:n+n:time)
> colnames(design)
[1] "(Intercept)" "groupdisease" "groupcontrol:n2" "groupdisease:n2" "groupcontrol:n3" "groupdisease:n3" "groupcontrol:n4" "groupdisease:n4" "groupcontrol:n5"
[10] "groupdisease:n5" "groupcontrol:n6" "groupdisease:n6" "groupcontrol:n7" "groupdisease:n7" "groupcontrol:n8" "groupdisease:n8" "groupcontrol:n9" "groupdisease:n9"
[19] "groupcontrol:n10" "groupdisease:n10" "groupcontrol:n11" "groupdisease:n11" "groupcontrol:n12" "groupdisease:n12" "groupcontrol:n13" "groupdisease:n13" "n1:time2"
[28] "n2:time2" "n3:time2" "n4:time2" "n5:time2" "n6:time2" "n7:time2" "n8:time2" "n9:time2" "n10:time2"
[37] "n11:time2" "n12:time2" "n13:time2"
---------------------------------------------------------------------------------------------
However, now I am not sure if my matrix is right and I am confused as to what would be the best way to compare the two groups. Any ideas would be greatly appreciated.
Thank you very much,
L
-- output of sessionInfo():
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_3.7.1 limma_3.21.3
--
Sent via the guest posting facility at bioconductor.org.
More information about the Bioconductor
mailing list