Hi.
I have found some R/Bioconductor/Genominator code on the web (below) and it
measures differential expression of RNA-seq short read data using a general
linear model.
Can someone explain some basic questions I have?
1) What is the reason for using 2 glm's for measuring differential
expression?
2) In the function(y) there are two linear models ran; one with argument y ~
groups and the other with argument y ~ 1. Why do this?
3) What does the offset do?
4) Why use ANOVA; is to compare the linear models?
5) What can we say about results, if adjusted for multiple testing; how
would you explain a significant result?
6) Would an adjusted p-value of <= 0.05 be significant?
Basically, I don't know much about the statistics done below and any advice
or pointers to good literature for this would be a great help. Thank you.
> head(geneCountsUI)
mut_1_f mut_2_f wt_1_f wt_2_f
YAL069W 0 0 0 0
YBL049W 19 18 10 4
# Normalisation of RNA-seq lanes
notZero <- which(rowSums(geneCountsUI) != 0)
upper.quartiles <- apply(geneCountsUI[notZero, ], 2, function(x) quantile(x,
0.75))
uq.scaled <- upper.quartiles/sum(upper.quartiles) * sum(laneCounts)
# Calculating differential expression
groups <- factor(rep(c("mut", "wt"), times = c(2, 2)))
pvalues <- apply(geneCountsUI[notZero, ], 1,
function(y) {
fit <- glm(y ~ groups, family = poisson(), offset = log(uq.scaled))
fit0 <- glm(y ~ 1, family = poisson(), offset = log(uq.scaled))
anova(fit0, fit, test = "Chisq")[2, 5]
})
[[alternative HTML version deleted]]