[BioC] edgeR: finding tissue specific genes [was: Design matrix and BCV]
Gordon K Smyth
smyth at wehi.EDU.AU
Wed May 8 02:26:36 CEST 2013
Sorry, first command shoud have been
contrasts(tiss_groups) <- contr.sum(levels(tiss_groups))
Your linear model can be parametrized in terms of any set of 18
coefficients. This command says that you want the effects to "sum" to
zero, in other words the effects should be relative to the grand mean.
Best wishes
Gordon
On Tue, 7 May 2013, Manoj Hariharan wrote:
> Hi Gordon,
Actually, I had never used the glmQRT() - I've always been using the glmQLFTest().
And, as you had suggested, when I do the
contrasts(tiss_groups) <- contr.sum(tiss_groups)
I get the following error:
> contrasts(tiss_groups) <- contr.sum(tiss_groups)
Error in `contrasts<-`(`*tmp*`, value = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, :
wrong number of contrast matrix rows
I didn't really understand what the difference by using the
contrasts(tiss_groups) <- contr.sum(tiss_groups)
design <- model.matrix(~tiss_groups)
rather than specifying design without the "contrasts(tiss_groups) <- contr.sum(tiss_groups)", as below:
design <- model.matrix(~tiss_groups)
I would still have the intercept and have the following for fit$design:
attr(,"assign")
[1] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$tiss_groups
[1] "contr.treatment"
Thanks,
Manoj.
________________________________
From: Gordon K Smyth <smyth at wehi.EDU.AU>
To: Manoj Hariharan <h_manoj at yahoo.com>
Cc: Bioconductor mailing list <bioconductor at r-project.org>
Sent: Monday, May 6, 2013 11:39 PM
Subject: edgeR: finding tissue specific genes [was: Design matrix and BCV]
Dear Manoj,
Why not simply find genes than are higher in one group than the average of
the other groups? edgeR can do this sort of thing easily.
Let's suppose suppose you going to using the quasi-lik approach of
glmQFTest() rather than glmQRT().
First define a design matrix for which the intercept is the overall mean:
contrasts(tiss_groups) <- contr.sum(tiss_groups)
design <- model.matrix(~tiss_groups)
Then estimate the trended dispersions:
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTrendedDisp(y, design)
Then fit the basic linear model:
fit <- glmFit(y, design)
Then you can extract all the lists you want. For example
ql <- glmQLFTest(fit, coef=2)
top1 <- topTags(ql)
will give you genes specifically up or specifically down in tissue 1, as
compared to the average of all the other groups.
ql <- glmQLFTest(fit, coef=3)
top2 <- topTags(ql)
will give you genes specifically up/down in tissue 2, and so on up to
ql <- glmQLFTest(fit, coef=18)
top17 <- topTags(de)
will give you genes specifically up/down in tissue 17. Finally, to get
genes up/down in tissue 18:
cont <- rep(-1,18)
cont[1] <- 0
ql <- glmQLFTest(fit, contrast=cont)
top18 <- topTags(de)
What you propose doesn't quite make sense to me. If you want to put genes
on the same scale (and you don't need to for the above analysis), would it
not be better to use rpkm()?
Best wishes
Gordon
---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth
On Mon, 6 May 2013, Manoj Hariharan wrote:
> Thanks Gordon.
I was wondering if I could have a quantitative value for the deviance of
each group from the average, for each of the DE genes. I understand that
the F value (from the F-statistic) is a measure of how far the gene is
compared to the expression of all others across the samples.
One option, I could think of is to just get the normalized counts for each of the sample, for the set of DE genes:
de_lrt <- rownames(top_lrt[top_lrt$FDR<0.05,])
scale <- D$samples$lib.size*D$samples$norm.factors
normCounts <- round(t(t(D$counts)/scale)*mean(scale))
write.table(log(normCounts[de_lrt[1:5690],]+1), "All37_NormCounts_DEGenes", sep="\t", quote=FALSE)
Essentially, I am trying to get the list of genes that shows a more
"tissue-specific" behaviour. Most genes are not expressed strictly in one
particular tissue - there would be related tissues where its expression
would be almost similar. So I would like to rank them based on their
expression values and for that I need to have all comparable values.
Then, I could consider those samples where the expression of the gene is
higher than the 90th percentile. Hope that makes sense!
Thanks,
Manoj.
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:5}}
More information about the Bioconductor
mailing list