[BioC] edgeR: finding tissue specific genes [was: Design matrix and BCV]

Wed May 8 02:26:36 CEST 2013

Sorry, first command shoud have been

  contrasts(tiss_groups) <- contr.sum(levels(tiss_groups))

Your linear model can be parametrized in terms of any set of 18 
coefficients.  This command says that you want the effects to "sum" to 
zero, in other words the effects should be relative to the grand mean.

Best wishes
Gordon

On Tue, 7 May 2013, Manoj Hariharan wrote:

> Hi Gordon,

Actually, I had never used the glmQRT() - I've always been using the glmQLFTest().

And, as you had suggested, when I do the
contrasts(tiss_groups) <- contr.sum(tiss_groups)

I get the following error:
> contrasts(tiss_groups) <- contr.sum(tiss_groups)
Error in `contrasts<-`(`*tmp*`, value = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0,  : 
  wrong number of contrast matrix rows

I didn't really understand what the difference by using the

  contrasts(tiss_groups) <- contr.sum(tiss_groups)
  design <- model.matrix(~tiss_groups)

rather than specifying design without the "contrasts(tiss_groups) <- contr.sum(tiss_groups)", as below:
design <- model.matrix(~tiss_groups)

I would still have the intercept and have the following for fit$design:

attr(,"assign")
 [1] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$tiss_groups
[1] "contr.treatment"

Thanks,
Manoj.

________________________________
  From: Gordon K Smyth <smyth at wehi.EDU.AU>
To: Manoj Hariharan <h_manoj at yahoo.com> 
Cc: Bioconductor mailing list <bioconductor at r-project.org> 
Sent: Monday, May 6, 2013 11:39 PM
Subject: edgeR: finding tissue specific genes [was: Design matrix and BCV]

Dear Manoj,

Why not simply find genes than are higher in one group than the average of 
the other groups?  edgeR can do this sort of thing easily.

Let's suppose suppose you going to using the quasi-lik approach of 
glmQFTest() rather than glmQRT().

First define a design matrix for which the intercept is the overall mean:

  contrasts(tiss_groups) <- contr.sum(tiss_groups)
  design <- model.matrix(~tiss_groups)

Then estimate the trended dispersions:

  y <- estimateGLMCommonDisp(y, design)
  y <- estimateGLMTrendedDisp(y, design)

Then fit the basic linear model:

  fit <- glmFit(y, design)

Then you can extract all the lists you want.  For example

  ql <- glmQLFTest(fit, coef=2)
  top1 <- topTags(ql)

will give you genes specifically up or specifically down in tissue 1, as 
compared to the average of all the other groups.

  ql <- glmQLFTest(fit, coef=3)
  top2 <- topTags(ql)

will give you genes specifically up/down in tissue 2, and so on up to

  ql <- glmQLFTest(fit, coef=18)
  top17 <- topTags(de)

will give you genes specifically up/down in tissue 17.  Finally, to get 
genes up/down in tissue 18:

  cont <- rep(-1,18)
  cont[1] <- 0
  ql <- glmQLFTest(fit, contrast=cont)
  top18 <- topTags(de)

What you propose doesn't quite make sense to me.  If you want to put genes 
on the same scale (and you don't need to for the above analysis), would it 
not be better to use rpkm()?

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth

On Mon, 6 May 2013, Manoj Hariharan wrote:

> Thanks Gordon.

I was wondering if I could have a quantitative value for the deviance of 
each group from the average, for each of the DE genes. I understand that 
the F value (from the F-statistic) is a measure of how far the gene is 
compared to the expression of all others across the samples.

One option, I could think of is to just get the normalized counts for each of the sample, for the set of DE genes:
de_lrt <- rownames(top_lrt[top_lrt$FDR<0.05,])
scale <- D$samples$lib.size*D$samples$norm.factors
normCounts <- round(t(t(D$counts)/scale)*mean(scale))
write.table(log(normCounts[de_lrt[1:5690],]+1), "All37_NormCounts_DEGenes", sep="\t", quote=FALSE)

Essentially, I am trying to get the list of genes that shows a more 
"tissue-specific" behaviour. Most genes are not expressed strictly in one 
particular tissue - there would be related tissues where its expression 
would be almost similar. So I would like to rank them based on their 
expression values and for that I need to have all comparable values. 
Then,  I could consider those samples where the expression of the gene is 
higher than the 90th percentile. Hope that makes sense!

Thanks,
Manoj.

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:5}}