[BioC] [bioc] correlation analysis of RNA-seq

Gordon K Smyth smyth at wehi.EDU.AU
Tue Jul 2 11:06:44 CEST 2013


Dear Makis,

We often do an analysis similar to what you have described when we have a 
particular hub geneX of interest, and we want to find other genes 
correlated with geneX.  With 300 samples, it might be more efficient to do 
this with voom rather than edgeR, just for speed.  The voom analysis would 
go something like this.

   v <- voom(counts)

Get the logCPM for GeneX:
   i <- which(rownames(v)=="GeneX")
   GeneX <- v$E[i,]

Get expression for all other genes:
   v2 <- v[-i,]

   design <- cbind(1,GeneX)
   fit <- lmFit(v2,design)
   fit <- eBayes(fit)
   topTable(fit,coef=2)

This will correlate all the other genes with GeneX, and will rank them 
from most to least correlated.  This is surely more efficient than 
correlating individual pairs of genes using glm.nb!

The edgeR analysis would be very similar, but I am not sure how fast the 
dispersion estimation step will be with 300 samples.  It would go 
something like:

   y <- DGEList(counts=counts)
   logCPM <- cpm(y,log=TRUE,prior.count)
   GeneX <- logCPM[i,]
   design <- cbind(1,GeneX)
   y2 <- y[-i,]
   y2 <- estimateDisp(y2,design)
   fit <- glmFit(y2,design)
   lrt <- glmLRT(fit)
   topTags(lrt)

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth

On Mon, 1 Jul 2013, Makis Motakis wrote:

> Dear list,
>
> I have a large collection of RNA-seq data (300 samples from different 
> cells types, e.g. 2 samples from Basophils, 3 samples from cultured Mast 
> cells etc; the number of replicates per cell type varies from 2 to 5). I 
> would like to analyze the counts or the TPM data and find correlated 
> RefSeq genes. I have estimated Kendall correlations of the TPM data. For 
> comparison reasons, I would also like to model the association between 
> the gene counts data by the GLM model of negative binomial distribution. 
> In my data, I expect that non-parametric correlations and glm model will 
> give quite different results for several gene pairs. For example,
>
> gene1<-rnbinom(300,mu=50,size=5)
> gene2<-rnbinom(300,mu=50,size=5)
> plot(gene1,gene2)
> cor.test(gene1,gene2,method="kendall")
> summary(glm.nb(gene2 ~ gene1))
>
> I understand that the above glm model is over-simplified (for example it 
> does not take into account the replicates information) but at this point 
> I am not worried about that yet. My question is whether it is possible 
> to incorporate the offset (library size * normalization factor) to the 
> glm model e.g. the way that edgeR works. Looking at edgeR source code at 
> "mglmLS.R"  function (a similar expression also exist at 
> "mglmLevenberg.R"): mu <- exp(beta %*% t(X) + offset)
>
> In my case, would it be correct if I estimated the offset by
>
> offset <- getOffset()
>
> and then I estimated the gene / gene associations by glm using something 
> like mu <- exp(beta %*% t(X-offset) + offset) ? if yes, is there any 
> reason for this model to follow the edgeR dispersion estimation method 
> or simply glm.nb would do what I want?
>
> Thank you in advance,
> Makis

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list