[BioC] mRNA-seq cross-species analysis, is it possible?

Davis McCarthy dmccarthy at wehi.EDU.AU
Thu Aug 4 01:09:10 CEST 2011

Hi Mali

It _is_ possible and relatively straight-forward to specify a gene-specific normalization factor with edgeR. The GLM methods in edgeR take an argument called "offset", which can be a matrix (of same size as your matrix of counts) providing the appropriate offset for the negative binomial generalized linear model. This allows you to apply a gene-specific normalization factor.

Say you have a regular DGEList object, d, which contains the matrix of raw read counts in d$counts, and a matrix of normalized read counts, x.norm, then the appropriate offset matrix is just as follows (we add 0.1 to the counts to deal with counts of zero):

offset <- log(d$counts + 0.1) - log(x.norm + 0.1)

This offset can be added to the DGEList object as the element "offset" (funnily enough):

d$offset <- offset

>From there the software takes care of the rest, using this offset matrix as the default when estimating dispersion values and fitting NB models.

Using normalization factors rather than normalized read counts is the appropriate approach for edgeR. The edgeR methods require the raw counts and any normalization should enter the model(s) as an offset. Normalized read counts should _not_ be substituted for the raw read counts in the "counts" element of a DGEList in edgeR.

We have used this gene-specific normalization factor to try out things like quantile normalization on RNA-Seq data in house. To my knowledge, the new cqn package outputs gene-specific offsets that will plug in to edgeR to normalize data for (possibly among other things) gene length and GC bias.

Best wishes


On Aug 3, 2011, at 11:48 PM, mali salmon wrote:

> Dear list
> I would like to perform mRNA-seq cross-species comparison. In that case it
> would be necessary to account for the differences in gene length.
> I already got a reply from the author of DESeq (see below) that this is
> currently can't be done with DESeq.
> Is it possible to specify gene-specific normalization factor with edgeR? or
> to input read counts that have been normalized to gene length?
> Thanks
> Mali
> ---------- Forwarded message ----------
> From: Simon Anders <anders at embl.de>
> Date: Wed, Aug 3, 2011 at 10:03 AM
> Subject: Re: DESeq between two plants with different gene length
> To: mali salmon <shalmom1 at gmail.com>
> Hi Mali
> On 08/02/2011 09:54 PM, mali salmon wrote:
>> I have counts data of 2 plants, one is rice which have a genome, and the
>> other is non-model plant with no genome. In order to find the gene
>> counts for the unknown genome-plant I assembled the reads, and aligned
>> the contigs to the rice proteome.
>> Can I use DESeq to find DE genes between rice and the non-model plant?
>> The problem is that the genes length is different between the two
>> plants. Does the comparison still be valid? Would you suggest to
>> normalize to gene length before DESeq?
> First the technical point: It might be appropriate to account for gene
> length, but with the current version of DESeq, you cannot specify
> gene-specific normalization factors, even though we'll add this feature at
> some point.
> In general, I'm hesitating to recommend using DESeq for a cross-species
> comparison, but I also wouldn't know of any other good method. Such
> comparisons are really difficult and proper interpretation is filled with
> methodological pitfalls.
> Differences in gene length and ambiguity in assigning orthologous genes are
> the main technical ones. Another one is the question what constitutes proper
> replication here. Should you grow both species under identical conditions in
> the lab? If so, which conditions, those good for rice (e.g., lot of water),
> or those good for the other species? Maybe, you should grow both species in
> both conditions, and consider the samples from the same species but
> different conditions as replicates, as this would capture as much of the
> environmental influence
> as possible. Otherwise, you could not say whether the differences in
> condition may be attributed to genetics (different species) or environment
> (different growth conditions) or an interaction of both (different level of
> adaption of the species to the chosen growth conditions).
> I know, there are papers that try to do such comparisons, but I haven't seen
> anything yet addressing these issues in a convincing manner.
> Simon
> 	[[alternative HTML version deleted]]
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Davis J McCarthy
Research Technician
Bioinformatics Division
Walter and Eliza Hall Institute of Medical Research
1G Royal Parade, Parkville, Vic 3052, Australia
dmccarthy at wehi.edu.au

The information in this email is confidential and intend...{{dropped:6}}

More information about the Bioconductor mailing list