[BioC] RNASeq: normalization issues

ywchen at jimmy.harvard.edu ywchen at jimmy.harvard.edu
Mon May 2 03:21:01 CEST 2011


Hi Wei and Davis,

Thank you so much for such detailed explanations! Now it is very clear.
In your case you found the benefit of using quantile normalization+GLM+LRT,
is it single factor with many libraries or multiple factor data?

Yiwen


> Hi Yiwen:
>
> 	As Davis said, the "length+quantile" method I mentioned in the previous
> correspondences is not the "quantile normalization" option in
> calcNormFactors function in edgeR. That's the reason why you didn't see
> gene length adjustment with that function.
>
> 	Adjusting read counts using gene length (total exon length) will put all
> genes on the same baseline within the sample (longer transcripts produce
> more reads), and quantile between-sample normalization will make all
> samples have the same read count distribution (and library size will
> become the same as well). This is what I mean by "length+quantile"
> normalization. The quantile normalization here is the same quantile
> normalization applied to microarray data, however it is applied to
> sequencing data in a different way (used as offsets in the general linear
> model).
>
> 	Now I elaborate how to do this normalization. Suppose you have a read
> count matrix of "x" with rows being genes and columns being samples .
> Also suppose you have a numeric vector "gene.length" which includes total
> exon length for each gene and gene order in "gene.length" is the same
> with that in "x". The following line of code yields the number of reads
> per 1000 bases for each gene:
>
> x1 <- x*1000/gene.length
>
> Now perform quantile normalization for gene length adjusted data:
>
> library(limma)
> x2 <- normalizeBetweenArrays(x1,method="quantile")
>
> Suppose x has two columns named "wt" and "ko". Create a design matrix:
>
> snames <- factor(c("wt","ko"))
> design <- model.matrix(~snames)
>
> Now get the offsets for each gene in each sample. The offsets are the
> intensity differences between raw data and normalized data.
>
> library(edgeR)
> y <- DGEList(counts=x,group=colnames(x))
> lowcounts <- rowSums(x)<5
> offset <- log(x[!lowcounts,]+0.1)-log(x2[!lowcounts,]+0.1)
> yf <- y[!lowcounts,]
>
> Fit general linear models to read count data with offsets included:
>
> y.glm <- estimateCRDisp(y=yf,design=design,offset=offset,trend=TRUE,
> tagwise=TRUE)
> fit <-
> glmFit(y=y.glm,design=design,dispersion=y.glm$CR.tagwise.dispersion,offset=offset)
>
> Perform likelihood ratio tests to find differentially expressed genes:
>
> DE <- glmLRT(y.glm,fit)
> dt <- decideTestsDGE(DE)
> summary(dt)
>
> Hope this will work for you!
>
> Cheers,
> wei
>
>
>
> On May 2, 2011, at 8:52 AM, Davis McCarthy wrote:
>
>> Hi Yiwen
>>
>> The "quantile normalization" option in calcNormFactors in edgeR does
>> something very different from the quantile normalization
>> (microarray-style) that Wei has been discussing.
>>
>> The quantile normalization in calcNormFactors computes an offset for
>> sequencing library depth after Bullard et al (2010) [1]. This is an
>> approach in the same vein as TMM normalization [2] or scaled median [3].
>>
>> I believe that the approach that Wei is suggesting is more similar to
>> the quantile normalization approach that has been taken with microarray
>> data, adjusting the data so that the response follows the same
>> distribution across (in this context) sequenced libraries. This will
>> typically result in non-integer data from adjusting counts, but
>> count-based methods could still be used if this quantile normalization
>> were treated as an offset for each observation in (e.g.) a generalized
>> linear model.
>>
>> Cheers
>> Davis
>>
>>
>> [1] http://www.biomedcentral.com/1471-2105/11/94
>> [2] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2864565/?tool=pubmed
>> [3] http://genomebiology.com/2010/11/10/R106#B13
>>
>>
>>> Hi Wei,
>>>
>>> Could you elaborate on how to appropriately do gene-length-adjusted
>>> quantile normalization in edgeR? The "quantile normalization" option in
>>> calcNormFactors function does not seem to take into account the gene
>>> length.
>>>
>>> Thanks.
>>> Yiwen
>>>> Hi João:
>>>>
>>>> 	Maybe you can try different normalization methods for your data to
>>>> see
>>>> which one looks better. How to best normalize RNA-seq data is still of
>>>> much debate at this stage.
>>>>
>>>> 	You can try scaling methods like TMM, RPKM, or 75th percentile, which
>>>> as
>>>> you said normalize data within samples. Or you can try quantile
>>>> between-sample normalization (read counts should be adjusted by gene
>>>> length first), which performs normalization across samples. You can
>>>> try
>>>> all these in edgeR package.
>>>>
>>>> 	From my experience, I actually found the quantile method performed
>>>> better
>>>> for my RNA-seq data. I used general linear model and likelihood ratio
>>>> test in edgeR in my analysis.
>>>>
>>>> 	Hope this helps.
>>>>
>>>> Cheers,
>>>> Wei
>>>>
>>>> On Apr 28, 2011, at 7:36 PM, João Moura wrote:
>>>>
>>>>> Dear all,
>>>>>
>>>>>
>>>>> Until now I was doing RNAseq DE analysis and to do that I understand
>>>>> that
>>>>> normalization issues only matter inside samples, because one can
>>>>> assume
>>>>> the
>>>>> length/content biases will cancel out when comparing same genes in
>>>>> different
>>>>> samples.
>>>>> Although, I'm now trying to compare correlation of different genes
>>>>> and
>>>>> so,
>>>>> this biases should be taken into account - for this is there any
>>>>> better
>>>>> method than RPKM?
>>>>>
>>>>> My main doubt is if I should also take into acount the biases inside
>>>>> samples
>>>>> and to do that is there any better approach then TMM by Robinson and
>>>>> Oshlack
>>>>> [2010]?
>>>>>
>>>>> Thank you all,
>>>>> --
>>>>> João Moura
>>>>>
>>>>> 	[[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>
>>>> ______________________________________________________________________
>>>> The information in this email is confidential and
>>>> intend...{{dropped:6}}
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>> --------------------------------------------------
>> Davis J McCarthy
>> Research Technician
>> Bioinformatics Division
>> Walter and Eliza Hall Institute of Medical Research
>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>> dmccarthy at wehi.edu.au
>> http://www.wehi.edu.au
>
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:8}}



More information about the Bioconductor mailing list