[BioC] Single nucleotide based RNAseq normalization with edgeR
Mark Robinson
mrobinson at wehi.EDU.AU
Mon Feb 7 23:11:49 CET 2011
Hi Jens/Sridhara.
A few thoughts below.
On 2011-02-07, at 11:22 PM, Sridhara Gupta Kunjeti wrote:
> Hi Gordon,
> First I would like to thank Jens for asking the questions that I had asked
> few days ago.
> In additions to the Jens question, I have one more question on my RNA-seq
> data
> 1. I would like to know if I can multiply the counts for each gene with the
> norm.factor (calculated by "calcNormFactors( )" function)
Sridhara, you've asked this exact question before and I answered (short answer is: NO to multiplying ... instead, divide by [library size]*[normalization factor]):
https://stat.ethz.ch/pipermail/bioconductor/2011-January/037564.html
https://stat.ethz.ch/pipermail/bioconductor/2011-January/037469.html
Perhaps you can clarify what you don't understand.
> On Mon, Feb 7, 2011 at 5:46 AM, Jens Georg <
> jens.georg at biologie.uni-freiburg.de> wrote:
>
>> Hi Gordon,
>> thank you for your reply. The resolution of our ~100nt solexa reads is to
>> small to detect individual processing sites, so we want to investigate every
>> single nucleotide individually ("single nucleotide based normalization").
>> That means that we count, how often an individual nucleotide is covered by
>> sequence reads. Of course, this approach will virtually increase the
>> lib.size by a factor which depends on length of the solexa reads. As the
>> lib.size is critical for the normalization, I am not sure if I should use
>> the original read numbers for each library or the read numbers multiplicated
>> with the read length to adjust for the single nucleotide investigation.
So basically, by counting this way, your library size is ~100x the number of reads you've actually mapped. While I think this will work out ok (normalization calculation be fine), this coverage calculation does impose a (strong?) dependence between adjacent nucleotides. One alternative would be to count the reads that *begin* at a given nucleotide and only consider these. Then your library sizes are as normal.
>> I have two more question regarding to the normalization:
>> 1. Are the norm factors calculated by the calcNormFactors( ) function
>> automatically used for further steps like the estimateCommonDisp( )
>> function?
Yes.
>> 2. Are the pseudocounts calculated by estimateCommonDisp( ) the normalized
>> readcounts?
Yes, but this is only accounting for overall depth and potential composition biases, not for length biases (or any others). It is with the intention of making inferences of a given gene across conditions. The inferences for differential expression are still done on the raw counts.
Hope that helps.
Mark
>>
>> Many thanks
>>
>> Jens
>>
>> Hi Jens,
>>>
>>> I don't know what you mean by single nucleotide based normalization,
>>> however the following comments may be helpful.
>>>
>>> edgeR automatically adjusts for library sizes, whether you include an
>>> explicit normalization step or not. Normalization is a separate issue, and
>>> is intended to deal with more subtle issues.
>>>
>>> Normalization, as edgeR does it, does not require replicates.
>>>
>>> Best wishes
>>> Gordon
>>>
>>> Date: Fri, 04 Feb 2011 11:28:15 +0100
>>>> From: Jens Georg <jens.georg at biologie.uni-freiburg.de>
>>>> To: bioconductor at r-project.org
>>>> Subject: [BioC] Single nucleotide based RNAseq normalization with
>>>> edgeR?
>>>> Message-ID: <4D4BD4BF.4010009 at biologie.uni-freiburg.de>
>>>> Content-Type: text/plain; charset=ISO-8859-15; format=flowed
>>>>
>>>>
>>>>
>>>> Dear edgeR users and developers,
>>>>
>>>> we used Solexa sequencing in order to detect RNase E processing sites.
>>>> Therefor we splitted a RNA sample and treated one half with RNase E
>>>> prior to cDNA synthesis and sequencing. The libraries differ in size
>>>> (1.918.953 and 1.208.586 reads respectively) which clearly necessitates
>>>> a normalization step. Furthermore we expect site specific differences
>>>> rather than differences in the accumulation of the full length RNAs.
>>>>
>>>> So I want to ask, if it is appropiate to do a single nucleotide based
>>>> normalization with edgeR and do you think a reliable basic normalization
>>>> is possible without replicates?
>>>>
>>>> Thank you for your comments.
>>>>
>>>> Best regards
>>>>
>>>> Jens
>>>>
>>>
>>> ______________________________________________________________________
>>> The information in this email is confidential and inte...{{dropped:6}}
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
>
> --
> Sridhara G Kunjeti
> PhD Candidate
> University of Delaware
> Department of Plant and Soil Science
> email- sridhara at udel.edu
> Ph: 832-566-0011
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
------------------------------
Mark Robinson, PhD (Melb)
Epigenetics Laboratory, Garvan
Bioinformatics Division, WEHI
e: mrobinson at wehi.edu.au
e: m.robinson at garvan.org.au
p: +61 (0)3 9345 2628
f: +61 (0)3 9347 0852
------------------------------
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}
More information about the Bioconductor
mailing list