[BioC] Normalization
Gordon K Smyth
smyth at wehi.EDU.AU
Fri Mar 1 08:52:05 CET 2013
Hi Ryan,
Everything else you say is correct, but the pseudo counts are not linearly
related to counts-per-million, even when the norm factors are all 1.
Their definition and purpose is described in Robinson and Smyth
(Biostatistics, 2008).
Pseudo counts are used internally by edgeR to estimate the dispersions and
to compute the exact tests. They do not have a simple interpretation as
normalized counts because they depend on the experimental design as well
as on the library sizes. We do not recommend that they are used by users
for other purposes.
For descriptive purposes, users should use cpm() or similar.
Best wishes
Gordon
> Date: Wed, 27 Feb 2013 23:48:34 -0800
> From: "Ryan C. Thompson" <rct at thompsonclan.org>
> To: Vittoria Roncalli <roncalli at hawaii.edu>
> Cc: bioconductor <Bioconductor at r-project.org>
> Subject: Re: [BioC] Normalization
>
> Hi Vittoria,
>
> Please use "Reply All" so that your reply also goes to the mailing list.
>
> The normalization factors are used to adjust the library sizes (I forget
> the details, I believe they are given in the User's Guide), and then the
> pseudo counts are obtained by normalizing the counts to the adjusted
> library sizes. Since you have not used any normalization factors (i.e.
> all norm factors = 1), the pseudo counts will simply be some constant
> factor of counts-per-million, if I'm not mistaken. If you want
> absolutely no normalization, you would have to set both the
> normalization factors and library sizes to 1, I think.
>
> In any case, the pseudo counts are only for descriptive purposes. The
> statistical testing in edgeR happens using the raw integer counts.
>
> On 02/27/2013 10:12 PM, Vittoria Roncalli wrote:
>> Hi Ryan,
>>
>> thanks for your reply.
>> I obtain pesudo.counts with the following commands
>>
>> "
>>
>>> raw.data <- read.table("counts 2.txt",sep="\t",header=T)
>>
>>> d <- raw.data[, 2:10]
>>
>>> d[is.na <http://is.na>(d)] <- 0
>>
>>> rownames(d) <- raw.data[, 1]
>>
>>> group <- c("CONTROL","CONTROL","CONTROL","LD","LD","LD","HD","HD","HD")
>>
>>> d <- DGEList(counts = d, group = group)
>>
>> Calculating library sizes from column totals.
>>
>>> keep <- rowSums (cpm(d)>1) >=3
>>
>>> d <- d[keep,]
>>
>>> dim(d)
>>
>> [1] 28755 9
>>
>>> d <- DGEList(counts = d, group = group)
>>
>> Calculating library sizes from column totals.
>>
>>> d <- estimateCommonDisp(d)
>>
>>
>> After the common dispersion, I get in the DGE list
>>
>> $counts
>>
>> $samples
>>
>> $commondispersion
>>
>> $pseudo.counts
>>
>> $logCPM
>>
>> $pseudo.lib.size
>>
>>
>>
>> Then I write a table for the pseudo.counts and I will continue with
>> those for the DGE.
>>
>> Considering that I did non normalize the libraries, what are the
>> different counts in the pseudo.counts output?
>>
>>
>> Thanks so much
>>
>>
>> Vittoria
>> On Wed, Feb 27, 2013 at 7:20 PM, Ryan C. Thompson
>> <rct at thompsonclan.org <mailto:rct at thompsonclan.org>> wrote:
>>
>> To answer your first question, when you first create a DGEList
>> object, all the normalization factors are initially set to 1 by
>> default. This is equivalent to no normalization. Once you use
>> calcNormFactors, the normalization factors will be set appropriately.
>>
>> I'm not sure about the second question. Could you provide an
>> example of how you are obtaining pseudocounts with edgeR?
>>
>>
>> On Wed 27 Feb 2013 05:12:27 PM PST, Vittoria Roncalli wrote:
>>
>> Hi, I am a edgeR user and I am a little bit confused on the
>> normalization
>> topic.
>> I am using EdgeR to get different expressed genes within 3
>> conditions
>> (RnaSeq) with 3 replicates each.
>> I am following the user guide step:
>>
>> -update counts file (from mapping against reference transcriptome)
>> - filter the low counts reads (1cpm)
>> - reassess library size
>> - estimate common dispersion
>>
>> Mi first question is related to the normalization. Why, after
>> I import my
>> file, next to the library size there is then column with
>> norm.factors?
>>
>> $samples
>>
>> group lib.size norm.factors
>>
>> X48h_C_r1.sam CONTROL 10898526 1
>>
>> X48h_C_r2.sam CONTROL 7176817 1
>>
>> X48h_C_r3.sam CONTROL 9511875 1
>>
>> X48h_LD_r1.sam LD 11350347 1
>>
>> X48h_LD_r2.sam LD 14836541 1
>>
>> X48h_LD_r3.sam LD 12635344 1
>>
>> X48h_HD_r1.sam HD 11840963 1
>>
>> X48h_HD_r2.sam HD 17335549 1
>>
>> X48h_HD_r3.sam HD 10274526 1
>>
>>
>>
>> Is the normalization automated? What is the difference with the
>> "calNormFactors?"
>>
>> Moreover, if I do not run the calNormFactors, what is into the
>> pseudo.counts output?
>>
>>
>> I am very confused about those points.
>>
>>
>> Thanks in advance for your help.
>>
>>
>> Looking forward to hearing from you.
>>
>>
>> Vittoria
>>
>>
>>
>>
>> --
>>
>> Vittoria Roncalli
>>
>> Graduate Research Assistant
>> Center Békésy Laboratory of Neurobiology
>> Pacific Biosciences Research Center
>> University of Hawaii at Manoa
>> 1993 East-West Road
>> Honolulu, HI 96822 USA
>>
>> Tel: 808-4695693
>>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:5}}
More information about the Bioconductor
mailing list