[BioC] EdgeR GLM
Gordon K Smyth
smyth at wehi.EDU.AU
Sun Dec 18 06:24:48 CET 2011
Dear Shona,
These all give the same result as my suggestion. It doesn't matter how
you express the three ages, as long as they are equally spaced, you will
get the same differential expression results. Not just similar, but
exactly the same p-values. The coefficient logFC will change depending on
how you scale the age. If you multiply the ages by 2, the logFC will
divide by 2. Etc.
Best wishes
Gordon
On Fri, 16 Dec 2011, Wood, Shona wrote:
Dear Gordon,
Thank you for your suggestions I think I may have a solution though I
would like to run it by you and anyone else who uses R. I was reading a
thread on the mailing list called: "[R] Behavior of ordered factors in
glm" and it seems this person has different ages they want to do a linear
analysis on (not in edgeR just R). Looking at there solution I have come
up with the following:
> library(edgeR)
> library(limma)
> targets<-read.delim (file="coding_targets.txt")
> targets$age<-C(targets$age,poly,1)
> attr(targets$age,"contrasts")
.L
a-young -7.071068e-01
b-middle -7.850462e-17
c-old 7.071068e-01
> targets$sample<-factor(targets$sample)
> targets
X files age sample
1 p1 p1_coding_counts.txt a-young 1
2 p2 p2_coding_counts.txt a-young 2
3 p3 p3_coding_counts.txt a-young 3
4 p7 p7_coding_counts.txt b-middle 7
5 p8 p8_coding_counts.txt b-middle 8
6 p9 p9_coding_counts.txt b-middle 9
7 p4 p4_coding_counts.txt c-old 4
8 p5 p5_coding_counts.txt c-old 5
9 p6 p6_coding_counts.txt c-old 6
> d<-readDGE(targets, skip=1, comment.char='#')
> colnames(d)<-row.names(targets)
> d<- calcNormFactors(d)
> d
An object of class "DGEList"
$samples
X files age sample lib.size norm.factors
1 p1 p1_coding_counts.txt a-young 1 3445622 0.9655724
2 p2 p2_coding_counts.txt a-young 2 2696547 0.9902573
3 p3 p3_coding_counts.txt a-young 3 3308099 0.9787044
4 p7 p7_coding_counts.txt b-middle 7 2503479 1.0584660
5 p8 p8_coding_counts.txt b-middle 8 2639127 1.0477893
6 p9 p9_coding_counts.txt b-middle 9 2696547 0.9902573
7 p4 p4_coding_counts.txt c-old 4 3037440 0.9778553
8 p5 p5_coding_counts.txt c-old 5 2647915 1.0144109
9 p6 p6_coding_counts.txt c-old 6 2475370 0.9809077
$counts
1 2 3 4 5 6 7 8 9
ENSRNOG00000000007 1287 1285 1041 788 968 1285 1092 1009 960
ENSRNOG00000000008 0 0 0 0 1 0 0 1 0
ENSRNOG00000000009 0 0 0 3 0 0 1 0 0
ENSRNOG00000000010 38 405 50 18 105 405 372 42 282
ENSRNOG00000000012 0 0 0 0 0 0 0 0 0
22932 more rows ...
> d<-d[rowSums(d$counts)>9,]
> design<- model.matrix(~ age, data = targets)
> design
(Intercept) age.L
1 1 -7.071068e-01
2 1 -7.071068e-01
3 1 -7.071068e-01
4 1 -7.850462e-17
5 1 -7.850462e-17
6 1 -7.850462e-17
7 1 7.071068e-01
8 1 7.071068e-01
9 1 7.071068e-01
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$age
.L
a-young -7.071068e-01
b-middle -7.850462e-17
c-old 7.071068e-01
> d<-estimateGLMCommonDisp(d, design)
> glmfit<- glmFit(d, design, dispersion=d$common.dispersion)
> results<- glmLRT(d, glmfit, coef=c(2))
> topTags(results)
Coefficient: age.L
logConc logFC LR P.Value FDR
ENSRNOG000000xxxxx -11.271529 -1.9336775 55.45465 9.564007e-14 1.453825e-09
ENSRNOG000000xxxxx -10.359443 1.2661223 49.11802 2.410161e-12 1.831843e-08
ENSRNOG000000xxxxx -11.317494 -1.5480925 47.35359 5.926938e-12 3.003180e-08
Please correct me if I am wrong but I think that this is the change in
gene expression with increasing age taking into account expression at
middle age. I did also have another way of doing it which involved
changing age to a number as suggested by you:
targets$agenum<-as.numeric (targets$age)
targets$agenum<-targets$agenum-1
> design<- model.matrix(~ agenum, data = targets)
> design
(Intercept) agenum
1 1 0
2 1 0
3 1 0
4 1 1
5 1 1
6 1 1
7 1 2
8 1 2
9 1 2
attr(,"assign")
[1] 0 1
And this gives me similar results. Please let me know if you think this is
doing what I hope it is i.e. giving my the linear change in gene
expression with increasing age whilst taking into account middle age.
Many thanks
Shona
________________________________________
From: Gordon K Smyth [smyth at wehi.EDU.AU]
Sent: 15 December 2011 22:43
To: Wood, Shona
Cc: Bioconductor mailing list
Subject: Re: EdgeR GLM
Dear Shona,
One last suggestion. Try typing
contrasts(age)
This tells you how the coefficients are determined.
I would actually suggest you simply put age in as a numericl column
(1=young,2=middle,3=old) into your design matrix, instead of defining an
ordinal factor.
Best wishes
Gordon
---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Tel: (03) 9345 2326, Fax (03) 9347 0852,
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth
On Fri, 16 Dec 2011, Gordon K Smyth wrote:
> Dear Shona,
>
> Of course linear regression would use all the data points if you had 50 ages,
> but you have in fact only three.
>
> If there are an odd number of equally spaced ages, the middle one will not
> contribute to the coefficient. That's just how the math works.
>
> Polynomial regression isn't an invention of the edgeR package, and how it
> works in R is determined by the base functions in R, not those in edgeR. If
> you have further questions about linear or quadratic regression itself, it
> might be best to read separately, or else to ask for help on a more general
> list, probably the main R help list.
>
> Best wishes
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> http://www.wehi.edu.au
> http://www.statsci.org/smyth
>
> On Thu, 15 Dec 2011, Wood, Shona wrote:
>
> Dear Gordon,
>
> Thank you for getting back to me and giving me those explanations.
>
> The only thing that is still confusing me is the linear coefficient. If it
> compares young and old without regard to middle isn't it basically the same
> as a pairwise comparison? Also it seems strange that it only uses the two
> extremities - I thought linear regression should use all the data points. For
> example if we had 50 different ages would 48 data points be excluded to
> compare the first and last only?
>
> Thanks
>
> Shona
> ________________________________________
> From: Gordon K Smyth [smyth at wehi.EDU.AU]
> Sent: 13 December 2011 23:18
> To: Wood, Shona
> Cc: Bioconductor mailing list
> Subject: RE: EdgeR GLM
>
> Dear Shona,
>
> I can only give you brief pointers.
>
> There are only three time points in your data. The linear coefficient in
> the model simply compares old with young, without regard to middle. The
> quadratic term compares middle with the average of young and old. So if
> expression goes 2-4-2 for a particular gene over the three times then you
> get a positive quadratic coefficient and zero linear coefficient.
>
> It is perfectly possible for the likelihood ratio test of both linear
> and quadratic together:
>
> glmLRT(d, glmfit, coef=2:3)
>
> to detect some genes not in the separate linear or quadratic lists,
> because the linear or quadratic terms might be not quite significant
> separately, but might be so when considered together. For example, the
> expression might go 2-4-3. Is this linear or quadratic? Not quite sure.
> But we might be confident that the three times are not all equal.
>
> If you only want to report genes that have increasing expression with age,
> then I would have thought you want to fit the linear term only, since the
> quadratic effect can never be guaranteed to be increasing.
>
> Best wishes
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> Tel: (03) 9345 2326, Fax (03) 9347 0852,
> smyth at wehi.edu.au
> http://www.wehi.edu.au
> http://www.statsci.org/smyth
>
> On Tue, 13 Dec 2011, Wood, Shona wrote:
>
> Dear Gordon,
>
> Thank you for getting back to me. I am not a statistician so forgive me if
> I am asking basic questions. I was just reading about quadratic equations.
> Does a significant hit in the quadratic term mean that the rate of change
> first increases and then decreases over time (or visa versa)? Therefore
> hits which require the quadratic term are not a linear up or down with age
> they are only "linear" after middle age?
>
> If this is the case what does the logFC for age.Q represent, is it the
> fold change between middle age and old? And how is it calculated? Does the
> gene expression at young age affect the quadratic calculation? For example
> if the expression at young age is 2 and then 4 at middle and then 1.99 at
> old would the fact that old age is basically back to the young expression
> level stop it being a significant quadratic result or would it be
> significant because of the change from middle age.
>
> I have done the quadratic term separately as suggested but I also did the
> linear term separately (glmLRT(d, glmfit, coef=2)). When I compare the
> results of both these tests to glmLRT(d, glmfit, coef=2, 3) there are some
> significant genes in the glmLRT(d, glmfit, coef=2,3), that aren't in
> either of the separate quadratic or linear tests - why would this be?
>
> Finally if i want to report changes with increasing age would it be best
> to present the results from the quadratic and linear tests separately or
> would it be better to present the combined test (glmLRT(d, glmfit, coef=2,
> 3)) and point out which ones rely on the quadratic term?
>
> Many thanks
>
> Shona
> ________________________________________
> From: Gordon K Smyth [smyth at wehi.EDU.AU]
> Sent: 10 December 2011 07:10
> To: Wood, Shona
> Cc: Bioconductor mailing list
> Subject: EdgeR GLM
>
> Dear Shona,
>
> Yes, age.L is the log2-fold-change for increasing age.
>
> However your topTags table is testing for the linear and quadratic terms
> together, so it is testing the significance of any differences between the
> three age groups. I'd suggest you test whether the quadratic term is
> required using
>
> glmLRT(d, glmfit, coef=3)
>
> I'd also generally recommend you go on to estimate the trended and tagwise
> dispersions, not just the common dispersion.
>
> Best wishes
> Gordon
>
>> Date: Thu, 8 Dec 2011 17:41:07 +0000
>> From: "Wood, Shona" <Shona.Wood at liverpool.ac.uk>
>> To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>> Subject: [BioC] EdgeR GLM
>>
>> Hi,
>>
>> I am trying to look at gene expression changes with increasing age. I
>> have three time points young, middle age and old. I have been using the
>> following, ordered factors and coefficient in order to get a fold change
>> and FDR for a linear change in gene expression with increasing age. Can
>> you look over my code and see if I have done this right? Is age.L the
>> fold change in gene expression with increasing age?
>>
>>> library(edgeR)
>>> library(limma)
>>> targets<-read.delim (file="coding_targets.txt")
>>> targets$age<-factor(ordered (targets$age))
>>> targets$sample<-factor(targets$sample)
>>> targets
>> X files age sample
>> 1 p1 p1_coding_counts.txt a-young 1
>> 2 p2 p2_coding_counts.txt a-young 2
>> 3 p3 p3_coding_counts.txt a-young 3
>> 4 p7 p7_coding_counts.txt b-middle 7
>> 5 p8 p8_coding_counts.txt b-middle 8
>> 6 p9 p9_coding_counts.txt b-middle 9
>> 7 p4 p4_coding_counts.txt c-old 4
>> 8 p5 p5_coding_counts.txt c-old 5
>> 9 p6 p6_coding_counts.txt c-old 6
>>
>>> d<-readDGE(targets, skip=1, comment.char='#')
>>> colnames(d)<-row.names(targets)
>>> d<- calcNormFactors(d)
>>> d
>> An object of class "DGEList"
>> $samples
>> X files age sample lib.size norm.factors
>> 1 p1 p1_coding_counts.txt a-young 1 3445622 0.9655724
>> 2 p2 p2_coding_counts.txt a-young 2 2696547 0.9902573
>> 3 p3 p3_coding_counts.txt a-young 3 3308099 0.9787044
>> 4 p7 p7_coding_counts.txt b-middle 7 2503479 1.0584660
>> 5 p8 p8_coding_counts.txt b-middle 8 2639127 1.0477893
>> 6 p9 p9_coding_counts.txt b-middle 9 2696547 0.9902573
>> 7 p4 p4_coding_counts.txt c-old 4 3037440 0.9778553
>> 8 p5 p5_coding_counts.txt c-old 5 2647915 1.0144109
>> 9 p6 p6_coding_counts.txt c-old 6 2475370 0.9809077
>>
>> $counts
>> 1 2 3 4 5 6 7 8 9
>> ENSRNOG00000000007 1287 1285 1041 788 968 1285 1092 1009 960
>> ENSRNOG00000000008 0 0 0 0 1 0 0 1 0
>> ENSRNOG00000000009 0 0 0 3 0 0 1 0 0
>> ENSRNOG00000000010 38 405 50 18 105 405 372 42 282
>> ENSRNOG00000000012 0 0 0 0 0 0 0 0 0
>> 22932 more rows ...
>>
>>> d<-d[rowSums(d$counts)>9,]
>>> design<- model.matrix(~ age, data = targets)
>>> design
>> (Intercept) age.L age.Q
>> 1 1 -7.071068e-01 0.4082483
>> 2 1 -7.071068e-01 0.4082483
>> 3 1 -7.071068e-01 0.4082483
>> 4 1 -7.850462e-17 -0.8164966
>> 5 1 -7.850462e-17 -0.8164966
>> 6 1 -7.850462e-17 -0.8164966
>> 7 1 7.071068e-01 0.4082483
>> 8 1 7.071068e-01 0.4082483
>> 9 1 7.071068e-01 0.4082483
>> attr(,"assign")
>> [1] 0 1 1
>> attr(,"contrasts")
>> attr(,"contrasts")$age
>> [1] "contr.poly"
>>
>>> d<-estimateGLMCommonDisp(d, design)
>>> names(d)
>> [1] "samples" "counts" "common.dispersion"
>>> glmfit<- glmFit(d, design, dispersion=d$common.dispersion)
>>> results<- glmLRT(d, glmfit, coef=c(2,3))
>>> topTags(results)
>> Coefficient: age.L age.Q
>> logConc age.L age.Q LR P.Value
>> ENSRNOG00000011672 -11.27122 -3.2445886 -1.9642177 109.53641 1.638591e-24
>>
>> Many Thanks
>>
>> Shona
>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:9}}
More information about the Bioconductor
mailing list