[BioC] detecting oscillating genes from RNA-seq data

Gordon K Smyth smyth at wehi.EDU.AU
Sun Apr 6 12:56:02 CEST 2014


On Sun, 6 Apr 2014, mali salmon wrote:

> Thank you so much Prof. Smyth for your reply.
> I have been trying to follow your suggestions, and encountered an error
> when trying to run estimateGLMTagwiseDisp.
>
> Below is the code I used and the design matrix.
>
> Many thanks
> Mali
>
>> library(edgeR)
>> x <- read.delim("count.matrix", row.names=1, stringsAsFactors=FALSE)
>> x<-round(x) // since I have RSEM values
>> y <- DGEList(counts=x)
>> keep <- rowSums(cpm(y)>1) >= 4
>> y <- y[keep,]
>> y$samples$lib.size <- colSums(y$counts)
>> y <- calcNormFactors(y)
>> time <- rep(seq(2,22,4),4)
>> s1 <- sin(2*pi*time/24)
>> c1 <- cos(2*pi*time/24)
>> tide<-c(1,0,0.5,1,0,0.5,0.5,1,0,0.5,1,0.5,1,0,0.5,1,0.5,0.5,0.5,1,0,0.5,1,0) *//
> I used 1 for high tide, 0 for low tide, and 0.5 in-between (rising or
> falling)*

No, this isn't going to work, for several reasons.

First, you have equated low and high tide as the same value, because a 
periodic function returns to the same value at 1 as it had at 0. You need 
to do instead what I suggested, which was to use 0.5 for low tide and 0 or 
1 (makes no difference which) for high tide.

Second, you can't simply put every in-between stage to the same value. 
This is far, far too rough. The rising and falling stages are not the 
same. The successive values for 'tide' have to be at intervals of 
4hr/12.4hr = 0.323 for observations that are four hours apart.

Third, you have coded the second week as being in exactly the same tidal 
phase as for the first week at each clock time.  This cannot be true, and 
is again far too rough.  If the tidal cycle has a period of 12.4 hours, 
then there must be a gradual phase shift in the tidal cycle by about 
0.8/12.4 each day.

You have to code the tidal values more precisely, and you have to 
determine accurately the phase change of the tidal cycle between the first 
week and the second.  For the analysis to have any hope of working you 
have to be able to interpolate all the tidal values accurately to at least 
one or two decimal places.  If you can't do it as accurately as that, then 
it won't be worth doing it all.

It may also be good idea to consult a mathematician or statistician at 
your university to help you interpret the results.

Good luck
Gordon

>> tides <- sin(2*pi*tide)
>> tidec <- cos(2*pi*tide)
>> design <- model.matrix(~s1+c1+tides+tidec)
>
>> design
>   (Intercept)   s1            c1         tides tidec
> 1            1  0.5  8.660254e-01 -2.449294e-16     1
> 2            1  1.0  6.123234e-17  0.000000e+00     1
> 3            1  0.5 -8.660254e-01  1.224647e-16    -1
> 4            1 -0.5 -8.660254e-01 -2.449294e-16     1
> 5            1 -1.0 -1.836970e-16  0.000000e+00     1
> 6            1 -0.5  8.660254e-01  1.224647e-16    -1
> 7            1  0.5  8.660254e-01  1.224647e-16    -1
> 8            1  1.0  6.123234e-17 -2.449294e-16     1
> 9            1  0.5 -8.660254e-01  0.000000e+00     1
> 10           1 -0.5 -8.660254e-01  1.224647e-16    -1
> 11           1 -1.0 -1.836970e-16 -2.449294e-16     1
> 12           1 -0.5  8.660254e-01  1.224647e-16    -1
> 13           1  0.5  8.660254e-01 -2.449294e-16     1
> 14           1  1.0  6.123234e-17  0.000000e+00     1
> 15           1  0.5 -8.660254e-01  1.224647e-16    -1
> 16           1 -0.5 -8.660254e-01 -2.449294e-16     1
> 17           1 -1.0 -1.836970e-16  1.224647e-16    -1
> 18           1 -0.5  8.660254e-01  1.224647e-16    -1
> 19           1  0.5  8.660254e-01  1.224647e-16    -1
> 20           1  1.0  6.123234e-17 -2.449294e-16     1
> 21           1  0.5 -8.660254e-01  0.000000e+00     1
> 22           1 -0.5 -8.660254e-01  1.224647e-16    -1
> 23           1 -1.0 -1.836970e-16 -2.449294e-16     1
> 24           1 -0.5  8.660254e-01  0.000000e+00     1
>
>
> y <- estimateGLMCommonDisp(y,design)
> y <- estimateGLMTrendedDisp(y,design)
> y <- estimateGLMTagwiseDisp(y,design)
>
> *Error in dispCoxReidInterpolateTagwise(y, design, offset = offset,
> dispersion,  : *
> *  design matrix must be full column rank*
>
>
>> sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-redhat-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] edgeR_3.4.2   limma_3.18.12
>
> loaded via a namespace (and not attached):
> [1] tools_3.0.2
>
>
> On Fri, Apr 4, 2014 at 2:19 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>
>>
>> On Thu, 3 Apr 2014, mali salmon wrote:
>>
>>  Dear Prof. Smith
>>> Thank you so much for the explanation of how to do the analysis.
>>> If you don't mind I'll describe the design of our experiment just to be
>>> sure that I'm not doing any mistake.
>>> We work on a sea animal, and we are interested with day/night oscillating
>>> genes (24 hours cycle) and with tidal genes (12.4 h cycle).
>>> In the first week we have sacrified animals every 4 hours across 48 h, and
>>> then repeated the same procedure the week after.
>>> So in total we have 24 samples, 4 "replicates" of 6 time points
>>> (2am,6am,10am,2pm,6pm,10pm)
>>> From your mail I understand that time is a vector of time points in hours,
>>> but I'm not sure how to define it, it can be:
>>> 1) time <- rep(seq(2,46,4),2)
>>> 2) time <- seq(2,94,4)
>>> 3) time <- rep(seq(2,22,4),4)
>>>
>>
>> Makes no difference.  All three choices will give exactly the same results.
>>
>>
>>  I saw this matter for the cos vector.
>>> I suppose that for detecting tidal genes I just have to repeat the
>>> analysis
>>> with 12.4 cycle.
>>>
>>
>> You need to carefully identify the tidal stage applicable to each of the
>> 24 samples.  Clearly the tidal phase will have moved in the second week.
>> For example you might code 'tide' to be between 0 and 1, with 0/1 for high
>> tide and 0.5 for low tide.  Then:
>>
>>   tides <- sin(2*pi*tide)
>>   tidec <- cos(2*pi*tide)
>>
>> Then you need to fit both day/night and tidal cycles to your data at the
>> same time:
>>
>>   ~s1+c1+tides+tidec
>>
>> It may be difficult to distinguish a tidal cycle from a 12-hour cycle. You
>> might try to check check this by:
>>
>>   ~s1+c1+s2+c2+tides+tidec
>>
>> Gordon
>>
>>
>>  Thanks again
>>> Mali
>>>
>>>
>>> On Thu, Apr 3, 2014 at 6:16 AM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>>>
>>>  Dear Mali,
>>>>
>>>> It is easy to do such an analysis directly in edgeR.
>>>>
>>>> Suppose for example that you are looking for a daily cycle, and that you
>>>> have at least half a dozen time points spanning more than a full day.
>>>>  The
>>>> method is to create basis vectors for a periodic expression pattern.
>>>> Suppose that 'time' is time in hours:
>>>>
>>>>   s1 <- sin(2*pi*time/24)
>>>>   c1 <- cos(2*pi*time/24)
>>>>   design <- model.matrix(~s1+c1)
>>>>
>>>> This will give a design matrix with three columns: one for the intercept
>>>> and two for the periodic signal.  To test for a periodic cycle:
>>>>
>>>>   fit <- glmFit(y,design)
>>>>   lrt <- glmLRT(fit,coef=2:3)
>>>>
>>>> If you have lots of time points and you want to make a more complex curve
>>>> you can add a harmonic:
>>>>
>>>>   s2 <- sin(4*pi*time/24)
>>>>   c2 <- cos(4*pi*time/24)
>>>>   design <- model.matrix(~s1+c1+s2+c2)
>>>>
>>>> There are now four coefficients parametrizing the periodic curve.  (These
>>>> can be viewed as representing amplitudes and phase shifts for the two
>>>> harmonics.)  I think you would seldom want more than two harmonics for
>>>> RNA-seq data.
>>>>
>>>> This method can easily be adapted to find cell-cycle genes and so on.
>>>>
>>>> Best wishes
>>>> Gordon
>>>>
>>>>
>>>>  Date: Tue, 1 Apr 2014 14:33:27 +0200
>>>>
>>>>> From: mali salmon <shalmom1 at gmail.com>
>>>>> To: "bioconductor at r-project.org" <bioconductor at r-project.org>
>>>>> Subject: [BioC] detecting oscillating genes from RNA-seq data
>>>>>
>>>>> Hello List
>>>>> I have RNA-seq data from different time points and I would like to find
>>>>> oscillating genes.
>>>>> I thought of using the "cycle" package (which is based on Fourier
>>>>> score) ,
>>>>> but I'm not sure what values to use: FPKM or DESeq/edgeR normalized
>>>>> values.
>>>>> Any suggestion what would be more appropriate?
>>>>> Thanks
>>>>> Mali
>>>>>
>>>>>
>>>> ______________________________________________________________________
>>>> The information in this email is confidential and intended solely for the
>>>> addressee.
>>>> You must not disclose, forward, print or use it without the permission of
>>>> the sender.
>>>> ______________________________________________________________________
>>>>
>>>>
>>>
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for the
>> addressee.
>> You must not disclose, forward, print or use it without the permission of
>> the sender.
>> ______________________________________________________________________
>>
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list