[BioC] Questions on doing EdgeR Analysis of Timeseries Data
Gordon K Smyth
smyth at wehi.EDU.AU
Sat Jun 18 01:45:24 CEST 2011
Counts per million should have been:
cpm <- 1e6 * t(t(y$counts)/y$samples$lib.size)
Gordon
On Fri, 17 Jun 2011, Gordon K Smyth wrote:
> Dear Mark,
>
> I'd suggest that you filter out genes that fail to achieve a reasonable count
> in any library. I can't give you a firm cutoff for this, it depends on the
> sequencing depth, and the analysis should not be sensitive to this anyway.
> In my lab, we have tended to keep genes that achieve at least 1 count per
> thousand reads in at least a minimum number of libraries. If y is a DGEList
> object, you might use
>
> cpm <- t(t(y$counts)/y$samples$lib.size)
> keep <- rowSums(cpm>1)>0
> y.filtered <- y[keep,]
>
> Since you have two replicates of one lifecycle, you could base your estimate
> of variability (the biological coefficient of variation) on these replicates.
>
> design <- model.matrix(~lifecycle)
> y.filtered <- estimateGLMCommonDisp(y.filtered,design)
>
> Perhaps also
>
> y.filtered <- estimateGLMTrendedDisp(y.filtered,design)
>
> Then
>
> fit <- glmFit(y.filtered,design)
>
> Then you could test for genes that change at all through the 6 lifecycles by
>
> lrt <- glmLRT(y.filtered,fit,coef=2:6)
> topTags(lrt)
>
> This will perform a likelihood ratio test for each gene on 5 degrees of
> freedom. Then you could examine the significant genes to see what patterns
> they show across the lifecycles:
>
> is.sig <- p.adjust(lrt$table$p.value,method="BH") < 0.1
> library(limma)
> plotlines(lrt$coef[is.sig,],first.column.origin=TRUE)
>
> Best wishes
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> Tel: (03) 9345 2326, Fax (03) 9347 0852,
> smyth at wehi.edu.au
> http://www.wehi.edu.au
> http://www.statsci.org/smyth
>
> On Tue, 14 Jun 2011, Lawson, Mark (mjl3p) wrote:
>
>> Hi Gordon
>>
>> Thanks for getting back to me on this. I don't mind the transfer, whatever
>> leads to answers to my questions is fine by me.
>>
>> I am sorry that I forgot to classify the lifecycles of the samples. They
>> represent 6 distinct lifecycles (the first measured lifecycle has two
>> samples). Will this affect the prior.n you suggested?
>>
>> ~ Mark
>>
>> On Jun 10, 2011, at 9:14 PM, Gordon K Smyth wrote:
>>
>>> Hi Mark,
>>>
>>> This is an analysis question rather than a devel suggestion, so I've
>>> transferred it from Bioc-sig-seq to Bioconductor. Hope you don't mind.
>>>
>>> If you have divided your samples into two groups, and you have 7 samples,
>>> then I recommend you set prior.n=4. See
>>>
>>> https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2011-June/002042.html
>>>
>>> for an explanation.
>>>
>>> I can't answer your other questions without knowing whether you have
>>> biological replicates in your experiment. You have 7 samples. Do these
>>> samples all correspond to distinct lifecycle stages, or are some stages
>>> observed more than once? How many stages are there?
>>>
>>> Best wishes
>>> Gordon
>>>
>>>> Date: Thu, 9 Jun 2011 18:19:42 +0000
>>>> From: "Lawson, Mark (mjl3p)" <mjl3p at virginia.edu>
>>>> To: "bioc-sig-sequencing at r-project.org"
>>>> <bioc-sig-sequencing at r-project.org>
>>>> Subject: [Bioc-sig-seq] Questions on doing EdgeR Analysis of
>>>> Timeseries Data
>>>>
>>>> (I apologize if this message was sent more than once)
>>>>
>>>> Dear Analysis Gurus
>>>>
>>>> I am currently performing a gene expression analysis on a plant parasite.
>>>> I have mapped Illumina read counts for various stages in this parasites
>>>> lifecycle. Of interest for us in this analysis are genes that are
>>>> differentially expressed during these lifecycles. To determine this, I
>>>> have focused on two types of differential expression: "peaks" and
>>>> "cliffs." "Peaks" occur when a gene is differentially expressed in one
>>>> time sample (either higher or lower than the remaining samples) and
>>>> "cliffs" occur when a gene is differentially expressed between two groups
>>>> of sample (for instance higher expression in the first three samples than
>>>> the last three).
>>>>
>>>> To determine these peaks and cliffs, I have been creating groups in which
>>>> the desired peak/cliff is "case" and the remaining samples are "control."
>>>> I then run common dispersion and/or tagwise dispersion and extract those
>>>> reads with an FDR of less than 0.1. So, my questions:
>>>>
>>>> 1.) How much filtering of data should I do? Right now I have a fair
>>>> amount of genes that are expressed in 0, 1, 2 etc. samples. It seems
>>>> logical that I would filter out genes that have no expression, but at
>>>> what level should it stop? Also, should there be different filtering
>>>> depending on the analysis (peak or cliff)?
>>>>
>>>> 2.) When doing tagwise dispersion, what should I set my prior.n to (I
>>>> currently have 7 samples)? Does it depend on the type of analysis?
>>>>
>>>> 3.) Should I investigate using a more advanced glm based analysis? Any
>>>> advice on crafting a design for this?
>>>>
>>>> 4.) Any other ideas on analyses to perform on a set of timeseries data
>>>> with EdgeR?
>>>>
>>>> I greatly appreciate any help/advice and thank you in advance!
>>>>
>>>>
>>>> Mark J. Lawson, Ph.D.
>>>> Bioinformatics Research Scientist
>>>> Center for Public Health Genomics, UVA
>>>> mlawson at virginia.edu
>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list