[BioC] (EdgeR) statistical justification of partitioning dataset for multiple analysis

Thu Jan 30 20:03:00 CET 2014

Hi Adriaan,

If I understand correctly, you have 3 different treatments, i.e. 
control, treatment 1, and treatment 2, and you have fit the same model 
formula to the full dataset as well as all 3 combinations of only 2 
treatments, and you are getting significantly different results between 
the 3-treatment fit and the 2-treatment fits. I think the first thing 
you need to do is to look at the result of plotBCV for each analysis. It 
is possible that one of your treatments has significantly more 
biological variability across all genes than the others. edgeR assumes 
that each gene has the same BCV across all conditions, so that it can 
more robustly estimate a single dispersion value for each gene. So look 
at the plotBCV output from all your analyses, and see if the BCV 
estimates differ significantly. This would surely explain what you are 
seeing. You may also want to estimate dispersions from each treatment 
group individually (drop Treatment from the model formula in this case). 
The tagwise dispersions will not be very robust in this case, but the 
trend and common dispersions can help you figure out which treatment has 
the most biological variability.

If the dispersion estimates don't explain your differing p-values, ask 
back here and maybe someone else will have another idea.

Good luck,

-Ryan

On 1/30/14, 9:43 AM, Adriaan Sticker wrote:
> Dear all,
>
> I'm doing analysis on allready mapped reads from sequencing data for
> differential expression with EdgeR. My experimental setup is as follow:
> I have samples from 4 different subjects. Material of each subject wast
> treated with 2 different treatments (and a control) for 2 timepoints.
>
> I want to analyse the effect of the treatments (compared to control and
> compared to eachother)
>
> In EdgeR I used following design
> model.matrix(~ subject+ Treatment + Time +Treatment : Time)
>
> I considered 2 strategies to analye te data:
>
> Estimate parameters from above mentioned design with all data (all samples)
> and use different contrasts to get the differential expressed genes I want.
>
> OR
>
> Use only the samples of the two treatments (eg. control vs treatment1,
> treatment 1 vs treatment 2) I want to compare to fit the parameters. Repeat
> the previous 3 times till I have compared all 3 treatments with eachother.
> So exctually 3 different analysis using only a subset (2/3 th) of the data.
>
> I noticed that I could find considerably more significant differential
> expressed genes between 2 treatments with the last approach. But I wondered
> how correct this approach is? Will I have for example problems with
> multiple testing? (I control each analysis on fdr 5% with bejamin Hochberg)
>
> thanks in advance
> Kind regard
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor