[BioC] (EdgeR) statistical justification of partitioning dataset for multiple analysis

Adriaan Sticker adriaan.sticker at gmail.com
Fri Jan 31 16:01:51 CET 2014

Thanks for your input. I did as you suggested.
For all treatment groups combined i got common BCV = 0.08

 When I look split up my dataset in 3 treatments groups and calculate the
BCV for each seperately I got common BCV:
control: 0.081   treatment1: 0.085   treantment2: 0.096

When I split the data for each analysis I got common BCV;
control + treat1: 0.078     control + treat2: 0.084     treat1 +treat2:

So it seems that treatment2 has some extra BCV compared to the others but
thes differences are not so big when you look at each analysis for
treatment comparison. I also don't think the BCVs for each analysis look
much different when you look at the BCV plots themself (in attachment)

I have to revise my statement  about finding more genes after splitting the
dataset compared to an analysis on the full dataset.
I find more genes (almost double) for treatment 1 vs control when I split
the dataset.
I find less genes (almost half) for treatment 2 vs control when I split the
I find more or less (it depends at which timepoint you look) for treatment
2 vs treatment 1 when I split the dataset.

This puzzles me a bit.

But in general, when all BCVs are more or less the same. Would you gain
something by splitting the dataset or doesn't that make much sense

Best regards

2014-01-30 Ryan <rct at thompsonclan.org>:

> Hi Adriaan,
> If I understand correctly, you have 3 different treatments, i.e. control,
> treatment 1, and treatment 2, and you have fit the same model formula to
> the full dataset as well as all 3 combinations of only 2 treatments, and
> you are getting significantly different results between the 3-treatment fit
> and the 2-treatment fits. I think the first thing you need to do is to look
> at the result of plotBCV for each analysis. It is possible that one of your
> treatments has significantly more biological variability across all genes
> than the others. edgeR assumes that each gene has the same BCV across all
> conditions, so that it can more robustly estimate a single dispersion value
> for each gene. So look at the plotBCV output from all your analyses, and
> see if the BCV estimates differ significantly. This would surely explain
> what you are seeing. You may also want to estimate dispersions from each
> treatment group individually (drop Treatment from the model formula in this
> case). The tagwise dispersions will not be very robust in this case, but
> the trend and common dispersions can help you figure out which treatment
> has the most biological variability.
> If the dispersion estimates don't explain your differing p-values, ask
> back here and maybe someone else will have another idea.
> Good luck,
> -Ryan
> On 1/30/14, 9:43 AM, Adriaan Sticker wrote:
>> Dear all,
>> I'm doing analysis on allready mapped reads from sequencing data for
>> differential expression with EdgeR. My experimental setup is as follow:
>> I have samples from 4 different subjects. Material of each subject wast
>> treated with 2 different treatments (and a control) for 2 timepoints.
>> I want to analyse the effect of the treatments (compared to control and
>> compared to eachother)
>> In EdgeR I used following design
>> model.matrix(~ subject+ Treatment + Time +Treatment : Time)
>> I considered 2 strategies to analye te data:
>> Estimate parameters from above mentioned design with all data (all
>> samples)
>> and use different contrasts to get the differential expressed genes I
>> want.
>> OR
>> Use only the samples of the two treatments (eg. control vs treatment1,
>> treatment 1 vs treatment 2) I want to compare to fit the parameters.
>> Repeat
>> the previous 3 times till I have compared all 3 treatments with eachother.
>> So exctually 3 different analysis using only a subset (2/3 th) of the
>> data.
>> I noticed that I could find considerably more significant differential
>> expressed genes between 2 treatments with the last approach. But I
>> wondered
>> how correct this approach is? Will I have for example problems with
>> multiple testing? (I control each analysis on fdr 5% with bejamin
>> Hochberg)
>> thanks in advance
>> Kind regard
>>         [[alternative HTML version deleted]]
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.
>> science.biology.informatics.conductor
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bcv_all.png
Type: image/png
Size: 32597 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20140131/c9780c8b/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bcv_control_treat1.png
Type: image/png
Size: 30368 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20140131/c9780c8b/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bcv_control_treat2.png
Type: image/png
Size: 31007 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20140131/c9780c8b/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bcv_treat1_treat2.png
Type: image/png
Size: 30864 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/bioconductor/attachments/20140131/c9780c8b/attachment-0003.png>

More information about the Bioconductor mailing list