[BioC] (EdgeR) statistical justification of partitioning dataset for multiple analysis

Ryan rct at thompsonclan.org
Fri Jan 31 18:01:31 CET 2014

On Fri Jan 31 07:01:51 2014, Adriaan Sticker wrote:
> when all BCVs are more or less the same. Would you
> gain something by splitting the dataset or doesn't that make much
> sense statistically?

No, when all BCVs are consistent across treatments, you want to combine 
all of them into one dataset to the the most robust BCV estimates possible.

> Best regards
> Adriaan
> 2014-01-30 Ryan <rct at thompsonclan.org <mailto:rct at thompsonclan.org>>:
> Hi Adriaan,
> If I understand correctly, you have 3 different treatments, i.e.
> control, treatment 1, and treatment 2, and you have fit the same
> model formula to the full dataset as well as all 3 combinations of
> only 2 treatments, and you are getting significantly different
> results between the 3-treatment fit and the 2-treatment fits. I
> think the first thing you need to do is to look at the result of
> plotBCV for each analysis. It is possible that one of your
> treatments has significantly more biological variability across
> all genes than the others. edgeR assumes that each gene has the
> same BCV across all conditions, so that it can more robustly
> estimate a single dispersion value for each gene. So look at the
> plotBCV output from all your analyses, and see if the BCV
> estimates differ significantly. This would surely explain what you
> are seeing. You may also want to estimate dispersions from each
> treatment group individually (drop Treatment from the model
> formula in this case). The tagwise dispersions will not be very
> robust in this case, but the trend and common dispersions can help
> you figure out which treatment has the most biological variability.
> If the dispersion estimates don't explain your differing p-values,
> ask back here and maybe someone else will have another idea.
> Good luck,
> -Ryan
> On 1/30/14, 9:43 AM, Adriaan Sticker wrote:
> Dear all,
> I'm doing analysis on allready mapped reads from sequencing
> data for
> differential expression with EdgeR. My experimental setup is
> as follow:
> I have samples from 4 different subjects. Material of each
> subject wast
> treated with 2 different treatments (and a control) for 2
> timepoints.
> I want to analyse the effect of the treatments (compared to
> control and
> compared to eachother)
> In EdgeR I used following design
> model.matrix(~ subject+ Treatment + Time +Treatment : Time)
> I considered 2 strategies to analye te data:
> Estimate parameters from above mentioned design with all data
> (all samples)
> and use different contrasts to get the differential expressed
> genes I want.
> OR
> Use only the samples of the two treatments (eg. control vs
> treatment1,
> treatment 1 vs treatment 2) I want to compare to fit the
> parameters. Repeat
> the previous 3 times till I have compared all 3 treatments
> with eachother.
> So exctually 3 different analysis using only a subset (2/3 th)
> of the data.
> I noticed that I could find considerably more significant
> differential
> expressed genes between 2 treatments with the last approach.
> But I wondered
> how correct this approach is? Will I have for example problems
> with
> multiple testing? (I control each analysis on fdr 5% with
> bejamin Hochberg)
> thanks in advance
> Kind regard
> [[alternative HTML version deleted]]
> _________________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/__listinfo/bioconductor
> <https://stat.ethz.ch/mailman/listinfo/bioconductor>
> Search the archives:
> http://news.gmane.org/gmane.__science.biology.informatics.__conductor
> <http://news.gmane.org/gmane.science.biology.informatics.conductor>

More information about the Bioconductor mailing list