[BioC] DESeq2 with data containing 3 factors or more and outlier detection

Thu May 8 17:20:53 CEST 2014

hi Olga,

I CC the Bioconductor mailing list. Answers in line below.

On Thu, May 8, 2014 at 8:57 AM, solgakar at bi.technion.ac.il
<solgakar at bi.technion.ac.il> wrote:
>
>
> -          If I have data with three factors: condition, age and group.
>
> The levels of condition are: treated and untreated
>
> The levels of age are: young and adult
>
> The levels of group are: 1 and 2
>
> Suppose I want to receive the result for: treated-1 vs. treated-2 for age=young and the result for: treated-1 vs. treated-2 in age=adult.
>
> From what I know, there is no way to receive interaction between three factors at a time.

Yes, this is true for betaPrior=TRUE, the default. I only implemented
the shrinkage of effects for models with terms up to order two. There
is no limitation, however if you set betaPrior=FALSE.

>
> With that assumption, how can I receive these results? Is it better to separate the data into two separate sets: young and adult, and in each to receive the result   : treated-1 vs. treated-2.
>
> Or is there a better way to receive the results I want using only one full data set containing all the samples?
>
> I had an idea to combine the factor age and condition into one factor: age_condition (with levels young_treated, young_untreated, adult_treated, adult_untreated), since I am not interested in the interaction between those two.

Yes, this combining levels seems like the cleanest approach when users
have a lot of groups they are interested in comparing.

> Using there method, if I would call the results function using the "list" option in contrast with: "age_conditionyoung_treated.group1", "age_conditionyoung_treated.group2", will I receive the result I expect?

yes, this would work, but for the group 1 vs 2 effect for treated and
young, you need to also add in the main effect for group 1 vs 2:

results(dds, contrast=list(c("group1","age_conditionyoung_treated.group1"),c("group2","age_conditionyoung_treated.group2")

The contrast as you have it is for testing whether the 1 vs 2 effect
is different in young and treated samples than the main 1 vs 2 effect.

Another option, with a caveat. You might even consider combining all
three factors into, say a variable 'g', if you are interested in more
easily extracting the contrasts between groups, for example:

results(dds, contrast=c("g","young_treated_condition1","young_treated_condition2"))

The caveat is: it is easier for comparing groups, but doesn't provide
access to the interactions. Additionally, this option will give the
same contrasts of groups with betaPrior=FALSE, but the contrasts of
groups will not be exactly the same with this option and using a beta
prior. This is a modeling choice that the user must decided, whether
the fold changes for only only the interaction terms should be
moderated (using a design with interaction terms), or whether the fold
changes from each unique combination of level should be moderated
(here combining all factors into one factor).

> -          I refer to the outlier detection algorithm. is the detection of outlier Is done separately for each result or is it done once for an entire data set?

The outlier filtering is done for the entire gene, for simplicity of
implementation. However, the user can perform any filtering they chose
by setting cooksCutoff=FALSE, and then using the matrix of Cook's
distances which are stored in assays(dds)[["cooks"]]. For example:

mcols(dds)$myOutlierFlag <- rowSums( assays(dds)[["cooks"]][ , c(2,3,4) ] > 10 ]
res$pvalue[ mcols(dds)$myOutlierFlag ] <- NA
res$padj <- p.adjust( res$pvalue, method="BH" )

The above code sets a flag for genes when samples 2,3 and 4 have a
Cook's distance over 10. One can find their own cutoff by looking at
the Cook's values over the proportion that the maximum count
contributes to the total count for a row (here I use a very simple
unexported function):

plot( DESeq2:::propMaxOfTotal(counts(dds)[ , c(2,3,4) ] ),
apply(assays(dds)[["cooks"]][ , c(2,3,4) ] , 1, max) )

Mike

>
> Meaning, if one sample is set to be an outlier, will the gene receive "NA" in the pvalue and adjusted pvalue for all comparisons or only those that sample_1 is relevant to?
>
> I ask this question since if sample_1 is an outlier but I am interested in the comparison only regarding samples 2-6, I wouldn't want to lose the information for this gene in this comparison.
>
> -          Is there an easy way to detect which sample is the outlier for a specific gene in a specific comparison if I want to flag it in any way?
>
> Thank you very much,
>
> Olga Karinsky.
>
>
>
>
>
>