[BioC] Low number of replicates DESeq
Steve Lianoglou
lianoglou.steve at gene.com
Tue Feb 25 06:34:03 CET 2014
Hi,
On Mon, Feb 24, 2014 at 5:18 PM, Federico Gaiti <f.gaiti at uq.edu.au> wrote:
> Thanks Steve -- I'll have a look at the DESeq2 vignette.
>
> Here is what I've done, let me know if it's still unclear
>
> samtools view TOPHAT_STRANDED_sorted.bam | python -m HTSeq.scripts.count -s no - gtfFile > stranded_counts_noS.txt
>
> head(Counts)
>
> 1.1 1.2 1.3 1.4 2.1 2.2 2.3 2.4 3.1 3.2 3.3 3.4 4.1 4.2 4.3 4.4
>
> XXXX 9 0 24 48 30 5 1 1 21 15 8 6 28 28 27 47
>
> XXXX 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
> XXXX 16 0 0 0 19 4 0 0 40 1 2 3 78 5 5 7
>
> XXXX 0 8 0 7 5 5 19 7 14 4 4 7 9 4 12 1
> ......
>
> and then in R:
>
> data<-read.table("Counts",header=TRUE,row.names=1)
> head(data)
> Design=data.frame(
> row.names=colnames(data),
> condition=c('1','1','1','1','2','2','2','2','3','3','3','3','4','4','4','4'))
> cds=newCountDataSet(data,Design)
> cds=estimateSizeFactors(cds)
> sizeFactors(cds)
> All_samples_normalized<-counts(cds,normalized=TRUE)
> table(rowSums(All_samples_normalized)==0)
> data<-All_samples_normalized[rowSums(All_samples_normalized)>0,]
> data_subset_matrix<-as.matrix(data)
> Spearman_normalized<-rcorr(data_subset_matrix,type="spearman")
> A<-Spearman_normalized$r
> write.table(A,file="Spearman_normalized_allsamples.txt")
> Pearson_normalized<-rcorr(data_subset_matrix,type="pearson")
> B<-Pearson_normalized$r
> write.table(B,file="Pearson_normalized_allsamples.txt")
> A_matrix<-as.matrix(A)
> B_matrix<-as.matrix(B)
> corrgram(B_matrix,order="PCA")
> corrgram(A_matrix,order="PCA")
> pca3=PCA(A_matrix,graph=TRUE)
> pca3=PCA(B_matrix,graph=TRUE)
>
> This gave me a nice correlation but I'd still like to to use the stranded counts for the DGE.
Thanks for pasting the code now -- without the images or results from
your output, it's hard to see what is reasonably correlated or not,
but let's continue ...
I see you're already doing PCA, and it would be interesting to see how
your results compare to what you get by following along with the
DESeq2 vignette and shooting your data through one of the variance
stabilizing transforms (incidentally, the DESeq vignette also has a
similar section, and it would have been worth while to browse that
vignette and follow the advice outlined there as well)
> The issue is that if I only use stranded data I don't have replicates. If I include the 3 unstranded replicates I need to use the option -s no for the stranded data because otherwise stranded and unstranded do not correlate.
Are you saying that if you use the stranded data but do the counting
as if it were unstranded, it looks "just fine," but counting the
stranded data as stranded data makes it look quite bad?
Interesting ....I've actually never used stranded RNAseq data before,
but that's a bit surprising ...
What if you only consider a subset of genes that are in regions that
do not have annotated genes on the opposite strands -- the data for
these genes from the stranded and unstranded should be quite similar,
no? How do these look?
> So my ideas was to use the unstranded data to estimate the level of variation to get a threshold for DE detection but still use the stranded data as expression values.
But why? If you're just doing "vanilla DGE" and you have data that you
feel like you can use (the unstranded data) in triplicate, and a
source of singlicate data (the stranded one), why do you want to
torture some DGE analysis by cramming the stranded data through it
assuming <something> from the unstranded.
If you really want to use the stranded data, I'd encode the "library
type" as a covariate in the linear model while doing differential
expression. Both the DESeq and DESeq2 vignettes have a section on
multi-factorial designs where they incorporate data from single and
paired-end runs to perform a DGE analysis -- the way you would combine
unstranded and stranded counts for a gene would be rather similar.
> Would it be possible? Or would it be better to stick to the -s no option for the DGE?
To be honest, if I had a set of data that had three replicates per
condition, and one set of data that had "singlicate" samples in a
condition which came from an apparently wildly different protocol, I'd
probably just use the triplicate data and forget about the outlier
data if I'm just doing "vanilla DGE".
If the stranded data is giving you wildly different results, the
simplest explanation might be that something may have gotten screwed
up in the library prep -- but you have no way to tell since you have
no replicates of those data.
As I said, you can include it in your linear model by using a
multi-factorial design if you *really* want to use it, but why bother
for a vanilla DGE analysis?
On the other hand, if you're using the stranded data to do something
more than just vanilla DGE -- perhaps you're interested in quantifying
anti-sense transcription, then I'd try to do a lot more exploratory
analysis to see how I could explain large difference in quantitation
that I'm seeing. I'd still be really pissed that I had no replicates,
but it wouldn't be the first time a graduate student is stuck with an
underpowered dataset that they are tasked to torture, and it sadly
won't be the last ... still, in this situation, I'd go the
multi-factorial design approach to combine both datasets for any
analysis I'm asked to do. In the meantime, I'd also complain to my
advisor/collaborator whatever to twist their arm enough to run
replicates for the future.
HTH,
-steve
--
Steve Lianoglou
Computational Biologist
Genentech
More information about the Bioconductor
mailing list