[BioC] Low number of replicates DESeq

Steve Lianoglou lianoglou.steve at gene.com
Tue Feb 25 01:34:21 CET 2014


Hi,

Since you are just starting your analysis and are in the world of
DESeq, you should probably switch to DESeq2.

You mention things about "low correlation" but it's not clear what
conditions you are comparing where. Instead of describing your
analysis at a high level, showing the code that you used would be more
helpful.

That having been said, the first thing I would do is to perform the
steps outlined in the DESeq2 vignette under the section "Data quality
assessment by sample clustering and visualization" to see if your
replicate data cluster closely together in meaningful ways using the
heatmaps and PCA plots outlined there.

HTH,
-steve

On Mon, Feb 24, 2014 at 3:31 PM, Federico Gaiti <f.gaiti at uq.edu.au> wrote:
> Hi all,
>
> I am using DESEq for a DGE analysis.
>
> I have STRANDED RNA-Seq data for 4 developmental stages with no replicates but I know that to have a more reliable DGE I should have replicates. So I got (from another lab member) UNSTRANDED RNA-Seq data with 3 replicates per stage.
>
> So my data situation at the moment is:
>
> STAGE 1     stranded
> STAGE 1.1  unstranded
> STAGE 1.2  unstranded
> STAGE 1.3  unstranded
> STAGE 2     stranded
> STAGE 2.1  unstranded
> STAGE 2.2  unstranded
> STAGE 2.3  unstranded
> STAGE 3     stranded
> STAGE 3.1  unstranded
> STAGE 3.2  unstranded
> STAGE 3.3  unstranded
> STAGE 4     stranded
> STAGE 4.1  unstranded
> STAGE 4.2  unstranded
> STAGE 4.3  unstranded
>
> Before doing a DGE, I thought to test the correlation between these samples, just to show that similar samples "cluster" together. If so, I thought to use the unstranded data for my DGE analysis to reach the final number of 4 replicates per stage.
>
> I mapped the raw reads to the genome using TOPHAT (v2.0.9) (fr-unstranded for unstranded data and fr-secondstrand for stranded data), used htseq-count (HTSeq 0.5.4p5) to get the raw reads counts for both the data. For the stranded data I used the option -s yes and for the unstranded data I used -s no. I then used DESeq (v1.14.0) to include metadata and for normalization, and I removed the genes that always have a 0 value. I then calcualted the correlation which was really low.
>
> I tried to use the option -s reverse for the stranded data and still got really low correlation. So I reran htseq-count on the stranded data selecting the option -s no and in this way I got a very similar number of total counts between the unstranded and stranded data, around 4-5M counts each stage (while both cases before the stranded ones were double in number).
>
> I included the metadata
>
>
>> Design
>             condition
> ADULT        ADULT
> ADULT1       ADULT
> ADULT2       ADULT
> ADULT3       ADULT
> JUV            JUV
> JUV1           JUV
> JUV2           JUV
> JUV3           JUV
> COMP          COMP
> COMP1         COMP
> COMP2         COMP
> COMP3         COMP
> PRECOMP    PRECOMP
> PRECOMP1   PRECOMP
> PRECOMP2   PRECOMP
> PRECOMP3   PRECOMP
>
> and estimated the new size factors, normalized and calculated the new correlation. Pearson performed pretty well, confirmed by both a PCA and correlogram. So my initial idea was to do a DGE "treating" the stranded data as unstranded, having 4 replicates per stage. Though, I'd still like to figure out a way to use the stranded counts since I am not sure if I am losing some information running htseq-count using -s no on the stranded data.
>
>
> What I had in mind was using unstranded data to estimate the level of variation to get a threshold for DE detection but still use the stranded data as expression values. Not sure if I can do that though given one is stranded and the other is not.
>
>
> I would like to hear from you if you have any thoughts about this.
>
>
> Let me know if you need any further details to better understand the issue.
>
>
> Thanks in advance,
>
> Federico
>
> Federico Gaiti
> Ph.D. Candidate
> School of Biological Sciences
> University of Queensland
> St Lucia QLD 4072
> Australia
> f.gaiti at uq.edu.au
>
>
>         [[alternative HTML version deleted]]
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



-- 
Steve Lianoglou
Computational Biologist
Genentech



More information about the Bioconductor mailing list