Hi Mike, I am trying to make a heatmap with only a subset of differentially expressed genes. The previous email shows how I performed the DGE analsis. I used: select<-order(res$padj)[1:50] heatmap.2(assay(rld)[select,],col=rev(heat.colors(25)),trace="none") My 'res' result looks like this: log2 fold change (MAP): condition ADULT vs PRECOMP Wald test p-value: condition ADULT vs PRECOMP DataFrame with 36909 rows and 6 columns baseMean log2FoldChange lfcSE Aqu2.00001_001 3.5523018 -0.38899137 1.4906115 Aqu2.00002_001 0.2352661 -0.07056396 1.4687333 Aqu2.00003_001 1.1502763 -4.38453469 1.5809570 Aqu2.00004_001 7.3886827 0.17898271 0.6788353 Aqu2.00005_001 31.5022485 0.76676647 0.3843871 ... ... ... ... TCONS_00005299 48.85533 0.3962286 0.3862817 TCONS_00005300 18.57709 1.3389367 1.0457321 TCONS_00005301 1.42026 -3.2133525 1.5173331 TCONS_00005302 17.31190 0.5504013 0.6322061 TCONS_00005303 43.26359 0.5185803 0.5393246 stat pvalue padj Aqu2.00001_001 -0.26096094 0.794122627 0.86811294 Aqu2.00002_001 -0.04804409 0.961681103 NA Aqu2.00003_001 -2.77334224 0.005548374 0.01662316 Aqu2.00004_001 0.26366147 0.792040790 0.86680716 Aqu2.00005_001 1.99477700 0.046067207 0.09954042 ... ... ... ... TCONS_00005299 1.0257506 0.30500918 0.44484806 TCONS_00005300 1.2803821 0.20041078 0.32274512 TCONS_00005301 -2.1177633 0.03419512 0.07789263 TCONS_00005302 0.8706042 0.38397032 0.52699964 TCONS_00005303 0.9615364 0.33628255 0.47886499 As you can see I have two different type of ID: Aqu2.XXXX_XXX and TCONS_XXXXXXXX I'd like to "grep" only the top 50 TCONS_XXXXXXXX differentially expressed genes and generate the heatmap with those. Do you have any suggestions on how to solve this? Thanks in advance Federico ________________________________ From: Michael Love [michaelisaiahlove@gmail.com] Sent: Monday, 3 March 2014 10:51 PM To: Federico Gaiti Cc: Steve Lianoglou; bioconductor@r-project.org Subject: Re: [BioC] Low number of replicates DESeq hi Federico, This is correct. Mike On Mon, Mar 3, 2014 at 5:15 AM, Federico Gaiti > wrote: Hi Mike, I did the DGE with the mmultifactorial design combining stranded and unstranded data. Here it is: Multifactorial (considering both stranded and unstranded data) > head(CountTable) ADULT ADULT1 ADULT2 ADULT3 JUV JUV1 JUV2 JUV3 COMP COMP1 COMP2 COMP3 PRECOMP PRECOMP1 PRECOMP2 PRECOMP3 asmbl_1 30 0 24 48 84 5 1 1 47 15 8 6 47 28 27 47 asmbl_10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 asmbl_100 0 0 0 0 0 4 0 0 0 1 2 3 2 5 5 7 asmbl_1000 0 8 0 7 5 5 19 7 14 4 4 7 9 4 12 1 asmbl_10000 11 75 0 73 103 12 112 51 65 43 17 57 56 18 23 63 asmbl_10001 0 1 0 0 3 11 6 4 4 4 11 1 5 0 3 0 dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design=~libType + condition) colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOMP","COMP","JUV","ADULT")) design(dds)<-formula(~libType + condition ) dds<-DESeq(dds) res<-results(dds) resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) > resultsNames(dds) [1] "Intercept" "libTypestranded" [3] "libTypeunstranded" "conditionPRECOMP" [5] "conditionCOMP" "conditionJUV" [7] "conditionADULT" Just to conlude, have I done anything wrong? Would you suggest any further/different analysis? Thanks for all the help Federico ________________________________ From: Michael Love [michaelisaiahlove@gmail.com] Sent: Friday, 28 February 2014 12:11 PM To: Federico Gaiti Cc: Steve Lianoglou; bioconductor@r-project.org Subject: Re: [BioC] Low number of replicates DESeq Then I would recommend the multifactorial design, as it's the best you can do without stranded replicates. it will be underpowered for transcripts which are mostly made up of those positions which Simon described: "position which is covered by a regular gene on one strand and by an overlapping antisense transcript on the other strand". Because for such transcripts, only the single replicate for the stranded experiments will contribute signal (if I am getting it correct this time). On Thu, Feb 27, 2014 at 7:36 PM, Federico Gaiti > wrote: No worries at all. It's all a good exercise for me. I'm learning a lot just from this email exchange. Back to the DGE and considering the situation, what would you raccomend for the DGE on DESeq2? Thanks again Federico ________________________________ From: Michael Love [michaelisaiahlove@gmail.com] Sent: Friday, 28 February 2014 10:27 AM To: Federico Gaiti Cc: Steve Lianoglou; bioconductor@r-project.org Subject: RE: [BioC] Low number of replicates DESeq Hi Federico, Yes, Simon is right. Please ignore my previous email. Sorry for adding to the confusion. Mike On Feb 27, 2014 5:51 PM, "Federico Gaiti" > wrote: Hi Mike, Thanks for reply. I see what you mean but I'm a bit confused about ht-seq count now. Please see also an open thread with Simon Anders where I'm discussing this in details: http://seqanswers.com/forums/showthread.php?p=133959&posted=1#post133959 Sorry for the crosspost I know, I just didn't know it's not a good idea. It's actually one of the first question I post online so I'm still learning how all this works. I am investgating lncRNAs, which can be intronic, intergenic, can overlap on the same strand of another gene or have anti-sense orientation. That's what I meant with "I need to have anti-sense transcription". I need the stranded data to account for this. As for htseq-count I thought that depending on the library preparation for stranded libraries I could select -s reverse or -s yes. And so to be sure I did a quick test on the stranded libraries using infer_experiment.py, it is indeed forward-reverse. This is PairEnd Data Fraction of reads explained by "1++,1--,2+-,2-+": 0.9189 Fraction of reads explained by "1+-,1-+,2++,2--": 0.0811 Fraction of reads explained by other combinations: 0.0000 1++,1–,2+-,2-+ read1 mapped to ‘+’ strand indicates parental gene on ‘+’ strand read1 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand read2 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand read2 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand Based on this I ran TOPHAT with fr-secondstrand option and htseq-count with -s yes As Simon said in the other thread "if a read maps to a position which is covered by a regular gene on one strand and by an overlapping antisense transcript on the other strand, then this read will be counted as ambiguous if you have set "stranded" to "no", because there is no information to decide whether the read originated from the sense of from the antisense transcript. For "stranded=yes", however, the read will be counted for the feature that is on the same strand as the read" So why would my stranded experiment counted with -s yes capture only sense trasncription? Shouldn't my stranded experiment counted with -s yes capture both sense and anti-sense transcription based on where the reads map? Also, if selecting this option depends on the library prepration protocol and not on the DGE design, shouldn't -s reverse be "wrong" in my case? Thanks for clarifications and help Federico ________________________________ From: Michael Love [michaelisaiahlove@gmail.com] Sent: Friday, 28 February 2014 2:23 AM To: Federico Gaiti Cc: Steve Lianoglou; bioconductor@r-project.org Subject: Re: [BioC] Low number of replicates DESeq hi Federico, The question of design falls on what you are looking for. The multifactorial design gets at differences which are consistent for both stranded and unstranded experiments (though the unstranded has 3 times more samples, so contributes more to a gene's likelihood of being detected DE here). But to go back to an earlier point. You mentioned earlier: "I am investigating long non-coding RNAs and so I need to have anti-sense transcription quantification." Your current multifactorial analysis is looking for consistent differences across developmental stages, between sense transcription (your stranded experiment counted with -s yes rather than -s reverse) and when you combine sense and anti-sense transcription (your unstranded experiments counted with -s no). Anti-sense transcription plays little role here if we assume that more reads are coming from sense than anti-sense. Note that if you are looking in particular for differences in anti-sense transcription across developmental stages, you need to use the -s reverse option to htseq-count, and peform biological replicates. I don't see any way around requiring more replicates, as there are both technical and biological sources of variation which will be different in the stranded and unstranded experiments. Adding in the unstranded data seems not so helpful, as you are mixing a small signal of interest (anti-sense transcription) with most likely a lot more reads coming from sense transcription. You mentioned, "I tried to use the option -s reverse for the stranded data and still got really low correlation." Wouldn't this makes sense, because you are comparing anti-sense transcription to the unstranded protocol which is likely capturing mostly sense transcription? Mike On Thu, Feb 27, 2014 at 3:36 AM, Federico Gaiti > wrote: Hi Steve, I carefully read the DESeq2 vignette (February 19, 2014) and then did the DGE using two different models as you suggested and then performed different contrasts. Multifactorial > head(CountTable) ADULT ADULT1 ADULT2 ADULT3 JUV JUV1 JUV2 JUV3 COMP COMP1 COMP2 COMP3 PRECOMP PRECOMP1 PRECOMP2 PRECOMP3 asmbl_1 30 0 24 48 84 5 1 1 47 15 8 6 47 28 27 47 asmbl_10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 asmbl_100 0 0 0 0 0 4 0 0 0 1 2 3 2 5 5 7 asmbl_1000 0 8 0 7 5 5 19 7 14 4 4 7 9 4 12 1 asmbl_10000 11 75 0 73 103 12 112 51 65 43 17 57 56 18 23 63 asmbl_10001 0 1 0 0 3 11 6 4 4 4 11 1 5 0 3 0 dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design=~libType + condition) colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOMP","COMP","JUV","ADULT")) design(dds)<-formula(~libType + condition ) dds<-DESeq(dds) res<-results(dds) resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) > resultsNames(dds) [1] "Intercept" "libTypestranded" [3] "libTypeunstranded" "conditionPRECOMP" [5] "conditionCOMP" "conditionJUV" [7] "conditionADULT" ONE FACTOR > head(CountTable) ADULT1 ADULT2 ADULT3 JUV1 JUV2 JUV3 COMP1 COMP2 COMP3 PRECOMP1 PRECOMP2 PRECOMP3 asmbl_1 0 24 48 5 1 1 15 8 6 28 27 47 asmbl_10 0 0 0 0 0 0 0 0 0 0 0 0 asmbl_100 0 0 0 4 0 0 1 2 3 5 5 7 asmbl_1000 8 0 7 5 19 7 4 4 7 4 12 1 asmbl_10000 75 0 73 12 112 51 43 17 57 18 23 63 asmbl_10001 1 0 0 11 6 4 4 11 1 0 3 0 dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design=~condition) colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOMP","COMP","JUV","ADULT")) design(dds)<-formula(~condition ) dds<-DESeq(dds) res<-results(dds) resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT")) resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV")) resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP")) > resultsNames(dds) [1] "Intercept" "conditionPRECOMP" [3] "conditionCOMP" "conditionJUV" [5] "conditionADULT" Here is the number of DE genes at a threshold of 0.05 (padj<0.05) PreComp-Comp Comp-Juv Juv-Adult Shared 1400 5541 5733 Multifactorial specific 98 1584 1304 One-factor specific 1436 1658 2480 As you can see considering *only* unstranded data in the analysis detected more DE genes but they seem comparable (at least to me). Any thougths on this? Should I rely on the multifactorial design? Thanks for help Fede ________________________________________ From: mailinglist.honeypot@gmail.com [mailinglist.honeypot@gmail.com] on behalf of Steve Lianoglou [lianoglou.steve@gene.com] Sent: Wednesday, 26 February 2014 7:03 PM To: Federico Gaiti Cc: bioconductor@r-project.org Subject: Re: [BioC] Low number of replicates DESeq Hi, On Wed, Feb 26, 2014 at 12:50 AM, Federico Gaiti > wrote: > Hi Steve, > > thanks for the reply and sorry for all the code. > I'm still a beginner in this field so I'm still learning how to correctly formulate my questions/emails. Yeah, no problem, just pointing these things out -- keep in mind that it takes even experienced people time to wade through lots of code, so it's best to keep things short and sweet (with sufficient detail, of course ;-) > I agree with you about the PCA plot analysis. > Could you just explain better to me what you mean with " If this is the case, then encoding the libType as a main effect in your model (as you've done) should go a long ways in dealing with this issue for you." ?? > > So let's see if I got what you are saying. > Are you suggesting I should try to do a DGE with the undtranded data with "condition" as the only level and then compare it to the DGE outcome using a multifactorial design? > > This would be the way I start the multifactorial analysis: > > dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design=~condition + libType) >> colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOMP","COMP","JUV","ADULT")) >> design(dds)<-formula(~libType + condition) > > Am I getting it right? > If so, I'll go ahead and keep you posted about the outcome Yes, you are getting it right -- I'd put the `condition` data on the Design data.frame before you create the dds, but I'm not if it will matter. Just follow closely the example in the deseq2 vignette. Read the entire vignette actually, so you understand how to get the particular results you are after out of your objects (ie. what the things are that you should pass into a call to `results` for instance). You will be working with two models, say `dds1` which was built with *only* the unstranded data and your design is ~ condition. dds2 will be the model with the unstranded and stranded along with the `~ libType + condition` design. Once you have those, look at the output from `resultsNames(dds1)` and `resultsNames(dds2)` and see that you compare the same results between dds1 and dds2. This should become more clear to you as you read the deseq2 vignette (read it again if you think you already read it once) and when you look with your data. Note that the DESeq2 folks recently posted an early version of the paper detaling the deseq2 method here: http://biorxiv.org/content/early/2014/02/19/002832 Which would be helpful to read. -steve -- Steve Lianoglou Computational Biologist Genentech [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]