Hi Mike,

I am trying to make a heatmap with only a subset of differentially expressed genes. The previous email shows how I performed the DGE analsis.

I used:

select<-order(res$padj)[1:50]
heatmap.2(assay(rld)[select,],col=rev(heat.colors(25)),trace="none")

My 'res' result looks like this:


log2 fold change (MAP): condition ADULT vs PRECOMP
Wald test p-value: condition ADULT vs PRECOMP
DataFrame with 36909 rows and 6 columns
                 baseMean log2FoldChange     lfcSE
                <numeric>      <numeric> <numeric>
Aqu2.00001_001  3.5523018    -0.38899137 1.4906115
Aqu2.00002_001  0.2352661    -0.07056396 1.4687333
Aqu2.00003_001  1.1502763    -4.38453469 1.5809570
Aqu2.00004_001  7.3886827     0.17898271 0.6788353
Aqu2.00005_001 31.5022485     0.76676647 0.3843871
...                   ...            ...       ...
TCONS_00005299   48.85533      0.3962286 0.3862817
TCONS_00005300   18.57709      1.3389367 1.0457321
TCONS_00005301    1.42026     -3.2133525 1.5173331
TCONS_00005302   17.31190      0.5504013 0.6322061
TCONS_00005303   43.26359      0.5185803 0.5393246
                      stat      pvalue       padj
                 <numeric>   <numeric>  <numeric>
Aqu2.00001_001 -0.26096094 0.794122627 0.86811294
Aqu2.00002_001 -0.04804409 0.961681103         NA
Aqu2.00003_001 -2.77334224 0.005548374 0.01662316
Aqu2.00004_001  0.26366147 0.792040790 0.86680716
Aqu2.00005_001  1.99477700 0.046067207 0.09954042
...                    ...         ...        ...
TCONS_00005299   1.0257506  0.30500918 0.44484806
TCONS_00005300   1.2803821  0.20041078 0.32274512
TCONS_00005301  -2.1177633  0.03419512 0.07789263
TCONS_00005302   0.8706042  0.38397032 0.52699964
TCONS_00005303   0.9615364  0.33628255 0.47886499

As you can see I have two different type of ID: Aqu2.XXXX_XXX and TCONS_XXXXXXXX

I'd like to "grep" only the top 50 TCONS_XXXXXXXX differentially expressed genes and generate the heatmap with those.

Do you have any suggestions on how to solve this?

Thanks in advance
Federico

________________________________
From: Michael Love [michaelisaiahlove@gmail.com]
Sent: Monday, 3 March 2014 10:51 PM
To: Federico Gaiti
Cc: Steve Lianoglou; bioconductor@r-project.org
Subject: Re: [BioC] Low number of replicates DESeq

hi Federico,

This is correct.

Mike


On Mon, Mar 3, 2014 at 5:15 AM, Federico Gaiti <f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote:
Hi Mike,

I did the DGE with the mmultifactorial design combining stranded and unstranded data.
Here it is:

Multifactorial (considering both stranded and unstranded data)


> head(CountTable)
            ADULT ADULT1 ADULT2 ADULT3 JUV JUV1 JUV2 JUV3 COMP COMP1 COMP2 COMP3 PRECOMP PRECOMP1 PRECOMP2 PRECOMP3
asmbl_1        30      0     24     48  84    5    1    1   47    15     8     6      47       28       27       47
asmbl_10        0      0      0      0   0    0    0    0    0     0     0     0       0        0        0        0
asmbl_100       0      0      0      0   0    4    0    0    0     1     2     3       2        5        5        7
asmbl_1000      0      8      0      7   5    5   19    7   14     4     4     7       9        4       12        1
asmbl_10000    11     75      0     73 103   12  112   51   65    43    17    57      56       18       23       63
asmbl_10001     0      1      0      0   3   11    6    4    4     4    11     1       5        0        3        0

dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design=~libType + condition)
colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOMP","COMP","JUV","ADULT"))
design(dds)<-formula(~libType + condition )
dds<-DESeq(dds)
res<-results(dds)
resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT"))
resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV"))
resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP"))


> resultsNames(dds)
[1] "Intercept"         "libTypestranded"
[3] "libTypeunstranded" "conditionPRECOMP"
[5] "conditionCOMP"     "conditionJUV"
[7] "conditionADULT"


Just to conlude, have I done anything wrong? Would you suggest any further/different analysis?

Thanks for all the help

Federico

________________________________
From: Michael Love [michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>]
Sent: Friday, 28 February 2014 12:11 PM
To: Federico Gaiti

Cc: Steve Lianoglou; bioconductor@r-project.org<mailto:bioconductor@r-project.org>
Subject: Re: [BioC] Low number of replicates DESeq

Then I would recommend the multifactorial design, as it's the best you can do without stranded replicates. it will be underpowered for transcripts which are mostly made up of those positions which Simon described: "position which is covered by a regular gene on one strand and by an overlapping antisense transcript on the other strand".  Because for such transcripts, only the single replicate for the stranded experiments will contribute signal (if I am getting it correct this time).



On Thu, Feb 27, 2014 at 7:36 PM, Federico Gaiti <f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote:
No worries at all.

It's all a good exercise for me. I'm learning a lot just from this email exchange.

Back to the DGE and considering the situation, what would you raccomend for the DGE on DESeq2?

Thanks again
Federico


________________________________
From: Michael Love [michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>]
Sent: Friday, 28 February 2014 10:27 AM

To: Federico Gaiti
Cc: Steve Lianoglou; bioconductor@r-project.org<mailto:bioconductor@r-project.org>
Subject: RE: [BioC] Low number of replicates DESeq


Hi Federico,

Yes, Simon is right. Please ignore my previous email. Sorry for adding to the confusion.

Mike

On Feb 27, 2014 5:51 PM, "Federico Gaiti" <f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote:
Hi Mike,

Thanks for reply. I see what you mean but I'm a bit confused about ht-seq count now.

Please see also an open thread with Simon Anders where I'm discussing this in details: http://seqanswers.com/forums/showthread.php?p=133959&posted=1#post133959
Sorry for the crosspost I know, I just didn't know it's not a good idea. It's actually one of the first question I post online so I'm still learning how all this works.

I am investgating lncRNAs, which can be intronic, intergenic, can overlap on the same strand of another gene or have anti-sense orientation. That's what I meant with "I need to have anti-sense transcription". I need the stranded data to account for this.

As for htseq-count I thought that depending on the library preparation for stranded libraries I could select  -s reverse or -s yes. And so to be sure I did a quick test on the stranded libraries using infer_experiment.py, it is indeed forward-reverse.

This is PairEnd Data
Fraction of reads explained by "1++,1--,2+-,2-+": 0.9189
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0811
Fraction of reads explained by other combinations: 0.0000

1++,1–,2+-,2-+

read1 mapped to ‘+’ strand indicates parental gene on ‘+’ strand
read1 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
read2 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
read2 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand

Based on this I ran TOPHAT with fr-secondstrand option and htseq-count with -s yes

As Simon said in the other thread "if a read maps to a position which is covered by a regular gene on one strand and by an overlapping antisense transcript on the other strand, then this read will be counted as ambiguous if you have set "stranded" to "no", because there is no information to decide whether the read originated from the sense of from the antisense transcript.
For "stranded=yes", however, the read will be counted for the feature that is on the same strand as the read"

So why would my stranded experiment counted with -s yes capture only sense trasncription? Shouldn't my stranded experiment counted with -s yes capture both sense and anti-sense transcription based on where the reads map?
Also, if selecting this option depends on the library prepration protocol and not on the DGE design, shouldn't -s reverse be "wrong" in my case?

Thanks for clarifications and help
Federico



________________________________
From: Michael Love [michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>]
Sent: Friday, 28 February 2014 2:23 AM
To: Federico Gaiti
Cc: Steve Lianoglou; bioconductor@r-project.org<mailto:bioconductor@r-project.org>
Subject: Re: [BioC] Low number of replicates DESeq

hi Federico,

The question of design falls on what you are looking for. The multifactorial design gets at differences which are consistent for both stranded and unstranded experiments (though the unstranded has 3 times more samples, so contributes more to a gene's likelihood of being detected DE here).

But to go back to an earlier point. You mentioned earlier: "I am investigating long non-coding RNAs and so I need to have anti-sense transcription quantification."  Your current multifactorial analysis is looking for consistent differences across developmental stages, between sense transcription (your stranded experiment counted with -s yes rather than -s reverse) and when you combine sense and anti-sense transcription (your unstranded experiments counted with -s no). Anti-sense transcription plays little role here if we assume that more reads are coming from sense than anti-sense.

Note that if you are looking in particular for differences in anti-sense transcription across developmental stages, you need to use the -s reverse option to htseq-count, and peform biological replicates. I don't see any way around requiring more replicates, as there are both technical and biological sources of variation which will be different in the stranded and unstranded experiments. Adding in the unstranded data seems not so helpful, as you are mixing a small signal of interest (anti-sense transcription) with most likely a lot more reads coming from sense transcription.

You mentioned, "I tried to use the option -s reverse for the stranded data and still got really low correlation." Wouldn't this makes sense, because you are comparing anti-sense transcription to the unstranded protocol which is likely capturing mostly sense transcription?

Mike




On Thu, Feb 27, 2014 at 3:36 AM, Federico Gaiti <f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote:
Hi Steve,

I carefully read the DESeq2 vignette (February 19, 2014) and then did the DGE using two different models as you suggested and then performed different contrasts.

Multifactorial


> head(CountTable)
            ADULT ADULT1 ADULT2 ADULT3 JUV JUV1 JUV2 JUV3 COMP COMP1 COMP2 COMP3 PRECOMP PRECOMP1 PRECOMP2 PRECOMP3
asmbl_1        30      0     24     48  84    5    1    1   47    15     8     6      47       28       27       47
asmbl_10        0      0      0      0   0    0    0    0    0     0     0     0       0        0        0        0
asmbl_100       0      0      0      0   0    4    0    0    0     1     2     3       2        5        5        7
asmbl_1000      0      8      0      7   5    5   19    7   14     4     4     7       9        4       12        1
asmbl_10000    11     75      0     73 103   12  112   51   65    43    17    57      56       18       23       63
asmbl_10001     0      1      0      0   3   11    6    4    4     4    11     1       5        0        3        0

dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design=~libType + condition)
colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOMP","COMP","JUV","ADULT"))
design(dds)<-formula(~libType + condition )
dds<-DESeq(dds)
res<-results(dds)
resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT"))
resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV"))
resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP"))

> resultsNames(dds)
[1] "Intercept"         "libTypestranded"
[3] "libTypeunstranded" "conditionPRECOMP"
[5] "conditionCOMP"     "conditionJUV"
[7] "conditionADULT"


ONE FACTOR


> head(CountTable)
            ADULT1 ADULT2 ADULT3 JUV1 JUV2 JUV3 COMP1 COMP2 COMP3 PRECOMP1 PRECOMP2 PRECOMP3
asmbl_1          0     24     48    5    1    1    15     8     6       28       27       47
asmbl_10         0      0      0    0    0    0     0     0     0        0        0        0
asmbl_100        0      0      0    4    0    0     1     2     3        5        5        7
asmbl_1000       8      0      7    5   19    7     4     4     7        4       12        1
asmbl_10000     75      0     73   12  112   51    43    17    57       18       23       63
asmbl_10001      1      0      0   11    6    4     4    11     1        0        3        0



dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design=~condition)
colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOMP","COMP","JUV","ADULT"))
design(dds)<-formula(~condition )
dds<-DESeq(dds)
res<-results(dds)
resJUV_ADULT<-results(dds,contrast=c("condition","JUV","ADULT"))
resCOMP_JUV<-results(dds,contrast=c("condition","COMP","JUV"))
resPRECOMP_COMP<-results(dds,contrast=c("condition","PRECOMP","COMP"))

> resultsNames(dds)
[1] "Intercept"        "conditionPRECOMP"
[3] "conditionCOMP"    "conditionJUV"
[5] "conditionADULT"

Here is the number of DE genes at a threshold of 0.05 (padj<0.05)

                               PreComp-Comp    Comp-Juv     Juv-Adult
Shared                              1400             5541            5733
Multifactorial specific           98                1584            1304
One-factor specific             1436             1658             2480

As you can see considering *only* unstranded data in the analysis detected more DE genes but they seem comparable (at least to me).

Any thougths on this? Should I rely on the multifactorial design?

Thanks for help
Fede

________________________________________
From: mailinglist.honeypot@gmail.com<mailto:mailinglist.honeypot@gmail.com> [mailinglist.honeypot@gmail.com<mailto:mailinglist.honeypot@gmail.com>] on behalf of Steve Lianoglou [lianoglou.steve@gene.com<mailto:lianoglou.steve@gene.com>]
Sent: Wednesday, 26 February 2014 7:03 PM
To: Federico Gaiti
Cc: bioconductor@r-project.org<mailto:bioconductor@r-project.org>
Subject: Re: [BioC] Low number of replicates DESeq

Hi,

On Wed, Feb 26, 2014 at 12:50 AM, Federico Gaiti <f.gaiti@uq.edu.au<mailto:f.gaiti@uq.edu.au>> wrote:
> Hi Steve,
>
> thanks for the reply and sorry for all the code.
> I'm still a beginner in this field so I'm still learning how to correctly formulate my questions/emails.

Yeah, no problem, just pointing these things out -- keep in mind that
it takes even experienced people time to wade through lots of code, so
it's best to keep things short and sweet (with sufficient detail, of
course ;-)

> I agree with you about the PCA plot analysis.
> Could you just explain better to me what you mean with " If this is the case, then encoding the libType as a main effect in your model (as you've done) should go a long ways in dealing with this issue for you." ??
>
> So let's see if I got what you are saying.
> Are you suggesting I should try to do a DGE with the undtranded data with "condition" as the only level and then compare it to the DGE outcome using a multifactorial design?
>
> This would be the way I start the multifactorial analysis:
>
> dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design=~condition + libType)
>> colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOMP","COMP","JUV","ADULT"))
>> design(dds)<-formula(~libType + condition)
>
> Am I getting it right?
> If so, I'll go ahead and keep you posted about the outcome

Yes, you are getting it right -- I'd put the `condition` data on the
Design data.frame before you create the dds, but I'm not if it will
matter.

Just follow closely the example in the deseq2 vignette. Read the
entire vignette actually, so you understand how to get the particular
results you are after out of your objects (ie. what the things are
that you should pass into a call to `results` for instance).

You will be working with two models, say `dds1` which was built with
*only* the unstranded data and your design is ~ condition.

dds2 will be the model with the unstranded and stranded along with the
`~ libType + condition` design.

Once you have those, look at the output from `resultsNames(dds1)` and
`resultsNames(dds2)` and see that you compare the same results between
dds1 and dds2.

This should become more clear to you as you read the deseq2 vignette
(read it again if you think you already read it once) and when you
look with your data.

Note that the DESeq2 folks recently posted an early version of the
paper detaling the deseq2 method here:
http://biorxiv.org/content/early/2014/02/19/002832

Which would be helpful to read.

-steve

--
Steve Lianoglou
Computational Biologist
Genentech

        [[alternative HTML version deleted]]

_______________________________________________
Bioconductor mailing list
Bioconductor@r-project.org<mailto:Bioconductor@r-project.org>
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor




	[[alternative HTML version deleted]]

