[BioC] goseq analysis

Fri Nov 2 03:21:15 CET 2012

Hi Steve,

Thank you for your reply.

On Thu, 01 Nov 2012 21:41:16 +0900, Steve Lianoglou  
<mailinglist.honeypot at gmail.com> wrote:

[snipped]

>> Also I have some questions regarding the graph on page 9. The x-axis is
>> bias.data, which according to the vignette is usually the "gene length"  
>> or
>> "number of counts". I can understand "gene length" but I don't  
>> understand
>> what "number of counts" refers to.
>
> It is likely referring to the number of reads assigned to that gene
> when it was assessed for DE. By the structure of the data.frame, it
> seems that you might just add up the number of read counts between the
> two conditions under test and put it there. While I haven't tried
> this, I'd suspect that this has some (strong(?)) correlation with the
> gene's length.

It was confusing because for my analysis I just gave goseq a vector of all  
the genes assayed and the DE status, so there wasn't any count  
information. I guess in this context, I'll just assume that bias.data is  
just the length of the gene.

[snipped]

>> And lastly I'm working with a list of differentially expressed features
>> (CAGE tags), which can be annotated to genes based on genome mapping.
>> However a small subset of these features cannot be annotated and I have
>> discarded them from the analysis since they cannot be associated to GO
>> terms. Is this potentially disastrous?
>
> Depends on how you define disaster?

Sorry for the lack of clarity; I should avoid using vague words. I guess  
the worst case scenario is that I get results that are flawed because of  
the analysis.

> If I were you, I'd try and assign tags that are intergenic but only
> slightly 5' upstream from annotated genes to the downstream gene. I'm
> leaving the definition of "slightly" undefined, though :-)

That's almost how I annotated the CAGE tags. So I used the starting  
position of all Ensembl genes, and if a tag was 400 bp upstream or  
downstream of this position, I annotated this tag with that gene.

> Also, a disaster you might try to avert is whether or not using goseq
> is appropriate for your analysis to begin with.

I was going to use GOstats for this but thought "ah a newish package,  
let's try it out!" From the vignette it mentions that using their methods  
when there is no length bias does no harm. They also have the  
hypergeometric method implemented, which doesn't take into account the  
length bias, so I also used that.

Actually I got different results when I did the analysis accounting for  
length bias and not accounting for length bias; I got no enriched GO terms  
after pvalue adjustment when I did the hypergeometric test compared to 20  
enriched GO terms using the Wallenius approximation method.

> If you are using goseq to correct for a length bias in detecting
> differential expression, you might explore whether or not CAGE data is
> subject to this bias at all. I think the "common" understanding is
> that tag-based methods generally don't suffer from this, see:
>
> Protocol Dependence of Sequencing-Based Gene Expression Measurements
> http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0019287
>
> And in the discussion of:
>
> Transcript length bias in RNA-seq data confounds systems biology
> http://www.biology-direct.com/content/4/1/14

Thanks for the papers (which I will read in more detail later) and the  
comments; it was very useful. Much appreciated.

Cheers,

-- 
Dave