[BioC] goseq analysis

Dave Tang davetingpongtang at gmail.com
Fri Nov 2 15:01:45 CET 2012


Hi Alicia,

Thank you for your reply and comments. It definitely makes much more sense  
now.

I shall redo the analysis with the tag counts as my bias.data.

Cheers,

Dave


On Fri, 02 Nov 2012 20:40:50 +0900, Alicia Oshlack  
<alicia.oshlack at mcri.edu.au> wrote:

> Hi Dave,
>
> I¹d just like to mention that I don¹t believe that using gene lengths for
> CAGE data is the right thing to do because you are not counting the  
> number
> of reads across the whole transcript, which will correlate with gene  
> length.
> Therefore I believe you should not use the automatic fetching of gene
> lengths as your bias data. It seems likely that your power to detect DE  
> for
> CAGE data will depend on the total counts for each tag so your bias.data
> would be total counts for each tag which you will have to supply.
>
> Hope this makes sense.
>
> Cheers,
> Alicia
>
>
> Hi Steve,
>
> Thank you for your reply.
>
> On Thu, 01 Nov 2012 21:41:16 +0900, Steve Lianoglou
> <mailinglist.honeypot at gmail.com> wrote:
>
> [snipped]
>
>>> Also I have some questions regarding the graph on page 9. The x-axis is
>>> bias.data, which according to the vignette is usually the "gene length"
>>> or
>>> "number of counts". I can understand "gene length" but I don't
>>> understand
>>> what "number of counts" refers to.
>>
>> It is likely referring to the number of reads assigned to that gene
>> when it was assessed for DE. By the structure of the data.frame, it
>> seems that you might just add up the number of read counts between the
>> two conditions under test and put it there. While I haven't tried
>> this, I'd suspect that this has some (strong(?)) correlation with the
>> gene's length.
>
> It was confusing because for my analysis I just gave goseq a vector of  
> all
> the genes assayed and the DE status, so there wasn't any count
> information. I guess in this context, I'll just assume that bias.data is
> just the length of the gene.
>
> [snipped]
>
>>> And lastly I'm working with a list of differentially expressed features
>>> (CAGE tags), which can be annotated to genes based on genome mapping.
>>> However a small subset of these features cannot be annotated and I have
>>> discarded them from the analysis since they cannot be associated to GO
>>> terms. Is this potentially disastrous?
>>
>> Depends on how you define disaster?
>
> Sorry for the lack of clarity; I should avoid using vague words. I guess
> the worst case scenario is that I get results that are flawed because of
> the analysis.
>
>> If I were you, I'd try and assign tags that are intergenic but only
>> slightly 5' upstream from annotated genes to the downstream gene. I'm
>> leaving the definition of "slightly" undefined, though :-)
>
> That's almost how I annotated the CAGE tags. So I used the starting
> position of all Ensembl genes, and if a tag was 400 bp upstream or
> downstream of this position, I annotated this tag with that gene.
>
>> Also, a disaster you might try to avert is whether or not using goseq
>> is appropriate for your analysis to begin with.
>
> I was going to use GOstats for this but thought "ah a newish package,
> let's try it out!" From the vignette it mentions that using their methods
> when there is no length bias does no harm. They also have the
> hypergeometric method implemented, which doesn't take into account the
> length bias, so I also used that.
>
> Actually I got different results when I did the analysis accounting for
> length bias and not accounting for length bias; I got no enriched GO  
> terms
> after pvalue adjustment when I did the hypergeometric test compared to 20
> enriched GO terms using the Wallenius approximation method.
>
>> If you are using goseq to correct for a length bias in detecting
>> differential expression, you might explore whether or not CAGE data is
>> subject to this bias at all. I think the "common" understanding is
>> that tag-based methods generally don't suffer from this, see:
>>
>> Protocol Dependence of Sequencing-Based Gene Expression Measurements
>> http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0019287
>>
>> And in the discussion of:
>>
>> Transcript length bias in RNA-seq data confounds systems biology
>> http://www.biology-direct.com/content/4/1/14
>
> Thanks for the papers (which I will read in more detail later) and the
> comments; it was very useful. Much appreciated.
>
> Cheers,
>
>


-- 
Dave



More information about the Bioconductor mailing list