[BioC] goseq analysis
Dave Tang
davetingpongtang at gmail.com
Fri Nov 2 03:21:15 CET 2012
Hi Steve,
Thank you for your reply.
On Thu, 01 Nov 2012 21:41:16 +0900, Steve Lianoglou
<mailinglist.honeypot at gmail.com> wrote:
[snipped]
>> Also I have some questions regarding the graph on page 9. The x-axis is
>> bias.data, which according to the vignette is usually the "gene length"
>> or
>> "number of counts". I can understand "gene length" but I don't
>> understand
>> what "number of counts" refers to.
>
> It is likely referring to the number of reads assigned to that gene
> when it was assessed for DE. By the structure of the data.frame, it
> seems that you might just add up the number of read counts between the
> two conditions under test and put it there. While I haven't tried
> this, I'd suspect that this has some (strong(?)) correlation with the
> gene's length.
It was confusing because for my analysis I just gave goseq a vector of all
the genes assayed and the DE status, so there wasn't any count
information. I guess in this context, I'll just assume that bias.data is
just the length of the gene.
[snipped]
>> And lastly I'm working with a list of differentially expressed features
>> (CAGE tags), which can be annotated to genes based on genome mapping.
>> However a small subset of these features cannot be annotated and I have
>> discarded them from the analysis since they cannot be associated to GO
>> terms. Is this potentially disastrous?
>
> Depends on how you define disaster?
Sorry for the lack of clarity; I should avoid using vague words. I guess
the worst case scenario is that I get results that are flawed because of
the analysis.
> If I were you, I'd try and assign tags that are intergenic but only
> slightly 5' upstream from annotated genes to the downstream gene. I'm
> leaving the definition of "slightly" undefined, though :-)
That's almost how I annotated the CAGE tags. So I used the starting
position of all Ensembl genes, and if a tag was 400 bp upstream or
downstream of this position, I annotated this tag with that gene.
> Also, a disaster you might try to avert is whether or not using goseq
> is appropriate for your analysis to begin with.
I was going to use GOstats for this but thought "ah a newish package,
let's try it out!" From the vignette it mentions that using their methods
when there is no length bias does no harm. They also have the
hypergeometric method implemented, which doesn't take into account the
length bias, so I also used that.
Actually I got different results when I did the analysis accounting for
length bias and not accounting for length bias; I got no enriched GO terms
after pvalue adjustment when I did the hypergeometric test compared to 20
enriched GO terms using the Wallenius approximation method.
> If you are using goseq to correct for a length bias in detecting
> differential expression, you might explore whether or not CAGE data is
> subject to this bias at all. I think the "common" understanding is
> that tag-based methods generally don't suffer from this, see:
>
> Protocol Dependence of Sequencing-Based Gene Expression Measurements
> http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0019287
>
> And in the discussion of:
>
> Transcript length bias in RNA-seq data confounds systems biology
> http://www.biology-direct.com/content/4/1/14
Thanks for the papers (which I will read in more detail later) and the
comments; it was very useful. Much appreciated.
Cheers,
--
Dave
More information about the Bioconductor
mailing list