[BioC] aggregate genes in DEXSeq

Thu Mar 6 22:20:49 CET 2014

Alejandro Reyes <alejandro.reyes at ...> writes:

> 
> Dear Julien, Dear Mar and people interested in DEXSeq ,
> 
> You recently reported some problems in DEXSeq that had to do with the 
> way the HTSeq python scripts deal with the exons that overlap with more 
> than one gene ID.
> 
> The solution that we had taken so far was that the gene IDs sharing an 
> exon were merged into an "aggregate gene" ID.  From the input of some 
> users and our own experience, we know that it was not the most 
> appropriate solution: when the merged genes were differentially 
> expressed, DEXSeq falsely calls differential usage in other exons of the 
> aggregate genes. We have included a "-r" parameter in the script 
> "prepare_annotation_dexseq.py", for the user to decide what to do with 
> these exons: either to ignore the exons associated with more than one 
> gene IDs and treat each gene separately, or to merge the genes and take 
> these exons into account.
> 
> Additionally, we have implemented the R/Bioconductor functions 
> equivalent to the python scripts. These functions were implemented using 
> code contributed by Mike Love.
> 
> All these changes are available in the last svn version (1.5.9).
> 
> Best regards,
> Alejandro Reyes
> 
> Hi Alejandro,
> Just to let you know that adding the junctions to the test of 
> differential expression of DEXSeq worked fine! The "hack" was actually 
> straightforward, I just had to modify the counts files taken as input.
> 
> On a different note, I noticed that many false positives were generated 
> because of "aggregate" gene models that were composed on different 
> overlapping genes. When these overlapping genes have different behavior 
> in different conditions, this is interpreted as differential expression 
> of some exons, while it is differential expression of genes... See the 
> attached picture, this might turn out to be easier to understand
> Did you notice this behavior of DEXSeq, and do you have any comment on 
> this?
> 
> Thanks again for your work on DEXSeq
> Julien
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at ...
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 

Hi, Alejandro

We recently used the python script dexseq_prepare_annotation.py to 
generate exon.gtf from genes.gtf. Similar to the situations in your 
previous discussion, we found some overlapping exons which share 
patial/all regions and in the meanwhile belong to different genes, which 
you can see below, the 1st and the 4th line. Even after using "-r yes" 
parameter, we still see these happening. "-r yes" should be the default. 
We thought by merging the genes and taking all these exons into account 
would generate non-overlapping exon.gtf in the end, but now the gtf still 
have some overlapping exons. We are concerned if the further count files 
generated would be biased in these regions. Do you have any lastest update 
about the solution of this problem? 

In your last message, you mentioned about the R functions equivalent to 
the python scripts which were contributed by Mike Love. Could you provide 
more details about how we can find these functions, like the name or the 
website of it? 

Thank you very much in advance!

Thanks,
Xiayu

1       dexseq_prepare_annotation.py    exonic_part     13671   
14409   .       +       .       
transcripts "ENST00000456328+ENST00000515242+ENST00000518655"; exonic_
part_number "018"; gene_id "ENSG00000223972"
1       dexseq_prepare_annotation.py    exonic_part     14410   
14412   .       +       .       transcripts "ENST00000515242"; 
exonic_part_number "019"; gene_id "ENSG
00000223972"
1       dexseq_prepare_annotation.py    aggregate_gene  14363   
29806   .       -       .       gene_id "ENSG00000227232"
1       dexseq_prepare_annotation.py    exonic_part     14363   
14403   .       -       .       
transcripts "ENST00000541675+ENST00000423562+ENST00000438504"; exonic_
part_number "001"; gene_id "ENSG00000227232"