[Bioc-devel] makeTxDbFromGFF drops genes which have multiple chromosome locations. (with iGenome GTF)

Marlin JL.M marlin- at gmx.cn
Sun Nov 13 19:07:14 CET 2016


No, I did not get that warning. But only:

> Import genomic features from the file as a GRanges object ... OK
> Prepare the 'metadata' data frame ... OK
> Make the TxDb object ... OK


Further, I noticed that:

1. Those genes are not actually "dropped", they can be shown with
`transcriptsBy(txdb,'gene')` but can not be shown with `genes(txdb)`.

2. The behavior is not associated with multiple chromosome locations,
though they have a large intersection.

Hence, I am not clear what is the actual factor causing this somehow
weired behavior.

As I am not using the development version of bioconductor, if anyone 
can help to reproduce it? I have uploaded a small dummy gtf file at 
https://gist.github.com/Marlin-Na/1eefedc4984e40b8ef76a0e1e7612dbb .


This is what I get:

$ wget https://gist.githubusercontent.com/Marlin-Na/1eefedc4984e40b8ef76a0e1e7612dbb/raw/9977b699a9fcdf70c8860de985ff550934c2c4a7/dummy.gtf
$ R

> library(GenomicFeatures)

> txdb = makeTxDbFromGFF('dummy.gtf')
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK

> genes(txdb)
# GRanges object with 1 range and 1 metadata column:
#         seqnames               ranges strand |     gene_id
#            <Rle>            <IRanges>  <Rle> | <character>
#   TTTY5     chrY [24442945, 24445023]      - |       TTTY5
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths

> transcriptsBy(txdb, 'gene')
# GRangesList object of length 3:
# $TTTY17A 
# GRanges object with 3 ranges and 2 metadata columns:
#       seqnames               ranges strand |     tx_id     tx_name
#          <Rle>            <IRanges>  <Rle> | <integer> <character>
#   [1]     chrY [24997731, 24998862]      + |         3 NR_001526_1
#   [2]     chrY [26631479, 26632610]      + |         4 NR_001526_2
#   [3]     chrY [27329790, 27330920]      - |         6   NR_001526
# 
# $TTTY5 
# GRanges object with 1 range and 2 metadata columns:
#       seqnames               ranges strand | tx_id   tx_name
#   [1]     chrY [24442945, 24445023]      - |     5 NR_001541
# 
# $XGY2 
# GRanges object with 2 ranges and 2 metadata columns:
#       seqnames             ranges strand | tx_id     tx_name
#   [1]     chrX [2670337, 2693037]      + |     1   NR_003254
#   [2]     chrY [2620337, 2643037]      + |     2 NR_003254_1
# 
# -------
# seqinfo: 2 sequences from an unspecified genome; no seqlengths

As you can see, 'XGY2' and 'TTTY17A' are not shown with `genes(txdb)`.



> sessionInfo()
# R version 3.3.2 (2016-10-31)
# Platform: i686-pc-linux-gnu (32-bit)
# Running under: Ubuntu 16.04.1 LTS
#
# ......
# 
# attached base packages:
# [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
# [8] methods   base     
# 
# other attached packages:
# [1] GenomicFeatures_1.24.5 AnnotationDbi_1.34.4   Biobase_2.32.0        
# [4] GenomicRanges_1.24.3   GenomeInfoDb_1.8.3     IRanges_2.6.1         
# [7] S4Vectors_0.10.3       BiocGenerics_0.18.0   



On Sun, 2016-11-13 at 08:24 -0500, Vincent Carey wrote:
> Did you not see a message like:
> 
> Import genomic features from the file as a GRanges object ... OK
> Prepare the 'metadata' data frame ... OK
> Make the TxDb object ... OK
> Warning message:
> In makeTxDbFromGRanges(gr, metadata = metadata) :
>   The following transcripts were dropped because their exon ranks
> could
>   not be inferred (either because the exons are not on the same
>   chromosome/strand or because they are not separated by introns):
>   NR_003254
> 
> On Sun, Nov 13, 2016 at 5:25 AM, Marlin JL.M <marlin- at gmx.cn> wrote:
> > Dear all,
> > 
> > 
> > When trying to import the the GTF file downloaded from iGenome
> > using
> > makeTxDbFromGFF, I figured out that many genes are dropped
> > silently,
> > probably because those genes happens to have multiple chromosome
> > locations in that GTF file. 
> > 
> > I have posted it at https://support.bioconductor.org/p/89401 with
> > no
> > reply yet. As I find it somehow a problem-causing bug, I decide to
> > send
> > the information here.
> > 
> > Is there any suggested way to deal with the case?
> > 
> > 
> > Best regards,
> > Marlin
> > 
> > _______________________________________________
> > Bioc-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list