[Bioc-devel] makeTxDbFromGFF drops genes which have multiple chromosome locations. (with iGenome GTF)

Martin Morgan martin.morgan at roswellpark.org
Sun Nov 13 19:13:40 CET 2016


On 11/13/2016 01:07 PM, Marlin JL.M wrote:
> No, I did not get that warning. But only:
>
>> Import genomic features from the file as a GRanges object ... OK
>> Prepare the 'metadata' data frame ... OK
>> Make the TxDb object ... OK
>
>
> Further, I noticed that:
>
> 1. Those genes are not actually "dropped", they can be shown with
> `transcriptsBy(txdb,'gene')` but can not be shown with `genes(txdb)`.
>
> 2. The behavior is not associated with multiple chromosome locations,
> though they have a large intersection.
>
> Hence, I am not clear what is the actual factor causing this somehow
> weired behavior.
>
> As I am not using the development version of bioconductor, if anyone
> can help to reproduce it? I have uploaded a small dummy gtf file at
> https://gist.github.com/Marlin-Na/1eefedc4984e40b8ef76a0e1e7612dbb .

In the current release version

 > packageVersion("GenomicFeatures")
[1] '1.26.0'

look up the help for ?genes and use the argument 
single.strand.genes.only=FALSE

Martin

>
>
> This is what I get:
>
> $ wget https://gist.githubusercontent.com/Marlin-Na/1eefedc4984e40b8ef76a0e1e7612dbb/raw/9977b699a9fcdf70c8860de985ff550934c2c4a7/dummy.gtf
> $ R
>
>> library(GenomicFeatures)
>
>> txdb = makeTxDbFromGFF('dummy.gtf')
> # Import genomic features from the file as a GRanges object ... OK
> # Prepare the 'metadata' data frame ... OK
> # Make the TxDb object ... OK
>
>> genes(txdb)
> # GRanges object with 1 range and 1 metadata column:
> #         seqnames               ranges strand |     gene_id
> #            <Rle>            <IRanges>  <Rle> | <character>
> #   TTTY5     chrY [24442945, 24445023]      - |       TTTY5
> #   -------
> #   seqinfo: 2 sequences from an unspecified genome; no seqlengths
>
>> transcriptsBy(txdb, 'gene')
> # GRangesList object of length 3:
> # $TTTY17A
> # GRanges object with 3 ranges and 2 metadata columns:
> #       seqnames               ranges strand |     tx_id     tx_name
> #          <Rle>            <IRanges>  <Rle> | <integer> <character>
> #   [1]     chrY [24997731, 24998862]      + |         3 NR_001526_1
> #   [2]     chrY [26631479, 26632610]      + |         4 NR_001526_2
> #   [3]     chrY [27329790, 27330920]      - |         6   NR_001526
> #
> # $TTTY5
> # GRanges object with 1 range and 2 metadata columns:
> #       seqnames               ranges strand | tx_id   tx_name
> #   [1]     chrY [24442945, 24445023]      - |     5 NR_001541
> #
> # $XGY2
> # GRanges object with 2 ranges and 2 metadata columns:
> #       seqnames             ranges strand | tx_id     tx_name
> #   [1]     chrX [2670337, 2693037]      + |     1   NR_003254
> #   [2]     chrY [2620337, 2643037]      + |     2 NR_003254_1
> #
> # -------
> # seqinfo: 2 sequences from an unspecified genome; no seqlengths
>
> As you can see, 'XGY2' and 'TTTY17A' are not shown with `genes(txdb)`.
>
>
>
>> sessionInfo()
> # R version 3.3.2 (2016-10-31)
> # Platform: i686-pc-linux-gnu (32-bit)
> # Running under: Ubuntu 16.04.1 LTS
> #
> # ......
> #
> # attached base packages:
> # [1] stats4    parallel  stats     graphics  grDevices utils     datasets
> # [8] methods   base
> #
> # other attached packages:
> # [1] GenomicFeatures_1.24.5 AnnotationDbi_1.34.4   Biobase_2.32.0
> # [4] GenomicRanges_1.24.3   GenomeInfoDb_1.8.3     IRanges_2.6.1
> # [7] S4Vectors_0.10.3       BiocGenerics_0.18.0
>
>
>
> On Sun, 2016-11-13 at 08:24 -0500, Vincent Carey wrote:
>> Did you not see a message like:
>>
>> Import genomic features from the file as a GRanges object ... OK
>> Prepare the 'metadata' data frame ... OK
>> Make the TxDb object ... OK
>> Warning message:
>> In makeTxDbFromGRanges(gr, metadata = metadata) :
>>   The following transcripts were dropped because their exon ranks
>> could
>>   not be inferred (either because the exons are not on the same
>>   chromosome/strand or because they are not separated by introns):
>>   NR_003254
>>
>> On Sun, Nov 13, 2016 at 5:25 AM, Marlin JL.M <marlin- at gmx.cn> wrote:
>>> Dear all,
>>>
>>>
>>> When trying to import the the GTF file downloaded from iGenome
>>> using
>>> makeTxDbFromGFF, I figured out that many genes are dropped
>>> silently,
>>> probably because those genes happens to have multiple chromosome
>>> locations in that GTF file.
>>>
>>> I have posted it at https://support.bioconductor.org/p/89401 with
>>> no
>>> reply yet. As I find it somehow a problem-causing bug, I decide to
>>> send
>>> the information here.
>>>
>>> Is there any suggested way to deal with the case?
>>>
>>>
>>> Best regards,
>>> Marlin
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>


This email message may contain legally privileged and/or...{{dropped:2}}



More information about the Bioc-devel mailing list