[Bioc-devel] very slow to use intronsByTranscript in GenomicFeatures

Hervé Pagès hpages at fhcrc.org
Fri Dec 20 20:25:31 CET 2013


Hi Robert, Jianhong,

This could be related to some changes to the relist() and split() code
that I made a few days ago in IRanges. I didn't immediately make the
corresponding changes to GenomicRanges and GenomicAlignments so
relisting or splitting GRanges and GAlignments objects was broken for
a couple of days, which had all kinds of nasty consequences in many
places (relist() and split() are used a lot internally).

Updated versions of GenomicRanges and GenomicAlignments are now online
so please make sure you have the latest versions (1.15.17 for
GenomicRanges and 0.99.10 for GenomicAlignments).

Sorry for the inconvenience and please let me know if you still run
into problems with this.

H.

On 12/20/2013 10:15 AM, Robert Castelo wrote:
> hi,
>
> i can reproduce what Jianhong says, i noticed it earlier this week but
> didn't mention because we all know devel is a moving target and so on,
> but since this has been raised now i'll report what i'm getting.
>
> so, this is for Jianhong, if you downgrade the following packages to
> these particular versions:
>
> Biostrings_2.31.3.tar.gz
> GenomicRanges_1.15.15.tar.gz
> IRanges_1.21.13.tar.gz
> XVector_0.3.2.tar.gz
>
> you'll be all fine, unless you need some functionality of later versions
> of them, here is the test with the session information:
>
> suppressPackageStartupMessages(library(TxDb.Hsapiens.UCSC.hg19.knownGene))
> Warning messages:
> 1: multiple methods tables found for ‘rname’
> 2: multiple methods tables found for ‘rname<-’
> 3: multiple methods tables found for ‘cigar’
> 4: multiple methods tables found for ‘qwidth’
> 5: multiple methods tables found for ‘introns’
> system.time(txbygene <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene,
> "gene"))
>     user  system elapsed
>    2.524   0.046   2.575
>
> sessionInfo()
> R Under development (unstable) (2013-10-20 r64082)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF8        LC_COLLATE=en_US.UTF8
>   [5] LC_MONETARY=en_US.UTF8    LC_MESSAGES=en_US.UTF8
>   [7] LC_PAPER=en_US.UTF8       LC_NAME=C
>   [9] LC_ADDRESS=C              LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
>   [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.10.1
>   [2] GenomicFeatures_1.15.4
>   [3] AnnotationDbi_1.25.9
>   [4] Biobase_2.23.3
>   [5] GenomicRanges_1.15.11
>   [6] XVector_0.3.2
>   [7] IRanges_1.21.13
>   [8] BiocGenerics_0.9.2
>   [9] vimcom_0.9-92
> [10] setwidth_1.0-3
> [11] colorout_1.0-1
>
> loaded via a namespace (and not attached):
>   [1] biomaRt_2.19.1           Biostrings_2.31.3        bitops_1.0-6
>   [4] BSgenome_1.31.7          DBI_0.2-7 GenomicAlignments_0.99.9
>   [7] RCurl_1.95-4.1           Rsamtools_1.15.15        RSQLite_0.11.4
> [10] rtracklayer_1.23.6       stats4_3.1.0             tools_3.1.0
> [13] XML_3.98-1.1             zlibbioc_1.9.0
>
>
> however, if you go to the bleeding edge of devel BioC:
>
> suppressPackageStartupMessages(library(TxDb.Hsapiens.UCSC.hg19.knownGene))
> system.time(txbygene <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene,
> "gene"))
>
> the previous call never ends until you press CTRL+C:
>
> ^C
> Error in unlist(lapply(c("seqnames", "ranges", "strand", "mcols"),
> checkCoreGetterReturnedLength)) :
>    error in evaluating the argument 'x' in selecting a method for
> function 'unlist': Error in NROW(get(getter)(x)) :
>    error in evaluating the argument 'x' in selecting a method for
> function 'NROW': Error in get(getter)(x) :
>    error in evaluating the argument 'x' in selecting a method for
> function 'ranges':
> Timing stopped at: 24.5 0.072 24.619
>
> sessionInfo()
> R Under development (unstable) (2013-10-20 r64082)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF8       LC_NUMERIC=C LC_TIME=en_US.UTF8
> LC_COLLATE=en_US.UTF8
>   [5] LC_MONETARY=en_US.UTF8    LC_MESSAGES=en_US.UTF8
> LC_PAPER=en_US.UTF8       LC_NAME=C
>   [9] LC_ADDRESS=C              LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF8
> LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>    base
>
> other attached packages:
>   [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.10.1 GenomicFeatures_1.15.4
>   [3] AnnotationDbi_1.25.9                     Biobase_2.23.3
>   [5] GenomicRanges_1.15.15                    XVector_0.3.5
>   [7] IRanges_1.21.17                          BiocGenerics_0.9.2
>   [9] vimcom_0.9-92                            setwidth_1.0-3
> [11] colorout_1.0-1
>
> loaded via a namespace (and not attached):
>   [1] biomaRt_2.19.1           Biostrings_2.31.5        bitops_1.0-6
>           BSgenome_1.31.7
>   [5] DBI_0.2-7                GenomicAlignments_0.99.9 RCurl_1.95-4.1
>           Rsamtools_1.15.15
>   [9] RSQLite_0.11.4           rtracklayer_1.23.6       stats4_3.1.0
>           tools_3.1.0
> [13] XML_3.98-1.1             zlibbioc_1.9.0
>
>
>
> cheers,
> robert.
>
>
> On 12/20/2013 06:31 PM, Ou, Jianhong wrote:
>> In my case, looks like never end.
>>
>> I need to check my R first.
>>
>> Yours sincerely,
>>
>> Jianhong Ou
>>
>> LRB 670A
>> Program in Gene Function and Expression
>> 364 Plantation Street Worcester,
>> MA 01605
>>
>>
>>
>>
>> On 12/20/13 12:05 PM, "Hervé Pagès"<hpages at fhcrc.org>  wrote:
>>
>>> Hi Jianhong,
>>>
>>> According to my timings, it's a little bit slower than exonsBy() but
>>> not that much. It has to do a little bit more work too as the introns
>>> are not explicitly stored in the SQLite db (the exons are) but are
>>> inferred from the exons and transcript boundaries.
>>> So intronsByTranscript() has to retrieve all the exons + all the
>>> transcripts from the db.
>>>
>>> intronsByTranscript():
>>>
>>>    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>    system.time(introns<-
>>> intronsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene))
>>>    #   user  system elapsed
>>>    #  9.165   0.076   9.263
>>>    system.time(introns<-
>>> intronsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene))
>>>    #   user  system elapsed
>>>    #  4.824   0.064   4.896
>>>
>>> exonsBy():
>>>
>>>    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>    system.time(exons<- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene))
>>>    #   user  system elapsed
>>>    #  7.720   0.072   7.812
>>>    system.time(exons<- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene))
>>>    #   user  system elapsed
>>>    #  4.229   0.028   4.265
>>>
>>> transcripts():
>>>
>>>    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>    system.time(tx<- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene))
>>>    #   user  system elapsed
>>>    #  1.424   0.008   1.436
>>>    system.time(tx<- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene))
>>>    #   user  system elapsed
>>>    #  0.776   0.012   0.790
>>>
>>> Less than 10 sec. to retrieve all the exons and transcripts from disk
>>> and compute the 659327 introns. It's actually not that bad.
>>>
>>> Cheers,
>>> H.
>>>
>>>
>>> On 12/20/2013 08:25 AM, Ou, Jianhong wrote:
>>>> Dear all,
>>>>
>>>> When I try to use intronsByTranscript to get introns for hg19 known
>>>> genes, I found it is unacceptable slow. Does any body has the same
>>>> problem?
>>>>
>>>> My code:
>>>> library(GenomicFeatures)
>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>> introns<- intronsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>>
>>>>> sessionInfo()
>>>> R Under development (unstable) (2013-12-12 r64453)
>>>> Platform: x86_64-apple-darwin12.5.0 (64-bit)
>>>>
>>>> locale:
>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>>
>>>> attached base packages:
>>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>>>   base
>>>>
>>>> other attached packages:
>>>> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.10.1 GenomicFeatures_1.15.4
>>>> [3] AnnotationDbi_1.25.9                     Biobase_2.23.3
>>>> [5] GenomicRanges_1.15.15                    XVector_0.3.5
>>>> [7] IRanges_1.21.17                          BiocGenerics_0.9.2
>>>>
>>>> loaded via a namespace (and not attached):
>>>>    [1] biomaRt_2.19.1           Biostrings_2.31.5        bitops_1.0-6
>>>>          BSgenome_1.31.7
>>>>    [5] DBI_0.2-7                GenomicAlignments_0.99.9 RCurl_1.95-4.1
>>>>          Rsamtools_1.15.15
>>>>    [9] RSQLite_0.11.4           rtracklayer_1.23.6       stats4_3.1.0
>>>>          tools_3.1.0
>>>> [13] XML_3.98-1.1             zlibbioc_1.9.0
>>>>
>>>> Yours sincerely,
>>>>
>>>> Jianhong Ou
>>>>
>>>> LRB 670A
>>>> Program in Gene Function and Expression
>>>> 364 Plantation Street Worcester,
>>>> MA 01605
>>>>
>>>>     [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list