[BioC] running time: countOverlaps & summarizedOverlaps vs. HTSeq

Steve Lianoglou mailinglist.honeypot at gmail.com
Thu Mar 29 03:20:03 CEST 2012


Hi,

2012/3/28 Hervé Pagès <hpages at fhcrc.org>:
> Hi Nico, Milica,
>
>
> On 03/28/2012 04:21 AM, Nicolas Delhomme wrote:
>>
>> Hi Milica,
>>
>> I do the exact same thing in my package (easyRNASeq, still in the devel
>> branch of Bioc) and it definitely does not require 20 hours to read "only"
>> 20 million reads. Are you sure you are not getting your machine to swap?
>> I.e. did you monitor the memory usage?
>>
>> It would be interesting (for me, at least) if you could try my package to
>> get your count table. You can either retrieve the annotation from biomaRt or
>> provide a GFF file. See the vignette of the package for the details and
>> maybe these two posts on that mailing list:
>>
>> https://stat.ethz.ch/pipermail/bioconductor/2012-February/043478.html
>> https://mailman.stat.ethz.ch/pipermail/bioconductor/2012-March/044124.html
>
>
> Slightly off-topic but probably worth mentioning is that you don't
> need to use a BSgenome package in order to fetch the chromosome
> lengths of an organism. For example in the code you show in the
> 2 above posts, you could use makeTranscriptDbFromUCSC() (or
> makeTranscriptDbFromBiomart(), both from the GenomicFeatures
> package) to make a TranscriptDb object, and then to do seqlengths()
> on that object (BTW it would be nice if this worked with
> makeFeatureDbFromUCSC() too). If you don't have the BSgenome packages
> for hg19 or mm9 already installed, that should be faster than
> installing them.

Also, the BAM file you are using as the source of your reads has the
seqlength info in it, eg:

R> library(Rsamtools)
R> bf <- BamFile('/path/to/your.bam')
R> seqlengths(bf)   ## hg19
     chr1      chr2      chr3      chr4      chr5      chr6      chr7      chr8
249250621 243199373 198022430 191154276 180915260 171115067 159138663 146364022
     chr9     chr10     chr11     chr12     chr13     chr14     chr15     chr16
141213431 135534747 135006516 133851895 115169878 107349540 102531392  90354753
    chr17     chr18     chr19     chr20     chr21     chr22      chrX      chrY
 81195210  78077248  59128983  63025520  48129895  51304566 155270560  59373566
     chrM
    16571

Perhaps that's a helpful piece of trivia to know, too ...

-steve


>
> Or, even faster:
>
>> library(rtracklayer)
>> session <- browserSession()
>> genome(session) <- "mm9"
>> seqlengths(session)
>        chr1  chr1_random         chr2         chr3  chr3_random   chr4
>   197195432      1231697    181748087    159599783        41899 155630120
>  chr4_random         chr5  chr5_random         chr6         chr7 chr7_random
>      160594    152537259       357350    149517037    152524553 362490
>        chr8  chr8_random         chr9  chr9_random        chr10  chr11
>   131738871       849593    124076172       449403    129993255 121843856
>       chr12        chr13 chr13_random        chr14        chr15  chr16
>   121257530    120284312       400311    125194864    103494974 98319150
> chr16_random        chr17 chr17_random        chr18        chr19  chrX
>        3994     95272651       628739     90772031     61342430 166650296
>  chrX_random         chrY  chrY_random chrUn_random         chrM
>     1785075     15902555     58682461      5900358        16299
>
> However, those alternative require internet access...
>
> Cheers,
> H.
>
>
>>
>> For addressing if from the countOverlaps / summarizedOverlaps point of
>> view, it would help if you could post your code and sessionInfo().
>>
>> HTH,
>>
>> Nico
>>
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>>
>> Genome Biology Computational Support
>>
>> European Molecular Biology Laboratory
>>
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>> ---------------------------------------------------------------
>>
>>
>>
>>
>>
>> On 28 Mar 2012, at 13:04, Milica Krunic wrote:
>>
>>> Hello!
>>>
>>>
>>>
>>> I am working with cat RNA Seq data and after mapping I wanted to get the
>>> count tables. So, I tried to do it using countOverlaps and
>>> summarizedOverlaps in R and HTSeq in python. I've noticed that using R,
>>> per
>>> one sorted .bam file (~20*10^6 reads), no matter which previously
>>> mentioned
>>> method I used, it takes ~20 hours. In python, it takes ~15 minutes. For R
>>> methods I used GRangesList object downloaded directly in R from Biomart.
>>> In
>>> HTSeq I used GTF file provided on Ensembl homepage. Average  cat gene
>>> width
>>> is about 44000 in GRangesList.
>>> Does anyone know why getting count tables in R takes so long compared to
>>> HTSeq?
>>>
>>>
>>> Thank you!
>>>
>>> Best,
>>> Milica
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the Bioconductor mailing list