[BioC] running time: countOverlaps & summarizedOverlaps vs. HTSeq

Thu Mar 29 14:39:03 CEST 2012

Hi Steve,

Yes that's true, I've been using that too, but I still encounter BAM files without an header section. Combining both your input will be a way to avoid BSgenome :-)

Cheers,

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------

On 29 Mar 2012, at 03:20, Steve Lianoglou wrote:

> Hi,
> 
> 2012/3/28 Hervé Pagès <hpages at fhcrc.org>:
>> Hi Nico, Milica,
>> 
>> 
>> On 03/28/2012 04:21 AM, Nicolas Delhomme wrote:
>>> 
>>> Hi Milica,
>>> 
>>> I do the exact same thing in my package (easyRNASeq, still in the devel
>>> branch of Bioc) and it definitely does not require 20 hours to read "only"
>>> 20 million reads. Are you sure you are not getting your machine to swap?
>>> I.e. did you monitor the memory usage?
>>> 
>>> It would be interesting (for me, at least) if you could try my package to
>>> get your count table. You can either retrieve the annotation from biomaRt or
>>> provide a GFF file. See the vignette of the package for the details and
>>> maybe these two posts on that mailing list:
>>> 
>>> https://stat.ethz.ch/pipermail/bioconductor/2012-February/043478.html
>>> https://mailman.stat.ethz.ch/pipermail/bioconductor/2012-March/044124.html
>> 
>> 
>> Slightly off-topic but probably worth mentioning is that you don't
>> need to use a BSgenome package in order to fetch the chromosome
>> lengths of an organism. For example in the code you show in the
>> 2 above posts, you could use makeTranscriptDbFromUCSC() (or
>> makeTranscriptDbFromBiomart(), both from the GenomicFeatures
>> package) to make a TranscriptDb object, and then to do seqlengths()
>> on that object (BTW it would be nice if this worked with
>> makeFeatureDbFromUCSC() too). If you don't have the BSgenome packages
>> for hg19 or mm9 already installed, that should be faster than
>> installing them.
> 
> Also, the BAM file you are using as the source of your reads has the
> seqlength info in it, eg:
> 
> R> library(Rsamtools)
> R> bf <- BamFile('/path/to/your.bam')
> R> seqlengths(bf)   ## hg19
>     chr1      chr2      chr3      chr4      chr5      chr6      chr7      chr8
> 249250621 243199373 198022430 191154276 180915260 171115067 159138663 146364022
>     chr9     chr10     chr11     chr12     chr13     chr14     chr15     chr16
> 141213431 135534747 135006516 133851895 115169878 107349540 102531392  90354753
>    chr17     chr18     chr19     chr20     chr21     chr22      chrX      chrY
> 81195210  78077248  59128983  63025520  48129895  51304566 155270560  59373566
>     chrM
>    16571
> 
> Perhaps that's a helpful piece of trivia to know, too ...
> 
> -steve
> 
> 
>> 
>> Or, even faster:
>> 
>>> library(rtracklayer)
>>> session <- browserSession()
>>> genome(session) <- "mm9"
>>> seqlengths(session)
>>        chr1  chr1_random         chr2         chr3  chr3_random   chr4
>>   197195432      1231697    181748087    159599783        41899 155630120
>>  chr4_random         chr5  chr5_random         chr6         chr7 chr7_random
>>      160594    152537259       357350    149517037    152524553 362490
>>        chr8  chr8_random         chr9  chr9_random        chr10  chr11
>>   131738871       849593    124076172       449403    129993255 121843856
>>       chr12        chr13 chr13_random        chr14        chr15  chr16
>>   121257530    120284312       400311    125194864    103494974 98319150
>> chr16_random        chr17 chr17_random        chr18        chr19  chrX
>>        3994     95272651       628739     90772031     61342430 166650296
>>  chrX_random         chrY  chrY_random chrUn_random         chrM
>>     1785075     15902555     58682461      5900358        16299
>> 
>> However, those alternative require internet access...
>> 
>> Cheers,
>> H.
>> 
>> 
>>> 
>>> For addressing if from the countOverlaps / summarizedOverlaps point of
>>> view, it would help if you could post your code and sessionInfo().
>>> 
>>> HTH,
>>> 
>>> Nico
>>> 
>>> ---------------------------------------------------------------
>>> Nicolas Delhomme
>>> 
>>> Genome Biology Computational Support
>>> 
>>> European Molecular Biology Laboratory
>>> 
>>> Tel: +49 6221 387 8310
>>> Email: nicolas.delhomme at embl.de
>>> Meyerhofstrasse 1 - Postfach 10.2209
>>> 69102 Heidelberg, Germany
>>> ---------------------------------------------------------------
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 28 Mar 2012, at 13:04, Milica Krunic wrote:
>>> 
>>>> Hello!
>>>> 
>>>> 
>>>> 
>>>> I am working with cat RNA Seq data and after mapping I wanted to get the
>>>> count tables. So, I tried to do it using countOverlaps and
>>>> summarizedOverlaps in R and HTSeq in python. I've noticed that using R,
>>>> per
>>>> one sorted .bam file (~20*10^6 reads), no matter which previously
>>>> mentioned
>>>> method I used, it takes ~20 hours. In python, it takes ~15 minutes. For R
>>>> methods I used GRangesList object downloaded directly in R from Biomart.
>>>> In
>>>> HTSeq I used GTF file provided on Ensembl homepage. Average  cat gene
>>>> width
>>>> is about 44000 in GRangesList.
>>>> Does anyone know why getting count tables in R takes so long compared to
>>>> HTSeq?
>>>> 
>>>> 
>>>> Thank you!
>>>> 
>>>> Best,
>>>> Milica
>>>> 
>>>>        [[alternative HTML version deleted]]
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> 
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> 
>> 
>> --
>> Hervé Pagès
>> 
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>> 
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>> 
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 
> 
> -- 
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact