[BioC] rtracklayer: import.gff seems to be very slow
Michael Dondrup
Michael.Dondrup at uni.no
Wed Oct 20 16:04:48 CEST 2010
Hi,
just installes R 2.12.0 biocondutor 2.7 rtracklayer 1.10 and I can confirm that there is a major improvement
in the speed of import.gff.
Thanks a lot for this fix.
Michael
On Oct 16, 2010, at 6:39 AM, Michael Lawrence wrote:
> Wow thanks for a serious testing file. There were some bugs and somewhat interesting performance issues.
>
> For example, I've discovered that gregexpr with fixed=TRUE is quadratic time with respect to string length (gets real bad up in the millions). Haven't been able to figure out why. This makes fixed=FALSE much quicker. Counterintuitive. substring() is also surprisingly slow.
>
> Anyway, try the latest SVN. Or version 1.9.12.
>
> Still much slower than read.delim. It's the attributes in the last column (being translated to columns in R) that are so costly, and that one has them in significant quantity. I guess I could give an option to disable that parsing (or in general select the desired columns, as suggested previously), but it should be much quicker for you now.
>
> Thanks again,
> Michael
>
> On Fri, Oct 15, 2010 at 2:40 AM, Michael Dondrup <Michael.Dondrup at uni.no> wrote:
> Hi,
>
> I am trying to read in a genome annotation from a GFF3 file from NCBI [1]
> The file is about 7.5 MB and has ~17000 non-comment lines. While I can read the file
> with read.delim in less than a second, trying
> bsub = import.gff("~/Downloads/bsubtilis.gff")
> is very slow. I would rather like to use a standardized function form the package
> that understands various formats, but currently I cannot use it for whole genome
> annotation. Could this be improved, or is the fie format incorrect?
>
> Best
> Michael
>
>
> [1]: ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Bacillus_subtilis/AL009126.gff
>
> > sessionInfo()R version 2.11.1 (2010-05-31)
> x86_64-apple-darwin9.8.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] rtracklayer_1.8.1 RCurl_1.4-2 bitops_1.0-4.1
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.8.0 Biostrings_2.16.0 BSgenome_1.16.1
> [4] GenomicRanges_1.0.9 IRanges_1.6.6 XML_3.1-0
> >
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
Michael Dondrup
Post-doctoral researcher
Uni BCCS
Thormøhlensgate 55, N-5008 Bergen, Norway
Phone: +47 55584157 Fax: +47 55584354
Please note my new phone number
More information about the Bioconductor
mailing list