[Bioc-sig-seq] BED file parser
Ivan Gregoretti
ivangreg at gmail.com
Wed Mar 9 16:33:01 CET 2011
I find simple BED files to be slow to import. I only use BED without
track headers. The data is derived mostly from *-seq so we are talking
about multiple million lines per file.
The problem as I understand it is that the function reads one row at a
time. It could be much faster if it read, say, 1000 rows at a time.
I never get errors. There are no bugs to fix. It's just very slow for
the real world of high throughput sequencing. That's all.
Thanks,
Ivan
Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1016 and 1-301-496-1592
Fax: 1-301-496-9878
On Wed, Mar 9, 2011 at 10:21 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
>
>
> On Wed, Mar 9, 2011 at 6:41 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
>>
>> Just to expand a little bit Vincent's response.
>>
>> If you happen to be handling very large BED files, you probably keep
>> them compressed. The good news is that even in that case, you can load
>> them:
>>
>> lit = import("~/lit.bed.gz"."bed")
>>
>> There is still the long-standing issue of how slow the import()
>> function is but I am still hopeful.
>>
>
> This is the first I've heard of this. What sort of files are slow? Do they
> have a track line? The parsing gets complicated when there are track lines
> and multiple tracks in a file. BED is a complex format with many variants.
>
>>
>> Ivan
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1016 and 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>>
>> On Tue, Mar 8, 2011 at 9:26 PM, Vincent Carey
>> <stvjc at channing.harvard.edu> wrote:
>> > 2011/3/8 Thiago Yukio Kikuchi Oliveira <stratust at gmail.com>:
>> >> Hi,
>> >>
>> >> Is there a BED file parser for R?
>> >
>> > I suppose it depends on what you mean by "parser". import() from the
>> > rtracklayer package imports BED and constructs and populates a
>> > RangedData object with the contents. Here we look at a small bed file
>> > in text,
>> > start R, load rtracklayer, import the data, show the result, and show
>> > the resources used.
>> >
>> > bash-3.2$ head ~/junc716_20.bed
>> > chr20 55658 64827 JUNC00000001 14 + 55658 64827
>> > 255,0,0 2 27,25 0,9144
>> > chr20 55662 64821 JUNC00000002 2 - 55662 64821
>> > 255,0,0 2 34,8 0,9151
>> > chr20 135774 147029 JUNC00000003 1 - 135774 147029
>> > 255,0,0 2 8,29 0,11226
>> > chr20 167951 172361 JUNC00000004 1 + 167951 172361
>> > 255,0,0 2 29,8 0,4402
>> > chr20 189824 192113 JUNC00000005 3 + 189824 192113
>> > 255,0,0 2 33,9 0,2280
>> > chr20 189829 192113 JUNC00000006 3 + 189829 192113
>> > 255,0,0 2 32,9 0,2275
>> > chr20 193930 199576 JUNC00000007 4 - 193930 199576
>> > 255,0,0 2 28,11 0,5635
>> > chr20 207050 207846 JUNC00000008 2 - 207050 207846
>> > 255,0,0 2 20,34 0,762
>> > chr20 218306 218925 JUNC00000009 1 - 218306 218925
>> > 255,0,0 2 11,26 0,593
>> > chr20 221160 225070 JUNC00000010 25 - 221160 225070
>> > 255,0,0 2 29,9 0,3901
>> > bash-3.2$ head ~/junc716_20.bed > ~/lit.bed
>> > bash-3.2$ R213 --vanilla --quiet
>> >> library(rtracklayer)
>> > Loading required package: RCurl
>> > Loading required package: bitops
>> >> lit = import("~/lit.bed")
>> >> lit
>> > RangedData with 10 rows and 9 value columns across 1 space
>> > space ranges | name score strand
>> > thickStart
>> > <character> <IRanges> | <character> <numeric> <character>
>> > <integer>
>> > 1 chr20 [ 55659, 64827] | JUNC00000001 14 +
>> > 55658
>> > 2 chr20 [ 55663, 64821] | JUNC00000002 2 -
>> > 55662
>> > 3 chr20 [135775, 147029] | JUNC00000003 1 -
>> > 135774
>> > 4 chr20 [167952, 172361] | JUNC00000004 1 +
>> > 167951
>> > 5 chr20 [189825, 192113] | JUNC00000005 3 +
>> > 189824
>> > 6 chr20 [189830, 192113] | JUNC00000006 3 +
>> > 189829
>> > 7 chr20 [193931, 199576] | JUNC00000007 4 -
>> > 193930
>> > 8 chr20 [207051, 207846] | JUNC00000008 2 -
>> > 207050
>> > 9 chr20 [218307, 218925] | JUNC00000009 1 -
>> > 218306
>> > 10 chr20 [221161, 225070] | JUNC00000010 25 -
>> > 221160
>> > thickEnd itemRgb blockCount blockSizes blockStarts
>> > <integer> <character> <integer> <character> <character>
>> > 1 64827 #FF0000 2 27,25 0,9144
>> > 2 64821 #FF0000 2 34,8 0,9151
>> > 3 147029 #FF0000 2 8,29 0,11226
>> > 4 172361 #FF0000 2 29,8 0,4402
>> > 5 192113 #FF0000 2 33,9 0,2280
>> > 6 192113 #FF0000 2 32,9 0,2275
>> > 7 199576 #FF0000 2 28,11 0,5635
>> > 8 207846 #FF0000 2 20,34 0,762
>> > 9 218925 #FF0000 2 11,26 0,593
>> > 10 225070 #FF0000 2 29,9 0,3901
>> >
>> >> sessionInfo()
>> > R version 2.13.0 Under development (unstable) (2011-03-01 r54628)
>> > Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
>> >
>> > locale:
>> > [1] C
>> >
>> > attached base packages:
>> > [1] stats graphics grDevices utils datasets methods base
>> >
>> > other attached packages:
>> > [1] rtracklayer_1.11.11 RCurl_1.5-0 bitops_1.0-4.1
>> >
>> > loaded via a namespace (and not attached):
>> > [1] BSgenome_1.19.4 Biobase_2.11.9 Biostrings_2.19.15
>> > [4] GenomicRanges_1.3.23 IRanges_1.9.25 Matrix_0.999375-47
>> > [7] XML_3.2-0 grid_2.13.0 lattice_0.19-17
>> >
>> >
>> >>
>> >>
>> >> Thanks
>> >>
>> >> / Thiago Yukio Kikuchi Oliveira
>> >> (=\
>> >> \=) Faculdade de Medicina de Ribeirão Preto
>> >> / Laboratório de Genética Molecular e Bioinformática
>> >> /=) -----------------------------------------------------------------
>> >> (=/ Centro de Terapia Celular/CEPID/FAPESP - Hemocentro de Rib. Preto
>> >> / Rua Tenente Catão Roxo, 2501 CEP 14151-140
>> >> (=\ Ribeirão Preto - São Paulo
>> >> \=) Fone: 55 16 2101-9300 Ramal: 9603
>> >> / E-mail: stratus at lgmb.fmrp.usp.br
>> >> /=) stratust at gmail.com
>> >> (=/
>> >> / Bioinformatic Team - BiT: http://lgmb.fmrp.usp.br
>> >> (=\ Hemocentro de Ribeirão Preto: http://pegasus.fmrp.usp.br
>> >> \=)
>> >> / -----------------------------------------------------------------
>> >>
>> >> _______________________________________________
>> >> Bioc-sig-sequencing mailing list
>> >> Bioc-sig-sequencing at r-project.org
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >>
>> >
>> > _______________________________________________
>> > Bioc-sig-sequencing mailing list
>> > Bioc-sig-sequencing at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>
More information about the Bioc-sig-sequencing
mailing list