[Bioc-sig-seq] BED file parser

Ivan Gregoretti ivangreg at gmail.com
Wed Mar 9 16:33:01 CET 2011


I find simple BED files to be slow to import. I only use BED without
track headers. The data is derived mostly from *-seq so we are talking
about multiple million lines per file.

The problem as I understand it is that the function reads one row at a
time. It could be much faster if it read, say, 1000 rows at a time.

I never get errors. There are no bugs to fix. It's just very slow for
the real world of high throughput sequencing. That's all.

Thanks,

Ivan


Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1016 and 1-301-496-1592
Fax: 1-301-496-9878



On Wed, Mar 9, 2011 at 10:21 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
>
>
> On Wed, Mar 9, 2011 at 6:41 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
>>
>> Just to expand a little bit Vincent's response.
>>
>> If you happen to be handling very large BED files, you probably keep
>> them compressed. The good news is that even in that case, you can load
>> them:
>>
>> lit = import("~/lit.bed.gz"."bed")
>>
>> There is still the long-standing issue of how slow the import()
>> function is but I am still hopeful.
>>
>
> This is the first I've heard of this. What sort of files are slow? Do they
> have a track line? The parsing gets complicated when there are track lines
> and multiple tracks in a file. BED is a complex format with many variants.
>
>>
>> Ivan
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1016 and 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>>
>> On Tue, Mar 8, 2011 at 9:26 PM, Vincent Carey
>> <stvjc at channing.harvard.edu> wrote:
>> > 2011/3/8 Thiago Yukio Kikuchi Oliveira <stratust at gmail.com>:
>> >> Hi,
>> >>
>> >> Is there a BED file parser for R?
>> >
>> > I suppose it depends on what you mean by "parser".  import() from the
>> > rtracklayer package imports BED and constructs and populates a
>> > RangedData object with the contents.  Here we look at a small bed file
>> > in text,
>> > start R, load rtracklayer, import the data, show the result, and show
>> > the resources used.
>> >
>> > bash-3.2$ head ~/junc716_20.bed
>> > chr20   55658   64827   JUNC00000001    14      +       55658   64827
>> >  255,0,0 2       27,25   0,9144
>> > chr20   55662   64821   JUNC00000002    2       -       55662   64821
>> >  255,0,0 2       34,8    0,9151
>> > chr20   135774  147029  JUNC00000003    1       -       135774  147029
>> >  255,0,0 2       8,29    0,11226
>> > chr20   167951  172361  JUNC00000004    1       +       167951  172361
>> >  255,0,0 2       29,8    0,4402
>> > chr20   189824  192113  JUNC00000005    3       +       189824  192113
>> >  255,0,0 2       33,9    0,2280
>> > chr20   189829  192113  JUNC00000006    3       +       189829  192113
>> >  255,0,0 2       32,9    0,2275
>> > chr20   193930  199576  JUNC00000007    4       -       193930  199576
>> >  255,0,0 2       28,11   0,5635
>> > chr20   207050  207846  JUNC00000008    2       -       207050  207846
>> >  255,0,0 2       20,34   0,762
>> > chr20   218306  218925  JUNC00000009    1       -       218306  218925
>> >  255,0,0 2       11,26   0,593
>> > chr20   221160  225070  JUNC00000010    25      -       221160  225070
>> >  255,0,0 2       29,9    0,3901
>> > bash-3.2$ head ~/junc716_20.bed > ~/lit.bed
>> > bash-3.2$ R213 --vanilla --quiet
>> >> library(rtracklayer)
>> > Loading required package: RCurl
>> > Loading required package: bitops
>> >> lit = import("~/lit.bed")
>> >> lit
>> > RangedData with 10 rows and 9 value columns across 1 space
>> >         space           ranges |         name     score      strand
>> > thickStart
>> >   <character>        <IRanges> |  <character> <numeric> <character>
>> >  <integer>
>> > 1        chr20 [ 55659,  64827] | JUNC00000001        14           +
>> >  55658
>> > 2        chr20 [ 55663,  64821] | JUNC00000002         2           -
>> >  55662
>> > 3        chr20 [135775, 147029] | JUNC00000003         1           -
>> > 135774
>> > 4        chr20 [167952, 172361] | JUNC00000004         1           +
>> > 167951
>> > 5        chr20 [189825, 192113] | JUNC00000005         3           +
>> > 189824
>> > 6        chr20 [189830, 192113] | JUNC00000006         3           +
>> > 189829
>> > 7        chr20 [193931, 199576] | JUNC00000007         4           -
>> > 193930
>> > 8        chr20 [207051, 207846] | JUNC00000008         2           -
>> > 207050
>> > 9        chr20 [218307, 218925] | JUNC00000009         1           -
>> > 218306
>> > 10       chr20 [221161, 225070] | JUNC00000010        25           -
>> > 221160
>> >    thickEnd     itemRgb blockCount  blockSizes blockStarts
>> >   <integer> <character>  <integer> <character> <character>
>> > 1      64827     #FF0000          2       27,25      0,9144
>> > 2      64821     #FF0000          2        34,8      0,9151
>> > 3     147029     #FF0000          2        8,29     0,11226
>> > 4     172361     #FF0000          2        29,8      0,4402
>> > 5     192113     #FF0000          2        33,9      0,2280
>> > 6     192113     #FF0000          2        32,9      0,2275
>> > 7     199576     #FF0000          2       28,11      0,5635
>> > 8     207846     #FF0000          2       20,34       0,762
>> > 9     218925     #FF0000          2       11,26       0,593
>> > 10    225070     #FF0000          2        29,9      0,3901
>> >
>> >> sessionInfo()
>> > R version 2.13.0 Under development (unstable) (2011-03-01 r54628)
>> > Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
>> >
>> > locale:
>> > [1] C
>> >
>> > attached base packages:
>> > [1] stats     graphics  grDevices utils     datasets  methods   base
>> >
>> > other attached packages:
>> > [1] rtracklayer_1.11.11 RCurl_1.5-0         bitops_1.0-4.1
>> >
>> > loaded via a namespace (and not attached):
>> > [1] BSgenome_1.19.4      Biobase_2.11.9       Biostrings_2.19.15
>> > [4] GenomicRanges_1.3.23 IRanges_1.9.25       Matrix_0.999375-47
>> > [7] XML_3.2-0            grid_2.13.0          lattice_0.19-17
>> >
>> >
>> >>
>> >>
>> >> Thanks
>> >>
>> >>     /    Thiago Yukio Kikuchi Oliveira
>> >> (=\
>> >>   \=) Faculdade de Medicina de Ribeirão Preto
>> >>    /   Laboratório de Genética Molecular e Bioinformática
>> >>   /=) -----------------------------------------------------------------
>> >> (=/   Centro de Terapia Celular/CEPID/FAPESP - Hemocentro de Rib. Preto
>> >>   /    Rua Tenente Catão Roxo, 2501 CEP 14151-140
>> >> (=\   Ribeirão Preto - São Paulo
>> >>   \=) Fone: 55 16 2101-9300   Ramal: 9603
>> >>    /   E-mail: stratus at lgmb.fmrp.usp.br
>> >>   /=)            stratust at gmail.com
>> >> (=/
>> >>   /    Bioinformatic Team - BiT: http://lgmb.fmrp.usp.br
>> >> (=\   Hemocentro de Ribeirão Preto: http://pegasus.fmrp.usp.br
>> >>   \=)
>> >>    /  -----------------------------------------------------------------
>> >>
>> >> _______________________________________________
>> >> Bioc-sig-sequencing mailing list
>> >> Bioc-sig-sequencing at r-project.org
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >>
>> >
>> > _______________________________________________
>> > Bioc-sig-sequencing mailing list
>> > Bioc-sig-sequencing at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>



More information about the Bioc-sig-sequencing mailing list