[Bioc-sig-seq] BED file parser

Martin Morgan mtmorgan at fhcrc.org
Wed Mar 9 18:37:40 CET 2011


On 03/09/2011 09:34 AM, Martin Morgan wrote:
> On 03/09/2011 08:18 AM, Ivan Gregoretti wrote:
>> I use BED because it uses less memory.
>>
>> BAM format contains the read names, the sequences, the quality string
>> and more information. I do not need that. I only need chromosome name,
>> start, end, and strand.
> 
> In Rsamtools, scanBam with param=ScanBamParam(what=c("qname", "pos",
> "qwidth", "strand")) and it'll be very fast.

'rname' rather than 'qname'.

> 
> see ?scanBam, ?ScanBamParam and the vignette.
> 
> Martin
> 
>>
>> So, for almost all my analyses, I start by converting my .bam to a
>> minimalistic .bed.gz outside R and then from R I load my tags into a
>> GRanges with import().
>>
>> As simple as that.
>>
>> Ivan
>>
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1016 and 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>>
>> On Wed, Mar 9, 2011 at 10:51 AM, Michael Lawrence
>> <lawrence.michael at gene.com> wrote:
>>>
>>>
>>> On Wed, Mar 9, 2011 at 7:33 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
>>>>
>>>> I find simple BED files to be slow to import. I only use BED without
>>>> track headers. The data is derived mostly from *-seq so we are talking
>>>> about multiple million lines per file.
>>>>
>>>> The problem as I understand it is that the function reads one row at a
>>>> time. It could be much faster if it read, say, 1000 rows at a time.
>>>>
>>>
>>> I hope it's not reading one row at a time. It just calls read.table(), in a
>>> fairly efficient way, with colClasses specified, etc. Why do you have high
>>> throughput sequencing results in BED files? BED is really for genes. Most
>>> other things fit into BAM, bedGraph (which uses the same basic parser
>>> though), WIG, etc.
>>>
>>>>
>>>> I never get errors. There are no bugs to fix. It's just very slow for
>>>> the real world of high throughput sequencing. That's all.
>>>>
>>>> Thanks,
>>>>
>>>> Ivan
>>>>
>>>>
>>>> Ivan Gregoretti, PhD
>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>> National Institutes of Health
>>>> 5 Memorial Dr, Building 5, Room 205.
>>>> Bethesda, MD 20892. USA.
>>>> Phone: 1-301-496-1016 and 1-301-496-1592
>>>> Fax: 1-301-496-9878
>>>>
>>>>
>>>>
>>>> On Wed, Mar 9, 2011 at 10:21 AM, Michael Lawrence
>>>> <lawrence.michael at gene.com> wrote:
>>>>>
>>>>>
>>>>> On Wed, Mar 9, 2011 at 6:41 AM, Ivan Gregoretti <ivangreg at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Just to expand a little bit Vincent's response.
>>>>>>
>>>>>> If you happen to be handling very large BED files, you probably keep
>>>>>> them compressed. The good news is that even in that case, you can load
>>>>>> them:
>>>>>>
>>>>>> lit = import("~/lit.bed.gz"."bed")
>>>>>>
>>>>>> There is still the long-standing issue of how slow the import()
>>>>>> function is but I am still hopeful.
>>>>>>
>>>>>
>>>>> This is the first I've heard of this. What sort of files are slow? Do
>>>>> they
>>>>> have a track line? The parsing gets complicated when there are track
>>>>> lines
>>>>> and multiple tracks in a file. BED is a complex format with many
>>>>> variants.
>>>>>
>>>>>>
>>>>>> Ivan
>>>>>>
>>>>>> Ivan Gregoretti, PhD
>>>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>>>> National Institutes of Health
>>>>>> 5 Memorial Dr, Building 5, Room 205.
>>>>>> Bethesda, MD 20892. USA.
>>>>>> Phone: 1-301-496-1016 and 1-301-496-1592
>>>>>> Fax: 1-301-496-9878
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 8, 2011 at 9:26 PM, Vincent Carey
>>>>>> <stvjc at channing.harvard.edu> wrote:
>>>>>>> 2011/3/8 Thiago Yukio Kikuchi Oliveira <stratust at gmail.com>:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Is there a BED file parser for R?
>>>>>>>
>>>>>>> I suppose it depends on what you mean by "parser".  import() from the
>>>>>>> rtracklayer package imports BED and constructs and populates a
>>>>>>> RangedData object with the contents.  Here we look at a small bed
>>>>>>> file
>>>>>>> in text,
>>>>>>> start R, load rtracklayer, import the data, show the result, and show
>>>>>>> the resources used.
>>>>>>>
>>>>>>> bash-3.2$ head ~/junc716_20.bed
>>>>>>> chr20   55658   64827   JUNC00000001    14      +       55658   64827
>>>>>>>  255,0,0 2       27,25   0,9144
>>>>>>> chr20   55662   64821   JUNC00000002    2       -       55662   64821
>>>>>>>  255,0,0 2       34,8    0,9151
>>>>>>> chr20   135774  147029  JUNC00000003    1       -       135774
>>>>>>>  147029
>>>>>>>  255,0,0 2       8,29    0,11226
>>>>>>> chr20   167951  172361  JUNC00000004    1       +       167951
>>>>>>>  172361
>>>>>>>  255,0,0 2       29,8    0,4402
>>>>>>> chr20   189824  192113  JUNC00000005    3       +       189824
>>>>>>>  192113
>>>>>>>  255,0,0 2       33,9    0,2280
>>>>>>> chr20   189829  192113  JUNC00000006    3       +       189829
>>>>>>>  192113
>>>>>>>  255,0,0 2       32,9    0,2275
>>>>>>> chr20   193930  199576  JUNC00000007    4       -       193930
>>>>>>>  199576
>>>>>>>  255,0,0 2       28,11   0,5635
>>>>>>> chr20   207050  207846  JUNC00000008    2       -       207050
>>>>>>>  207846
>>>>>>>  255,0,0 2       20,34   0,762
>>>>>>> chr20   218306  218925  JUNC00000009    1       -       218306
>>>>>>>  218925
>>>>>>>  255,0,0 2       11,26   0,593
>>>>>>> chr20   221160  225070  JUNC00000010    25      -       221160
>>>>>>>  225070
>>>>>>>  255,0,0 2       29,9    0,3901
>>>>>>> bash-3.2$ head ~/junc716_20.bed > ~/lit.bed
>>>>>>> bash-3.2$ R213 --vanilla --quiet
>>>>>>>> library(rtracklayer)
>>>>>>> Loading required package: RCurl
>>>>>>> Loading required package: bitops
>>>>>>>> lit = import("~/lit.bed")
>>>>>>>> lit
>>>>>>> RangedData with 10 rows and 9 value columns across 1 space
>>>>>>>         space           ranges |         name     score      strand
>>>>>>> thickStart
>>>>>>>   <character>        <IRanges> |  <character> <numeric> <character>
>>>>>>>  <integer>
>>>>>>> 1        chr20 [ 55659,  64827] | JUNC00000001        14           +
>>>>>>>  55658
>>>>>>> 2        chr20 [ 55663,  64821] | JUNC00000002         2           -
>>>>>>>  55662
>>>>>>> 3        chr20 [135775, 147029] | JUNC00000003         1           -
>>>>>>> 135774
>>>>>>> 4        chr20 [167952, 172361] | JUNC00000004         1           +
>>>>>>> 167951
>>>>>>> 5        chr20 [189825, 192113] | JUNC00000005         3           +
>>>>>>> 189824
>>>>>>> 6        chr20 [189830, 192113] | JUNC00000006         3           +
>>>>>>> 189829
>>>>>>> 7        chr20 [193931, 199576] | JUNC00000007         4           -
>>>>>>> 193930
>>>>>>> 8        chr20 [207051, 207846] | JUNC00000008         2           -
>>>>>>> 207050
>>>>>>> 9        chr20 [218307, 218925] | JUNC00000009         1           -
>>>>>>> 218306
>>>>>>> 10       chr20 [221161, 225070] | JUNC00000010        25           -
>>>>>>> 221160
>>>>>>>    thickEnd     itemRgb blockCount  blockSizes blockStarts
>>>>>>>   <integer> <character>  <integer> <character> <character>
>>>>>>> 1      64827     #FF0000          2       27,25      0,9144
>>>>>>> 2      64821     #FF0000          2        34,8      0,9151
>>>>>>> 3     147029     #FF0000          2        8,29     0,11226
>>>>>>> 4     172361     #FF0000          2        29,8      0,4402
>>>>>>> 5     192113     #FF0000          2        33,9      0,2280
>>>>>>> 6     192113     #FF0000          2        32,9      0,2275
>>>>>>> 7     199576     #FF0000          2       28,11      0,5635
>>>>>>> 8     207846     #FF0000          2       20,34       0,762
>>>>>>> 9     218925     #FF0000          2       11,26       0,593
>>>>>>> 10    225070     #FF0000          2        29,9      0,3901
>>>>>>>
>>>>>>>> sessionInfo()
>>>>>>> R version 2.13.0 Under development (unstable) (2011-03-01 r54628)
>>>>>>> Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
>>>>>>>
>>>>>>> locale:
>>>>>>> [1] C
>>>>>>>
>>>>>>> attached base packages:
>>>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>>>
>>>>>>> other attached packages:
>>>>>>> [1] rtracklayer_1.11.11 RCurl_1.5-0         bitops_1.0-4.1
>>>>>>>
>>>>>>> loaded via a namespace (and not attached):
>>>>>>> [1] BSgenome_1.19.4      Biobase_2.11.9       Biostrings_2.19.15
>>>>>>> [4] GenomicRanges_1.3.23 IRanges_1.9.25       Matrix_0.999375-47
>>>>>>> [7] XML_3.2-0            grid_2.13.0          lattice_0.19-17
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>     /    Thiago Yukio Kikuchi Oliveira
>>>>>>>> (=\
>>>>>>>>   \=) Faculdade de Medicina de Ribeirão Preto
>>>>>>>>    /   Laboratório de Genética Molecular e Bioinformática
>>>>>>>>   /=)
>>>>>>>> -----------------------------------------------------------------
>>>>>>>> (=/   Centro de Terapia Celular/CEPID/FAPESP - Hemocentro de Rib.
>>>>>>>> Preto
>>>>>>>>   /    Rua Tenente Catão Roxo, 2501 CEP 14151-140
>>>>>>>> (=\   Ribeirão Preto - São Paulo
>>>>>>>>   \=) Fone: 55 16 2101-9300   Ramal: 9603
>>>>>>>>    /   E-mail: stratus at lgmb.fmrp.usp.br
>>>>>>>>   /=)            stratust at gmail.com
>>>>>>>> (=/
>>>>>>>>   /    Bioinformatic Team - BiT: http://lgmb.fmrp.usp.br
>>>>>>>> (=\   Hemocentro de Ribeirão Preto: http://pegasus.fmrp.usp.br
>>>>>>>>   \=)
>>>>>>>>    /
>>>>>>>> -----------------------------------------------------------------
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-sig-sequencing mailing list
>>>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-sig-sequencing mailing list
>>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-sig-sequencing mailing list
>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>
>>>>>
>>>
>>>
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
> 


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioc-sig-sequencing mailing list