[Bioc-sig-seq] BED file parser
Martin Morgan
mtmorgan at fhcrc.org
Wed Mar 9 18:34:16 CET 2011
On 03/09/2011 08:18 AM, Ivan Gregoretti wrote:
> I use BED because it uses less memory.
>
> BAM format contains the read names, the sequences, the quality string
> and more information. I do not need that. I only need chromosome name,
> start, end, and strand.
In Rsamtools, scanBam with param=ScanBamParam(what=c("qname", "pos",
"qwidth", "strand")) and it'll be very fast.
see ?scanBam, ?ScanBamParam and the vignette.
Martin
>
> So, for almost all my analyses, I start by converting my .bam to a
> minimalistic .bed.gz outside R and then from R I load my tags into a
> GRanges with import().
>
> As simple as that.
>
> Ivan
>
>
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1016 and 1-301-496-1592
> Fax: 1-301-496-9878
>
>
>
> On Wed, Mar 9, 2011 at 10:51 AM, Michael Lawrence
> <lawrence.michael at gene.com> wrote:
>>
>>
>> On Wed, Mar 9, 2011 at 7:33 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
>>>
>>> I find simple BED files to be slow to import. I only use BED without
>>> track headers. The data is derived mostly from *-seq so we are talking
>>> about multiple million lines per file.
>>>
>>> The problem as I understand it is that the function reads one row at a
>>> time. It could be much faster if it read, say, 1000 rows at a time.
>>>
>>
>> I hope it's not reading one row at a time. It just calls read.table(), in a
>> fairly efficient way, with colClasses specified, etc. Why do you have high
>> throughput sequencing results in BED files? BED is really for genes. Most
>> other things fit into BAM, bedGraph (which uses the same basic parser
>> though), WIG, etc.
>>
>>>
>>> I never get errors. There are no bugs to fix. It's just very slow for
>>> the real world of high throughput sequencing. That's all.
>>>
>>> Thanks,
>>>
>>> Ivan
>>>
>>>
>>> Ivan Gregoretti, PhD
>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>> National Institutes of Health
>>> 5 Memorial Dr, Building 5, Room 205.
>>> Bethesda, MD 20892. USA.
>>> Phone: 1-301-496-1016 and 1-301-496-1592
>>> Fax: 1-301-496-9878
>>>
>>>
>>>
>>> On Wed, Mar 9, 2011 at 10:21 AM, Michael Lawrence
>>> <lawrence.michael at gene.com> wrote:
>>>>
>>>>
>>>> On Wed, Mar 9, 2011 at 6:41 AM, Ivan Gregoretti <ivangreg at gmail.com>
>>>> wrote:
>>>>>
>>>>> Just to expand a little bit Vincent's response.
>>>>>
>>>>> If you happen to be handling very large BED files, you probably keep
>>>>> them compressed. The good news is that even in that case, you can load
>>>>> them:
>>>>>
>>>>> lit = import("~/lit.bed.gz"."bed")
>>>>>
>>>>> There is still the long-standing issue of how slow the import()
>>>>> function is but I am still hopeful.
>>>>>
>>>>
>>>> This is the first I've heard of this. What sort of files are slow? Do
>>>> they
>>>> have a track line? The parsing gets complicated when there are track
>>>> lines
>>>> and multiple tracks in a file. BED is a complex format with many
>>>> variants.
>>>>
>>>>>
>>>>> Ivan
>>>>>
>>>>> Ivan Gregoretti, PhD
>>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>>> National Institutes of Health
>>>>> 5 Memorial Dr, Building 5, Room 205.
>>>>> Bethesda, MD 20892. USA.
>>>>> Phone: 1-301-496-1016 and 1-301-496-1592
>>>>> Fax: 1-301-496-9878
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 8, 2011 at 9:26 PM, Vincent Carey
>>>>> <stvjc at channing.harvard.edu> wrote:
>>>>>> 2011/3/8 Thiago Yukio Kikuchi Oliveira <stratust at gmail.com>:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Is there a BED file parser for R?
>>>>>>
>>>>>> I suppose it depends on what you mean by "parser". import() from the
>>>>>> rtracklayer package imports BED and constructs and populates a
>>>>>> RangedData object with the contents. Here we look at a small bed
>>>>>> file
>>>>>> in text,
>>>>>> start R, load rtracklayer, import the data, show the result, and show
>>>>>> the resources used.
>>>>>>
>>>>>> bash-3.2$ head ~/junc716_20.bed
>>>>>> chr20 55658 64827 JUNC00000001 14 + 55658 64827
>>>>>> 255,0,0 2 27,25 0,9144
>>>>>> chr20 55662 64821 JUNC00000002 2 - 55662 64821
>>>>>> 255,0,0 2 34,8 0,9151
>>>>>> chr20 135774 147029 JUNC00000003 1 - 135774
>>>>>> 147029
>>>>>> 255,0,0 2 8,29 0,11226
>>>>>> chr20 167951 172361 JUNC00000004 1 + 167951
>>>>>> 172361
>>>>>> 255,0,0 2 29,8 0,4402
>>>>>> chr20 189824 192113 JUNC00000005 3 + 189824
>>>>>> 192113
>>>>>> 255,0,0 2 33,9 0,2280
>>>>>> chr20 189829 192113 JUNC00000006 3 + 189829
>>>>>> 192113
>>>>>> 255,0,0 2 32,9 0,2275
>>>>>> chr20 193930 199576 JUNC00000007 4 - 193930
>>>>>> 199576
>>>>>> 255,0,0 2 28,11 0,5635
>>>>>> chr20 207050 207846 JUNC00000008 2 - 207050
>>>>>> 207846
>>>>>> 255,0,0 2 20,34 0,762
>>>>>> chr20 218306 218925 JUNC00000009 1 - 218306
>>>>>> 218925
>>>>>> 255,0,0 2 11,26 0,593
>>>>>> chr20 221160 225070 JUNC00000010 25 - 221160
>>>>>> 225070
>>>>>> 255,0,0 2 29,9 0,3901
>>>>>> bash-3.2$ head ~/junc716_20.bed > ~/lit.bed
>>>>>> bash-3.2$ R213 --vanilla --quiet
>>>>>>> library(rtracklayer)
>>>>>> Loading required package: RCurl
>>>>>> Loading required package: bitops
>>>>>>> lit = import("~/lit.bed")
>>>>>>> lit
>>>>>> RangedData with 10 rows and 9 value columns across 1 space
>>>>>> space ranges | name score strand
>>>>>> thickStart
>>>>>> <character> <IRanges> | <character> <numeric> <character>
>>>>>> <integer>
>>>>>> 1 chr20 [ 55659, 64827] | JUNC00000001 14 +
>>>>>> 55658
>>>>>> 2 chr20 [ 55663, 64821] | JUNC00000002 2 -
>>>>>> 55662
>>>>>> 3 chr20 [135775, 147029] | JUNC00000003 1 -
>>>>>> 135774
>>>>>> 4 chr20 [167952, 172361] | JUNC00000004 1 +
>>>>>> 167951
>>>>>> 5 chr20 [189825, 192113] | JUNC00000005 3 +
>>>>>> 189824
>>>>>> 6 chr20 [189830, 192113] | JUNC00000006 3 +
>>>>>> 189829
>>>>>> 7 chr20 [193931, 199576] | JUNC00000007 4 -
>>>>>> 193930
>>>>>> 8 chr20 [207051, 207846] | JUNC00000008 2 -
>>>>>> 207050
>>>>>> 9 chr20 [218307, 218925] | JUNC00000009 1 -
>>>>>> 218306
>>>>>> 10 chr20 [221161, 225070] | JUNC00000010 25 -
>>>>>> 221160
>>>>>> thickEnd itemRgb blockCount blockSizes blockStarts
>>>>>> <integer> <character> <integer> <character> <character>
>>>>>> 1 64827 #FF0000 2 27,25 0,9144
>>>>>> 2 64821 #FF0000 2 34,8 0,9151
>>>>>> 3 147029 #FF0000 2 8,29 0,11226
>>>>>> 4 172361 #FF0000 2 29,8 0,4402
>>>>>> 5 192113 #FF0000 2 33,9 0,2280
>>>>>> 6 192113 #FF0000 2 32,9 0,2275
>>>>>> 7 199576 #FF0000 2 28,11 0,5635
>>>>>> 8 207846 #FF0000 2 20,34 0,762
>>>>>> 9 218925 #FF0000 2 11,26 0,593
>>>>>> 10 225070 #FF0000 2 29,9 0,3901
>>>>>>
>>>>>>> sessionInfo()
>>>>>> R version 2.13.0 Under development (unstable) (2011-03-01 r54628)
>>>>>> Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
>>>>>>
>>>>>> locale:
>>>>>> [1] C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>>
>>>>>> other attached packages:
>>>>>> [1] rtracklayer_1.11.11 RCurl_1.5-0 bitops_1.0-4.1
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] BSgenome_1.19.4 Biobase_2.11.9 Biostrings_2.19.15
>>>>>> [4] GenomicRanges_1.3.23 IRanges_1.9.25 Matrix_0.999375-47
>>>>>> [7] XML_3.2-0 grid_2.13.0 lattice_0.19-17
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> / Thiago Yukio Kikuchi Oliveira
>>>>>>> (=\
>>>>>>> \=) Faculdade de Medicina de Ribeirão Preto
>>>>>>> / Laboratório de Genética Molecular e Bioinformática
>>>>>>> /=)
>>>>>>> -----------------------------------------------------------------
>>>>>>> (=/ Centro de Terapia Celular/CEPID/FAPESP - Hemocentro de Rib.
>>>>>>> Preto
>>>>>>> / Rua Tenente Catão Roxo, 2501 CEP 14151-140
>>>>>>> (=\ Ribeirão Preto - São Paulo
>>>>>>> \=) Fone: 55 16 2101-9300 Ramal: 9603
>>>>>>> / E-mail: stratus at lgmb.fmrp.usp.br
>>>>>>> /=) stratust at gmail.com
>>>>>>> (=/
>>>>>>> / Bioinformatic Team - BiT: http://lgmb.fmrp.usp.br
>>>>>>> (=\ Hemocentro de Ribeirão Preto: http://pegasus.fmrp.usp.br
>>>>>>> \=)
>>>>>>> /
>>>>>>> -----------------------------------------------------------------
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-sig-sequencing mailing list
>>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-sig-sequencing mailing list
>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-sig-sequencing mailing list
>>>>> Bioc-sig-sequencing at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>
>>>>
>>
>>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioc-sig-sequencing
mailing list