[Bioc-sig-seq] BED file parser
Ivan Gregoretti
ivangreg at gmail.com
Wed Mar 9 17:18:35 CET 2011
I use BED because it uses less memory.
BAM format contains the read names, the sequences, the quality string
and more information. I do not need that. I only need chromosome name,
start, end, and strand.
So, for almost all my analyses, I start by converting my .bam to a
minimalistic .bed.gz outside R and then from R I load my tags into a
GRanges with import().
As simple as that.
Ivan
Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1016 and 1-301-496-1592
Fax: 1-301-496-9878
On Wed, Mar 9, 2011 at 10:51 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
>
>
> On Wed, Mar 9, 2011 at 7:33 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
>>
>> I find simple BED files to be slow to import. I only use BED without
>> track headers. The data is derived mostly from *-seq so we are talking
>> about multiple million lines per file.
>>
>> The problem as I understand it is that the function reads one row at a
>> time. It could be much faster if it read, say, 1000 rows at a time.
>>
>
> I hope it's not reading one row at a time. It just calls read.table(), in a
> fairly efficient way, with colClasses specified, etc. Why do you have high
> throughput sequencing results in BED files? BED is really for genes. Most
> other things fit into BAM, bedGraph (which uses the same basic parser
> though), WIG, etc.
>
>>
>> I never get errors. There are no bugs to fix. It's just very slow for
>> the real world of high throughput sequencing. That's all.
>>
>> Thanks,
>>
>> Ivan
>>
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1016 and 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>>
>> On Wed, Mar 9, 2011 at 10:21 AM, Michael Lawrence
>> <lawrence.michael at gene.com> wrote:
>> >
>> >
>> > On Wed, Mar 9, 2011 at 6:41 AM, Ivan Gregoretti <ivangreg at gmail.com>
>> > wrote:
>> >>
>> >> Just to expand a little bit Vincent's response.
>> >>
>> >> If you happen to be handling very large BED files, you probably keep
>> >> them compressed. The good news is that even in that case, you can load
>> >> them:
>> >>
>> >> lit = import("~/lit.bed.gz"."bed")
>> >>
>> >> There is still the long-standing issue of how slow the import()
>> >> function is but I am still hopeful.
>> >>
>> >
>> > This is the first I've heard of this. What sort of files are slow? Do
>> > they
>> > have a track line? The parsing gets complicated when there are track
>> > lines
>> > and multiple tracks in a file. BED is a complex format with many
>> > variants.
>> >
>> >>
>> >> Ivan
>> >>
>> >> Ivan Gregoretti, PhD
>> >> National Institute of Diabetes and Digestive and Kidney Diseases
>> >> National Institutes of Health
>> >> 5 Memorial Dr, Building 5, Room 205.
>> >> Bethesda, MD 20892. USA.
>> >> Phone: 1-301-496-1016 and 1-301-496-1592
>> >> Fax: 1-301-496-9878
>> >>
>> >>
>> >>
>> >> On Tue, Mar 8, 2011 at 9:26 PM, Vincent Carey
>> >> <stvjc at channing.harvard.edu> wrote:
>> >> > 2011/3/8 Thiago Yukio Kikuchi Oliveira <stratust at gmail.com>:
>> >> >> Hi,
>> >> >>
>> >> >> Is there a BED file parser for R?
>> >> >
>> >> > I suppose it depends on what you mean by "parser". import() from the
>> >> > rtracklayer package imports BED and constructs and populates a
>> >> > RangedData object with the contents. Here we look at a small bed
>> >> > file
>> >> > in text,
>> >> > start R, load rtracklayer, import the data, show the result, and show
>> >> > the resources used.
>> >> >
>> >> > bash-3.2$ head ~/junc716_20.bed
>> >> > chr20 55658 64827 JUNC00000001 14 + 55658 64827
>> >> > 255,0,0 2 27,25 0,9144
>> >> > chr20 55662 64821 JUNC00000002 2 - 55662 64821
>> >> > 255,0,0 2 34,8 0,9151
>> >> > chr20 135774 147029 JUNC00000003 1 - 135774
>> >> > 147029
>> >> > 255,0,0 2 8,29 0,11226
>> >> > chr20 167951 172361 JUNC00000004 1 + 167951
>> >> > 172361
>> >> > 255,0,0 2 29,8 0,4402
>> >> > chr20 189824 192113 JUNC00000005 3 + 189824
>> >> > 192113
>> >> > 255,0,0 2 33,9 0,2280
>> >> > chr20 189829 192113 JUNC00000006 3 + 189829
>> >> > 192113
>> >> > 255,0,0 2 32,9 0,2275
>> >> > chr20 193930 199576 JUNC00000007 4 - 193930
>> >> > 199576
>> >> > 255,0,0 2 28,11 0,5635
>> >> > chr20 207050 207846 JUNC00000008 2 - 207050
>> >> > 207846
>> >> > 255,0,0 2 20,34 0,762
>> >> > chr20 218306 218925 JUNC00000009 1 - 218306
>> >> > 218925
>> >> > 255,0,0 2 11,26 0,593
>> >> > chr20 221160 225070 JUNC00000010 25 - 221160
>> >> > 225070
>> >> > 255,0,0 2 29,9 0,3901
>> >> > bash-3.2$ head ~/junc716_20.bed > ~/lit.bed
>> >> > bash-3.2$ R213 --vanilla --quiet
>> >> >> library(rtracklayer)
>> >> > Loading required package: RCurl
>> >> > Loading required package: bitops
>> >> >> lit = import("~/lit.bed")
>> >> >> lit
>> >> > RangedData with 10 rows and 9 value columns across 1 space
>> >> > space ranges | name score strand
>> >> > thickStart
>> >> > <character> <IRanges> | <character> <numeric> <character>
>> >> > <integer>
>> >> > 1 chr20 [ 55659, 64827] | JUNC00000001 14 +
>> >> > 55658
>> >> > 2 chr20 [ 55663, 64821] | JUNC00000002 2 -
>> >> > 55662
>> >> > 3 chr20 [135775, 147029] | JUNC00000003 1 -
>> >> > 135774
>> >> > 4 chr20 [167952, 172361] | JUNC00000004 1 +
>> >> > 167951
>> >> > 5 chr20 [189825, 192113] | JUNC00000005 3 +
>> >> > 189824
>> >> > 6 chr20 [189830, 192113] | JUNC00000006 3 +
>> >> > 189829
>> >> > 7 chr20 [193931, 199576] | JUNC00000007 4 -
>> >> > 193930
>> >> > 8 chr20 [207051, 207846] | JUNC00000008 2 -
>> >> > 207050
>> >> > 9 chr20 [218307, 218925] | JUNC00000009 1 -
>> >> > 218306
>> >> > 10 chr20 [221161, 225070] | JUNC00000010 25 -
>> >> > 221160
>> >> > thickEnd itemRgb blockCount blockSizes blockStarts
>> >> > <integer> <character> <integer> <character> <character>
>> >> > 1 64827 #FF0000 2 27,25 0,9144
>> >> > 2 64821 #FF0000 2 34,8 0,9151
>> >> > 3 147029 #FF0000 2 8,29 0,11226
>> >> > 4 172361 #FF0000 2 29,8 0,4402
>> >> > 5 192113 #FF0000 2 33,9 0,2280
>> >> > 6 192113 #FF0000 2 32,9 0,2275
>> >> > 7 199576 #FF0000 2 28,11 0,5635
>> >> > 8 207846 #FF0000 2 20,34 0,762
>> >> > 9 218925 #FF0000 2 11,26 0,593
>> >> > 10 225070 #FF0000 2 29,9 0,3901
>> >> >
>> >> >> sessionInfo()
>> >> > R version 2.13.0 Under development (unstable) (2011-03-01 r54628)
>> >> > Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
>> >> >
>> >> > locale:
>> >> > [1] C
>> >> >
>> >> > attached base packages:
>> >> > [1] stats graphics grDevices utils datasets methods base
>> >> >
>> >> > other attached packages:
>> >> > [1] rtracklayer_1.11.11 RCurl_1.5-0 bitops_1.0-4.1
>> >> >
>> >> > loaded via a namespace (and not attached):
>> >> > [1] BSgenome_1.19.4 Biobase_2.11.9 Biostrings_2.19.15
>> >> > [4] GenomicRanges_1.3.23 IRanges_1.9.25 Matrix_0.999375-47
>> >> > [7] XML_3.2-0 grid_2.13.0 lattice_0.19-17
>> >> >
>> >> >
>> >> >>
>> >> >>
>> >> >> Thanks
>> >> >>
>> >> >> / Thiago Yukio Kikuchi Oliveira
>> >> >> (=\
>> >> >> \=) Faculdade de Medicina de Ribeirão Preto
>> >> >> / Laboratório de Genética Molecular e Bioinformática
>> >> >> /=)
>> >> >> -----------------------------------------------------------------
>> >> >> (=/ Centro de Terapia Celular/CEPID/FAPESP - Hemocentro de Rib.
>> >> >> Preto
>> >> >> / Rua Tenente Catão Roxo, 2501 CEP 14151-140
>> >> >> (=\ Ribeirão Preto - São Paulo
>> >> >> \=) Fone: 55 16 2101-9300 Ramal: 9603
>> >> >> / E-mail: stratus at lgmb.fmrp.usp.br
>> >> >> /=) stratust at gmail.com
>> >> >> (=/
>> >> >> / Bioinformatic Team - BiT: http://lgmb.fmrp.usp.br
>> >> >> (=\ Hemocentro de Ribeirão Preto: http://pegasus.fmrp.usp.br
>> >> >> \=)
>> >> >> /
>> >> >> -----------------------------------------------------------------
>> >> >>
>> >> >> _______________________________________________
>> >> >> Bioc-sig-sequencing mailing list
>> >> >> Bioc-sig-sequencing at r-project.org
>> >> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >> >>
>> >> >
>> >> > _______________________________________________
>> >> > Bioc-sig-sequencing mailing list
>> >> > Bioc-sig-sequencing at r-project.org
>> >> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >> >
>> >>
>> >> _______________________________________________
>> >> Bioc-sig-sequencing mailing list
>> >> Bioc-sig-sequencing at r-project.org
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >
>> >
>
>
More information about the Bioc-sig-sequencing
mailing list