[Bioc-sig-seq] BED file parser
Jonathan Cairns
Jonathan.Cairns at cancer.org.uk
Thu Mar 10 15:08:44 CET 2011
FWIW, If you need only the chr, start, end and strand information from a BED file, the function read.bed() in the BayesPeak package does this. Moreover it is usually around twice as fast as import() because it ignores the rest of the columns. However, it cannot yet handle the compressed .bed.gz format.
Perhaps this would be a useful option in import.bed()? From my experience, use of BED files is not that uncommon for ChIP-seq reads.
Jonathan
________________________________________
From: bioc-sig-sequencing-bounces at r-project.org [bioc-sig-sequencing-bounces at r-project.org] On Behalf Of Ivan Gregoretti [ivangreg at gmail.com]
Sent: 09 March 2011 16:18
To: Michael Lawrence
Cc: bioc-sig-sequencing at r-project.org
Subject: Re: [Bioc-sig-seq] BED file parser
I use BED because it uses less memory.
BAM format contains the read names, the sequences, the quality string
and more information. I do not need that. I only need chromosome name,
start, end, and strand.
So, for almost all my analyses, I start by converting my .bam to a
minimalistic .bed.gz outside R and then from R I load my tags into a
GRanges with import().
As simple as that.
Ivan
Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1016 and 1-301-496-1592
Fax: 1-301-496-9878
On Wed, Mar 9, 2011 at 10:51 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
>
>
> On Wed, Mar 9, 2011 at 7:33 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
>>
>> I find simple BED files to be slow to import. I only use BED without
>> track headers. The data is derived mostly from *-seq so we are talking
>> about multiple million lines per file.
>>
>> The problem as I understand it is that the function reads one row at a
>> time. It could be much faster if it read, say, 1000 rows at a time.
>>
>
> I hope it's not reading one row at a time. It just calls read.table(), in a
> fairly efficient way, with colClasses specified, etc. Why do you have high
> throughput sequencing results in BED files? BED is really for genes. Most
> other things fit into BAM, bedGraph (which uses the same basic parser
> though), WIG, etc.
>
>>
>> I never get errors. There are no bugs to fix. It's just very slow for
>> the real world of high throughput sequencing. That's all.
>>
>> Thanks,
>>
>> Ivan
>>
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1016 and 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>>
>> On Wed, Mar 9, 2011 at 10:21 AM, Michael Lawrence
>> <lawrence.michael at gene.com> wrote:
>> >
>> >
>> > On Wed, Mar 9, 2011 at 6:41 AM, Ivan Gregoretti <ivangreg at gmail.com>
>> > wrote:
>> >>
>> >> Just to expand a little bit Vincent's response.
>> >>
>> >> If you happen to be handling very large BED files, you probably keep
>> >> them compressed. The good news is that even in that case, you can load
>> >> them:
>> >>
>> >> lit = import("~/lit.bed.gz"."bed")
>> >>
>> >> There is still the long-standing issue of how slow the import()
>> >> function is but I am still hopeful.
>> >>
>> >
>> > This is the first I've heard of this. What sort of files are slow? Do
>> > they
>> > have a track line? The parsing gets complicated when there are track
>> > lines
>> > and multiple tracks in a file. BED is a complex format with many
>> > variants.
>> >
>> >>
>> >> Ivan
>> >>
>> >> Ivan Gregoretti, PhD
>> >> National Institute of Diabetes and Digestive and Kidney Diseases
>> >> National Institutes of Health
>> >> 5 Memorial Dr, Building 5, Room 205.
>> >> Bethesda, MD 20892. USA.
>> >> Phone: 1-301-496-1016 and 1-301-496-1592
>> >> Fax: 1-301-496-9878
>> >>
>> >>
>> >>
>> >> On Tue, Mar 8, 2011 at 9:26 PM, Vincent Carey
>> >> <stvjc at channing.harvard.edu> wrote:
>> >> > 2011/3/8 Thiago Yukio Kikuchi Oliveira <stratust at gmail.com>:
>> >> >> Hi,
>> >> >>
>> >> >> Is there a BED file parser for R?
>> >> >
>> >> > I suppose it depends on what you mean by "parser". import() from the
>> >> > rtracklayer package imports BED and constructs and populates a
>> >> > RangedData object with the contents. Here we look at a small bed
>> >> > file
>> >> > in text,
>> >> > start R, load rtracklayer, import the data, show the result, and show
>> >> > the resources used.
>> >> >
>> >> > bash-3.2$ head ~/junc716_20.bed
>> >> > chr20 55658 64827 JUNC00000001 14 + 55658 64827
>> >> > 255,0,0 2 27,25 0,9144
>> >> > chr20 55662 64821 JUNC00000002 2 - 55662 64821
>> >> > 255,0,0 2 34,8 0,9151
>> >> > chr20 135774 147029 JUNC00000003 1 - 135774
>> >> > 147029
>> >> > 255,0,0 2 8,29 0,11226
>> >> > chr20 167951 172361 JUNC00000004 1 + 167951
>> >> > 172361
>> >> > 255,0,0 2 29,8 0,4402
>> >> > chr20 189824 192113 JUNC00000005 3 + 189824
>> >> > 192113
>> >> > 255,0,0 2 33,9 0,2280
>> >> > chr20 189829 192113 JUNC00000006 3 + 189829
>> >> > 192113
>> >> > 255,0,0 2 32,9 0,2275
>> >> > chr20 193930 199576 JUNC00000007 4 - 193930
>> >> > 199576
>> >> > 255,0,0 2 28,11 0,5635
>> >> > chr20 207050 207846 JUNC00000008 2 - 207050
>> >> > 207846
>> >> > 255,0,0 2 20,34 0,762
>> >> > chr20 218306 218925 JUNC00000009 1 - 218306
>> >> > 218925
>> >> > 255,0,0 2 11,26 0,593
>> >> > chr20 221160 225070 JUNC00000010 25 - 221160
>> >> > 225070
>> >> > 255,0,0 2 29,9 0,3901
>> >> > bash-3.2$ head ~/junc716_20.bed > ~/lit.bed
>> >> > bash-3.2$ R213 --vanilla --quiet
>> >> >> library(rtracklayer)
>> >> > Loading required package: RCurl
>> >> > Loading required package: bitops
>> >> >> lit = import("~/lit.bed")
>> >> >> lit
>> >> > RangedData with 10 rows and 9 value columns across 1 space
>> >> > space ranges | name score strand
>> >> > thickStart
>> >> > <character> <IRanges> | <character> <numeric> <character>
>> >> > <integer>
>> >> > 1 chr20 [ 55659, 64827] | JUNC00000001 14 +
>> >> > 55658
>> >> > 2 chr20 [ 55663, 64821] | JUNC00000002 2 -
>> >> > 55662
>> >> > 3 chr20 [135775, 147029] | JUNC00000003 1 -
>> >> > 135774
>> >> > 4 chr20 [167952, 172361] | JUNC00000004 1 +
>> >> > 167951
>> >> > 5 chr20 [189825, 192113] | JUNC00000005 3 +
>> >> > 189824
>> >> > 6 chr20 [189830, 192113] | JUNC00000006 3 +
>> >> > 189829
>> >> > 7 chr20 [193931, 199576] | JUNC00000007 4 -
>> >> > 193930
>> >> > 8 chr20 [207051, 207846] | JUNC00000008 2 -
>> >> > 207050
>> >> > 9 chr20 [218307, 218925] | JUNC00000009 1 -
>> >> > 218306
>> >> > 10 chr20 [221161, 225070] | JUNC00000010 25 -
>> >> > 221160
>> >> > thickEnd itemRgb blockCount blockSizes blockStarts
>> >> > <integer> <character> <integer> <character> <character>
>> >> > 1 64827 #FF0000 2 27,25 0,9144
>> >> > 2 64821 #FF0000 2 34,8 0,9151
>> >> > 3 147029 #FF0000 2 8,29 0,11226
>> >> > 4 172361 #FF0000 2 29,8 0,4402
>> >> > 5 192113 #FF0000 2 33,9 0,2280
>> >> > 6 192113 #FF0000 2 32,9 0,2275
>> >> > 7 199576 #FF0000 2 28,11 0,5635
>> >> > 8 207846 #FF0000 2 20,34 0,762
>> >> > 9 218925 #FF0000 2 11,26 0,593
>> >> > 10 225070 #FF0000 2 29,9 0,3901
>> >> >
>> >> >> sessionInfo()
>> >> > R version 2.13.0 Under development (unstable) (2011-03-01 r54628)
>> >> > Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
>> >> >
>> >> > locale:
>> >> > [1] C
>> >> >
>> >> > attached base packages:
>> >> > [1] stats graphics grDevices utils datasets methods base
>> >> >
>> >> > other attached packages:
>> >> > [1] rtracklayer_1.11.11 RCurl_1.5-0 bitops_1.0-4.1
>> >> >
>> >> > loaded via a namespace (and not attached):
>> >> > [1] BSgenome_1.19.4 Biobase_2.11.9 Biostrings_2.19.15
>> >> > [4] GenomicRanges_1.3.23 IRanges_1.9.25 Matrix_0.999375-47
>> >> > [7] XML_3.2-0 grid_2.13.0 lattice_0.19-17
>> >> >
>> >> >
>> >> >>
>> >> >>
>> >> >> Thanks
>> >> >>
>> >> >> / Thiago Yukio Kikuchi Oliveira
>> >> >> (=\
>> >> >> \=) Faculdade de Medicina de Ribeirão Preto
>> >> >> / Laboratório de Genética Molecular e Bioinformática
>> >> >> /=)
>> >> >> -----------------------------------------------------------------
>> >> >> (=/ Centro de Terapia Celular/CEPID/FAPESP - Hemocentro de Rib.
>> >> >> Preto
>> >> >> / Rua Tenente Catão Roxo, 2501 CEP 14151-140
>> >> >> (=\ Ribeirão Preto - São Paulo
>> >> >> \=) Fone: 55 16 2101-9300 Ramal: 9603
>> >> >> / E-mail: stratus at lgmb.fmrp.usp.br
>> >> >> /=) stratust at gmail.com
>> >> >> (=/
>> >> >> / Bioinformatic Team - BiT: http://lgmb.fmrp.usp.br
>> >> >> (=\ Hemocentro de Ribeirão Preto: http://pegasus.fmrp.usp.br
>> >> >> \=)
>> >> >> /
>> >> >> -----------------------------------------------------------------
>> >> >>
>> >> >> _______________________________________________
>> >> >> Bioc-sig-sequencing mailing list
>> >> >> Bioc-sig-sequencing at r-project.org
>> >> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >> >>
>> >> >
>> >> > _______________________________________________
>> >> > Bioc-sig-sequencing mailing list
>> >> > Bioc-sig-sequencing at r-project.org
>> >> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >> >
>> >>
>> >> _______________________________________________
>> >> Bioc-sig-sequencing mailing list
>> >> Bioc-sig-sequencing at r-project.org
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >
>> >
>
>
_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
This communication is from Cancer Research UK. Our website is at www.cancerresearchuk.org. We are a registered charity in England and Wales (1089464) and in Scotland (SC041666) and a company limited by guarantee registered in England and Wales under number 4325234. Our registered address is Angel Building, 407 St John Street, London, EC1V 4AD. Our central telephone number is 020 7242 0200.
This communication and any attachments contain information which is confidential and may also be privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) please note that any form of disclosure, distribution, copying or use of this communication or the information in it or in any attachments is strictly prohibited and may be unlawful. If you have received this communication in error, please notify the sender and delete the email and destroy any copies of it.
E-mail communications cannot be guaranteed to be secure or error free, as information could be intercepted, corrupted, amended, lost, destroyed, arrive late or incomplete, or contain viruses. We do not accept liability for any such matters or their consequences. Anyone who communicates with us by e-mail is taken to accept the risks in doing so.
More information about the Bioc-sig-sequencing
mailing list