[Bioc-sig-seq] BED file parser

Thu Mar 10 15:08:44 CET 2011

FWIW, If you need only the chr, start, end and strand information from a BED file, the function read.bed() in the BayesPeak package does this. Moreover it is usually around twice as fast as import() because it ignores the rest of the columns. However, it cannot yet handle the compressed .bed.gz format.

Perhaps this would be a useful option in import.bed()? From my experience, use of BED files is not that uncommon for ChIP-seq reads.

Jonathan
________________________________________
From: bioc-sig-sequencing-bounces at r-project.org [bioc-sig-sequencing-bounces at r-project.org] On Behalf Of Ivan Gregoretti [ivangreg at gmail.com]
Sent: 09 March 2011 16:18
To: Michael Lawrence
Cc: bioc-sig-sequencing at r-project.org
Subject: Re: [Bioc-sig-seq] BED file parser

I use BED because it uses less memory.

BAM format contains the read names, the sequences, the quality string
and more information. I do not need that. I only need chromosome name,
start, end, and strand.

So, for almost all my analyses, I start by converting my .bam to a
minimalistic .bed.gz outside R and then from R I load my tags into a
GRanges with import().

As simple as that.

Ivan

Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1016 and 1-301-496-1592
Fax: 1-301-496-9878

On Wed, Mar 9, 2011 at 10:51 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
>
>
> On Wed, Mar 9, 2011 at 7:33 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
>>
>> I find simple BED files to be slow to import. I only use BED without
>> track headers. The data is derived mostly from *-seq so we are talking
>> about multiple million lines per file.
>>
>> The problem as I understand it is that the function reads one row at a
>> time. It could be much faster if it read, say, 1000 rows at a time.
>>
>
> I hope it's not reading one row at a time. It just calls read.table(), in a
> fairly efficient way, with colClasses specified, etc. Why do you have high
> throughput sequencing results in BED files? BED is really for genes. Most
> other things fit into BAM, bedGraph (which uses the same basic parser
> though), WIG, etc.
>
>>
>> I never get errors. There are no bugs to fix. It's just very slow for
>> the real world of high throughput sequencing. That's all.
>>
>> Thanks,
>>
>> Ivan
>>
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1016 and 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>>
>> On Wed, Mar 9, 2011 at 10:21 AM, Michael Lawrence
>> <lawrence.michael at gene.com> wrote:
>> >
>> >
>> > On Wed, Mar 9, 2011 at 6:41 AM, Ivan Gregoretti <ivangreg at gmail.com>
>> > wrote:
>> >>
>> >> Just to expand a little bit Vincent's response.
>> >>
>> >> If you happen to be handling very large BED files, you probably keep
>> >> them compressed. The good news is that even in that case, you can load
>> >> them:
>> >>
>> >> lit = import("~/lit.bed.gz"."bed")
>> >>
>> >> There is still the long-standing issue of how slow the import()
>> >> function is but I am still hopeful.
>> >>
>> >
>> > This is the first I've heard of this. What sort of files are slow? Do
>> > they
>> > have a track line? The parsing gets complicated when there are track
>> > lines
>> > and multiple tracks in a file. BED is a complex format with many
>> > variants.
>> >
>> >>
>> >> Ivan
>> >>
>> >> Ivan Gregoretti, PhD
>> >> National Institute of Diabetes and Digestive and Kidney Diseases
>> >> National Institutes of Health
>> >> 5 Memorial Dr, Building 5, Room 205.
>> >> Bethesda, MD 20892. USA.
>> >> Phone: 1-301-496-1016 and 1-301-496-1592
>> >> Fax: 1-301-496-9878
>> >>
>> >>
>> >>
>> >> On Tue, Mar 8, 2011 at 9:26 PM, Vincent Carey
>> >> <stvjc at channing.harvard.edu> wrote:
>> >> > 2011/3/8 Thiago Yukio Kikuchi Oliveira <stratust at gmail.com>:
>> >> >> Hi,
>> >> >>
>> >> >> Is there a BED file parser for R?
>> >> >
>> >> > I suppose it depends on what you mean by "parser".  import() from the
>> >> > rtracklayer package imports BED and constructs and populates a
>> >> > RangedData object with the contents.  Here we look at a small bed
>> >> > file
>> >> > in text,
>> >> > start R, load rtracklayer, import the data, show the result, and show
>> >> > the resources used.
>> >> >
>> >> > bash-3.2$ head ~/junc716_20.bed
>> >> > chr20   55658   64827   JUNC00000001    14      +       55658   64827
>> >> >  255,0,0 2       27,25   0,9144
>> >> > chr20   55662   64821   JUNC00000002    2       -       55662   64821
>> >> >  255,0,0 2       34,8    0,9151
>> >> > chr20   135774  147029  JUNC00000003    1       -       135774
>> >> >  147029
>> >> >  255,0,0 2       8,29    0,11226
>> >> > chr20   167951  172361  JUNC00000004    1       +       167951
>> >> >  172361
>> >> >  255,0,0 2       29,8    0,4402
>> >> > chr20   189824  192113  JUNC00000005    3       +       189824
>> >> >  192113
>> >> >  255,0,0 2       33,9    0,2280
>> >> > chr20   189829  192113  JUNC00000006    3       +       189829
>> >> >  192113
>> >> >  255,0,0 2       32,9    0,2275
>> >> > chr20   193930  199576  JUNC00000007    4       -       193930
>> >> >  199576
>> >> >  255,0,0 2       28,11   0,5635
>> >> > chr20   207050  207846  JUNC00000008    2       -       207050
>> >> >  207846
>> >> >  255,0,0 2       20,34   0,762
>> >> > chr20   218306  218925  JUNC00000009    1       -       218306
>> >> >  218925
>> >> >  255,0,0 2       11,26   0,593
>> >> > chr20   221160  225070  JUNC00000010    25      -       221160
>> >> >  225070
>> >> >  255,0,0 2       29,9    0,3901
>> >> > bash-3.2$ head ~/junc716_20.bed > ~/lit.bed
>> >> > bash-3.2$ R213 --vanilla --quiet
>> >> >> library(rtracklayer)
>> >> > Loading required package: RCurl
>> >> > Loading required package: bitops
>> >> >> lit = import("~/lit.bed")
>> >> >> lit
>> >> > RangedData with 10 rows and 9 value columns across 1 space
>> >> >         space           ranges |         name     score      strand
>> >> > thickStart
>> >> >   <character>        <IRanges> |  <character> <numeric> <character>
>> >> >  <integer>
>> >> > 1        chr20 [ 55659,  64827] | JUNC00000001        14           +
>> >> >  55658
>> >> > 2        chr20 [ 55663,  64821] | JUNC00000002         2           -
>> >> >  55662
>> >> > 3        chr20 [135775, 147029] | JUNC00000003         1           -
>> >> > 135774
>> >> > 4        chr20 [167952, 172361] | JUNC00000004         1           +
>> >> > 167951
>> >> > 5        chr20 [189825, 192113] | JUNC00000005         3           +
>> >> > 189824
>> >> > 6        chr20 [189830, 192113] | JUNC00000006         3           +
>> >> > 189829
>> >> > 7        chr20 [193931, 199576] | JUNC00000007         4           -
>> >> > 193930
>> >> > 8        chr20 [207051, 207846] | JUNC00000008         2           -
>> >> > 207050
>> >> > 9        chr20 [218307, 218925] | JUNC00000009         1           -
>> >> > 218306
>> >> > 10       chr20 [221161, 225070] | JUNC00000010        25           -
>> >> > 221160
>> >> >    thickEnd     itemRgb blockCount  blockSizes blockStarts
>> >> >   <integer> <character>  <integer> <character> <character>
>> >> > 1      64827     #FF0000          2       27,25      0,9144
>> >> > 2      64821     #FF0000          2        34,8      0,9151
>> >> > 3     147029     #FF0000          2        8,29     0,11226
>> >> > 4     172361     #FF0000          2        29,8      0,4402
>> >> > 5     192113     #FF0000          2        33,9      0,2280
>> >> > 6     192113     #FF0000          2        32,9      0,2275
>> >> > 7     199576     #FF0000          2       28,11      0,5635
>> >> > 8     207846     #FF0000          2       20,34       0,762
>> >> > 9     218925     #FF0000          2       11,26       0,593
>> >> > 10    225070     #FF0000          2        29,9      0,3901
>> >> >
>> >> >> sessionInfo()
>> >> > R version 2.13.0 Under development (unstable) (2011-03-01 r54628)
>> >> > Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
>> >> >
>> >> > locale:
>> >> > [1] C
>> >> >
>> >> > attached base packages:
>> >> > [1] stats     graphics  grDevices utils     datasets  methods   base
>> >> >
>> >> > other attached packages:
>> >> > [1] rtracklayer_1.11.11 RCurl_1.5-0         bitops_1.0-4.1
>> >> >
>> >> > loaded via a namespace (and not attached):
>> >> > [1] BSgenome_1.19.4      Biobase_2.11.9       Biostrings_2.19.15
>> >> > [4] GenomicRanges_1.3.23 IRanges_1.9.25       Matrix_0.999375-47
>> >> > [7] XML_3.2-0            grid_2.13.0          lattice_0.19-17
>> >> >
>> >> >
>> >> >>
>> >> >>
>> >> >> Thanks
>> >> >>
>> >> >>     /    Thiago Yukio Kikuchi Oliveira
>> >> >> (=\
>> >> >>   \=) Faculdade de Medicina de Ribeirão Preto
>> >> >>    /   Laboratório de Genética Molecular e Bioinformática
>> >> >>   /=)
>> >> >> -----------------------------------------------------------------
>> >> >> (=/   Centro de Terapia Celular/CEPID/FAPESP - Hemocentro de Rib.
>> >> >> Preto
>> >> >>   /    Rua Tenente Catão Roxo, 2501 CEP 14151-140
>> >> >> (=\   Ribeirão Preto - São Paulo
>> >> >>   \=) Fone: 55 16 2101-9300   Ramal: 9603
>> >> >>    /   E-mail: stratus at lgmb.fmrp.usp.br
>> >> >>   /=)            stratust at gmail.com
>> >> >> (=/
>> >> >>   /    Bioinformatic Team - BiT: http://lgmb.fmrp.usp.br
>> >> >> (=\   Hemocentro de Ribeirão Preto: http://pegasus.fmrp.usp.br
>> >> >>   \=)
>> >> >>    /
>> >> >> -----------------------------------------------------------------
>> >> >>
>> >> >> _______________________________________________
>> >> >> Bioc-sig-sequencing mailing list
>> >> >> Bioc-sig-sequencing at r-project.org
>> >> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >> >>
>> >> >
>> >> > _______________________________________________
>> >> > Bioc-sig-sequencing mailing list
>> >> > Bioc-sig-sequencing at r-project.org
>> >> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >> >
>> >>
>> >> _______________________________________________
>> >> Bioc-sig-sequencing mailing list
>> >> Bioc-sig-sequencing at r-project.org
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >
>> >
>
>

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing at r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

This communication is from Cancer Research UK. Our website is at www.cancerresearchuk.org. We are a registered charity in England and Wales (1089464) and in Scotland (SC041666) and a company limited by guarantee registered in England and Wales under number 4325234. Our registered address is Angel Building, 407 St John Street, London, EC1V 4AD. Our central telephone number is 020 7242 0200.

This communication and any attachments contain information which is confidential and may also be privileged.   It is for the exclusive use of the intended recipient(s).  If you are not the intended recipient(s) please note that any form of disclosure, distribution, copying or use of this communication or the information in it or in any attachments is strictly prohibited and may be unlawful.  If you have received this communication in error, please notify the sender and delete the email and destroy any copies of it.

E-mail communications cannot be guaranteed to be secure or error free, as information could be intercepted, corrupted, amended, lost, destroyed, arrive late or incomplete, or contain viruses.  We do not accept liability for any such matters or their consequences.  Anyone who communicates with us by e-mail is taken to accept the risks in doing so.