[Bioc-sig-seq] BED to WIG format conversion

Patrick Aboyoun paboyoun at fhcrc.org
Fri Sep 18 07:21:42 CEST 2009


Michael,
This import operation has similarities with the rdapply framework. We 
should have the two types of operations sharing similar components such 
as filtering.

To everyone,
What other data reduction operations would you like to have on bed file 
import?


Patrick


Michael Lawrence wrote:
>
>
> On Thu, Sep 17, 2009 at 4:35 PM, Patrick Aboyoun <paboyoun at fhcrc.org 
> <mailto:paboyoun at fhcrc.org>> wrote:
>
>     Michael,
>     We could add that level of flexibility, but I worry that may open
>     the door too widely because the aggregator will need to take the
>     old RangedData object and the new block RangedData object as
>     arguments and produce a new RangedData object for the next
>     iteration and users may find it difficult to construct a valid
>     aggregator.
>
>
> Well just to clarify, I thought the aggregator would take the old and 
> new objects as output by the block processor. So for coverage, one 
> function would take a RangedData and return a RleList, and the 
> aggregator would take two RleLists and add them.
>
> The call for calculating the coverage would be like:
>
> cov <- import("reads.bed", blockSize = 500000, blockFun = coverage, 
> aggregateFun = `+`)
>
> Of course, high-level functionality would be nice for routine 
> operations. That will help prevent the user from making mistakes. The 
> blockFun and aggregateFun could be bound into a single class 
> BlockHandler.
>
> cov <- import("reads.bed", blockSize = 500000, blockHandler = 
> CoverageHandler())
>  
>
>     I think there may also be a need for filters to allow the user to
>     keep only those rows that satisfy their criteria. We could supply
>     chromosome, strand, score, etc. filters to meet this need.
>
>
> Yes, and the Filter class could inherit from BlockHandler, so that it 
> works incrementally. import() could be smart enough to incorporate 
> both a filter and a user block handler into a single "meta" handler.
>
> So to obtain an RleList of coverage, but only over chromosome 1, 
> processing half a million reads at a time:
>
> cov <- import("reads.bed", blockSize = 500000, blockHandler = 
> CoverageHandler(), filter = ChromosomeFilter("chr1"))
>
> Michael
>
>
>     Patrick
>
>
>     Michael Lawrence wrote:
>
>         I don't know if we want to add explicit support for scoring,
>         when there are many reasons why one might want to
>         incrementally load a track. One solution would be a design
>         based on callbacks. The user specifies the block size and a
>         callback to be invoked for each block. The result could be a
>         list, with each element holding the result for a block. The
>         user could then aggregate that. To be even more efficient, we
>         could support a second callback that aggregates at each step.
>
>         On Thu, Sep 17, 2009 at 3:47 PM, Patrick Aboyoun
>         <paboyoun at fhcrc.org <mailto:paboyoun at fhcrc.org>
>         <mailto:paboyoun at fhcrc.org <mailto:paboyoun at fhcrc.org>>> wrote:
>
>            Michael,
>            Before people start developing their own bed file parsers, it
>            looks pretty straight-forward to incorporate this operation in
>            rtracklayer now. You or I could add a scoreRanges argument,
>         which
>            takes either a logical or perhaps a character string if we
>            envision there being more than one way to score a set of
>         ranges,
>            to the import* functions and when a appropriate value is
>         supplied,
>            it will internally block process the data file and output a
>            RangedData object with a score column.
>
>
>            Patrick
>
>
>
>            Michael Lawrence wrote:
>
>                On Thu, Sep 17, 2009 at 2:17 PM, Ivan Gregoretti
>                <ivangreg at gmail.com <mailto:ivangreg at gmail.com>
>         <mailto:ivangreg at gmail.com <mailto:ivangreg at gmail.com>>> wrote:
>
>                
>                    Hi Michael,
>
>                    True; there are no scores in this BED file. That
>         brings us
>                    to Sean's
>                    remark.
>
>                    You may have noticed that all features in the BED
>         file are
>                    36 bases
>                    long. Each feature is the position and orientation of a
>                    Solexa read.
>
>                    I want the value in each bin in the WIG file to
>         represent
>                    the number
>                    of tags contained in that bin.
>
>                    So, coverage() is close to what I am looking for but I
>                    think that
>                    pileup() is perhaps better. Then comes the problem
>         of how
>                    to place
>                    those values in an object exportable as WIG. Lets
>         not talk
>                    about
>                    efficiency for now.
>
>
>                      
>                Ignoring efficiency, it's pretty straight-forward. In the
>                development
>                version of IRanges, one can convert an Rle (or RleList for
>                multiple
>                chromosomes) as output to coverage() to a RangedData.
>         In the
>                development
>                version of rtracklayer, this is done implicitly:
>
>                
>                    export(cov, "coverage.wig")
>                      
>
>
>                
>                    I think that I can solve this problem myself but
>         given that:
>
>                    1) there is a package that can read BED and write
>         WIG files
>                    and
>                    2) BED to WIG conversion is such a common task
>
>                    I thought that there must have been an existing tool or
>                    perhaps an
>                    established efficient way of doing this with BioC
>         tools.
>                    Why is this
>                    important? Well, this myRegions.bed contains 4.3
>         million
>                    records.
>                    Loading them with import() used about 1GB of RAM.
>         My real
>                    world BED
>                    files are never less than 50 million reads, 150
>         more like.
>
>                    Conclusion: I DO care about an expert's opinion of
>         BED-> WIG
>                    conversion within R.
>
>                      
>
>                Maybe nobody does this with R. That is also acceptable.
>         What
>                tool do
>                
>                    you use then?
>
>
>                      
>                The bottleneck seems to be loading and representing
>         large BED
>                files, since
>                after the reduction to coverage the data size is usually
>                manageable.
>                rtracklayer was never designed to load BED files with
>         millions
>                of records.
>                In my experience, short reads are more commonly
>         represented in
>                more
>                efficient formats like those output by MAQ and bowtie,
>         which
>                are imported
>                with ShortRead.
>
>                There is much room for optimization in rtracklayer, but for
>                now one idea is
>                to calculate coverage on subsets of the reads, and
>         aggregate
>                the results.
>                The BED file can be loaded with simple low-level calls like
>                read.table and
>                scan.
>
>                Michael
>
>
>
>                
>                    Thank you,
>
>                    Ivan
>
>
>                    Ivan Gregoretti, PhD
>                    National Institute of Diabetes and Digestive and Kidney
>                    Diseases
>                    National Institutes of Health
>                    5 Memorial Dr, Building 5, Room 205.
>                    Bethesda, MD 20892. USA.
>                    Phone: 1-301-496-1592
>                    Fax: 1-301-496-9878
>
>
>
>                    On Thu, Sep 17, 2009 at 4:45 PM, Sean Davis
>                    <seandavi at gmail.com <mailto:seandavi at gmail.com>
>         <mailto:seandavi at gmail.com <mailto:seandavi at gmail.com>>> wrote:
>                      
>                        On Thu, Sep 17, 2009 at 4:43 PM, Michael Lawrence
>                        <mflawren at fhcrc.org <mailto:mflawren at fhcrc.org>
>         <mailto:mflawren at fhcrc.org <mailto:mflawren at fhcrc.org>>>
>
>                        wrote:
>                            
>                            On Thu, Sep 17, 2009 at 8:14 AM, Ivan
>         Gregoretti
>                            <ivangreg at gmail.com
>         <mailto:ivangreg at gmail.com> <mailto:ivangreg at gmail.com
>         <mailto:ivangreg at gmail.com>>>
>
>                            wrote:
>
>                                  
>                                Hello everybody
>
>                                How do you convert BED formatted files
>         to WIG
>                                files?
>
>
>                                I tried to do that with rtracklayer but it
>                                didn't quite succeed.
>
>                                This is the session's transcript:
>
>                                First, you can download
>                              
>          http://dl.getdropbox.com/u/2051155/myRegions.bed,
>                                which looks like
>                                this
>
>                                chr1    3002444 3002479                 +
>                                chr1    3002989 3003024                 -
>                                chr1    3017603 3017638                 +
>                                chr1    3017879 3017914                 -
>                                chr1    3018173 3018208                 +
>                                chr1    3018183 3018218                 -
>                                chr1    3018183 3018218                 -
>                                chr1    3019065 3019100                 +
>                                chr1    3019761 3019796                 -
>                                chr1    3020044 3020079                 -
>                                ...
>
>                                Now to R
>
>                                suppressMessages(library(rtracklayer))
>                                myRegions <- import('myRegions.bed')
>
>                                So far so good. Now I try:
>                                export(myRegions, 'myRegions.wig',
>         format = 'wig')
>
>                                but I get:
>                                Error in export.ucsc(object, con,
>         subformat,
>                                ...) :
>                                 Track not compatible with WIG format:
>                                Overlapping features must be
>                                on separate strands and every feature width
>                                must be positive
>
>                                I seems that the error message is a feature
>                                rather than a bug. My
>                                interpretation is that export() does
>         not like
>                                records like lines 6 and
>                                7:
>                                chr1    3018183 3018218                 -
>                                chr1    3018183 3018218                 -
>
>                                So, how do you convert BED to WIG in your
>                                everyday work?
>
>
>                                        
>                            Well, as rtracklayer said, this track is not
>                            representable by WIG, which
>                            is
>                            meant to communicate a single data value for a
>                            given genomic region (to
>                            generate the bar plots in the UCSC browser, for
>                            instance). There aren't
>                            any
>                            scores in your file, so why do you want to
>         use WIG?
>
>                                  
>                        Just guessing, here, but you may want to calculate
>                        coverage() first,
>                            
>                    Ivan?
>                      
>                        Sean
>
>                            
>                                Thank you,
>
>                                Ivan
>
>                                        
>                                    sessionInfo()
>                                              
>                                R version 2.9.2 (2009-08-24)
>                                x86_64-redhat-linux-gnu
>
>                                locale:
>
>
>
>                                        
>                  
>          LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>                      
>                                attached base packages:
>                                [1] stats     graphics  grDevices utils
>                                    datasets  methods   base
>
>                                other attached packages:
>                                [1] rtracklayer_1.4.0 RCurl_0.94-1
>
>                                loaded via a namespace (and not attached):
>                                [1] Biobase_2.4.0     Biostrings_2.12.0
>                                BSgenome_1.12.0
>                                        
>                    IRanges_1.2.0
>                      
>                                [5] tools_2.9.2       XML_2.3-0
>
>                                Ivan Gregoretti, PhD
>                                National Institute of Diabetes and
>         Digestive
>                                and Kidney Diseases
>                                National Institutes of Health
>                                5 Memorial Dr, Building 5, Room 205.
>                                Bethesda, MD 20892. USA.
>                                Phone: 1-301-496-1592
>                                Fax: 1-301-496-9878
>
>                              
>          _______________________________________________
>                                Bioc-sig-sequencing mailing list
>                                Bioc-sig-sequencing at r-project.org
>         <mailto:Bioc-sig-sequencing at r-project.org>
>                              
>          <mailto:Bioc-sig-sequencing at r-project.org
>         <mailto:Bioc-sig-sequencing at r-project.org>>
>
>                              
>          https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>                                        
>                                  [[alternative HTML version deleted]]
>
>                            _______________________________________________
>                            Bioc-sig-sequencing mailing list
>                            Bioc-sig-sequencing at r-project.org
>         <mailto:Bioc-sig-sequencing at r-project.org>
>                            <mailto:Bioc-sig-sequencing at r-project.org
>         <mailto:Bioc-sig-sequencing at r-project.org>>
>
>                          
>          https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>                                  
>                            
>                    _______________________________________________
>                    Bioc-sig-sequencing mailing list
>                    Bioc-sig-sequencing at r-project.org
>         <mailto:Bioc-sig-sequencing at r-project.org>
>                    <mailto:Bioc-sig-sequencing at r-project.org
>         <mailto:Bioc-sig-sequencing at r-project.org>>
>
>                  
>          https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>                      
>
>                       [[alternative HTML version deleted]]
>
>                _______________________________________________
>                Bioc-sig-sequencing mailing list
>                Bioc-sig-sequencing at r-project.org
>         <mailto:Bioc-sig-sequencing at r-project.org>
>                <mailto:Bioc-sig-sequencing at r-project.org
>         <mailto:Bioc-sig-sequencing at r-project.org>>
>
>                https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>                
>
>
>
>



More information about the Bioc-sig-sequencing mailing list