[Bioc-sig-seq] BED to WIG format conversion

Fri Sep 18 00:47:08 CEST 2009

Michael,
Before people start developing their own bed file parsers, it looks 
pretty straight-forward to incorporate this operation in rtracklayer 
now. You or I could add a scoreRanges argument, which takes either a 
logical or perhaps a character string if we envision there being more 
than one way to score a set of ranges, to the import* functions and when 
a appropriate value is supplied, it will internally block process the 
data file and output a RangedData object with a score column.

Patrick

Michael Lawrence wrote:
> On Thu, Sep 17, 2009 at 2:17 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
>
>   
>> Hi Michael,
>>
>> True; there are no scores in this BED file. That brings us to Sean's
>> remark.
>>
>> You may have noticed that all features in the BED file are 36 bases
>> long. Each feature is the position and orientation of a Solexa read.
>>
>> I want the value in each bin in the WIG file to represent the number
>> of tags contained in that bin.
>>
>> So, coverage() is close to what I am looking for but I think that
>> pileup() is perhaps better. Then comes the problem of how to place
>> those values in an object exportable as WIG. Lets not talk about
>> efficiency for now.
>>
>>
>>     
> Ignoring efficiency, it's pretty straight-forward. In the development
> version of IRanges, one can convert an Rle (or RleList for multiple
> chromosomes) as output to coverage() to a RangedData. In the development
> version of rtracklayer, this is done implicitly:
>
>   
>> export(cov, "coverage.wig")
>>     
>
>
>   
>> I think that I can solve this problem myself but given that:
>>
>> 1) there is a package that can read BED and write WIG files
>> and
>> 2) BED to WIG conversion is such a common task
>>
>> I thought that there must have been an existing tool or perhaps an
>> established efficient way of doing this with BioC tools. Why is this
>> important? Well, this myRegions.bed contains 4.3 million records.
>> Loading them with import() used about 1GB of RAM. My real world BED
>> files are never less than 50 million reads, 150 more like.
>>
>> Conclusion: I DO care about an expert's opinion of BED-> WIG
>> conversion within R.
>>
>>     
>
> Maybe nobody does this with R. That is also acceptable. What tool do
>   
>> you use then?
>>
>>
>>     
> The bottleneck seems to be loading and representing large BED files, since
> after the reduction to coverage the data size is usually manageable.
> rtracklayer was never designed to load BED files with millions of records.
> In my experience, short reads are more commonly represented in more
> efficient formats like those output by MAQ and bowtie, which are imported
> with ShortRead.
>
> There is much room for optimization in rtracklayer, but for now one idea is
> to calculate coverage on subsets of the reads, and aggregate the results.
> The BED file can be loaded with simple low-level calls like read.table and
> scan.
>
> Michael
>
>
>
>   
>> Thank you,
>>
>> Ivan
>>
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>>
>> On Thu, Sep 17, 2009 at 4:45 PM, Sean Davis <seandavi at gmail.com> wrote:
>>     
>>> On Thu, Sep 17, 2009 at 4:43 PM, Michael Lawrence <mflawren at fhcrc.org>
>>> wrote:
>>>       
>>>> On Thu, Sep 17, 2009 at 8:14 AM, Ivan Gregoretti <ivangreg at gmail.com>
>>>> wrote:
>>>>
>>>>         
>>>>> Hello everybody
>>>>>
>>>>> How do you convert BED formatted files to WIG files?
>>>>>
>>>>>
>>>>> I tried to do that with rtracklayer but it didn't quite succeed.
>>>>>
>>>>> This is the session's transcript:
>>>>>
>>>>> First, you can download
>>>>> http://dl.getdropbox.com/u/2051155/myRegions.bed, which looks like
>>>>> this
>>>>>
>>>>> chr1    3002444 3002479                 +
>>>>> chr1    3002989 3003024                 -
>>>>> chr1    3017603 3017638                 +
>>>>> chr1    3017879 3017914                 -
>>>>> chr1    3018173 3018208                 +
>>>>> chr1    3018183 3018218                 -
>>>>> chr1    3018183 3018218                 -
>>>>> chr1    3019065 3019100                 +
>>>>> chr1    3019761 3019796                 -
>>>>> chr1    3020044 3020079                 -
>>>>> ...
>>>>>
>>>>> Now to R
>>>>>
>>>>> suppressMessages(library(rtracklayer))
>>>>> myRegions <- import('myRegions.bed')
>>>>>
>>>>> So far so good. Now I try:
>>>>> export(myRegions, 'myRegions.wig', format = 'wig')
>>>>>
>>>>> but I get:
>>>>> Error in export.ucsc(object, con, subformat, ...) :
>>>>>  Track not compatible with WIG format: Overlapping features must be
>>>>> on separate strands and every feature width must be positive
>>>>>
>>>>> I seems that the error message is a feature rather than a bug. My
>>>>> interpretation is that export() does not like records like lines 6 and
>>>>> 7:
>>>>> chr1    3018183 3018218                 -
>>>>> chr1    3018183 3018218                 -
>>>>>
>>>>> So, how do you convert BED to WIG in your everyday work?
>>>>>
>>>>>
>>>>>           
>>>> Well, as rtracklayer said, this track is not representable by WIG, which
>>>> is
>>>> meant to communicate a single data value for a given genomic region (to
>>>> generate the bar plots in the UCSC browser, for instance). There aren't
>>>> any
>>>> scores in your file, so why do you want to use WIG?
>>>>
>>>>         
>>> Just guessing, here, but you may want to calculate coverage() first,
>>>       
>> Ivan?
>>     
>>> Sean
>>>
>>>       
>>>>> Thank you,
>>>>>
>>>>> Ivan
>>>>>
>>>>>           
>>>>>> sessionInfo()
>>>>>>             
>>>>> R version 2.9.2 (2009-08-24)
>>>>> x86_64-redhat-linux-gnu
>>>>>
>>>>> locale:
>>>>>
>>>>>
>>>>>
>>>>>           
>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>>     
>>>>> attached base packages:
>>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>>
>>>>> other attached packages:
>>>>> [1] rtracklayer_1.4.0 RCurl_0.94-1
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] Biobase_2.4.0     Biostrings_2.12.0 BSgenome_1.12.0
>>>>>           
>> IRanges_1.2.0
>>     
>>>>> [5] tools_2.9.2       XML_2.3-0
>>>>>
>>>>> Ivan Gregoretti, PhD
>>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>>> National Institutes of Health
>>>>> 5 Memorial Dr, Building 5, Room 205.
>>>>> Bethesda, MD 20892. USA.
>>>>> Phone: 1-301-496-1592
>>>>> Fax: 1-301-496-9878
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-sig-sequencing mailing list
>>>>> Bioc-sig-sequencing at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>
>>>>>           
>>>>        [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>         
>>>       
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>     
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>