[Bioc-devel] could bsseq::data.frame2GRanges be added to GenomicRanges
Hervé Pagès
hpages at fhcrc.org
Tue Oct 8 20:20:45 CEST 2013
Hi GRangers,
I'll add this to GenomicRanges. Hope you don't mind if I rename
it makeGRangesFromDataFrame(). This is more consistent
with the current naming scheme for "specialized constructors"
e.g.:
makeTranscriptDbFrom*()
makeGRangesListFromFeatureFragments()
readGAlignment*FromBam()
etc...
I'll add some flexibility to let the user choose the columns to
import (we sometimes see start/stop instead of start/end), but by
default it'll do what bsseq::data.frame2GRanges currently does.
'keepColumns' and 'ignoreStrand' will become 'keep.columns' and
'ignore.strand'.
H.
On 10/07/2013 11:01 AM, Tim Triche, Jr. wrote:
> nb. Somehow I got a typo in there:
>
> aGR <- df2GR(slow)
> mm9toHg19 <- import('mm9ToHg19.over.chain')
>
> ## was full <- liftOver(slow, mm9toHg19)
> full <- liftOver(aGR, mm9toHg19)
> length(unlist(full))/length(slow)
> ## [1] 0.549
>
> That should remind me to re-run everything.
> In my defense... well, it's slow to re-run that :-/
>
> apologies
>
>
> On Mon, Oct 7, 2013 at 10:58 AM, Tim Triche, Jr. <tim.triche at gmail.com>wrote:
>
>> GRs are a great data structure. But, standard bioinformatic file formats
>> (BED, WIG, BAM) don't always fit 1:1 with the "organic" beginnings of some
>> projects. The GenomicRanges infrastructure isn't on the radar of every R
>> developer, and some useful data can be found in ugly formats. Wouldn't it
>> be handy if users could easily turn those blobs of data into something that
>> export() can handle?
>>
>> To better understand Michael's point of view, and as someone who has seen
>> firsthand the nontrivial amount of work required to maintain rtracklayer as
>> a high-performance import library, I wrote a few trivial Perl scripts to
>> convert some mouse data from assorted wacky tabular formats to standard
>> BED6 files. Besides, once I have BEDs, I can Tabix them, which speeds up
>> operations.
>>
>> I noticed that, when I imported the resulting BED files, where I had
>> cloned the base position into identical "chromStart" and "chromEnd"
>> coordinates, the import.bed() function assumed that I meant for them to be
>> UCSC-style, and therefore gave everything negative widths. (On the bright
>> side, this also explained why liftOver wasn't doing anything useful with
>> the results)
>>
>> packageVersion('rtracklayer')
>> ## [1] '1.21.12'
>>
>> wacky <- import('converted.theirSillyFormat.mm9.bed.gz', genome='mm9')
>> mm9toHg19 <- import('mm9ToHg19.over.chain')
>> empty <- liftOver(wacky, mm9toHg19)
>> length(unlist(empty))/length(wacky)
>> ## [1] 0
>>
>> ## cursing ensues
>>
>> That's not so good. If I wasn't already aware of the insanity of
>> UCSC-style indexing, this could have been a problem in and of itself. (As
>> it was, I fixed it)
>>
>> slow <- read.table('theirSillyFormat.mm9.txt.gz')
>> ## time passes...
>>
>> aGR <- df2GR(slow)
>> mm9toHg19 <- import('mm9ToHg19.over.chain')
>> full <- liftOver(slow, mm9toHg19)
>> length(unlist(full))/length(slow)
>> ## [1] 0.549
>>
>> ## better late than never
>>
>> So: turning nonstandard data into standard data with "standard" (grr) UCSC
>> assumptions took longer than simply brute-forcing the issue with
>> read.table().
>>
>> After importing the files with the "appropriate" (chrom, base-1, base)
>> indexing, I then went to liftOver() a related GRanges. (Note that the
>> related GRanges was from a BED file submitted by guys at the Broad, so any
>> wacky formatting wasn't an issue in this case; here I wanted to control for
>> any other possible fubars).
>>
>> foo <- import('an.RRBS.file.mm9.bed.gz', genome='mm9')
>> mm9toHg19 <- import('mm9ToHg19.over.chain')
>> bar <- liftOver(foo, mm9toHg19)
>>
>> length(unlist(bar)) / length(foo)
>> ## [1] 0.622
>>
>> Needless to say, this was a hell of a lot faster than importing the
>> corresponding file as a table. However, for the wacky file formats, it was
>> more time & trouble to decode all the assumptions prior to liftOver() than
>> it would have been to use granges(read.table('wacky.file.format.csv.gz')).
>> Ugly and sad, but still true!
>>
>> So, in conclusion, sometimes it might just be better to import one of
>> those wacky file formats to a data.frame D or what have you, and use
>> granges(D).
>>
>> Or, since I'm one of the people that wrote a df2GR() function, I just use
>> that.
>>
>> --t
>>
>>
>>
>>
>> On Mon, Oct 7, 2013 at 9:23 AM, Steve Lianoglou <lianoglou.steve at gene.com>wrote:
>>
>>> Hi,
>>>
>>> +1 from me, too ... I've also had a similar conversion function
>>> (data.[frame|table] <--> GRanges) in my toolbelt which I found quite
>>> useful over the years.
>>>
>>> -steve
>>>
>>> On Sun, Oct 6, 2013 at 5:00 PM, Kasper Daniel Hansen
>>> <kasperdanielhansen at gmail.com> wrote:
>>>> This is a convenience function, which provably has saved tons of time
>>> for
>>>> me and others. I get lots of data from various excel/cvs files lying
>>>> around various places, and these files _always_ have a clear path to a
>>>> GRanges. Perhaps you never have to deal with this kind of data, but we
>>> are
>>>> a few experienced users who find it extremely handy and would like it
>>> to be
>>>> in a more centralized place.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Oct 6, 2013 at 4:26 PM, Michael Lawrence
>>>> <lawrence.michael at gene.com>wrote:
>>>>
>>>>> I'm still unconvinced that there is an obvious, general path from
>>>>> data.frame -> GRanges. It's usually easy enough to just call GRanges(),
>>>>> often of the pattern with(df, GRanges(...)). Moreover, it's unusual
>>> for me
>>>>> to encounter genomic data in data.frames.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Oct 6, 2013 at 8:37 AM, Kasper Daniel Hansen <
>>>>> kasperdanielhansen at gmail.com> wrote:
>>>>>
>>>>>> Also, it goes without saying that I am happy to provide a patch for
>>>>>> GenomicRanges, and check that it passes R CMD check, to minimize the
>>> work
>>>>>> of the maintainer.
>>>>>>
>>>>>> Kasper
>>>>>>
>>>>>>
>>>>>> On Sun, Oct 6, 2013 at 9:28 AM, Kasper Daniel Hansen <
>>>>>> kasperdanielhansen at gmail.com> wrote:
>>>>>>
>>>>>>> bsseq::data.frame2GRanges does the obvious step of converting a
>>>>>> data.frame
>>>>>>> to GRanges. It has a couple of bells and whistles where strand can
>>> be
>>>>>>> ignored and additional columns (apart from genomic location) may be
>>>>>> ignore
>>>>>>> in the output object.
>>>>>>>
>>>>>>> I (and now quite a few other people) use this function almost every
>>> day.
>>>>>>> I have seen other implementations in other packages, suggesting
>>> this is
>>>>>>> not just something I (we) use.
>>>>>>>
>>>>>>> I suggests adding this function to GenomicRanges. I am happy to
>>> support
>>>>>>> it going forward.
>>>>>>>
>>>>>>> Using this function we could also add an as(x, "GRanges") method for
>>>>>>> x=data.frame, but I still suggest keeping the basic function for the
>>>>>>> extended functionality it provides.
>>>>>>>
>>>>>>> Best,
>>>>>>> Kasper
>>>>>>>
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>
>>>>>
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>>
>>> --
>>> Steve Lianoglou
>>> Computational Biologist
>>> Bioinformatics and Computational Biology
>>> Genentech
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>>
>>
>> --
>> *He that would live in peace and at ease, *
>> *Must not speak all he knows, nor judge all he sees.*
>> *
>> *
>> Benjamin Franklin, Poor Richard's Almanack<http://archive.org/details/poorrichardsalma00franrich>
>>
>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list