[Bioc-sig-seq] rtracklayer and import()ing into GRanges

Tue Aug 10 21:16:08 CEST 2010

Please ignore my last email. I forgot to load the GenomicRanges library.

Ivan

On Tue, Aug 10, 2010 at 3:13 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
> Hi Patrick,
>
> I tried the new GRanges functionality of import() and it works great.
>
> That leads me to a related question. How do you remove, say,  the field "score"?
>
> In the past, when I loaded my BED features into a RangedData object
> called A, removing the field "score" was as simple as:
>
> A$score <- NULL
>
> Now that A is a GRanges object, how do I do the same?
>
> Thank you,
>
> Ivan
>
>
>
>
> On Thu, Aug 5, 2010 at 8:06 PM, Patrick Aboyoun <paboyoun at fhcrc.org> wrote:
>> I just checked in a patch to rtracklayer version 1.9.6 into BioC 2.7 that
>> added asRangedData arguments to all of the import methods except import.bw
>> as well as to the GenomicData function. When asRangedData = FALSE, these
>> import methods will produce GRanges objects. I didn't optimize any of the
>> code yet, so there may be another rev tomorrow with some tweaks to the
>> import code at the R level. Here is are some examples:
>>
>>> library(rtracklayer)
>>> bedfilepath <- system.file("tests", "test.bed", package = "rtracklayer")
>>> import(bedfilepath, asRangedData = FALSE)
>> GRanges with 9 ranges and 5 elementMetadata values
>>     seqnames                 ranges strand |        name     score
>> thickStart
>>        <Rle>              <IRanges>  <Rle> | <character> <numeric>
>> <integer>
>> [1]     chr7 [127471197, 127472363]      + |        Pos1         0
>> 127471196
>> [2]     chr7 [127472364, 127473530]      + |        Pos2         0
>> 127472363
>> [3]     chr7 [127473531, 127474697]      + |        Pos3         0
>> 127473530
>> [4]     chr7 [127474698, 127475864]      + |        Pos4         0
>> 127474697
>> [5]     chr7 [127475865, 127477031]      - |        Neg1         0
>> 127475864
>> [6]     chr7 [127477032, 127478198]      - |        Neg2         0
>> 127477031
>> [7]     chr7 [127478199, 127479365]      - |        Neg3         0
>> 127478198
>> [8]     chr7 [127479366, 127480532]      + |        Pos5         0
>> 127479365
>> [9]     chr7 [127480533, 127481699]      - |        Neg4         0
>> 127480532
>>      thickEnd     itemRgb
>>     <integer> <character>
>> [1] 127472363     #FF0000
>> [2] 127473530     #FF0000
>> [3] 127474697     #FF0000
>> [4] 127475864     #FF0000
>> [5] 127477031     #0000FF
>> [6] 127478198     #0000FF
>> [7] 127479365     #0000FF
>> [8] 127480532     #FF0000
>> [9] 127481699     #0000FF
>>
>> seqlengths
>>  chr7
>>    NA
>>> bed15filepath <- system.file("tests", "test.bed15", package =
>>> "rtracklayer")
>>> import(bed15filepath, asRangedData = FALSE)
>> GRanges with 2 ranges and 13 elementMetadata values
>>     seqnames                 ranges strand |        name     score
>> thickStart
>>        <Rle>              <IRanges>  <Rle> | <character> <numeric>
>> <integer>
>> [1]     chr1 [159639973, 159640031]      - |     2440848       500
>> 159639972
>> [2]     chr1 [159640162, 159640190]      - |     2440849       500
>> 159640161
>>      thickEnd     itemRgb blockCount  blockSizes blockStarts  breast_A
>>     <integer> <character>  <integer> <character> <character> <numeric>
>> [1] 159640031          NA          1         59,          0,     0.593
>> [2] 159640190          NA          1         29,          0,    -0.906
>>      breast_B  breast_C cerebellum_A cerebellum_B
>>     <numeric> <numeric>    <numeric>    <numeric>
>> [1]     1.196    -0.190       -1.088        0.093
>> [2]    -1.247     0.111       -0.515       -0.057
>>
>> seqlengths
>>  chr1
>>    NA
>>> sessionInfo()
>> R version 2.12.0 Under development (unstable) (2010-08-01 r52659)
>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] rtracklayer_1.9.6 RCurl_1.4-3       bitops_1.0-4.1
>>
>> loaded via a namespace (and not attached):
>> [1] Biobase_2.9.0        Biostrings_2.17.27   BSgenome_1.17.6
>> [4] GenomicRanges_1.1.20 IRanges_1.7.15       tools_2.12.0
>> [7] XML_3.1-0
>>
>>
>> Patrick
>>
>>
>> On 8/5/10 1:11 PM, Michael Lawrence wrote:
>>
>> On Thu, Aug 5, 2010 at 10:45 AM, Patrick Aboyoun <paboyoun at fhcrc.org> wrote:
>>>
>>> Michael,
>>> I just made a minor check-in to rtracklayer where I replaced use of
>>> Biobase:listLen with IRanges::elementLenghts in an effort to minimize the
>>> impact of Biobase on the sequence package stack.
>>>
>>
>> Ok. It looks like elementLengths has been optimized since the last time I
>> looked.
>>
>>
>>>
>>> Before I start the boulder rolling, how should I reconcile the UCSCData
>>> class with the GRanges class? Once I have that sorted I can make changes to
>>> import.bed and import.wig as well.
>>>
>>
>> Well, eventually we'll want to stick the track line information on to
>> GRanges. Could be done via a subclass like with UCSCData. metadata() is
>> another option. I do actually use the subclass for dispatch purposes, pretty
>> printing, etc. For right now though, the extra information could just be
>> dropped if the user requests a GRanges.
>>
>>>
>>> I originally named the argument asRangedData in the BSgenome methods to
>>> reinforce that RangedData output is not intended to be the default and
>>> conceptually the user is making an extra effort to produce a RangedData
>>> object.
>>>
>>>
>>> Patrick
>>>
>>>
>>> On 8/5/10 4:32 AM, Michael Lawrence wrote:
>>>
>>> Makes sense. But why not make it asGRanges, which is shorter? Please go
>>> ahead and check in your work so far.
>>>
>>> Thanks a lot,
>>> Michael
>>>
>>> On Thu, Aug 5, 2010 at 12:51 AM, Patrick Aboyoun <paboyoun at fhcrc.org>
>>> wrote:
>>>>
>>>> Michael,
>>>> Breaking this down to two issues:
>>>>
>>>> Filtering
>>>> Martin has been working on improving filtering in the ShortRead package
>>>> to move from a read all then filter data to a block processing based
>>>> filtering methodology. Lessons learned there can be brought to rtracklayer
>>>> for large bed files and the like.
>>>>
>>>> import() output class
>>>> Keeping the same API and just switching the import methods from producing
>>>> RangedData (or UCSCData) output to GRanges output will break backward
>>>> compatibility because the RangedData API is not wholly applicable to GRanges
>>>> objects. I would not recommend this course since a number of packages in
>>>> BioC and scripts in the wild expect the import methods to produce a
>>>> RangedData (or UCSCData) object. An additional argument is not that onerous
>>>> and can be fazed out over the course of two or three releases (1 - 1.5
>>>> years). Another alternative is to add a new import function (read.GRanges?)
>>>> to rtracklayer that shares the same infrastructure as the existing import
>>>> methods.
>>>>
>>>> I have a local copy of rtracklayer where I added a new asRangedData flag
>>>> to the GenomicData function and import.gff* methods. I'll sit on this for
>>>> now since these changes didn't take a lot of work. This is one of the
>>>> situations where the managing the life cycle of the function specs is
>>>> trickier than making the desired code changes.
>>>>
>>>>
>>>> Cheers,
>>>> Patrick
>>>>
>>>>
>>>> On 8/4/10 8:24 PM, Michael Lawrence wrote:
>>>>
>>>> This might work, but it seems like an expensive optimization in that it
>>>> changes a lot of the API. If someone cannot make a single copy of the data,
>>>> it's unlikely that they're even going to be able to get to GenomicData() or
>>>> manipulate it later. Perhaps the coercion function needs some simple tweaks?
>>>> The filter support would definitely help. I'd rather keep things simple and
>>>> return a single type, and GRanges sounds most appropriate.
>>>>
>>>> But I'm open to suggestions and further argument.
>>>>
>>>> Michael
>>>>
>>>> On Wed, Aug 4, 2010 at 2:05 PM, Patrick Aboyoun <paboyoun at fhcrc.org>
>>>> wrote:
>>>>>
>>>>> Michael,
>>>>> How integrated would you like to see the GRanges class in rtracklayer?
>>>>> The rtracklayer::GenomicData constructor is the master instantiator. I would
>>>>> like to add an asRangedData = TRUE (default) argument to the GenomicData
>>>>> function and push it all the way up through the import functions where when
>>>>> the user sets asRangedData = FALSE, the GenomicData function would create a
>>>>> GRanges object. This is what we did with the
>>>>> {matchPWM,vmatchPattern,vmatchPDict},BSgenome-methods in the BSgenome
>>>>> package and it as good a solution as any. This is a straight-forward change
>>>>> and wouldn't take too long to complete.
>>>>>
>>>>>
>>>>> Patrick
>>>>>
>>>>>
>>>>> On 8/4/10 12:56 PM, Michael Lawrence wrote:
>>>>>>
>>>>>> GRanges support is definitely on the TODO list. Filters are a good idea
>>>>>> and
>>>>>> also on the TODO list, possibly with a chunk size parameter to enable
>>>>>> chunk
>>>>>> processing.
>>>>>>
>>>>>> I'd love to have the GRanges stuff at least done by the next release.
>>>>>> Patches welcome, of course :)
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>> On Wed, Aug 4, 2010 at 8:08 AM, Ivan Gregoretti<ivangreg at gmail.com>
>>>>>>  wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Hello Michael and everyone,
>>>>>>>
>>>>>>> Would you please consider adding to import() the capacity to generate
>>>>>>> a GRanges object rather than the default RangedData object?
>>>>>>>
>>>>>>> Also,
>>>>>>>
>>>>>>> Wouldn't it be great to be able to import() with filters just like
>>>>>>> with readAligned()?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Justification
>>>>>>>
>>>>>>> GRanges is a biology-aware container. When importing large BEDs into
>>>>>>> R, the current workflow involves creating RangedData first and then
>>>>>>> converting to GRanges.
>>>>>>>
>>>>>>> If the BEDs are really big, holding both objects in memory at any
>>>>>>> point in time is a hardware challenge.
>>>>>>>
>>>>>>> The capacity to filter the input would help in this case and in
>>>>>>> general it would provide an increase in efficiency.
>>>>>>>
>>>>>>>
>>>>>>> Thank you,
>>>>>>>
>>>>>>> Ivan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Ivan Gregoretti, PhD
>>>>>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>>>>>> National Institutes of Health
>>>>>>> 5 Memorial Dr, Building 5, Room 205.
>>>>>>> Bethesda, MD 20892. USA.
>>>>>>> Phone: 1-301-496-1016 and 1-301-496-1592
>>>>>>> Fax: 1-301-496-9878
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-sig-sequencing mailing list
>>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>        [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-sig-sequencing mailing list
>>>>>> Bioc-sig-sequencing at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>