[Bioc-sig-seq] RangedData versus GenomicRanges/GRanges

Mon Nov 1 20:20:56 CET 2010

On 11/01/2010 10:40 AM, Janet Young wrote:
> Thank you both - that helps.  I think for my current project I'll stick
> with RangedData (everything is coded already) but this'll help me decide
> which to use in future.

Hi Janet et al.,

Two one-cent pieces on this.

Ivan mentions

> limits of chromosomes. The most unforgiving example is mitochondrial
> DNA. Because it is circular, it naturally gets sequencing reads with
> "starts" that are numerically larger than it "ends".

GRanges is becoming circularity aware, and there are hints of that in
?GRanges and elsewhere, but this will only be fully developed in the
next release.

Michael says

>> Then there are
>> metadata columns added by the user, but these are sort of "second
>> class" compared to the RangedData design. The ranges are primary.

The 'second class' is confusing to me a bit. I view a GRanges instance
gr as consisting of two parts, the ranges(gr) and the 'user data'
values(gr). I view this as a separation of information, rather than
subordinating one source of data to another.

Martin

> 
> Janet
> 
> 
> 
> On Oct 29, 2010, at 5:22 AM, Michael Lawrence wrote:
> 
>> Ivan does a pretty good job. But just to summarize and fill in the gaps:
>>
>> GenomicRanges (an abstract class that is made concrete by GRanges) is
>> essentially a table of ranges, chromosomes (actually generalized to
>> sequence names), and strand. It also has a formally associated
>> "SeqInfo" object which stores the sequence lengths. Then there are
>> metadata columns added by the user, but these are sort of "second
>> class" compared to the RangedData design. The ranges are primary. A
>> GRanges can be placed into a GRangesList, which is the data structure
>> of choice for holding compound ranges, like gene structures and read
>> mappings.
>>
>> RangedData acts more like a data frame with a formal notion of ranges
>> divided into spaces (chromosomes). The API is very much more like a
>> data frame compared to GenomicRanges, where user columns are at the
>> same "level" as the ranges. It can informally hold the same
>> information as GenomicRanges, like strand and a SeqInfo in its
>> metadata. In terms of implementation, RangedData consists of two
>> parallel lists, a RangesList for the ranges and a SplitDataFrameList
>> for the rest of the columns. This means that the data must, as Ivan
>> mentioned, be sorted by chromosome. But there are advantages over
>> GenomicRanges when a RangesList cannot be flattened to a Ranges. These
>> include the ability to store an RleViewsList (preserving the coverage
>> information) or a RangesList of IntervalTree objects (allowing fast
>> interval queries) as the ranges.
>>
>> Choosing one or the other depends on the use case. For RNA-seq, where
>> one has complex read mappings to complex gene structures,
>> GRanges(Lists) are the best in my opinion. But then for ChIP-seq
>> peaks, where the strand does not matter and the ranges simple, one
>> might prefer the data frame features of RangedData and its ability to
>> keep the coverage around.
>>
>> Michael
>>
>> On Thu, Oct 28, 2010 at 8:54 PM, Ivan Gregoretti <ivangreg at gmail.com>
>> wrote:
>> Hello Janet,
>>
>> It is a rare pleasure to have the opportunity to enlighten somebody
>> from the Fred Hutchinson Cancer Research Center about R functionality.
>>
>> The bottom line is this: GenomicRanges is much more biology-awared
>> than the generic RangedData class.
>>
>> GenomicRanges natively stores a strand value per feature. RangedData
>> does not, unless you create it. GenomicRanges' strand values are very
>> intuitive: +, -, and *.
>>
>> GenomicRanges "rows" can be ordered by any "column" even if it ends up
>> dis-ordering the chromosomes. RangedData can only order features
>> within each space.
>>
>> GenomicRanges can store the complete list of chromosomes and their
>> corresponding sizes four your particular organism. You can create a
>> GenomicRanges instance out of a RangedData without providing
>> explicitly the list of chromosomes and their sizes. Just do
>>
>> library(GenomicRanges)
>> my_gr <- as(my_rd,"GRanges")
>>
>> The list of chromosomes is gathered on the fly from the features. The
>> list chromosome lengths still has to be assigned manually, which is
>> fine.
>>
>> Nowadays you can rtracklayer::import() BED directly as GenomicRanges.
>>
>> Importing large BED into either GenomicRanges or RangedData is, in my
>> experience, equally slow. There is no difference there.
>>
>> Why not forgetting RangedData then? The advantage over GenomicRanges
>> is, also in my experience, that it accepts features mapped beyond the
>> limits of chromosomes. The most unforgiving example is mitochondrial
>> DNA. Because it is circular, it naturally gets sequencing reads with
>> "starts" that are numerically larger than it "ends".
>>
>> In high throughput sequencing I still use RangedData when
>> 1) I do not care about relatively few misbehaving reads
>> 2) I need my script to run without errors from GenomicRanges sanity
>> check.
>>
>> For everyday high throughput sequencing I use GenomicRanges keeping
>> the chromosome lengths unassigned. It could be called a hybrid.
>>
>> I hope this helps.
>>
>> Ivan
>>
>> Ivan Gregoretti, PhD
>> National Institute of Diabetes and Digestive and Kidney Diseases
>> National Institutes of Health
>> 5 Memorial Dr, Building 5, Room 205.
>> Bethesda, MD 20892. USA.
>> Phone: 1-301-496-1016 and 1-301-496-1592
>> Fax: 1-301-496-9878
>>
>>
>>
>> On Thu, Oct 28, 2010 at 9:25 PM, Janet Young <jayoung at fhcrc.org> wrote:
>> > Hi,
>> >
>> > I've been on a long long vacation, so I'm a bit more out of the loop
>> than I
>> > usually am.
>> >
>> > I've been using RangedData a lot in my code until now to represent
>> sets of
>> > genomic regions spread over multiple chromosomes, and I've just
>> realized
>> > that GenomicRanges has a lot of the same characteristics.
>> >
>> > I wanted to ask you all
>> > - whether RangedData and GenomicRanges are pretty much equivalent,
>> or if
>> > there are functions that exist for one but not the other?
>> > - whether I can use pretty much the same code and functions if I switch
>> > everything over to use GenomicRanges?
>> > - are there subtle differences I should be careful of if I make the
>> switch?
>> >
>> > thanks very much,
>> >
>> > Janet Young
>> >
>> >
>> > -------------------------------------------------------------------
>> >
>> > Dr. Janet Young (Trask lab)
>> >
>> > Fred Hutchinson Cancer Research Center
>> > 1100 Fairview Avenue N., C3-168,
>> > P.O. Box 19024, Seattle, WA 98109-1024, USA.
>> >
>> > tel: (206) 667 1471 fax: (206) 667 6524
>> > email: jayoung  ...at...  fhcrc.org
>> >
>> > http://www.fhcrc.org/labs/trask/
>> >
>> > _______________________________________________
>> > Bioc-sig-sequencing mailing list
>> > Bioc-sig-sequencing at r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> >
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793