[Bioc-sig-seq] RangedData versus GenomicRanges/GRanges

Mon Nov 1 18:40:26 CET 2010

Thank you both - that helps.  I think for my current project I'll  
stick with RangedData (everything is coded already) but this'll help  
me decide which to use in future.

Janet

On Oct 29, 2010, at 5:22 AM, Michael Lawrence wrote:

> Ivan does a pretty good job. But just to summarize and fill in the  
> gaps:
>
> GenomicRanges (an abstract class that is made concrete by GRanges)  
> is essentially a table of ranges, chromosomes (actually generalized  
> to sequence names), and strand. It also has a formally associated  
> "SeqInfo" object which stores the sequence lengths. Then there are  
> metadata columns added by the user, but these are sort of "second  
> class" compared to the RangedData design. The ranges are primary. A  
> GRanges can be placed into a GRangesList, which is the data  
> structure of choice for holding compound ranges, like gene  
> structures and read mappings.
>
> RangedData acts more like a data frame with a formal notion of  
> ranges divided into spaces (chromosomes). The API is very much more  
> like a data frame compared to GenomicRanges, where user columns are  
> at the same "level" as the ranges. It can informally hold the same  
> information as GenomicRanges, like strand and a SeqInfo in its  
> metadata. In terms of implementation, RangedData consists of two  
> parallel lists, a RangesList for the ranges and a SplitDataFrameList  
> for the rest of the columns. This means that the data must, as Ivan  
> mentioned, be sorted by chromosome. But there are advantages over  
> GenomicRanges when a RangesList cannot be flattened to a Ranges.  
> These include the ability to store an RleViewsList (preserving the  
> coverage information) or a RangesList of IntervalTree objects  
> (allowing fast interval queries) as the ranges.
>
> Choosing one or the other depends on the use case. For RNA-seq,  
> where one has complex read mappings to complex gene structures,  
> GRanges(Lists) are the best in my opinion. But then for ChIP-seq  
> peaks, where the strand does not matter and the ranges simple, one  
> might prefer the data frame features of RangedData and its ability  
> to keep the coverage around.
>
> Michael
>
> On Thu, Oct 28, 2010 at 8:54 PM, Ivan Gregoretti  
> <ivangreg at gmail.com> wrote:
> Hello Janet,
>
> It is a rare pleasure to have the opportunity to enlighten somebody
> from the Fred Hutchinson Cancer Research Center about R functionality.
>
> The bottom line is this: GenomicRanges is much more biology-awared
> than the generic RangedData class.
>
> GenomicRanges natively stores a strand value per feature. RangedData
> does not, unless you create it. GenomicRanges' strand values are very
> intuitive: +, -, and *.
>
> GenomicRanges "rows" can be ordered by any "column" even if it ends up
> dis-ordering the chromosomes. RangedData can only order features
> within each space.
>
> GenomicRanges can store the complete list of chromosomes and their
> corresponding sizes four your particular organism. You can create a
> GenomicRanges instance out of a RangedData without providing
> explicitly the list of chromosomes and their sizes. Just do
>
> library(GenomicRanges)
> my_gr <- as(my_rd,"GRanges")
>
> The list of chromosomes is gathered on the fly from the features. The
> list chromosome lengths still has to be assigned manually, which is
> fine.
>
> Nowadays you can rtracklayer::import() BED directly as GenomicRanges.
>
> Importing large BED into either GenomicRanges or RangedData is, in my
> experience, equally slow. There is no difference there.
>
> Why not forgetting RangedData then? The advantage over GenomicRanges
> is, also in my experience, that it accepts features mapped beyond the
> limits of chromosomes. The most unforgiving example is mitochondrial
> DNA. Because it is circular, it naturally gets sequencing reads with
> "starts" that are numerically larger than it "ends".
>
> In high throughput sequencing I still use RangedData when
> 1) I do not care about relatively few misbehaving reads
> 2) I need my script to run without errors from GenomicRanges sanity  
> check.
>
> For everyday high throughput sequencing I use GenomicRanges keeping
> the chromosome lengths unassigned. It could be called a hybrid.
>
> I hope this helps.
>
> Ivan
>
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1016 and 1-301-496-1592
> Fax: 1-301-496-9878
>
>
>
> On Thu, Oct 28, 2010 at 9:25 PM, Janet Young <jayoung at fhcrc.org>  
> wrote:
> > Hi,
> >
> > I've been on a long long vacation, so I'm a bit more out of the  
> loop than I
> > usually am.
> >
> > I've been using RangedData a lot in my code until now to represent  
> sets of
> > genomic regions spread over multiple chromosomes, and I've just  
> realized
> > that GenomicRanges has a lot of the same characteristics.
> >
> > I wanted to ask you all
> > - whether RangedData and GenomicRanges are pretty much equivalent,  
> or if
> > there are functions that exist for one but not the other?
> > - whether I can use pretty much the same code and functions if I  
> switch
> > everything over to use GenomicRanges?
> > - are there subtle differences I should be careful of if I make  
> the switch?
> >
> > thanks very much,
> >
> > Janet Young
> >
> >
> > -------------------------------------------------------------------
> >
> > Dr. Janet Young (Trask lab)
> >
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Avenue N., C3-168,
> > P.O. Box 19024, Seattle, WA 98109-1024, USA.
> >
> > tel: (206) 667 1471 fax: (206) 667 6524
> > email: jayoung  ...at...  fhcrc.org
> >
> > http://www.fhcrc.org/labs/trask/
> >
> > _______________________________________________
> > Bioc-sig-sequencing mailing list
> > Bioc-sig-sequencing at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> >
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>