[Bioc-sig-seq] RangedData versus GenomicRanges/GRanges
jayoung at fhcrc.org
Mon Nov 1 18:40:26 CET 2010
Thank you both - that helps. I think for my current project I'll
stick with RangedData (everything is coded already) but this'll help
me decide which to use in future.
On Oct 29, 2010, at 5:22 AM, Michael Lawrence wrote:
> Ivan does a pretty good job. But just to summarize and fill in the
> GenomicRanges (an abstract class that is made concrete by GRanges)
> is essentially a table of ranges, chromosomes (actually generalized
> to sequence names), and strand. It also has a formally associated
> "SeqInfo" object which stores the sequence lengths. Then there are
> metadata columns added by the user, but these are sort of "second
> class" compared to the RangedData design. The ranges are primary. A
> GRanges can be placed into a GRangesList, which is the data
> structure of choice for holding compound ranges, like gene
> structures and read mappings.
> RangedData acts more like a data frame with a formal notion of
> ranges divided into spaces (chromosomes). The API is very much more
> like a data frame compared to GenomicRanges, where user columns are
> at the same "level" as the ranges. It can informally hold the same
> information as GenomicRanges, like strand and a SeqInfo in its
> metadata. In terms of implementation, RangedData consists of two
> parallel lists, a RangesList for the ranges and a SplitDataFrameList
> for the rest of the columns. This means that the data must, as Ivan
> mentioned, be sorted by chromosome. But there are advantages over
> GenomicRanges when a RangesList cannot be flattened to a Ranges.
> These include the ability to store an RleViewsList (preserving the
> coverage information) or a RangesList of IntervalTree objects
> (allowing fast interval queries) as the ranges.
> Choosing one or the other depends on the use case. For RNA-seq,
> where one has complex read mappings to complex gene structures,
> GRanges(Lists) are the best in my opinion. But then for ChIP-seq
> peaks, where the strand does not matter and the ranges simple, one
> might prefer the data frame features of RangedData and its ability
> to keep the coverage around.
> On Thu, Oct 28, 2010 at 8:54 PM, Ivan Gregoretti
> <ivangreg at gmail.com> wrote:
> Hello Janet,
> It is a rare pleasure to have the opportunity to enlighten somebody
> from the Fred Hutchinson Cancer Research Center about R functionality.
> The bottom line is this: GenomicRanges is much more biology-awared
> than the generic RangedData class.
> GenomicRanges natively stores a strand value per feature. RangedData
> does not, unless you create it. GenomicRanges' strand values are very
> intuitive: +, -, and *.
> GenomicRanges "rows" can be ordered by any "column" even if it ends up
> dis-ordering the chromosomes. RangedData can only order features
> within each space.
> GenomicRanges can store the complete list of chromosomes and their
> corresponding sizes four your particular organism. You can create a
> GenomicRanges instance out of a RangedData without providing
> explicitly the list of chromosomes and their sizes. Just do
> my_gr <- as(my_rd,"GRanges")
> The list of chromosomes is gathered on the fly from the features. The
> list chromosome lengths still has to be assigned manually, which is
> Nowadays you can rtracklayer::import() BED directly as GenomicRanges.
> Importing large BED into either GenomicRanges or RangedData is, in my
> experience, equally slow. There is no difference there.
> Why not forgetting RangedData then? The advantage over GenomicRanges
> is, also in my experience, that it accepts features mapped beyond the
> limits of chromosomes. The most unforgiving example is mitochondrial
> DNA. Because it is circular, it naturally gets sequencing reads with
> "starts" that are numerically larger than it "ends".
> In high throughput sequencing I still use RangedData when
> 1) I do not care about relatively few misbehaving reads
> 2) I need my script to run without errors from GenomicRanges sanity
> For everyday high throughput sequencing I use GenomicRanges keeping
> the chromosome lengths unassigned. It could be called a hybrid.
> I hope this helps.
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1016 and 1-301-496-1592
> Fax: 1-301-496-9878
> On Thu, Oct 28, 2010 at 9:25 PM, Janet Young <jayoung at fhcrc.org>
> > Hi,
> > I've been on a long long vacation, so I'm a bit more out of the
> loop than I
> > usually am.
> > I've been using RangedData a lot in my code until now to represent
> sets of
> > genomic regions spread over multiple chromosomes, and I've just
> > that GenomicRanges has a lot of the same characteristics.
> > I wanted to ask you all
> > - whether RangedData and GenomicRanges are pretty much equivalent,
> or if
> > there are functions that exist for one but not the other?
> > - whether I can use pretty much the same code and functions if I
> > everything over to use GenomicRanges?
> > - are there subtle differences I should be careful of if I make
> the switch?
> > thanks very much,
> > Janet Young
> > -------------------------------------------------------------------
> > Dr. Janet Young (Trask lab)
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Avenue N., C3-168,
> > P.O. Box 19024, Seattle, WA 98109-1024, USA.
> > tel: (206) 667 1471 fax: (206) 667 6524
> > email: jayoung ...at... fhcrc.org
> > http://www.fhcrc.org/labs/trask/
> > _______________________________________________
> > Bioc-sig-sequencing mailing list
> > Bioc-sig-sequencing at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
More information about the Bioc-sig-sequencing