On Mon, Sep 13, 2010 at 6:16 AM, Kasper Daniel Hansen <
kasperdanielhansen@gmail.com> wrote:

> What follows is some random comments on the use of GRanges for storing
> data in the elementMetadata slot.  My use case is a lot of very small
> ranges (>10M ranges of width 1), which may be a special border case
> (but very important to me :).  I am particular interested in fast
> subsetting like subsetByOverlaps.
>
>
I'm not the primary developer of this package, but I have some comments:


> Comments:
>
> * I find column subsetting on the elementMetadata to be really slow,
> slower than it need to be I would say (I suspect some validity
> checking, see comment below on row subsetting).
>
>

> * There is no easy shortcut to get a specific metadata column (gr[,
> "score"] returns a GRange).  I would find it natural that gr$score
> would do that.  This is important in order to get at the data stores
> in the GRange
>
>
Others, including me, have suggested this. It's a design choice by the guys
in Seattle...


> * while there is a as.data.frame, as(from, "data.frame") does not work
> and I would also appreciate an as(from, "GRanges") method with from
> being a data.frame.  I have a quick function at the bottom of my
> email, for doing this.
>
>
I'm not such a big fan of as(from, "GRanges"), since it's going from
something that is loosely defined to something with more constraints. We did
not support this in RangedData either. Usually, it's not so tough to just
use the constructor, and it makes the translation more clear. I could be
convinced otherwise, though.


> * Since start, end, width are reserved column names for GRanges, it
> would be nice if the GRanges constructor was extended to allow
>  GRanges(seqnames = "chr1", start = 1, end = 10)
> instead of now where we need
>  GRanges(seqnames = "chr1", ranges = IRanges(start = 1, end = 10))
> (ok, this is pretty minor, but it is convenient)
>
> * There is no sort method implemented for GRanges; it returns an error
>
> * is.unsorted(xx) seems to always return FALSE with a warning when xx
> is an IRange or a GRange
>
> *It seems to me that row subsetting, specifically subsetByOverlaps is
> a bit too slow.  The find overlaps part is blazingly fast (when ranges
> is a single IRange); it seems like the initialization and subsetting
> part takes "more time than necessary".  This is based on comparing a
> (long) GRanges with around 8 elementMetadata columns to another
> approach where instead of using elementMetadata I use two elements of
> a list: a GRanges without elementMetadata and where the
> elementMetadata is stored in a separate data.frame - I then use
> findOverlaps to get me the indices (when I compare using a GRanges to
> using a GRanges and separate matrix, I use the "position" GRanges to
> get my indexes, which I use to subset the matrix with).
>
> Specifically here gr is a GRanges with length 25M (no elementMetadata)
> and mat is a 25M x 8 matrix
>
> grIn <- gr
> elementMetadata(grIn) <- mat
>
> ## The "classic" approach:
> system.time({
> grOut1 <- subsetByOverlaps(grIn, GRanges(seqnames = "chr1", ranges =
> IRanges(start = 1, end = 10^7)))
> })  # 4.7 secs
>
> ## Keeping gr and mat separate:
> system.time({
>    qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1",
> ranges = IRanges(start = 1, end = 10^7))))
>    grOut2 <- list(gr[qh],  mat = mat[qh,])
> }) #  0.9 secs
>
> ## Doing the above, but forming a GRange after the subsetting adds
> just a small overhead
> system.time({
>    qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1",
> ranges = IRanges(start = 1, end = 10^7))))
>    grOut3 = gr[qh]
>    elementMetadata(grOut3) <- mat[qh,]
> }) # 1.1 secs
>
>
> * For really large query GRanges timing results (not shown) seems to
> indicate that the line
>    query[!is.na(findOverlaps(query, subject, maxgap = maxgap,
>        minoverlap = minoverlap, type = type, select = "first"))]
> in subsetByOverlaps(query = "GRanges", subject = "GRanges") could be
> made faster by changing it to
>    query[queryHits(findOverlaps(query, subject, maxgap = maxgap,
>        minoverlap = minoverlap, type = type))]
> (my timing is for about a query length of 30M and a subject length of 1)
>
>
Right, select = "first" will be slow for GRanges, since at the low-level it
does a findOverlaps(select="all"), and then filters for "first".  It is not
clear to me why it always does "all" here. Convenience of implementation?

The filtering is necessary though. Using queryHits() and select = "all"
could cause a query to become repeated if it hits multiple subjects.
Obviously not a problem when you have a subject of length 1. In that case
though, findOverlaps, etc is overkill.

* I find it counter-intuitive that the following code doesn't work:
> R> example(GRanges)
> R> seqnames(gr) = "chr1"
> or perhaps more intuitive
> R> gr1 = gr[seqnames(gr) == "Chrom1"]
> R> seqnames(gr1) = "chr1"
> In general I don't like that seqnames is a factor.  This is fine when
> I just consider a single GRange, but when I put together multiple
> GRanges from different sources, I sometimes get into problems.
> Specifically, in human, it is not always true that the various contigs
> (like "chr10_random") are part of the data sources.  I would have
> preferred it to be a character and then have seqlengths essentially be
> NA in case a character does not appear in the names(seqlengths).  I
> know that I have complained about this a long time ago and it is
> probably too late to change, but still :)
>
>
> Kasper
>
> data.frame2GRanges <- function(object, keepColumns = TRUE) {
>    stopifnot(class(object) == "data.frame")
>    stopifnot(all(c("chr", "start", "end") %in% names(object)))
>    if("strand" %in% names(object)) {
>        if(is.numeric(object$strand)) {
>            strand <- ifelse(object$strand == 1, "+", "*")
>            strand[object$strand == -1] <- "-"
>            object$strand <- strand
>        }
>        gr <- GRanges(seqnames = object$chr,
>                      ranges = IRanges(start = object$start, end =
> object$end),
>                      strand = object$strand)
>    } else {
>        gr <- GRanges(seqnames = object$chr,
>                      ranges = IRanges(start = object$start, end =
> object$end))
>    }
>    if(keepColumns) {
>        dt <- as(object[, setdiff(names(object), c("chr", "start",
> "end", "strand"))],
>                     "DataFrame")
>        elementMetadata(gr) <- dt
>    }
>    names(gr) <- rownames(object)
>    gr
> }
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>

	[[alternative HTML version deleted]]