[BioC] GRanges - reduce() function
Cook, Malcolm
MEC at stowers.org
Fri Nov 18 21:13:26 CET 2011
Herve,
re: does anyone object
To be honest, I've been skimming.
Are you proposing that the names of the unlisted GRanges don't get suffixed so as to be unique.
If so, then, no objections here.
Thanks!
~Malcolm
> -----Original Message-----
> From: bioconductor-bounces at r-project.org [mailto:bioconductor-
> bounces at r-project.org] On Behalf Of Hervé Pagès
> Sent: Friday, November 18, 2011 1:38 PM
> To: Fahim Mohammad
> Cc: Vincent Carey; Bioconductor Newsgroup
> Subject: Re: [BioC] GRanges - reduce() function
>
> Hi Fahim and others,
>
> On 11-11-18 07:45 AM, Fahim Mohammad wrote:
> > Thanks a lot Vincent,
> > Your suggestion helped me a lot, but the problem of mangled names and
> > unsorted ranges remained. Also the 'unlist' function add sufiix in the
> > name.
> >
> >> unlist(reduce(split(gr, elementMetadata(gr)$Gene)))GRanges with 4
> ranges and 0 elementMetadata values:
> > seqnames ranges strand
> > <Rle> <IRanges> <Rle>
> > www.1 chr2 [10, 19] *
> > xxx.2 chr1 [ 1, 9] -
> > yyy.3 chr1 [ 4, 12] +
> > zzz.4 chr2 [ 1, 9] -
> > ---
> > seqlengths:
> > chr1 chr2
> > NA NA
>
> Yes this mangling is definitely not helping. We have control over the
> "unlist" method for GRangesList objects so we could change that if
> nobody objects. I think it was originally implemented that way for 2
> reasons:
>
> (1) to mimic what base::unlist() does (even though most people
> don't like it);
>
> (2) also until recently the names of a GRanges object had to be
> unique.
>
> Now that the names of a GRanges object don't have to be unique anymore
> (since BioC 2.9), maybe we could change the behaviour of unlist().
> Does anybody object?
>
> >
> > So I ended up calling GRanges function twice. I know that this approach
> may
> > not be the best and there may be other efficient workaround.
> > Thanks again for your help.
> > Here is what I did.
> >
> >
> >> gr<- GRanges(+ seqnames = Rle(c("chr1", "chr2", "chr2"),
> c(6, 2,2)), + ranges = IRanges(c(1:6, 1:2, 10:11) , end = c(7:12, 5,9,
> 18:19 )), + strand = Rle(strand(c("-", "+", "-", "*")),c(3, 3, 2,2)), +
> transcript = head(letters, 10),+ Gene = c(rep("xxx",3),
> rep("yyy",3), rep("zzz",2), rep("www",2)) + )> grGRanges with 10
> ranges and 2 elementMetadata values:
> > seqnames ranges strand | transcript Gene
> > <Rle> <IRanges> <Rle> |<character> <character>
> > [1] chr1 [ 1, 7] - | a xxx
> > [2] chr1 [ 2, 8] - | b xxx
> > [3] chr1 [ 3, 9] - | c xxx
> > [4] chr1 [ 4, 10] + | d yyy
> > [5] chr1 [ 5, 11] + | e yyy
> > [6] chr1 [ 6, 12] + | f yyy
> > [7] chr2 [ 1, 5] - | g zzz
> > [8] chr2 [ 2, 9] - | h zzz
> > [9] chr2 [10, 18] * | i www
> > [10] chr2 [11, 19] * | j www
> > ---
> > seqlengths:
> > chr1 chr2
> > NA NA> reducedRanges = reduce(split(gr,
> > elementMetadata(gr)$Gene)) #reduce the ranges according to 'Gene'
> >
> > # Convert the rr (reduced ranges) to data frame
> >
> >
> >> tmp = as.data.frame(reducedRanges)> tmp element seqnames start
> end width strand
> > 1 www chr2 10 19 10 *
> > 2 xxx chr1 1 9 9 -
> > 3 yyy chr1 4 12 9 +
> > 4 zzz chr2 1 9 9 -> rr =
> > tmp[order(tmp[,'seqnames'], tmp[,'start']),]; #sort the data frame
> > first on 'seqname' and 'start' location> rr element seqnames start
> > end width strand
> > 2 xxx chr1 1 9 9 -
> > 3 yyy chr1 4 12 9 +
> > 4 zzz chr2 1 9 9 -
> > 1 www chr2 10 19 10 *
> >
> > # And now convert the data frame again into GRanges object.
> >
> >
> >> grFinal<- GRanges(seqnames = Rle(rr[,'seqnames']),+
> ranges = IRanges (start =rr[,'start'], end = rr[,'end'] ),+ strand =
> Rle(rr[,'strand']),+ Gene = rr[,'element']+ )>
> grFinalGRanges with 4 ranges and 1 elementMetadata value:
> > seqnames ranges strand | Gene
> > <Rle> <IRanges> <Rle> |<character>
> > [1] chr1 [ 1, 9] - | xxx
> > [2] chr1 [ 4, 12] + | yyy
> > [3] chr2 [ 1, 9] - | zzz
> > [4] chr2 [10, 19] * | www
> > ---
> > seqlengths:
> > chr1 chr2
> > NA NA
>
> Alternatively you could do:
>
> grl <- reduce(split(gr, elementMetadata(gr)$Gene))
> reducedRanges <- unlist(grl, use.names=FALSE)
> elementMetadata(reducedRanges)$Gene <- rep(names(grl),
> elementLengths(grl))
>
> > reducedRanges
> GRanges with 4 ranges and 1 elementMetadata value:
> seqnames ranges strand | Gene
> <Rle> <IRanges> <Rle> | <character>
> [1] chr2 [10, 19] * | www
> [2] chr1 [ 1, 9] - | xxx
> [3] chr1 [ 4, 12] + | yyy
> [4] chr2 [ 1, 9] - | zzz
> ---
> seqlengths:
> chr1 chr2
> NA NA
>
> Then if you want to restore the original order:
>
> oo <- order(match(elementMetadata(reducedRanges)$Gene,
> unique(elementMetadata(gr)$Gene)))
> grFinal <- reducedRanges[oo]
>
> > grFinal
> GRanges with 4 ranges and 1 elementMetadata value:
> seqnames ranges strand | Gene
> <Rle> <IRanges> <Rle> | <character>
> [1] chr1 [ 1, 9] - | xxx
> [2] chr1 [ 4, 12] + | yyy
> [3] chr2 [ 1, 9] - | zzz
> [4] chr2 [10, 19] * | www
> ---
> seqlengths:
> chr1 chr2
> NA NA
>
> Or if you want to order by seqnames and start:
>
> oo2 <- order(as.vector(seqnames(reducedRanges)), start(reducedRanges))
>
> > reducedRanges[oo2]
> GRanges with 4 ranges and 1 elementMetadata value:
> seqnames ranges strand | Gene
> <Rle> <IRanges> <Rle> | <character>
> [1] chr1 [ 1, 9] - | xxx
> [2] chr1 [ 4, 12] + | yyy
> [3] chr2 [ 1, 9] - | zzz
> [4] chr2 [10, 19] * | www
> ---
> seqlengths:
> chr1 chr2
> NA NA
>
> which in your case happens to give the same result.
>
> Cheers,
> H.
>
> >
> >
> >
> > Regards.
> > Fahim
> >
> > On Thu, Nov 17, 2011 at 9:13 PM, Vincent Carey
> > <stvjc at channing.harvard.edu>wrote:
> >
> >>
> >>
> >> On Thu, Nov 17, 2011 at 6:28 PM, Fahim
> Mohammad<fahim.md at gmail.com>wrote:
> >>
> >>> Hi
> >>> In the following example, I am trying to use 'reduce()' function to reduce
> >>> the genomic intervals and find intervals corresponding to 'Gene'. It is
> >>> behaving as it is desired. However the output of the reduce just give the
> >>> intervals (I am losing the 'Gene' metadata).
> >>> Is there a way to retain 'Gene' metadata.
> >>>
> >>>
> >>>> gr<- GRanges(+ seqnames = Rle(c("chr1",
> >>> "chr2", "chr2"), c(6, 2,2)), + ranges = IRanges(c(1:6,
> >>> 1:2, 10:11) , end = c(7:12, 5,9, 18:19 )), + strand =
> >>> Rle(strand(c("-", "+", "-", "*")),c(3, 3, 2,2)), +
> >>> transcript = head(letters, 10),+ Gene =
> >>> c(rep("xxx",3), rep("yyy",3), rep("zzz",2), rep("www",2)) +
> >>> )> grGRanges with 10 ranges and 2 elementMetadata values:
> >>>
> >>> seqnames ranges strand | transcript Gene
> >>> <Rle> <IRanges> <Rle> |<character> <character>
> >>> [1] chr1 [ 1, 7] - | a xxx
> >>> [2] chr1 [ 2, 8] - | b xxx
> >>> [3] chr1 [ 3, 9] - | c xxx
> >>> [4] chr1 [ 4, 10] + | d yyy
> >>> [5] chr1 [ 5, 11] + | e yyy
> >>> [6] chr1 [ 6, 12] + | f yyy
> >>> [7] chr2 [ 1, 5] - | g zzz
> >>> [8] chr2 [ 2, 9] - | h zzz
> >>> [9] chr2 [10, 18] * | i www
> >>> [10] chr2 [11, 19] * | j www
> >>> ---
> >>> seqlengths:
> >>> chr1 chr2
> >>> NA NA> x = reduce(gr)> xGRanges with 4 ranges and 0
> >>>
> >>> elementMetadata values:
> >>> seqnames ranges strand
> >>> <Rle> <IRanges> <Rle>
> >>> [1] chr1 [ 4, 12] +
> >>> [2] chr1 [ 1, 9] -
> >>> [3] chr2 [ 1, 9] -
> >>> [4] chr2 [10, 19] *
> >>> ---
> >>> seqlengths:
> >>> chr1 chr2
> >>> NA NA
> >>>
> >>
> >> This doesn't do exactly what you seek, but it gives a mangled names
> >> attributed that you can unmangle
> >> and reattach as metadata if you like
> >>
> >> unlist(reduce(split(gr, elementMetadata(gr)$Gene)))
> >>
> >>
> >>>
> >>>>
> >>>
> >>>
> >>> Thanks
> >>>
> >>>
> >>> Fahim
> >>>
> >>> --
> >>> -------------------------------
> >>> Fahim Mohammad
> >>> Bioinforformatics Lab
> >>> University of Louisville
> >>> Louisville, KY, USA
> >>>
> >>> [[alternative HTML version deleted]]
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>
> >>
> >>
> >
> >
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list