[BioC] GRanges - reduce() function
Hervé Pagès
hpages at fhcrc.org
Fri Nov 18 20:37:41 CET 2011
Hi Fahim and others,
On 11-11-18 07:45 AM, Fahim Mohammad wrote:
> Thanks a lot Vincent,
> Your suggestion helped me a lot, but the problem of mangled names and
> unsorted ranges remained. Also the 'unlist' function add sufiix in the
> name.
>
>> unlist(reduce(split(gr, elementMetadata(gr)$Gene)))GRanges with 4 ranges and 0 elementMetadata values:
> seqnames ranges strand
> <Rle> <IRanges> <Rle>
> www.1 chr2 [10, 19] *
> xxx.2 chr1 [ 1, 9] -
> yyy.3 chr1 [ 4, 12] +
> zzz.4 chr2 [ 1, 9] -
> ---
> seqlengths:
> chr1 chr2
> NA NA
Yes this mangling is definitely not helping. We have control over the
"unlist" method for GRangesList objects so we could change that if
nobody objects. I think it was originally implemented that way for 2
reasons:
(1) to mimic what base::unlist() does (even though most people
don't like it);
(2) also until recently the names of a GRanges object had to be
unique.
Now that the names of a GRanges object don't have to be unique anymore
(since BioC 2.9), maybe we could change the behaviour of unlist().
Does anybody object?
>
> So I ended up calling GRanges function twice. I know that this approach may
> not be the best and there may be other efficient workaround.
> Thanks again for your help.
> Here is what I did.
>
>
>> gr<- GRanges(+ seqnames = Rle(c("chr1", "chr2", "chr2"), c(6, 2,2)), + ranges = IRanges(c(1:6, 1:2, 10:11) , end = c(7:12, 5,9, 18:19 )), + strand = Rle(strand(c("-", "+", "-", "*")),c(3, 3, 2,2)), + transcript = head(letters, 10),+ Gene = c(rep("xxx",3), rep("yyy",3), rep("zzz",2), rep("www",2)) + )> grGRanges with 10 ranges and 2 elementMetadata values:
> seqnames ranges strand | transcript Gene
> <Rle> <IRanges> <Rle> |<character> <character>
> [1] chr1 [ 1, 7] - | a xxx
> [2] chr1 [ 2, 8] - | b xxx
> [3] chr1 [ 3, 9] - | c xxx
> [4] chr1 [ 4, 10] + | d yyy
> [5] chr1 [ 5, 11] + | e yyy
> [6] chr1 [ 6, 12] + | f yyy
> [7] chr2 [ 1, 5] - | g zzz
> [8] chr2 [ 2, 9] - | h zzz
> [9] chr2 [10, 18] * | i www
> [10] chr2 [11, 19] * | j www
> ---
> seqlengths:
> chr1 chr2
> NA NA> reducedRanges = reduce(split(gr,
> elementMetadata(gr)$Gene)) #reduce the ranges according to 'Gene'
>
> # Convert the rr (reduced ranges) to data frame
>
>
>> tmp = as.data.frame(reducedRanges)> tmp element seqnames start end width strand
> 1 www chr2 10 19 10 *
> 2 xxx chr1 1 9 9 -
> 3 yyy chr1 4 12 9 +
> 4 zzz chr2 1 9 9 -> rr =
> tmp[order(tmp[,'seqnames'], tmp[,'start']),]; #sort the data frame
> first on 'seqname' and 'start' location> rr element seqnames start
> end width strand
> 2 xxx chr1 1 9 9 -
> 3 yyy chr1 4 12 9 +
> 4 zzz chr2 1 9 9 -
> 1 www chr2 10 19 10 *
>
> # And now convert the data frame again into GRanges object.
>
>
>> grFinal<- GRanges(seqnames = Rle(rr[,'seqnames']),+ ranges = IRanges (start =rr[,'start'], end = rr[,'end'] ),+ strand = Rle(rr[,'strand']),+ Gene = rr[,'element']+ )> grFinalGRanges with 4 ranges and 1 elementMetadata value:
> seqnames ranges strand | Gene
> <Rle> <IRanges> <Rle> |<character>
> [1] chr1 [ 1, 9] - | xxx
> [2] chr1 [ 4, 12] + | yyy
> [3] chr2 [ 1, 9] - | zzz
> [4] chr2 [10, 19] * | www
> ---
> seqlengths:
> chr1 chr2
> NA NA
Alternatively you could do:
grl <- reduce(split(gr, elementMetadata(gr)$Gene))
reducedRanges <- unlist(grl, use.names=FALSE)
elementMetadata(reducedRanges)$Gene <- rep(names(grl),
elementLengths(grl))
> reducedRanges
GRanges with 4 ranges and 1 elementMetadata value:
seqnames ranges strand | Gene
<Rle> <IRanges> <Rle> | <character>
[1] chr2 [10, 19] * | www
[2] chr1 [ 1, 9] - | xxx
[3] chr1 [ 4, 12] + | yyy
[4] chr2 [ 1, 9] - | zzz
---
seqlengths:
chr1 chr2
NA NA
Then if you want to restore the original order:
oo <- order(match(elementMetadata(reducedRanges)$Gene,
unique(elementMetadata(gr)$Gene)))
grFinal <- reducedRanges[oo]
> grFinal
GRanges with 4 ranges and 1 elementMetadata value:
seqnames ranges strand | Gene
<Rle> <IRanges> <Rle> | <character>
[1] chr1 [ 1, 9] - | xxx
[2] chr1 [ 4, 12] + | yyy
[3] chr2 [ 1, 9] - | zzz
[4] chr2 [10, 19] * | www
---
seqlengths:
chr1 chr2
NA NA
Or if you want to order by seqnames and start:
oo2 <- order(as.vector(seqnames(reducedRanges)), start(reducedRanges))
> reducedRanges[oo2]
GRanges with 4 ranges and 1 elementMetadata value:
seqnames ranges strand | Gene
<Rle> <IRanges> <Rle> | <character>
[1] chr1 [ 1, 9] - | xxx
[2] chr1 [ 4, 12] + | yyy
[3] chr2 [ 1, 9] - | zzz
[4] chr2 [10, 19] * | www
---
seqlengths:
chr1 chr2
NA NA
which in your case happens to give the same result.
Cheers,
H.
>
>
>
> Regards.
> Fahim
>
> On Thu, Nov 17, 2011 at 9:13 PM, Vincent Carey
> <stvjc at channing.harvard.edu>wrote:
>
>>
>>
>> On Thu, Nov 17, 2011 at 6:28 PM, Fahim Mohammad<fahim.md at gmail.com>wrote:
>>
>>> Hi
>>> In the following example, I am trying to use 'reduce()' function to reduce
>>> the genomic intervals and find intervals corresponding to 'Gene'. It is
>>> behaving as it is desired. However the output of the reduce just give the
>>> intervals (I am losing the 'Gene' metadata).
>>> Is there a way to retain 'Gene' metadata.
>>>
>>>
>>>> gr<- GRanges(+ seqnames = Rle(c("chr1",
>>> "chr2", "chr2"), c(6, 2,2)), + ranges = IRanges(c(1:6,
>>> 1:2, 10:11) , end = c(7:12, 5,9, 18:19 )), + strand =
>>> Rle(strand(c("-", "+", "-", "*")),c(3, 3, 2,2)), +
>>> transcript = head(letters, 10),+ Gene =
>>> c(rep("xxx",3), rep("yyy",3), rep("zzz",2), rep("www",2)) +
>>> )> grGRanges with 10 ranges and 2 elementMetadata values:
>>>
>>> seqnames ranges strand | transcript Gene
>>> <Rle> <IRanges> <Rle> |<character> <character>
>>> [1] chr1 [ 1, 7] - | a xxx
>>> [2] chr1 [ 2, 8] - | b xxx
>>> [3] chr1 [ 3, 9] - | c xxx
>>> [4] chr1 [ 4, 10] + | d yyy
>>> [5] chr1 [ 5, 11] + | e yyy
>>> [6] chr1 [ 6, 12] + | f yyy
>>> [7] chr2 [ 1, 5] - | g zzz
>>> [8] chr2 [ 2, 9] - | h zzz
>>> [9] chr2 [10, 18] * | i www
>>> [10] chr2 [11, 19] * | j www
>>> ---
>>> seqlengths:
>>> chr1 chr2
>>> NA NA> x = reduce(gr)> xGRanges with 4 ranges and 0
>>>
>>> elementMetadata values:
>>> seqnames ranges strand
>>> <Rle> <IRanges> <Rle>
>>> [1] chr1 [ 4, 12] +
>>> [2] chr1 [ 1, 9] -
>>> [3] chr2 [ 1, 9] -
>>> [4] chr2 [10, 19] *
>>> ---
>>> seqlengths:
>>> chr1 chr2
>>> NA NA
>>>
>>
>> This doesn't do exactly what you seek, but it gives a mangled names
>> attributed that you can unmangle
>> and reattach as metadata if you like
>>
>> unlist(reduce(split(gr, elementMetadata(gr)$Gene)))
>>
>>
>>>
>>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>> Fahim
>>>
>>> --
>>> -------------------------------
>>> Fahim Mohammad
>>> Bioinforformatics Lab
>>> University of Louisville
>>> Louisville, KY, USA
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list