[BioC] GRanges - reduce() function

Hervé Pagès hpages at fhcrc.org
Fri Nov 18 20:37:41 CET 2011


Hi Fahim and others,

On 11-11-18 07:45 AM, Fahim Mohammad wrote:
> Thanks a lot Vincent,
> Your suggestion helped me a lot, but the problem of mangled names and
> unsorted ranges remained. Also the 'unlist' function add sufiix in the
> name.
>
>> unlist(reduce(split(gr, elementMetadata(gr)$Gene)))GRanges with 4 ranges and 0 elementMetadata values:
>          seqnames    ranges strand
>             <Rle>  <IRanges>   <Rle>
>    www.1     chr2  [10, 19]      *
>    xxx.2     chr1  [ 1,  9]      -
>    yyy.3     chr1  [ 4, 12]      +
>    zzz.4     chr2  [ 1,  9]      -
>    ---
>    seqlengths:
>     chr1 chr2
>       NA   NA

Yes this mangling is definitely not helping. We have control over the
"unlist" method for GRangesList objects so we could change that if
nobody objects. I think it was originally implemented that way for 2
reasons:

   (1) to mimic what base::unlist() does (even though most people
       don't like it);

   (2) also until recently the names of a GRanges object had to be
       unique.

Now that the names of a GRanges object don't have to be unique anymore
(since BioC 2.9), maybe we could change the behaviour of unlist().
Does anybody object?

>
> So I ended up calling GRanges function twice. I know that this approach may
> not be the best and there may be other efficient workaround.
> Thanks again for your help.
> Here is what I did.
>
>
>>        gr<- GRanges(+                     seqnames = Rle(c("chr1", "chr2", "chr2"), c(6, 2,2)), +                     ranges = IRanges(c(1:6, 1:2, 10:11) , end = c(7:12, 5,9, 18:19 )), +                     strand = Rle(strand(c("-", "+", "-", "*")),c(3, 3, 2,2)), +                     transcript  = head(letters, 10),+                     Gene = c(rep("xxx",3), rep("yyy",3), rep("zzz",2), rep("www",2)) +                     )>  grGRanges with 10 ranges and 2 elementMetadata values:
>         seqnames    ranges strand |  transcript        Gene
>            <Rle>  <IRanges>   <Rle>  |<character>  <character>
>     [1]     chr1  [ 1,  7]      - |           a         xxx
>     [2]     chr1  [ 2,  8]      - |           b         xxx
>     [3]     chr1  [ 3,  9]      - |           c         xxx
>     [4]     chr1  [ 4, 10]      + |           d         yyy
>     [5]     chr1  [ 5, 11]      + |           e         yyy
>     [6]     chr1  [ 6, 12]      + |           f         yyy
>     [7]     chr2  [ 1,  5]      - |           g         zzz
>     [8]     chr2  [ 2,  9]      - |           h         zzz
>     [9]     chr2  [10, 18]      * |           i         www
>    [10]     chr2  [11, 19]      * |           j         www
>    ---
>    seqlengths:
>     chr1 chr2
>       NA   NA>        reducedRanges = reduce(split(gr,
> elementMetadata(gr)$Gene)) #reduce the ranges according to 'Gene'
>
> # Convert the rr (reduced ranges) to data frame
>
>
>>        tmp = as.data.frame(reducedRanges)>  tmp  element seqnames start end width strand
> 1     www     chr2    10  19    10      *
> 2     xxx     chr1     1   9     9      -
> 3     yyy     chr1     4  12     9      +
> 4     zzz     chr2     1   9     9      ->  rr =
> tmp[order(tmp[,'seqnames'], tmp[,'start']),];   #sort the data frame
> first on 'seqname' and 'start' location>  rr  element seqnames start
> end width strand
> 2     xxx     chr1     1   9     9      -
> 3     yyy     chr1     4  12     9      +
> 4     zzz     chr2     1   9     9      -
> 1     www     chr2    10  19    10      *
>
> # And now convert the data frame again into GRanges object.
>
>
>>        grFinal<- GRanges(seqnames = Rle(rr[,'seqnames']),+                          ranges = IRanges (start =rr[,'start'], end = rr[,'end'] ),+                          strand = Rle(rr[,'strand']),+                          Gene  = rr[,'element']+                          )>        grFinalGRanges with 4 ranges and 1 elementMetadata value:
>        seqnames    ranges strand |        Gene
>           <Rle>  <IRanges>   <Rle>  |<character>
>    [1]     chr1  [ 1,  9]      - |         xxx
>    [2]     chr1  [ 4, 12]      + |         yyy
>    [3]     chr2  [ 1,  9]      - |         zzz
>    [4]     chr2  [10, 19]      * |         www
>    ---
>    seqlengths:
>     chr1 chr2
>       NA   NA

Alternatively you could do:

   grl <- reduce(split(gr, elementMetadata(gr)$Gene))
   reducedRanges <- unlist(grl, use.names=FALSE)
   elementMetadata(reducedRanges)$Gene <- rep(names(grl), 
elementLengths(grl))

 > reducedRanges
GRanges with 4 ranges and 1 elementMetadata value:
       seqnames    ranges strand |        Gene
          <Rle> <IRanges>  <Rle> | <character>
   [1]     chr2  [10, 19]      * |         www
   [2]     chr1  [ 1,  9]      - |         xxx
   [3]     chr1  [ 4, 12]      + |         yyy
   [4]     chr2  [ 1,  9]      - |         zzz
   ---
   seqlengths:
    chr1 chr2
      NA   NA

Then if you want to restore the original order:

   oo <- order(match(elementMetadata(reducedRanges)$Gene,
               unique(elementMetadata(gr)$Gene)))
   grFinal <- reducedRanges[oo]

 > grFinal
GRanges with 4 ranges and 1 elementMetadata value:
       seqnames    ranges strand |        Gene
          <Rle> <IRanges>  <Rle> | <character>
   [1]     chr1  [ 1,  9]      - |         xxx
   [2]     chr1  [ 4, 12]      + |         yyy
   [3]     chr2  [ 1,  9]      - |         zzz
   [4]     chr2  [10, 19]      * |         www
   ---
   seqlengths:
    chr1 chr2
      NA   NA

Or if you want to order by seqnames and start:

   oo2 <- order(as.vector(seqnames(reducedRanges)), start(reducedRanges))

 > reducedRanges[oo2]
GRanges with 4 ranges and 1 elementMetadata value:
       seqnames    ranges strand |        Gene
          <Rle> <IRanges>  <Rle> | <character>
   [1]     chr1  [ 1,  9]      - |         xxx
   [2]     chr1  [ 4, 12]      + |         yyy
   [3]     chr2  [ 1,  9]      - |         zzz
   [4]     chr2  [10, 19]      * |         www
   ---
   seqlengths:
    chr1 chr2
      NA   NA

which in your case happens to give the same result.

Cheers,
H.

>
>
>
> Regards.
> Fahim
>
> On Thu, Nov 17, 2011 at 9:13 PM, Vincent Carey
> <stvjc at channing.harvard.edu>wrote:
>
>>
>>
>> On Thu, Nov 17, 2011 at 6:28 PM, Fahim Mohammad<fahim.md at gmail.com>wrote:
>>
>>> Hi
>>> In the following example, I am trying to use 'reduce()' function to reduce
>>> the genomic intervals and find intervals corresponding to  'Gene'. It is
>>> behaving as it is desired. However the output of the reduce just give the
>>> intervals (I am losing the 'Gene' metadata).
>>> Is there a way to retain 'Gene' metadata.
>>>
>>>
>>>>        gr<- GRanges(+                     seqnames = Rle(c("chr1",
>>> "chr2", "chr2"), c(6, 2,2)), +                     ranges = IRanges(c(1:6,
>>> 1:2, 10:11) , end = c(7:12, 5,9, 18:19 )), +                     strand =
>>> Rle(strand(c("-", "+", "-", "*")),c(3, 3, 2,2)), +
>>> transcript  = head(letters, 10),+                     Gene =
>>> c(rep("xxx",3), rep("yyy",3), rep("zzz",2), rep("www",2)) +
>>>      )>        grGRanges with 10 ranges and 2 elementMetadata values:
>>>
>>>        seqnames    ranges strand |  transcript        Gene
>>>           <Rle>  <IRanges>   <Rle>  |<character>  <character>
>>>    [1]     chr1  [ 1,  7]      - |           a         xxx
>>>    [2]     chr1  [ 2,  8]      - |           b         xxx
>>>    [3]     chr1  [ 3,  9]      - |           c         xxx
>>>    [4]     chr1  [ 4, 10]      + |           d         yyy
>>>    [5]     chr1  [ 5, 11]      + |           e         yyy
>>>    [6]     chr1  [ 6, 12]      + |           f         yyy
>>>    [7]     chr2  [ 1,  5]      - |           g         zzz
>>>    [8]     chr2  [ 2,  9]      - |           h         zzz
>>>    [9]     chr2  [10, 18]      * |           i         www
>>>   [10]     chr2  [11, 19]      * |           j         www
>>>   ---
>>>   seqlengths:
>>>    chr1 chr2
>>>      NA   NA>        x = reduce(gr)>        xGRanges with 4 ranges and 0
>>>
>>> elementMetadata values:
>>>       seqnames    ranges strand
>>>          <Rle>  <IRanges>   <Rle>
>>>   [1]     chr1  [ 4, 12]      +
>>>   [2]     chr1  [ 1,  9]      -
>>>   [3]     chr2  [ 1,  9]      -
>>>   [4]     chr2  [10, 19]      *
>>>   ---
>>>   seqlengths:
>>>    chr1 chr2
>>>      NA   NA
>>>
>>
>> This doesn't do exactly what you seek, but it gives a mangled names
>> attributed that you can unmangle
>> and reattach as metadata if you like
>>
>> unlist(reduce(split(gr, elementMetadata(gr)$Gene)))
>>
>>
>>>
>>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>> Fahim
>>>
>>> --
>>> -------------------------------
>>> Fahim Mohammad
>>> Bioinforformatics Lab
>>> University of Louisville
>>> Louisville, KY, USA
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list