[BioC] reducing RangedData

Patrick Aboyoun paboyoun at fhcrc.org
Sat Oct 24 07:09:07 CEST 2009


Robert,
I just added a reduce method for RangedData objects to BioC 2.5 for R 
2.10, which is the current development branch and will become the 
release branch on October 28. The signature for this method is

    reduce(x, by, with.inframe.attrib = FALSE)

where by specifies the grouping variables from the values columns. So if 
you just have strand in your values columns, a simple reduce(rd) call 
will do. If you have columns in addition to your strand column that you 
don't want to use for grouping during the reduction operation, the call 
would become reduce(rd, "strand"). I also recently changed the show 
methods for many IRanges objects to display the first 10 or so values so 
the user feels more connected to their data. Here is what the code looks 
like now:

 > suppressMessages(library(IRanges))
 > sta <- c(1, 2, 5, 6, 2, 3, 4, 5)
 > end <- c(3, 4, 5, 7, 2, 4, 6, 7)
 > str <- rep(c("+", "+", "-", "-"), 2)
 > chr <- rep(c("chr1", "chr2"), each = 4)
 > rd <- RangedData(IRanges(start=sta, end=end), strand=str, space=chr)
 > reduce(rd)
RangedData with 4 rows and 1 value column across 2 spaces
        space    ranges |      strand
  <character> <IRanges> | <character>
1        chr1    [5, 7] |           -
2        chr1    [1, 4] |           +
3        chr2    [4, 7] |           -
4        chr2    [2, 4] |           +
 > # if strand is a factor variable, the results are
 > reduce(RangedData(IRanges(start=sta, end=end), strand=factor(str), 
space=chr))
RangedData with 4 rows and 1 value column across 2 spaces
        space    ranges |   strand
  <character> <IRanges> | <factor>
1        chr1    [5, 7] |        -
2        chr1    [1, 4] |        +
3        chr2    [4, 7] |        -
4        chr2    [2, 4] |        +
 > sessionInfo()
R version 2.10.0 RC (2009-10-18 r50160)
i386-apple-darwin9.8.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] IRanges_1.3.97



Patrick Aboyoun wrote:
> Robert,
> To address your case of data extraction from the original RangedData 
> object, I would recommend using "[" with i = <<RangesList>>, e.g. 
> origRD[ranges(reducedRd),].
>
> The situation becomes more involved when, for example, someone wants 
> to sum up the "score" column values for all the ranges that 
> overlapped. These data summary operations are open ended and don't 
> lend themselves to simple interfaces, although we can provide targeted 
> functions, or add arguments to the reduce function, that support 
> specific data summary operations on ancillary columns.
>
>
> Patrick
>
>
> Robert Castelo wrote:
>> hi Patrick,
>>
>> thanks a lot, the concise code snippet below is what i was looking for,
>> i'd had written at least twice as much to do the same job :)
>>
>> regarding whether it would be nice to have a reduction operation with a
>> signature that includes other stuff i think that for the particular case
>> of the strand it is definitely useful to project (reduce in RangedData
>> terminology) genomic coordinates on genomic space taking it into account
>> because many of the functional elements in the genome are stranded and
>> projecting doesn't imply always that one is not interested in the strand
>> of the projected intervals.
>>
>> with respect to whether other information contained in additional
>> columns should somehow be "reduced" i think it would suffice to enable
>> the reduction of the "names" of a RangedData (taking, for instance,
>> arbitrarily one name between two overlapping ranges) because those names
>> would allow to go back to the original (un-reduced) data and pull
>> whatever we're interested in (actually, including the strand).
>>
>> my two-cents about this, and thanks again for your quick help.
>>
>> robert.
>>
>> On Thu, 2009-10-22 at 00:58 -0700, Patrick Aboyoun wrote:
>>  
>>> Robert,
>>> I wrote you some code that you can use below. I've seen other 
>>> reduction operations where the user wanted to reduce a RangedData 
>>> type object by more than one column (e.g. strand and score), so 
>>> perhaps an appropriate signature for a reduce method for RangedData 
>>> would be
>>>
>>> reduce(x, by, ...)
>>>
>>> If this method where written, the remaining question would be should 
>>> it support reducing transformations of the remaining values columns 
>>> or should non "by" columns be dropped from the output. I'm inclined 
>>> towards the latter, since a reduce(x, by, valuesFUN) method wouldn't 
>>> be much simpler than the rdapply method given below. Would a 
>>> straight-forward reduce(x = rd, by = "strand") approach for 
>>> RangedData meet your needs?
>>>
>>>
>>>  > suppressMessages(library(IRanges))
>>>  > sta <- c(1, 2, 5, 6, 2, 3, 4, 5)
>>>  > end <- c(3, 4, 5, 7, 2, 4, 6, 7)
>>>  > str <- rep(c("+", "+", "-", "-"), 2)
>>>  > chr <- rep(c("chr1", "chr2"), each = 4)
>>>  > rd <- RangedData(IRanges(start=sta, end=end), strand=str, space=chr)
>>>  > rd
>>> RangedData: 8 ranges by 1 columns
>>> columns(1): strand
>>> sequences(2): chr1 chr2
>>>  >
>>>  > # Interesting bit starts here
>>>  > reduceStranded <- function(x) {
>>> +     strand <- x$strand
>>> +     rngs <- ranges(x)[[1]]
>>> +     ans <- IRangesList("+" = reduce(rngs[strand == "+"]),
>>> +                        "-" = reduce(rngs[strand == "-"]))
>>> +     RangedData(unlist(ans, use.names = FALSE),
>>> +                strand = rep(c("+", "-"), elementLengths(ans)),
>>>                  space = names(x))
>>> + }  > params <- RDApplyParams(rd, reduceStranded, reducerFun = 
>>> function(x) do.call(c, unname(x)))
>>>  > rdapply(params)
>>> RangedData: 4 ranges by 1 columns
>>> columns(1): strand
>>> sequences(2): chr1 chr2
>>>  > as.data.frame(rdapply(params))
>>>   space start end width strand
>>> 1  chr1     1   4     4      +
>>> 2  chr1     5   7     3      -
>>> 3  chr2     2   4     3      +
>>> 4  chr2     4   7     4      -
>>>  > sessionInfo()
>>> R version 2.9.2 Patched (2009-08-24 r49420)
>>> i386-apple-darwin9.8.0
>>>
>>> locale:
>>> C/en_US.UTF-8/C/C/C/C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base   
>>> other attached packages:
>>> [1] IRanges_1.2.3
>>>
>>>
>>>
>>>
>>> Robert Castelo wrote:
>>>    
>>>> dear list,
>>>>
>>>> i'd like to project a set of exons on genomic space but keeping
>>>> the strand information. let's assume for simplicity that no exons
>>>> overlap in opposite strands so no conflicts need to be resolved
>>>> regarding that case. in principle this is easy without keeping the
>>>> strand using a RangeList and the reduce method. however i've been
>>>> struggling for a while without success to figure out how can i
>>>> "reduce" my coordinates without loosing the strand information.
>>>>
>>>> my guess is that to carry out the strand information i need to
>>>> use a RangedData object:
>>>>
>>>> sta <- c(1, 1, 1)
>>>> end <- c(3, 4, 5)
>>>> str <- c("+", "+", "-")
>>>> rd <- RangedData(IRanges(start=sta, end=end), strand=str, space=chr)
>>>> rd
>>>> RangedData: 3 ranges by 1 columns
>>>> columns(1): strand
>>>> sequences(2): chr1 chr2
>>>>
>>>> and then use rdapply to reduce on each of the chromosomes but i
>>>> either don't know how to directly apply reduce:
>>>>
>>>> params <- RDApplyParams(rd, reduce)
>>>> rdapply(params)
>>>> Error in function (classes, fdef, mtable)  :
>>>>   unable to find an inherited method for function "reduce", for 
>>>> signature "RangedData"
>>>>
>>>>
>>>> or i loose the strand information:
>>>>
>>>> params <- RDApplyParams(rd, function(rd) reduce(ranges(rd)))
>>>> rdapply(params)
>>>> $chr1
>>>> RangesList: 1 range
>>>> 1: 1:4
>>>>
>>>> $chr2
>>>> RangesList: 1 range
>>>> 1: 1:5
>>>>
>>>> thanks for your help!
>>>> robert.
>>>>
>>>>
>>>>
>>>> sessionInfo()
>>>> R version 2.9.1 (2009-06-26)
>>>> i386-apple-darwin8.11.1
>>>>
>>>> locale:
>>>> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>>  [1] rtracklayer_1.4.1                     RCurl_0.98-1
>>>>  [3] bitops_1.0-4.1                        GOstats_2.10.0
>>>>  [5] graph_1.22.2                          Category_2.10.1
>>>>  [7] geneplotter_1.22.0                    lattice_0.17-25
>>>>  [9] limma_2.18.2                          bgSpJncArrayDm1.db_1.0.1
>>>> [11] org.Dm.eg.db_2.2.11                   RSQLite_0.7-1
>>>> [13] DBI_0.2-4                             
>>>> BSgenome.Dmelanogaster.UCSC.dm1_1.0.0
>>>> [15] BSgenome_1.12.3                       RColorBrewer_1.0-2
>>>> [17] Biostrings_2.12.8                     IRanges_1.2.3
>>>> [19] annotate_1.22.0                       AnnotationDbi_1.6.1
>>>> [21] Biobase_2.4.1                         xtable_1.5-5
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] genefilter_1.24.2 GO.db_2.2.11      grid_2.9.1        
>>>> GSEABase_1.6.1
>>>> [5] RBGL_1.20.0       splines_2.9.1     survival_2.35-4   tools_2.9.1
>>>> [9] XML_2.5-3
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: 
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>       
>>>     
>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list