[BioC] loop over IRanges spaces

Patrick Aboyoun paboyoun at fhcrc.org
Thu Apr 29 22:56:35 CEST 2010


Yvan,
I'm not sure if I follow all the details you provided. You have a GFF 
file contained "scored" regions. (Do these regions form a partitioning 
of your sequence positions or do the regions overlap? If a 
partitioning/region is not represented is it assumed to have a score of 
0?) You would like to aggregate over neighboring regions using a moving 
window sum, thus creating new partitioning (with accompanying scores) of 
the sequence positions. (I provided code to do what I just mentioned.) 
 From that point I am less clear on what you are trying to do. Could you 
provide a simple concrete example maybe involving three regions on a 
single sequence to illustrate what you are trying to achieve? Thanks.


Patrick



On 4/29/10 4:51 AM, Yvan wrote:
> Hello Patrick,
>
> Sorry for the late answer and thank you very much for these precisions.
>
> Maybe I should try to explain what I am trying to do. I am starting 
> with a gff file produced by Nimblegen's NimbleScan software for a CGH 
> high density array. Out of the nine columns, I am interested mainly in 
> the first (seqname) and the sixth (score) even if all columns are 
> loaded into a gff object/matrix(correct term?).  The main purpose of 
> my project is to filtrate out some region of each spaces based on 
> sliding window over the score values. The same methodology used to 
> filtrate out the regions will be used in a second pass to detect the 
> region of interest. I was hoping using IRanges to perform the first 
> sliding window on the different spaces, and then create some views for 
> each spaces based on the sliding window results and finally 
> recalculate the sliding window on all the views. After these two 
> passes, a track for the ucsc genome browser should be created and also 
> a gff file in order to view the results in the SignalMap software from 
> Nimblegen.
>
> Hope I was clear enough and do you think that IRanges is the correct 
> tool?
> Thank you again for your time and help.
> Cheers,
> yvan
>
>
> On 26/04/10 21:08, Patrick Aboyoun wrote:
>> Yvan,
>> It appears to me that you are trying to perform two conflicting 
>> activities:
>>
>> 1) Calculate the running sum of a metric over an annotated sequence 
>> (as evidenced by your aggregate function call)
>> 2) Find the sum for a metric across specified intervals on the 
>> annotated sequence (as evidenced by your desire to assign the 
>> aggregated sums into an existing RangedData object)
>>
>> Taking a step back, I am guessing that you are trying to transform 
>> something akin to a UCSC bed file into something else that is UCSC 
>> bed file like. If you are using rtracklayer, this means your initial 
>> data are stored in a RangedData object. To create a RangedData object 
>> containing the running sum of a values column from an initial 
>> RangedData object, I recommend:
>>
>> 1) Creating an RleList object from the RangedData object using the 
>> coverage function. Make sure to specify the metric of interest in the 
>> weight argument to coverage.
>> 2) Using the runsum function on the RleList object to calculate your 
>> running sums.
>> 3) Creating a RangedData object from the RleList object in step 2 
>> using as(<<obj>>, "RangedData")
>>
>> Here is an example:
>>
>> > # Step 1: create an RleList representation of the metric
>> > rd <- RangedData(IRanges(start = c(5, 10, 15, 2, 4, 8), end = c(7, 
>> 14, 21, 3, 6, 9)),
>>                    score = 1:6, space = rep(c("A", "B"), each = 3))
>> > scoreRleList <- coverage(rd, weight = "score", width = list(A = 30, 
>> B = 10))
>> > scoreRleList
>> SimpleRleList of length 2
>> $A
>> 'integer' Rle of length 30 with 6 runs
>>   Lengths: 4 3 2 5 7 9
>>   Values : 0 1 0 2 3 0
>>
>> $B
>> 'integer' Rle of length 10 with 6 runs
>>   Lengths: 1 2 3 1 2 1
>>   Values : 0 4 5 0 6 0
>>
>> > # Step 2: calculate the running sums
>> > scoreRunsum <- runsum(scoreRleList, k = 3, endrule = "constant")
>> > scoreRunsum
>> SimpleRleList of length 2
>> $A
>> 'integer' Rle of length 30 with 15 runs
>>   Lengths: 3 1 1 1 1 1 1 1 3 1 1 5 1 1 8
>>   Values : 0 1 2 3 2 1 2 4 6 7 8 9 6 3 0
>>
>> $B
>> 'integer' Rle of length 10 with 7 runs
>>   Lengths:  2  1  1  1  1  1  3
>>   Values :  8 13 14 15 10 11 12
>>
>>
>> > # Step 3: Create a RangedData representation of the running sums
>> > rdRunsum <- as(scoreRunsum, "RangedData")
>> > rdRunsum
>> RangedData with 22 rows and 1 value column across 2 spaces
>>           space    ranges   |     score
>> <character> <IRanges>   | <integer>
>> 1             A  [ 1,  3]   |         0
>> 2             A  [ 4,  4]   |         1
>> 3             A  [ 5,  5]   |         2
>> 4             A  [ 6,  6]   |         3
>> 5             A  [ 7,  7]   |         2
>> 6             A  [ 8,  8]   |         1
>> 7             A  [ 9,  9]   |         2
>> 8             A  [10, 10]   |         4
>> 9             A  [11, 13]   |         6
>> ...         ...       ... ...       ...
>> 14            A  [22, 22]   |         3
>> 15            A  [23, 30]   |         0
>> 16            B  [ 1,  2]   |         8
>> 17            B  [ 3,  3]   |        13
>> 18            B  [ 4,  4]   |        14
>> 19            B  [ 5,  5]   |        15
>> 20            B  [ 6,  6]   |        10
>> 21            B  [ 7,  7]   |        11
>> 22            B  [ 8, 10]   |        12
>>
>> > sessionInfo()
>> R version 2.11.0 Patched (2010-04-24 r51820)
>> i386-apple-darwin9.8.0
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] IRanges_1.6.1
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.11.0
>>
>>
>>
>> On 4/23/10 6:26 AM, Michael Lawrence wrote:
>>> Also note that it's not really necessary to loop here, as is often 
>>> the case
>>> with IRanges:
>>>
>>> rd$windo<- unlist(runmean(RleList(values(rd)[,"score"])))
>>>
>>> On Fri, Apr 23, 2010 at 6:13 AM, Michael 
>>> Lawrence<michafla at gene.com>  wrote:
>>>
>>>>
>>>> On Fri, Apr 23, 2010 at 1:49 AM, Yvan<yvan.strahm at uni.no>  wrote:
>>>>
>>>>>   Hello,
>>>>> Thank you both of you.
>>>>>
>>>>> I could could calculate the sliding window, but not as a Rle 
>>>>> object, could
>>>>> not append values for the last w-1 position in the Rle object in 
>>>>> order to
>>>>> take care of the size problem.
>>>>>
>>>>>
>>>> Why not? Rle supports all the normal vector operations. And runsum or
>>>> runmean will output a vector of the same size as the input, using a 
>>>> choice
>>>> of two endrules. If you want 0's at the end, try something like:
>>>>
>>>> rle[(nrow(rd)-w+1):nrow(rd)]<- 0
>>>>
>>>>
>>>>
>>>>> So I did it like that:
>>>>>
>>>>> params<-RDApplyParams(rd,function(rd)
>>>>> append((diff(c(0,cumsum(rd$score)),lag=w)/w),rep(0,each=w-1),after=(length(rd$score)-w+1))) 
>>>>>
>>>>>
>>>>> But when I try to add the new values to the rangedData object I 
>>>>> got these
>>>>> error.
>>>>>
>>>>>> values(rd)[,"windo"]<-rdapply(params)
>>>>> Error in `[<-`(`*tmp*`, , j, value =<S4 object of class 
>>>>> "DataFrame">) :
>>>>>    ncol(x[j]) != ncol(value)
>>>>> In addition: Warning messages:
>>>>> 1: In mapply(f, ..., SIMPLIFY = FALSE) :
>>>>>    longer argument not a multiple of length of shorter
>>>>> 2: In mapply(f, ..., SIMPLIFY = FALSE) :
>>>>>    longer argument not a multiple of length of shorter
>>>>>
>>>>> But when I check the size, they are the same, here for one space
>>>>>
>>>>>> x<-rdapply(params)
>>>>>> length(x$SCAFFOLD_100) == length(rd["SCAFFOLD_100"]$windo)
>>>>> [1] TRUE
>>>>>
>>>>>
>>>> It may be that that type of insertion is unsupported. Why not just do
>>>> something like:
>>>>
>>>> rd$window<- unlist(x)
>>>>
>>>>
>>>>>   Maybe params miss a parameter or the way I try to update the rd 
>>>>> object is
>>>>> wrong. Anyway form the rdapply output a vector could be created 
>>>>> and so a new
>>>>> rd object with the new value column.
>>>>>
>>>>> yvan
>>>>>
>>>>>
>>>>> On 22/04/10 15:51, Michael Lawrence wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 22, 2010 at 5:49 AM, Michael 
>>>>> Dondrup<Michael.Dondrup at uni.no>wrote:
>>>>>
>>>>>> Hi,
>>>>>> how about funtion rdapply (not lapply) which is for that?
>>>>>>
>>>>>>
>>>>> lapply() should apply per-space as well, basically providing a 
>>>>> short-cut
>>>>> for the more complicated rdapply().
>>>>>
>>>>>> lapply(rd, function(x) sum(x$score))
>>>>> $chr1
>>>>> [1] 3
>>>>>
>>>>> $chr2
>>>>> [1] 0
>>>>>
>>>>> sapply() also works:
>>>>>> sapply(rd, function(x) sum(x$score))
>>>>> chr1 chr2
>>>>>     3    0
>>>>>
>>>>> Another choice is tapply:
>>>>>> tapply(rd$score, space(rd), sum)
>>>>> chr1 chr2
>>>>>     3    0
>>>>>
>>>>> Michael
>>>>>
>>>>>   The code below computes the sum score for each space in the 
>>>>> RangedData:
>>>>>> # taken from the examples mostly:
>>>>>>> ranges<- IRanges(c(1,2,3),c(4,5,6))
>>>>>>>    score<- c(2L, 0L, 1L)
>>>>>>>    rd<- RangedData(ranges, score, space = c("chr1","chr2","chr1"))
>>>>>>> rd
>>>>>> RangedData with 3 rows and 1 value column across 2 spaces
>>>>>>         space    ranges |     score
>>>>>> <character> <IRanges>  |<integer>
>>>>>> 1        chr1    [1, 4] |         2
>>>>>> 2        chr1    [3, 6] |         1
>>>>>> 3        chr2    [2, 5] |         0
>>>>>>> params<- RDApplyParams(rd, function(rd) sum(score(rd)))
>>>>>>> rdapply(params)
>>>>>> $chr1
>>>>>> [1] 3
>>>>>>
>>>>>> $chr2
>>>>>> [1] 0
>>>>>>
>>>>>>
>>>>>> Cheers
>>>>>> Michael
>>>>>>
>>>>>> Am Apr 22, 2010 um 1:57 PM schrieb Yvan:
>>>>>>
>>>>>>> On 21/04/10 18:43, Michael Lawrence wrote:
>>>>>>>>
>>>>>>>> On Wed, Apr 21, 2010 at 6:07 AM, Yvan<yvan.strahm at uni.no
>>>>>>>> <mailto:yvan.strahm at uni.no>>  wrote:
>>>>>>>>
>>>>>>>>     Hello List,
>>>>>>>>
>>>>>>>>     I am confused about  how to loop over a rangedData object.
>>>>>>>>     I have this rangedData
>>>>>>>>
>>>>>>>>     RangedData with 61 rows and 1 value column across 3 spaces
>>>>>>>>             space     ranges |     score
>>>>>>>> <character> <IRanges>  |<numeric>
>>>>>>>>     1   SCAFFOLD_1 [  8,   8] |  -0.09405
>>>>>>>>
>>>>>>>>     and the spaces are
>>>>>>>>
>>>>>>>>     "SCAFFOLD_1"   "SCAFFOLD_10"  "SCAFFOLD_100"
>>>>>>>>
>>>>>>>>     using aggregate it is possible to apply a function to one 
>>>>>>>> of the
>>>>>> space
>>>>>>>>     aggregate(rd["SCAFFOLD_1"]$score, start =
>>>>>>>>     1:(length(rd["SCAFFOLD_1"]$score)-w+1), width = w, FUN = sum)
>>>>>>>>
>>>>>>>>     but how can I apply the aggregate to all space without a 
>>>>>>>> for loop ?
>>>>>>>>
>>>>>>>>
>>>>>>>> It looks like you're attempting a running window sum of the score
>>>>>>>> vector. There are more efficient ways of doing this besides
>>>>>>>> aggregate(). If you convert the score into an Rle, you can use
>>>>>> runsum().
>>>>>>>> Anyway, to do this over each space individually, use lapply().
>>>>>>>>
>>>>>>>> This would come out to something like:
>>>>>>>>
>>>>>>>> values(rd)[,"smoothScore"]<- lapply(rd, function(x)
>>>>>>>> runsum(Rle(x$score), w))
>>>>>>>>
>>>>>>>> Probably not exactly right, but it gets you in the right 
>>>>>>>> direction...
>>>>>>>>
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>
>>>>>>> Hello Michael,
>>>>>>>
>>>>>>> Thanks for the answer and the tip about runsum!
>>>>>>> I try with lapply but could not get it working right, the main 
>>>>>>> problem
>>>>>>> is that the runsum is calculated on all values and not for a each
>>>>>>> specific spaces.
>>>>>>> Sorry, I should have been more precise in the problem description.
>>>>>>> The runsum should be calculated in a space specific manner, let 
>>>>>>> say w=2
>>>>>>>
>>>>>>> space        score    cumsum
>>>>>>> 1 space1    1            3
>>>>>>> 2 space1    2            4
>>>>>>> 3 space1    2            NA
>>>>>>> 4 space2    10          21
>>>>>>> 5 space2    11          22
>>>>>>> 6 space2    11          NA
>>>>>>>
>>>>>>> Is it possible to do it with lapply?
>>>>>>> Thanks again for your help
>>>>>>> cheers,
>>>>>>> yvan
>>>>>>>
>>>>>>>        [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>
>>>>>
>>>     [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>



More information about the Bioconductor mailing list