[BioC] loop over IRanges spaces
Patrick Aboyoun
paboyoun at fhcrc.org
Thu Apr 29 22:56:35 CEST 2010
Yvan,
I'm not sure if I follow all the details you provided. You have a GFF
file contained "scored" regions. (Do these regions form a partitioning
of your sequence positions or do the regions overlap? If a
partitioning/region is not represented is it assumed to have a score of
0?) You would like to aggregate over neighboring regions using a moving
window sum, thus creating new partitioning (with accompanying scores) of
the sequence positions. (I provided code to do what I just mentioned.)
From that point I am less clear on what you are trying to do. Could you
provide a simple concrete example maybe involving three regions on a
single sequence to illustrate what you are trying to achieve? Thanks.
Patrick
On 4/29/10 4:51 AM, Yvan wrote:
> Hello Patrick,
>
> Sorry for the late answer and thank you very much for these precisions.
>
> Maybe I should try to explain what I am trying to do. I am starting
> with a gff file produced by Nimblegen's NimbleScan software for a CGH
> high density array. Out of the nine columns, I am interested mainly in
> the first (seqname) and the sixth (score) even if all columns are
> loaded into a gff object/matrix(correct term?). The main purpose of
> my project is to filtrate out some region of each spaces based on
> sliding window over the score values. The same methodology used to
> filtrate out the regions will be used in a second pass to detect the
> region of interest. I was hoping using IRanges to perform the first
> sliding window on the different spaces, and then create some views for
> each spaces based on the sliding window results and finally
> recalculate the sliding window on all the views. After these two
> passes, a track for the ucsc genome browser should be created and also
> a gff file in order to view the results in the SignalMap software from
> Nimblegen.
>
> Hope I was clear enough and do you think that IRanges is the correct
> tool?
> Thank you again for your time and help.
> Cheers,
> yvan
>
>
> On 26/04/10 21:08, Patrick Aboyoun wrote:
>> Yvan,
>> It appears to me that you are trying to perform two conflicting
>> activities:
>>
>> 1) Calculate the running sum of a metric over an annotated sequence
>> (as evidenced by your aggregate function call)
>> 2) Find the sum for a metric across specified intervals on the
>> annotated sequence (as evidenced by your desire to assign the
>> aggregated sums into an existing RangedData object)
>>
>> Taking a step back, I am guessing that you are trying to transform
>> something akin to a UCSC bed file into something else that is UCSC
>> bed file like. If you are using rtracklayer, this means your initial
>> data are stored in a RangedData object. To create a RangedData object
>> containing the running sum of a values column from an initial
>> RangedData object, I recommend:
>>
>> 1) Creating an RleList object from the RangedData object using the
>> coverage function. Make sure to specify the metric of interest in the
>> weight argument to coverage.
>> 2) Using the runsum function on the RleList object to calculate your
>> running sums.
>> 3) Creating a RangedData object from the RleList object in step 2
>> using as(<<obj>>, "RangedData")
>>
>> Here is an example:
>>
>> > # Step 1: create an RleList representation of the metric
>> > rd <- RangedData(IRanges(start = c(5, 10, 15, 2, 4, 8), end = c(7,
>> 14, 21, 3, 6, 9)),
>> score = 1:6, space = rep(c("A", "B"), each = 3))
>> > scoreRleList <- coverage(rd, weight = "score", width = list(A = 30,
>> B = 10))
>> > scoreRleList
>> SimpleRleList of length 2
>> $A
>> 'integer' Rle of length 30 with 6 runs
>> Lengths: 4 3 2 5 7 9
>> Values : 0 1 0 2 3 0
>>
>> $B
>> 'integer' Rle of length 10 with 6 runs
>> Lengths: 1 2 3 1 2 1
>> Values : 0 4 5 0 6 0
>>
>> > # Step 2: calculate the running sums
>> > scoreRunsum <- runsum(scoreRleList, k = 3, endrule = "constant")
>> > scoreRunsum
>> SimpleRleList of length 2
>> $A
>> 'integer' Rle of length 30 with 15 runs
>> Lengths: 3 1 1 1 1 1 1 1 3 1 1 5 1 1 8
>> Values : 0 1 2 3 2 1 2 4 6 7 8 9 6 3 0
>>
>> $B
>> 'integer' Rle of length 10 with 7 runs
>> Lengths: 2 1 1 1 1 1 3
>> Values : 8 13 14 15 10 11 12
>>
>>
>> > # Step 3: Create a RangedData representation of the running sums
>> > rdRunsum <- as(scoreRunsum, "RangedData")
>> > rdRunsum
>> RangedData with 22 rows and 1 value column across 2 spaces
>> space ranges | score
>> <character> <IRanges> | <integer>
>> 1 A [ 1, 3] | 0
>> 2 A [ 4, 4] | 1
>> 3 A [ 5, 5] | 2
>> 4 A [ 6, 6] | 3
>> 5 A [ 7, 7] | 2
>> 6 A [ 8, 8] | 1
>> 7 A [ 9, 9] | 2
>> 8 A [10, 10] | 4
>> 9 A [11, 13] | 6
>> ... ... ... ... ...
>> 14 A [22, 22] | 3
>> 15 A [23, 30] | 0
>> 16 B [ 1, 2] | 8
>> 17 B [ 3, 3] | 13
>> 18 B [ 4, 4] | 14
>> 19 B [ 5, 5] | 15
>> 20 B [ 6, 6] | 10
>> 21 B [ 7, 7] | 11
>> 22 B [ 8, 10] | 12
>>
>> > sessionInfo()
>> R version 2.11.0 Patched (2010-04-24 r51820)
>> i386-apple-darwin9.8.0
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] IRanges_1.6.1
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.11.0
>>
>>
>>
>> On 4/23/10 6:26 AM, Michael Lawrence wrote:
>>> Also note that it's not really necessary to loop here, as is often
>>> the case
>>> with IRanges:
>>>
>>> rd$windo<- unlist(runmean(RleList(values(rd)[,"score"])))
>>>
>>> On Fri, Apr 23, 2010 at 6:13 AM, Michael
>>> Lawrence<michafla at gene.com> wrote:
>>>
>>>>
>>>> On Fri, Apr 23, 2010 at 1:49 AM, Yvan<yvan.strahm at uni.no> wrote:
>>>>
>>>>> Hello,
>>>>> Thank you both of you.
>>>>>
>>>>> I could could calculate the sliding window, but not as a Rle
>>>>> object, could
>>>>> not append values for the last w-1 position in the Rle object in
>>>>> order to
>>>>> take care of the size problem.
>>>>>
>>>>>
>>>> Why not? Rle supports all the normal vector operations. And runsum or
>>>> runmean will output a vector of the same size as the input, using a
>>>> choice
>>>> of two endrules. If you want 0's at the end, try something like:
>>>>
>>>> rle[(nrow(rd)-w+1):nrow(rd)]<- 0
>>>>
>>>>
>>>>
>>>>> So I did it like that:
>>>>>
>>>>> params<-RDApplyParams(rd,function(rd)
>>>>> append((diff(c(0,cumsum(rd$score)),lag=w)/w),rep(0,each=w-1),after=(length(rd$score)-w+1)))
>>>>>
>>>>>
>>>>> But when I try to add the new values to the rangedData object I
>>>>> got these
>>>>> error.
>>>>>
>>>>>> values(rd)[,"windo"]<-rdapply(params)
>>>>> Error in `[<-`(`*tmp*`, , j, value =<S4 object of class
>>>>> "DataFrame">) :
>>>>> ncol(x[j]) != ncol(value)
>>>>> In addition: Warning messages:
>>>>> 1: In mapply(f, ..., SIMPLIFY = FALSE) :
>>>>> longer argument not a multiple of length of shorter
>>>>> 2: In mapply(f, ..., SIMPLIFY = FALSE) :
>>>>> longer argument not a multiple of length of shorter
>>>>>
>>>>> But when I check the size, they are the same, here for one space
>>>>>
>>>>>> x<-rdapply(params)
>>>>>> length(x$SCAFFOLD_100) == length(rd["SCAFFOLD_100"]$windo)
>>>>> [1] TRUE
>>>>>
>>>>>
>>>> It may be that that type of insertion is unsupported. Why not just do
>>>> something like:
>>>>
>>>> rd$window<- unlist(x)
>>>>
>>>>
>>>>> Maybe params miss a parameter or the way I try to update the rd
>>>>> object is
>>>>> wrong. Anyway form the rdapply output a vector could be created
>>>>> and so a new
>>>>> rd object with the new value column.
>>>>>
>>>>> yvan
>>>>>
>>>>>
>>>>> On 22/04/10 15:51, Michael Lawrence wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 22, 2010 at 5:49 AM, Michael
>>>>> Dondrup<Michael.Dondrup at uni.no>wrote:
>>>>>
>>>>>> Hi,
>>>>>> how about funtion rdapply (not lapply) which is for that?
>>>>>>
>>>>>>
>>>>> lapply() should apply per-space as well, basically providing a
>>>>> short-cut
>>>>> for the more complicated rdapply().
>>>>>
>>>>>> lapply(rd, function(x) sum(x$score))
>>>>> $chr1
>>>>> [1] 3
>>>>>
>>>>> $chr2
>>>>> [1] 0
>>>>>
>>>>> sapply() also works:
>>>>>> sapply(rd, function(x) sum(x$score))
>>>>> chr1 chr2
>>>>> 3 0
>>>>>
>>>>> Another choice is tapply:
>>>>>> tapply(rd$score, space(rd), sum)
>>>>> chr1 chr2
>>>>> 3 0
>>>>>
>>>>> Michael
>>>>>
>>>>> The code below computes the sum score for each space in the
>>>>> RangedData:
>>>>>> # taken from the examples mostly:
>>>>>>> ranges<- IRanges(c(1,2,3),c(4,5,6))
>>>>>>> score<- c(2L, 0L, 1L)
>>>>>>> rd<- RangedData(ranges, score, space = c("chr1","chr2","chr1"))
>>>>>>> rd
>>>>>> RangedData with 3 rows and 1 value column across 2 spaces
>>>>>> space ranges | score
>>>>>> <character> <IRanges> |<integer>
>>>>>> 1 chr1 [1, 4] | 2
>>>>>> 2 chr1 [3, 6] | 1
>>>>>> 3 chr2 [2, 5] | 0
>>>>>>> params<- RDApplyParams(rd, function(rd) sum(score(rd)))
>>>>>>> rdapply(params)
>>>>>> $chr1
>>>>>> [1] 3
>>>>>>
>>>>>> $chr2
>>>>>> [1] 0
>>>>>>
>>>>>>
>>>>>> Cheers
>>>>>> Michael
>>>>>>
>>>>>> Am Apr 22, 2010 um 1:57 PM schrieb Yvan:
>>>>>>
>>>>>>> On 21/04/10 18:43, Michael Lawrence wrote:
>>>>>>>>
>>>>>>>> On Wed, Apr 21, 2010 at 6:07 AM, Yvan<yvan.strahm at uni.no
>>>>>>>> <mailto:yvan.strahm at uni.no>> wrote:
>>>>>>>>
>>>>>>>> Hello List,
>>>>>>>>
>>>>>>>> I am confused about how to loop over a rangedData object.
>>>>>>>> I have this rangedData
>>>>>>>>
>>>>>>>> RangedData with 61 rows and 1 value column across 3 spaces
>>>>>>>> space ranges | score
>>>>>>>> <character> <IRanges> |<numeric>
>>>>>>>> 1 SCAFFOLD_1 [ 8, 8] | -0.09405
>>>>>>>>
>>>>>>>> and the spaces are
>>>>>>>>
>>>>>>>> "SCAFFOLD_1" "SCAFFOLD_10" "SCAFFOLD_100"
>>>>>>>>
>>>>>>>> using aggregate it is possible to apply a function to one
>>>>>>>> of the
>>>>>> space
>>>>>>>> aggregate(rd["SCAFFOLD_1"]$score, start =
>>>>>>>> 1:(length(rd["SCAFFOLD_1"]$score)-w+1), width = w, FUN = sum)
>>>>>>>>
>>>>>>>> but how can I apply the aggregate to all space without a
>>>>>>>> for loop ?
>>>>>>>>
>>>>>>>>
>>>>>>>> It looks like you're attempting a running window sum of the score
>>>>>>>> vector. There are more efficient ways of doing this besides
>>>>>>>> aggregate(). If you convert the score into an Rle, you can use
>>>>>> runsum().
>>>>>>>> Anyway, to do this over each space individually, use lapply().
>>>>>>>>
>>>>>>>> This would come out to something like:
>>>>>>>>
>>>>>>>> values(rd)[,"smoothScore"]<- lapply(rd, function(x)
>>>>>>>> runsum(Rle(x$score), w))
>>>>>>>>
>>>>>>>> Probably not exactly right, but it gets you in the right
>>>>>>>> direction...
>>>>>>>>
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>
>>>>>>> Hello Michael,
>>>>>>>
>>>>>>> Thanks for the answer and the tip about runsum!
>>>>>>> I try with lapply but could not get it working right, the main
>>>>>>> problem
>>>>>>> is that the runsum is calculated on all values and not for a each
>>>>>>> specific spaces.
>>>>>>> Sorry, I should have been more precise in the problem description.
>>>>>>> The runsum should be calculated in a space specific manner, let
>>>>>>> say w=2
>>>>>>>
>>>>>>> space score cumsum
>>>>>>> 1 space1 1 3
>>>>>>> 2 space1 2 4
>>>>>>> 3 space1 2 NA
>>>>>>> 4 space2 10 21
>>>>>>> 5 space2 11 22
>>>>>>> 6 space2 11 NA
>>>>>>>
>>>>>>> Is it possible to do it with lapply?
>>>>>>> Thanks again for your help
>>>>>>> cheers,
>>>>>>> yvan
>>>>>>>
>>>>>>> [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>
>>>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
More information about the Bioconductor
mailing list