[BioC] IRanges::Rle and missing values

Patrick Aboyoun paboyoun at fhcrc.org
Mon Aug 23 05:37:24 CEST 2010


  Kasper,
I'll look to add na.rm arguments to the run* functions by the next release.


Cheers,
Patrick


On 8/21/10 12:47 PM, Kasper Daniel Hansen wrote:
> Thanks a lot for the fix.
>
> Some background.  I have data associated with (very small) genomic
> locations, irregularly space and I wanted to use the runmeans
> functionality.
>
> Now, for the standard example of Rles: coverage across the genome,
> "missing" data is equal to a coverage of zero.  But in my case, zero
> is a perfectly fine data value and is quite different from NA which
> indicates no data.  So while I would like to calculate running means
> with a fixed window size (and hence different number of data points in
> each window since they are irregularly spaced) I could not use the
> runmeans function, with missing values filled in as zero.
>
> I found a solution to my specific problem which uses the fact that my
> problem with the running mean is more about using the right
> denominator.  I just create 2 Rle's, one with zeroes and data values
> and one with 0 and 1 (1 indicating that there is data) and then the
> "right" running mean is the ratio between two running sums.
>
> Since NA's are allowed I think it makes a lot of sense to support them
> in the run* suite of functions, but it is not something that is
> extremely urgent (to me) (since I found a workaround).
>
> Thanks for the help,
> Kasper
>
> On Fri, Aug 20, 2010 at 8:03 PM, Patrick Aboyoun<paboyoun at fhcrc.org>  wrote:
>>   Kasper,
>> I have addressed these two issues, which were caused by inappropriate
>> comparisons using NA_REAL at the C-level for 'numeric' Rle objects. As with
>> the runmed function in the stats package, I don't currently support missing
>> values in the run* methods for Rle objects. Below is the current behavior in
>> IRanges 1.6.15 (BioC 2.6, R-2.11) and IRanges 1.7.21 (BioC 2.7, R-devel). I
>> can add support for missing values. Just so I prioritize this, when do you
>> encounter missing values in your Rle vectors?
>>
>>> tmp = Rle(c(1,2,2,2,3,NA,NA,NA,NA,2,3,3,3,3,3,2))
>>> tmp
>> 'numeric' Rle of length 16 with 7 runs
>>   Lengths:  1  3  1  4  1  5  1
>>   Values :  1  2  3 NA  2  3  2
>>
>>> runsum(tmp, 3)
>> Error in runsum(tmp, 3) : some values are NA, NaN, +/-Inf
>>
>>> sessionInfo()
>> R version 2.12.0 Under development (unstable) (2010-08-01 r52659)
>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] IRanges_1.7.21
>>
>>
>>
>> Patrick
>>
>>
>> On 8/20/10 9:43 AM, Patrick Aboyoun wrote:
>>>   Kasper,
>>> I'll take a look into this. The Rle constructor issue seems to be isolated
>>> to 'numeric' and 'complex' Rles. I'll have an update out soon.
>>>
>>>
>>> Patrick
>>>
>>>
>>> On 8/20/10 8:53 AM, Kasper Daniel Hansen wrote:
>>>> Would it make sense to allow missing values in Rle objects and also to
>>>> incorporate removal of missing values in running summaries (and
>>>> possibly other functions)?
>>>>
>>>> Example:
>>>>
>>>>> tmp = Rle(c(1,2,2,2,3,NA,NA,NA,NA,2,3,3,3,3,3,2))
>>>>> tmp
>>>> 'numeric' Rle of length 16 with 10 runs
>>>>    Lengths:  1  3  1  1  1  1  1  1  5  1
>>>>    Values :  1  2  3 NA NA NA NA  2  3  2
>>>>
>>>> Seems like the run of 4 NA's is treated differently
>>>>
>>>>> runsum(tmp, k = 2)
>>>> 'numeric' Rle of length 15 with 11 runs
>>>>    Lengths:  1  2  1  1  1  1  1  1  1  4  1
>>>>    Values :  3  4  5 NA NA NA NA NA NA NA NA
>>>>
>>>> And there is no way to do runsum(..., na.rm = TRUE) like in sum (as
>>>> far as I can see).
>>>>
>>>> Kasper
>>>>
>>>>> sessionInfo()
>>>> R version 2.12.0 Under development (unstable) (2010-08-20 r52790)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>   [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C
>>>>   [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915
>>>>   [5] LC_MONETARY=C                  LC_MESSAGES=en_US.iso885915
>>>>   [7] LC_PAPER=en_US.iso885915       LC_NAME=C
>>>>   [9] LC_ADDRESS=C                   LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] grid      stats     graphics  grDevices datasets  utils     methods
>>>> [8] base
>>>>
>>>> other attached packages:
>>>> [1] multicore_0.1-3   IRanges_1.7.19    matrixStats_0.2.1
>>>> R.methodsS3_1.2.0
>>>> [5] ggplot2_0.8.8     proto_0.3-8       reshape_0.8.3     plyr_1.1
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] tools_2.12.0
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>



More information about the Bioconductor mailing list