[BioC] IRanges::Rle and missing values

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Mon Aug 23 15:11:03 CEST 2010


Thanks Patrick, that is a sensible schedule.

Kasper

On Sun, Aug 22, 2010 at 11:37 PM, Patrick Aboyoun <paboyoun at fhcrc.org> wrote:
>  Kasper,
> I'll look to add na.rm arguments to the run* functions by the next release.
>
>
> Cheers,
> Patrick
>
>
> On 8/21/10 12:47 PM, Kasper Daniel Hansen wrote:
>>
>> Thanks a lot for the fix.
>>
>> Some background.  I have data associated with (very small) genomic
>> locations, irregularly space and I wanted to use the runmeans
>> functionality.
>>
>> Now, for the standard example of Rles: coverage across the genome,
>> "missing" data is equal to a coverage of zero.  But in my case, zero
>> is a perfectly fine data value and is quite different from NA which
>> indicates no data.  So while I would like to calculate running means
>> with a fixed window size (and hence different number of data points in
>> each window since they are irregularly spaced) I could not use the
>> runmeans function, with missing values filled in as zero.
>>
>> I found a solution to my specific problem which uses the fact that my
>> problem with the running mean is more about using the right
>> denominator.  I just create 2 Rle's, one with zeroes and data values
>> and one with 0 and 1 (1 indicating that there is data) and then the
>> "right" running mean is the ratio between two running sums.
>>
>> Since NA's are allowed I think it makes a lot of sense to support them
>> in the run* suite of functions, but it is not something that is
>> extremely urgent (to me) (since I found a workaround).
>>
>> Thanks for the help,
>> Kasper
>>
>> On Fri, Aug 20, 2010 at 8:03 PM, Patrick Aboyoun<paboyoun at fhcrc.org>
>>  wrote:
>>>
>>>  Kasper,
>>> I have addressed these two issues, which were caused by inappropriate
>>> comparisons using NA_REAL at the C-level for 'numeric' Rle objects. As
>>> with
>>> the runmed function in the stats package, I don't currently support
>>> missing
>>> values in the run* methods for Rle objects. Below is the current behavior
>>> in
>>> IRanges 1.6.15 (BioC 2.6, R-2.11) and IRanges 1.7.21 (BioC 2.7, R-devel).
>>> I
>>> can add support for missing values. Just so I prioritize this, when do
>>> you
>>> encounter missing values in your Rle vectors?
>>>
>>>> tmp = Rle(c(1,2,2,2,3,NA,NA,NA,NA,2,3,3,3,3,3,2))
>>>> tmp
>>>
>>> 'numeric' Rle of length 16 with 7 runs
>>>  Lengths:  1  3  1  4  1  5  1
>>>  Values :  1  2  3 NA  2  3  2
>>>
>>>> runsum(tmp, 3)
>>>
>>> Error in runsum(tmp, 3) : some values are NA, NaN, +/-Inf
>>>
>>>> sessionInfo()
>>>
>>> R version 2.12.0 Under development (unstable) (2010-08-01 r52659)
>>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] IRanges_1.7.21
>>>
>>>
>>>
>>> Patrick
>>>
>>>
>>> On 8/20/10 9:43 AM, Patrick Aboyoun wrote:
>>>>
>>>>  Kasper,
>>>> I'll take a look into this. The Rle constructor issue seems to be
>>>> isolated
>>>> to 'numeric' and 'complex' Rles. I'll have an update out soon.
>>>>
>>>>
>>>> Patrick
>>>>
>>>>
>>>> On 8/20/10 8:53 AM, Kasper Daniel Hansen wrote:
>>>>>
>>>>> Would it make sense to allow missing values in Rle objects and also to
>>>>> incorporate removal of missing values in running summaries (and
>>>>> possibly other functions)?
>>>>>
>>>>> Example:
>>>>>
>>>>>> tmp = Rle(c(1,2,2,2,3,NA,NA,NA,NA,2,3,3,3,3,3,2))
>>>>>> tmp
>>>>>
>>>>> 'numeric' Rle of length 16 with 10 runs
>>>>>   Lengths:  1  3  1  1  1  1  1  1  5  1
>>>>>   Values :  1  2  3 NA NA NA NA  2  3  2
>>>>>
>>>>> Seems like the run of 4 NA's is treated differently
>>>>>
>>>>>> runsum(tmp, k = 2)
>>>>>
>>>>> 'numeric' Rle of length 15 with 11 runs
>>>>>   Lengths:  1  2  1  1  1  1  1  1  1  4  1
>>>>>   Values :  3  4  5 NA NA NA NA NA NA NA NA
>>>>>
>>>>> And there is no way to do runsum(..., na.rm = TRUE) like in sum (as
>>>>> far as I can see).
>>>>>
>>>>> Kasper
>>>>>
>>>>>> sessionInfo()
>>>>>
>>>>> R version 2.12.0 Under development (unstable) (2010-08-20 r52790)
>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>
>>>>> locale:
>>>>>  [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C
>>>>>  [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915
>>>>>  [5] LC_MONETARY=C                  LC_MESSAGES=en_US.iso885915
>>>>>  [7] LC_PAPER=en_US.iso885915       LC_NAME=C
>>>>>  [9] LC_ADDRESS=C                   LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] grid      stats     graphics  grDevices datasets  utils     methods
>>>>> [8] base
>>>>>
>>>>> other attached packages:
>>>>> [1] multicore_0.1-3   IRanges_1.7.19    matrixStats_0.2.1
>>>>> R.methodsS3_1.2.0
>>>>> [5] ggplot2_0.8.8     proto_0.3-8       reshape_0.8.3     plyr_1.1
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] tools_2.12.0
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>
>



More information about the Bioconductor mailing list