[BioC] IRanges::Rle and missing values
Kasper Daniel Hansen
kasperdanielhansen at gmail.com
Mon Aug 23 15:11:03 CEST 2010
Thanks Patrick, that is a sensible schedule.
Kasper
On Sun, Aug 22, 2010 at 11:37 PM, Patrick Aboyoun <paboyoun at fhcrc.org> wrote:
> Kasper,
> I'll look to add na.rm arguments to the run* functions by the next release.
>
>
> Cheers,
> Patrick
>
>
> On 8/21/10 12:47 PM, Kasper Daniel Hansen wrote:
>>
>> Thanks a lot for the fix.
>>
>> Some background. I have data associated with (very small) genomic
>> locations, irregularly space and I wanted to use the runmeans
>> functionality.
>>
>> Now, for the standard example of Rles: coverage across the genome,
>> "missing" data is equal to a coverage of zero. But in my case, zero
>> is a perfectly fine data value and is quite different from NA which
>> indicates no data. So while I would like to calculate running means
>> with a fixed window size (and hence different number of data points in
>> each window since they are irregularly spaced) I could not use the
>> runmeans function, with missing values filled in as zero.
>>
>> I found a solution to my specific problem which uses the fact that my
>> problem with the running mean is more about using the right
>> denominator. I just create 2 Rle's, one with zeroes and data values
>> and one with 0 and 1 (1 indicating that there is data) and then the
>> "right" running mean is the ratio between two running sums.
>>
>> Since NA's are allowed I think it makes a lot of sense to support them
>> in the run* suite of functions, but it is not something that is
>> extremely urgent (to me) (since I found a workaround).
>>
>> Thanks for the help,
>> Kasper
>>
>> On Fri, Aug 20, 2010 at 8:03 PM, Patrick Aboyoun<paboyoun at fhcrc.org>
>> wrote:
>>>
>>> Kasper,
>>> I have addressed these two issues, which were caused by inappropriate
>>> comparisons using NA_REAL at the C-level for 'numeric' Rle objects. As
>>> with
>>> the runmed function in the stats package, I don't currently support
>>> missing
>>> values in the run* methods for Rle objects. Below is the current behavior
>>> in
>>> IRanges 1.6.15 (BioC 2.6, R-2.11) and IRanges 1.7.21 (BioC 2.7, R-devel).
>>> I
>>> can add support for missing values. Just so I prioritize this, when do
>>> you
>>> encounter missing values in your Rle vectors?
>>>
>>>> tmp = Rle(c(1,2,2,2,3,NA,NA,NA,NA,2,3,3,3,3,3,2))
>>>> tmp
>>>
>>> 'numeric' Rle of length 16 with 7 runs
>>> Lengths: 1 3 1 4 1 5 1
>>> Values : 1 2 3 NA 2 3 2
>>>
>>>> runsum(tmp, 3)
>>>
>>> Error in runsum(tmp, 3) : some values are NA, NaN, +/-Inf
>>>
>>>> sessionInfo()
>>>
>>> R version 2.12.0 Under development (unstable) (2010-08-01 r52659)
>>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] IRanges_1.7.21
>>>
>>>
>>>
>>> Patrick
>>>
>>>
>>> On 8/20/10 9:43 AM, Patrick Aboyoun wrote:
>>>>
>>>> Kasper,
>>>> I'll take a look into this. The Rle constructor issue seems to be
>>>> isolated
>>>> to 'numeric' and 'complex' Rles. I'll have an update out soon.
>>>>
>>>>
>>>> Patrick
>>>>
>>>>
>>>> On 8/20/10 8:53 AM, Kasper Daniel Hansen wrote:
>>>>>
>>>>> Would it make sense to allow missing values in Rle objects and also to
>>>>> incorporate removal of missing values in running summaries (and
>>>>> possibly other functions)?
>>>>>
>>>>> Example:
>>>>>
>>>>>> tmp = Rle(c(1,2,2,2,3,NA,NA,NA,NA,2,3,3,3,3,3,2))
>>>>>> tmp
>>>>>
>>>>> 'numeric' Rle of length 16 with 10 runs
>>>>> Lengths: 1 3 1 1 1 1 1 1 5 1
>>>>> Values : 1 2 3 NA NA NA NA 2 3 2
>>>>>
>>>>> Seems like the run of 4 NA's is treated differently
>>>>>
>>>>>> runsum(tmp, k = 2)
>>>>>
>>>>> 'numeric' Rle of length 15 with 11 runs
>>>>> Lengths: 1 2 1 1 1 1 1 1 1 4 1
>>>>> Values : 3 4 5 NA NA NA NA NA NA NA NA
>>>>>
>>>>> And there is no way to do runsum(..., na.rm = TRUE) like in sum (as
>>>>> far as I can see).
>>>>>
>>>>> Kasper
>>>>>
>>>>>> sessionInfo()
>>>>>
>>>>> R version 2.12.0 Under development (unstable) (2010-08-20 r52790)
>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>
>>>>> locale:
>>>>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
>>>>> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
>>>>> [5] LC_MONETARY=C LC_MESSAGES=en_US.iso885915
>>>>> [7] LC_PAPER=en_US.iso885915 LC_NAME=C
>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] grid stats graphics grDevices datasets utils methods
>>>>> [8] base
>>>>>
>>>>> other attached packages:
>>>>> [1] multicore_0.1-3 IRanges_1.7.19 matrixStats_0.2.1
>>>>> R.methodsS3_1.2.0
>>>>> [5] ggplot2_0.8.8 proto_0.3-8 reshape_0.8.3 plyr_1.1
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] tools_2.12.0
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>
>
More information about the Bioconductor
mailing list