[BioC] IRanges::Rle and missing values
Kasper Daniel Hansen
kasperdanielhansen at gmail.com
Sat Aug 21 21:47:59 CEST 2010
Thanks a lot for the fix.
Some background. I have data associated with (very small) genomic
locations, irregularly space and I wanted to use the runmeans
functionality.
Now, for the standard example of Rles: coverage across the genome,
"missing" data is equal to a coverage of zero. But in my case, zero
is a perfectly fine data value and is quite different from NA which
indicates no data. So while I would like to calculate running means
with a fixed window size (and hence different number of data points in
each window since they are irregularly spaced) I could not use the
runmeans function, with missing values filled in as zero.
I found a solution to my specific problem which uses the fact that my
problem with the running mean is more about using the right
denominator. I just create 2 Rle's, one with zeroes and data values
and one with 0 and 1 (1 indicating that there is data) and then the
"right" running mean is the ratio between two running sums.
Since NA's are allowed I think it makes a lot of sense to support them
in the run* suite of functions, but it is not something that is
extremely urgent (to me) (since I found a workaround).
Thanks for the help,
Kasper
On Fri, Aug 20, 2010 at 8:03 PM, Patrick Aboyoun <paboyoun at fhcrc.org> wrote:
> Kasper,
> I have addressed these two issues, which were caused by inappropriate
> comparisons using NA_REAL at the C-level for 'numeric' Rle objects. As with
> the runmed function in the stats package, I don't currently support missing
> values in the run* methods for Rle objects. Below is the current behavior in
> IRanges 1.6.15 (BioC 2.6, R-2.11) and IRanges 1.7.21 (BioC 2.7, R-devel). I
> can add support for missing values. Just so I prioritize this, when do you
> encounter missing values in your Rle vectors?
>
>> tmp = Rle(c(1,2,2,2,3,NA,NA,NA,NA,2,3,3,3,3,3,2))
>
>> tmp
> 'numeric' Rle of length 16 with 7 runs
> Lengths: 1 3 1 4 1 5 1
> Values : 1 2 3 NA 2 3 2
>
>> runsum(tmp, 3)
> Error in runsum(tmp, 3) : some values are NA, NaN, +/-Inf
>
>> sessionInfo()
> R version 2.12.0 Under development (unstable) (2010-08-01 r52659)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] IRanges_1.7.21
>
>
>
> Patrick
>
>
> On 8/20/10 9:43 AM, Patrick Aboyoun wrote:
>>
>> Kasper,
>> I'll take a look into this. The Rle constructor issue seems to be isolated
>> to 'numeric' and 'complex' Rles. I'll have an update out soon.
>>
>>
>> Patrick
>>
>>
>> On 8/20/10 8:53 AM, Kasper Daniel Hansen wrote:
>>>
>>> Would it make sense to allow missing values in Rle objects and also to
>>> incorporate removal of missing values in running summaries (and
>>> possibly other functions)?
>>>
>>> Example:
>>>
>>>> tmp = Rle(c(1,2,2,2,3,NA,NA,NA,NA,2,3,3,3,3,3,2))
>>>> tmp
>>>
>>> 'numeric' Rle of length 16 with 10 runs
>>> Lengths: 1 3 1 1 1 1 1 1 5 1
>>> Values : 1 2 3 NA NA NA NA 2 3 2
>>>
>>> Seems like the run of 4 NA's is treated differently
>>>
>>>> runsum(tmp, k = 2)
>>>
>>> 'numeric' Rle of length 15 with 11 runs
>>> Lengths: 1 2 1 1 1 1 1 1 1 4 1
>>> Values : 3 4 5 NA NA NA NA NA NA NA NA
>>>
>>> And there is no way to do runsum(..., na.rm = TRUE) like in sum (as
>>> far as I can see).
>>>
>>> Kasper
>>>
>>>> sessionInfo()
>>>
>>> R version 2.12.0 Under development (unstable) (2010-08-20 r52790)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
>>> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
>>> [5] LC_MONETARY=C LC_MESSAGES=en_US.iso885915
>>> [7] LC_PAPER=en_US.iso885915 LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] grid stats graphics grDevices datasets utils methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] multicore_0.1-3 IRanges_1.7.19 matrixStats_0.2.1
>>> R.methodsS3_1.2.0
>>> [5] ggplot2_0.8.8 proto_0.3-8 reshape_0.8.3 plyr_1.1
>>>
>>> loaded via a namespace (and not attached):
>>> [1] tools_2.12.0
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
More information about the Bioconductor
mailing list