[Rd] NAs and rle
Gabriel Becker
g@bembecker @end|ng |rom gm@||@com
Wed Aug 26 07:57:10 CEST 2020
Hi All,
A twitter user, Mike fc (@coolbutuseless) mentioned today that he was
surprised that repeated NAs weren't treated as a run by the rle function.
Now I know why they are not. NAs represent values which could be the same
or different from eachother if they were known, so from a purely conceptual
standpoint there is no way to tell whether they are the same and thus
constitute a run or not.
This conceptual strictness isnt universally observed, though, because we
get the following:
> unique(c(1, 2, 3, NA, NA, NA))
[1] 1 2 3 NA
Which means that rle(sort(x))$value is not guaranteed to be the same as
unique(x), which is a little strange (though likely of little practical
impact).
Personally, to me it also seems that, from a purely data-compression
standpoint, it would be valid to collapse those missing values into a run
of missing, as it reduces size in-memory/on disk without losing any
information.
Now none of this is to say that I suggest the default behavior be changed
(that would surely disrupt some non-trivial amount of existing code) but
what do people think of a group.nas argument which defaults to FALSE
controlling the behavior?
As a final point, there is some precedent here (though obviously not at all
binding), as Bioconductor's Rle functionality does group NAs.
Best,
~G
[[alternative HTML version deleted]]
More information about the R-devel
mailing list