[Rd] [R] Why does R replace all row values with NAs

Tue Mar 3 23:09:59 CET 2015

On 3/3/15 1:26 PM, Hervé Pagès wrote:
>
>
> On 03/03/2015 02:28 AM, Martin Maechler wrote:
>> Diverted from R-help :
>> .... as it gets into musing about new R language "primitives"
>>
>>>>>>> William Dunlap <wdunlap at tibco.com>
>>>>>>>      on Fri, 27 Feb 2015 08:04:36 -0800 writes:
>>
>>      > You could define functions like
>>
>>      > is.true <- function(x) !is.na(x) & x
>>      > is.false <- function(x) !is.na(x) & !x
>>
>>      > and use them in your selections.  E.g.,
>>      >> x <- data.frame(a=1:10,b=2:11,c=c(1,NA,3,NA,5,NA,7,NA,NA,10))
>>      >> x[is.true(x$c >= 6), ]
>>      > a  b  c
>>      > 7   7  8  7
>>      > 10 10 11 10
>>
>>      > Bill Dunlap
>>      > TIBCO Software
>>      > wdunlap tibco.com
>>
>> Yes; the Matrix package has had these
>>
>> is0  <- function(x) !is.na(x) & x == 0
>> isN0 <- function(x)  is.na(x) | x != 0
>> is1  <- function(x) !is.na(x) & x   # also == "isTRUE componentwise"
>
> Note that using %in% to block propagation of NAs is about 2x faster:
>
>  > x <- sample(c(NA_integer_, 1:10000), 500000, replace=TRUE)
>  > microbenchmark(as.logical(x) %in% TRUE, !is.na(x) & x)
> Unit: milliseconds
>                      expr       min        lq      mean   median        uq
>   as.logical(x) %in% TRUE  6.034744  6.264382  6.999083  6.29488  6.346028
>             !is.na(x) & x 11.202808 11.402437 11.469101 11.44848 11.517576
>        max neval
>   40.36472   100
>   11.90916   100

Unfortunately %in% does not preserve matrix dimensions:

 > x <- matrix(sample(c(NA_integer_, 1:100), 500, replace=TRUE), nrow=50)
 > dim(x)
[1] 50 10
 > dim(!is.na(x) & x)
[1] 50 10
 > dim(as.logical(x) %in% TRUE)
NULL

Stephanie

>
>
>
>>
>> namespace hidden for a while  [note the comment of the last one!]
>> and using them for readibility in its own code.
>>
>> Maybe we should (again) consider providing some versions of
>> these with R ?
>>
>> The Matrix package also has had fast
>>
>> allFalse <- all0 <- function(x) .Call(R_all0, x)
>> anyFalse <- any0 <- function(x) .Call(R_any0, x)
>> ##
>> ## anyFalse <- function(x) isTRUE(any(!x))         ## ~= any0
>> ## any0 <- function(x) isTRUE(any(x == 0))          ## ~= anyFalse
>>
>> namespace hidden as well, already, which probably could also be
>> brought to base R.
>>
>> One big reason to *not* go there (to internal C code) at all with R is
>> that
>> S3 and S4 dispatch for '==' ('!=', etc, the 'Compare' group generics)
>> and 'is.na() have been known and package writers have
>> programmed methods for these.
>> To ensure that S3 and S4 dispatch works "correctly" also inside
>> such new internals is much less easily achieved, and so
>> such a C-based internal function  is0() would no longer be
>> equivalent with    !is.na(x) & x == 0
>> as soon as 'x' is an "object" with a '==', 'Compare' and/or an is.na()
>> method.
>
> Excellent point. Thank you! It really makes a big difference for
> developers who maintain a complex hierarchy of S4 classes and methods,
> when functions like is.true, anyFalse, etc..., which can be expressed in
> terms of more basic operations like ==, !=, !, is.na, etc..., just work
> out-of-the-box on objects for which these basic operations are defined.
>
> There is conceptually a small set of "building blocks", at least for
> objects with a vector-like or list-like semantic, that can be used
> to formally describe the semantic of many functions in base R. This
> is what the man page for anyNA does by saying:
>
>    anyNA implements any(is.na(x))
>
> even though the actual implementation differs, but that's ok, as long
> as anyNA is equivalent to doing any(is.na(x)) on any object for which
> building block is.na() is implemented.
>
> Unfortunately there is no clearly identified set of building blocks
> in base R. For example, if I want the comparison operations to work
> on my object, I need to implement ==, >, <, !=, <=, and >= (the
> 'Compare' group generics) even though it should be enough to implement
> == and >=, because all the others can be described in terms of these
> 2 building blocks. unique/duplicated is another example (unique(x) is
> conceptually x[!duplicated(x)]). And so on...
>
> Cheers,
> H.
>
>>
>> OTOH, simple R versions such as your  'is.true',  called 'is1'
>> inside Matrix maybe optimizable a bit by the byte compiler (and
>> jit and other such tricks) and still keep the full
>> semantic including correct method dispatch.
>>
>> Martin Maechler, ETH Zurich
>>
>>
>>      > On Fri, Feb 27, 2015 at 7:27 AM, Dimitri Liakhovitski <
>>      > dimitri.liakhovitski at gmail.com> wrote:
>>
>>      >> Thank you very much, Duncan.
>>      >> All this being said:
>>      >>
>>      >> What would you say is the most elegant and most safe way to
>> solve such
>>      >> a seemingly simple task?
>>      >>
>>      >> Thank you!
>>      >>
>>      >> On Fri, Feb 27, 2015 at 10:02 AM, Duncan Murdoch
>>      >> <murdoch.duncan at gmail.com> wrote:
>>      >> > On 27/02/2015 9:49 AM, Dimitri Liakhovitski wrote:
>>      >> >> So, Duncan, do I understand you correctly:
>>      >> >>
>>      >> >> When I use x$x<6, R doesn't know if it's TRUE or FALSE, so
>> it returns
>>      >> >> a logical value of NA.
>>      >> >
>>      >> > Yes, when x$x is NA.  (Though I think you meant x$c.)
>>      >> >
>>      >> >> When this logical value is applied to a row, the R says:
>> hell, I don't
>>      >> >> know if I should keep it or not, so, just in case, I am
>> going to keep
>>      >> >> it, but I'll replace all the values in this row with NAs?
>>      >> >
>>      >> > Yes.  Indexing with a logical NA is probably a mistake, and
>> this is one
>>      >> > way to signal it without actually triggering a warning or
>> error.
>>      >> >
>>      >> > BTW, I should have mentioned that the example where you
>> indexed using
>>      >> > -which(x$c>=6) is a bad idea:  if none of the entries were 6
>> or more,
>>      >> > this would be indexing with an empty vector, and you'd get
>> nothing, not
>>      >> > everything.
>>      >> >
>>      >> > Duncan Murdoch
>>      >> >
>>      >> >
>>      >> >>
>>      >> >> On Fri, Feb 27, 2015 at 9:13 AM, Duncan Murdoch
>>      >> >> <murdoch.duncan at gmail.com> wrote:
>>      >> >>> On 27/02/2015 9:04 AM, Dimitri Liakhovitski wrote:
>>      >> >>>> I know how to get the output I need, but I would benefit
>> from an
>>      >> >>>> explanation why R behaves the way it does.
>>      >> >>>>
>>      >> >>>> # I have a data frame x:
>>      >> >>>> x = data.frame(a=1:10,b=2:11,c=c(1,NA,3,NA,5,NA,7,NA,NA,10))
>>      >> >>>> x
>>      >> >>>> # I want to toss rows in x that contain values >=6. But I
>> don't want
>>      >> >>>> to toss my NAs there.
>>      >> >>>>
>>      >> >>>> subset(x,c<6) # Works correctly, but removes NAs in c,
>> understand why
>>      >> >>>> x[which(x$c<6),] # Works correctly, but removes NAs in c,
>> understand
>>      >> why
>>      >> >>>> x[-which(x$c>=6),] # output I need
>>      >> >>>>
>>      >> >>>> # Here is my question: why does the following line
>> replace the values
>>      >> >>>> of all rows that contain an NA # in x$c with NAs?
>>      >> >>>>
>>      >> >>>> x[x$c<6,]  # Leaves rows with c=NA, but makes the whole
>> row an NA.
>>      >> Why???
>>      >> >>>> x[(x$c<6) | is.na(x$c),] # output I need - I have to be
>>      >> super-explicit
>>      >> >>>>
>>      >> >>>> Thank you very much!
>>      >> >>>
>>      >> >>> Most of your examples (except the ones using which()) are
>> doing logical
>>      >> >>> indexing.  In logical indexing, TRUE keeps a line, FALSE
>> drops the
>>      >> line,
>>      >> >>> and NA returns NA.  Since "x$c < 6" is NA if x$c is NA,
>> you get the
>>      >> >>> third kind of indexing.
>>      >> >>>
>>      >> >>> Your last example works because in the cases where x$c is
>> NA, it
>>      >> >>> evaluates NA | TRUE, and that evaluates to TRUE.  In the
>> cases where
>>      >> x$c
>>      >> >>> is not NA, you get x$c < 6 | FALSE, and that's the same as
>> x$c < 6,
>>      >> >>> which will be either TRUE or FALSE.
>>      >> >>>
>>      >> >>> Duncan Murdoch
>>      >> >>>
>>      >> >>
>>      >> >>
>>      >> >>
>>      >> >
>>      >>
>>      >>
>>      >>
>>      >> --
>>      >> Dimitri Liakhovitski
>>      >>
>>      >> ______________________________________________
>>      >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>      >> https://stat.ethz.ch/mailman/listinfo/r-help
>>      >> PLEASE do read the posting guide
>>      >> http://www.R-project.org/posting-guide.html
>>      >> and provide commented, minimal, self-contained, reproducible
>> code.
>>      >>
>>
>>      > [[alternative HTML version deleted]]
>>
>>      > ______________________________________________
>>      > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>      > https://stat.ethz.ch/mailman/listinfo/r-help
>>      > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>      > and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>