[Rd] [R] Why does R replace all row values with NAs
Stephanie M. Gogarten
sdmorris at u.washington.edu
Tue Mar 3 23:09:59 CET 2015
On 3/3/15 1:26 PM, Hervé Pagès wrote:
>
>
> On 03/03/2015 02:28 AM, Martin Maechler wrote:
>> Diverted from R-help :
>> .... as it gets into musing about new R language "primitives"
>>
>>>>>>> William Dunlap <wdunlap at tibco.com>
>>>>>>> on Fri, 27 Feb 2015 08:04:36 -0800 writes:
>>
>> > You could define functions like
>>
>> > is.true <- function(x) !is.na(x) & x
>> > is.false <- function(x) !is.na(x) & !x
>>
>> > and use them in your selections. E.g.,
>> >> x <- data.frame(a=1:10,b=2:11,c=c(1,NA,3,NA,5,NA,7,NA,NA,10))
>> >> x[is.true(x$c >= 6), ]
>> > a b c
>> > 7 7 8 7
>> > 10 10 11 10
>>
>> > Bill Dunlap
>> > TIBCO Software
>> > wdunlap tibco.com
>>
>> Yes; the Matrix package has had these
>>
>> is0 <- function(x) !is.na(x) & x == 0
>> isN0 <- function(x) is.na(x) | x != 0
>> is1 <- function(x) !is.na(x) & x # also == "isTRUE componentwise"
>
> Note that using %in% to block propagation of NAs is about 2x faster:
>
> > x <- sample(c(NA_integer_, 1:10000), 500000, replace=TRUE)
> > microbenchmark(as.logical(x) %in% TRUE, !is.na(x) & x)
> Unit: milliseconds
> expr min lq mean median uq
> as.logical(x) %in% TRUE 6.034744 6.264382 6.999083 6.29488 6.346028
> !is.na(x) & x 11.202808 11.402437 11.469101 11.44848 11.517576
> max neval
> 40.36472 100
> 11.90916 100
Unfortunately %in% does not preserve matrix dimensions:
> x <- matrix(sample(c(NA_integer_, 1:100), 500, replace=TRUE), nrow=50)
> dim(x)
[1] 50 10
> dim(!is.na(x) & x)
[1] 50 10
> dim(as.logical(x) %in% TRUE)
NULL
Stephanie
>
>
>
>>
>> namespace hidden for a while [note the comment of the last one!]
>> and using them for readibility in its own code.
>>
>> Maybe we should (again) consider providing some versions of
>> these with R ?
>>
>> The Matrix package also has had fast
>>
>> allFalse <- all0 <- function(x) .Call(R_all0, x)
>> anyFalse <- any0 <- function(x) .Call(R_any0, x)
>> ##
>> ## anyFalse <- function(x) isTRUE(any(!x)) ## ~= any0
>> ## any0 <- function(x) isTRUE(any(x == 0)) ## ~= anyFalse
>>
>> namespace hidden as well, already, which probably could also be
>> brought to base R.
>>
>> One big reason to *not* go there (to internal C code) at all with R is
>> that
>> S3 and S4 dispatch for '==' ('!=', etc, the 'Compare' group generics)
>> and 'is.na() have been known and package writers have
>> programmed methods for these.
>> To ensure that S3 and S4 dispatch works "correctly" also inside
>> such new internals is much less easily achieved, and so
>> such a C-based internal function is0() would no longer be
>> equivalent with !is.na(x) & x == 0
>> as soon as 'x' is an "object" with a '==', 'Compare' and/or an is.na()
>> method.
>
> Excellent point. Thank you! It really makes a big difference for
> developers who maintain a complex hierarchy of S4 classes and methods,
> when functions like is.true, anyFalse, etc..., which can be expressed in
> terms of more basic operations like ==, !=, !, is.na, etc..., just work
> out-of-the-box on objects for which these basic operations are defined.
>
> There is conceptually a small set of "building blocks", at least for
> objects with a vector-like or list-like semantic, that can be used
> to formally describe the semantic of many functions in base R. This
> is what the man page for anyNA does by saying:
>
> anyNA implements any(is.na(x))
>
> even though the actual implementation differs, but that's ok, as long
> as anyNA is equivalent to doing any(is.na(x)) on any object for which
> building block is.na() is implemented.
>
> Unfortunately there is no clearly identified set of building blocks
> in base R. For example, if I want the comparison operations to work
> on my object, I need to implement ==, >, <, !=, <=, and >= (the
> 'Compare' group generics) even though it should be enough to implement
> == and >=, because all the others can be described in terms of these
> 2 building blocks. unique/duplicated is another example (unique(x) is
> conceptually x[!duplicated(x)]). And so on...
>
> Cheers,
> H.
>
>>
>> OTOH, simple R versions such as your 'is.true', called 'is1'
>> inside Matrix maybe optimizable a bit by the byte compiler (and
>> jit and other such tricks) and still keep the full
>> semantic including correct method dispatch.
>>
>> Martin Maechler, ETH Zurich
>>
>>
>> > On Fri, Feb 27, 2015 at 7:27 AM, Dimitri Liakhovitski <
>> > dimitri.liakhovitski at gmail.com> wrote:
>>
>> >> Thank you very much, Duncan.
>> >> All this being said:
>> >>
>> >> What would you say is the most elegant and most safe way to
>> solve such
>> >> a seemingly simple task?
>> >>
>> >> Thank you!
>> >>
>> >> On Fri, Feb 27, 2015 at 10:02 AM, Duncan Murdoch
>> >> <murdoch.duncan at gmail.com> wrote:
>> >> > On 27/02/2015 9:49 AM, Dimitri Liakhovitski wrote:
>> >> >> So, Duncan, do I understand you correctly:
>> >> >>
>> >> >> When I use x$x<6, R doesn't know if it's TRUE or FALSE, so
>> it returns
>> >> >> a logical value of NA.
>> >> >
>> >> > Yes, when x$x is NA. (Though I think you meant x$c.)
>> >> >
>> >> >> When this logical value is applied to a row, the R says:
>> hell, I don't
>> >> >> know if I should keep it or not, so, just in case, I am
>> going to keep
>> >> >> it, but I'll replace all the values in this row with NAs?
>> >> >
>> >> > Yes. Indexing with a logical NA is probably a mistake, and
>> this is one
>> >> > way to signal it without actually triggering a warning or
>> error.
>> >> >
>> >> > BTW, I should have mentioned that the example where you
>> indexed using
>> >> > -which(x$c>=6) is a bad idea: if none of the entries were 6
>> or more,
>> >> > this would be indexing with an empty vector, and you'd get
>> nothing, not
>> >> > everything.
>> >> >
>> >> > Duncan Murdoch
>> >> >
>> >> >
>> >> >>
>> >> >> On Fri, Feb 27, 2015 at 9:13 AM, Duncan Murdoch
>> >> >> <murdoch.duncan at gmail.com> wrote:
>> >> >>> On 27/02/2015 9:04 AM, Dimitri Liakhovitski wrote:
>> >> >>>> I know how to get the output I need, but I would benefit
>> from an
>> >> >>>> explanation why R behaves the way it does.
>> >> >>>>
>> >> >>>> # I have a data frame x:
>> >> >>>> x = data.frame(a=1:10,b=2:11,c=c(1,NA,3,NA,5,NA,7,NA,NA,10))
>> >> >>>> x
>> >> >>>> # I want to toss rows in x that contain values >=6. But I
>> don't want
>> >> >>>> to toss my NAs there.
>> >> >>>>
>> >> >>>> subset(x,c<6) # Works correctly, but removes NAs in c,
>> understand why
>> >> >>>> x[which(x$c<6),] # Works correctly, but removes NAs in c,
>> understand
>> >> why
>> >> >>>> x[-which(x$c>=6),] # output I need
>> >> >>>>
>> >> >>>> # Here is my question: why does the following line
>> replace the values
>> >> >>>> of all rows that contain an NA # in x$c with NAs?
>> >> >>>>
>> >> >>>> x[x$c<6,] # Leaves rows with c=NA, but makes the whole
>> row an NA.
>> >> Why???
>> >> >>>> x[(x$c<6) | is.na(x$c),] # output I need - I have to be
>> >> super-explicit
>> >> >>>>
>> >> >>>> Thank you very much!
>> >> >>>
>> >> >>> Most of your examples (except the ones using which()) are
>> doing logical
>> >> >>> indexing. In logical indexing, TRUE keeps a line, FALSE
>> drops the
>> >> line,
>> >> >>> and NA returns NA. Since "x$c < 6" is NA if x$c is NA,
>> you get the
>> >> >>> third kind of indexing.
>> >> >>>
>> >> >>> Your last example works because in the cases where x$c is
>> NA, it
>> >> >>> evaluates NA | TRUE, and that evaluates to TRUE. In the
>> cases where
>> >> x$c
>> >> >>> is not NA, you get x$c < 6 | FALSE, and that's the same as
>> x$c < 6,
>> >> >>> which will be either TRUE or FALSE.
>> >> >>>
>> >> >>> Duncan Murdoch
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Dimitri Liakhovitski
>> >>
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible
>> code.
>> >>
>>
>> > [[alternative HTML version deleted]]
>>
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
More information about the R-devel
mailing list