[R-pkg-devel] Absent variables and tibble

Duncan Murdoch murdoch.duncan at gmail.com
Tue Jun 28 16:55:38 CEST 2016


On 28/06/2016 10:03 AM, William Dunlap wrote:
> Currently exists("someName", where=someDataFrame) reports if 
> "someName" is an column
> of the data.frame 'someDataFrame' and the 'where=' may be omitted.  If 
> we have an
> environment we use exsts("someName", envir=someEnvironment).  It might 
> be nice to
> continue using exists() instead of introducing a new function has(), 
> although, since we
> want the same syntax to work for environments, data.frames, tbl_dfs, 
> data.tables, etc.,
> we may need the new function.

One issue with exists("someName", someDataFrame) is that it's quite a 
bit slower.  (I think it converts the dataframe to an environment.) On 
the other hand, getting the names from an environment requires more work 
than checking for one, so exists("someName", someEnvironment) is faster 
than checking for the name in the obvious way.  The slow operations
could be sped up, but is that worth the effort?

The other issue with exists() is that it has a complicated definition 
and hard to follow argument list (with args "where", "envir", "frame" 
that all do related things); the thing I like about hasName() is that it 
is very clear what it does.  A criticism of it is that it is hardly any 
shorter than just doing

   name %in% names(x)

so is there really any point in making a function for this?

Duncan Murdoch

>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com <http://tibco.com>
>
> On Tue, Jun 28, 2016 at 4:08 AM, Duncan Murdoch 
> <murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>> wrote:
>
>     On 27/06/2016 10:15 PM, Lenth, Russell V wrote:
>
>         Hadley's note on partial matching has me scared the most
>         concerning the as.null() coding. So the need for a hasName()
>         (or whatever) function seems all the more compelling, and that
>         it be in base R. Perhaps it should be generic, with a default
>         method that searches in the names attribute, potentially
>         extensible to other classes.
>
>
>     I am thinking of putting it in, but if I do the definition will be
>     equivalent to the one-liner down below.  That's already slower
>     than the is.null() test; making it generic would slow it down too
>     much.
>
>     Duncan Murdoch
>
>
>         Thanks so much, several of you, for your positive and helpful
>         responses.
>
>         Russ
>
>         -----Original Message-----
>         From: Duncan Murdoch [mailto:murdoch.duncan at gmail.com
>         <mailto:murdoch.duncan at gmail.com>]
>         Sent: Monday, June 27, 2016 12:50 PM
>         To: Hadley Wickham <h.wickham at gmail.com
>         <mailto:h.wickham at gmail.com>>; Lenth, Russell V
>         <russell-lenth at uiowa.edu <mailto:russell-lenth at uiowa.edu>>
>         Cc: r-package-devel at r-project.org
>         <mailto:r-package-devel at r-project.org>
>         Subject: Re: [R-pkg-devel] Absent variables and tibble
>
>         On 27/06/2016 1:09 PM, Hadley Wickham wrote:
>
>             The other thing you need to be aware of it you're using
>             the other
>             approach is partial matching:
>
>             df <- data.frame(xyz = 1)
>             is.null(df$x)
>             #> [1] FALSE
>
>             Duncan - I think that argues for including a has_name()
>             (hasName() ?)
>             function in base R. Is that something you'd consider?
>
>
>         Yes, I'd consider it.  I think hasName() would be more
>         consistent with other has*() functions in the R sources.
>
>         I guess the implementation should be defined to be equivalent to
>
>         hasName <- function(x, name)
>            name %in% names(x)
>
>         though it would make sense to make a faster internal
>         implementation;
>         !is.null(df$x) is quite a bit faster than "x" %in% names(df).
>
>         Duncan Murdoch
>
>
>
>             Hadley
>
>             On Mon, Jun 27, 2016 at 10:05 AM, Lenth, Russell V
>             <russell-lenth at uiowa.edu <mailto:russell-lenth at uiowa.edu>>
>             wrote:
>
>                 Thanks, Hadley. I do understand why you'd want more
>                 careful checking.
>
>                 If you're going to provide a variable-existing
>                 function, may I suggest a short name like 'has'? I.e.,
>                 has(x, var) returns TRUE if x has var in it.
>
>                 Thanks
>
>                 Russ
>
>                     On Jun 27, 2016, at 9:47 AM, Hadley Wickham
>                     <h.wickham at gmail.com <mailto:h.wickham at gmail.com>>
>                     wrote:
>
>                     On Mon, Jun 27, 2016 at 9:03 AM, Duncan Murdoch
>                     <murdoch.duncan at gmail.com
>                     <mailto:murdoch.duncan at gmail.com>> wrote:
>
>                         On 27/06/2016 9:22 AM, Lenth, Russell V wrote:
>
>
>                             My package 'lsmeans' is now suddenly
>                             broken because of a new
>                             provision in the 'tibble' package (loaded
>                             by 'dplyr' 0.5.0), whereby the "[[" and "$"
>                             methods for 'tbl_df' objects - as
>                             documented - throw an error if
>                             a variable is not found.
>
>                             The problem is that my code uses tests
>                             like this:
>
>                                    if (is.null (x$var)) {...}
>
>                             to see whether 'x' has a variable 'var'.
>                             Obviously, I can work
>                             around this using
>
>                                    if (!("var" %in% names(x))) {...}
>
>                             but (a) I like the first version better,
>                             in terms of the code
>                             being understandable; and (b) isn't there
>                             a long history whereby
>                             we can expect a NULL result when accessing
>                             an absent member of a
>                             list (and hence a data.frame)? (c) the
>                             code base for 'lsmeans'
>                             has about 50 instances of such tests.
>
>                             Anyway, I wonder if a lot of other package
>                             developers test for
>                             absent variables in that first way; if so,
>                             they too are in for a
>                             rude awakening if their users provide a
>                             tbl_df instead of a
>                             data.frame. And what is considered the
>                             best practice for testing
>                             absence of a list member? Apparently, not
>                             either of the above;
>                             and because of (c), I want to do these
>                             many tedious corrections only once.
>
>                             Thanks for any light you can shed.
>
>
>
>                         This is why CRAN asks that people test reverse
>                         dependencies.
>
>
>                     Which we did do - the problem is that this is
>                     actually caused by a
>                     recursive reverse dependency (lsmeans -> dplyr ->
>                     tibble), and we
>                     didn't correctly anticipate how much pain this
>                     would cause.
>
>                         I think the most defensive thing you can do is
>                         to write a small
>                         function
>
>                         name_missing <- function(x, name)
>                            !(name %in% names(x))
>
>                         and use name_missing(x, "var") in your tests.
>                         (Pick your own name
>                         to make your code understandable if you don't
>                         like my choice.)
>
>                         You could suggest to the tibble maintainers
>                         that they add a
>                         function like this.
>
>
>                     We're definitely going to add this.
>
>                     And I think we'll make df[["var"]] return NULL
>                     too, so at least
>                     there's one easy way to opt out.
>
>                     The motivation for this change was that returning
>                     NULL + recycling
>                     rules means it's very easy for errors to silently
>                     propagate. But I
>                     think this approach might be somewhat too
>                     aggressive - I hadn't
>                     considered that people use `is.null()` to check
>                     for missing columns.
>
>                     We'll try and get an update to tibble out soon
>                     after useR.
>                     Thoughts on what we should do are greatly appreciated.
>
>                     Hadley
>
>                     --
>                     http://hadley.nz
>
>
>
>
>
>
>     ______________________________________________
>     R-package-devel at r-project.org
>     <mailto:R-package-devel at r-project.org> mailing list
>     https://stat.ethz.ch/mailman/listinfo/r-package-devel
>
>



More information about the R-package-devel mailing list