[R] Variable passed to function not used in function in select=... in subset

Wacek Kusnierczyk Waclaw.Marcin.Kusnierczyk at idi.ntnu.no
Mon Nov 10 20:04:45 CET 2008


pardon me, but does this address in any way the legitimate complaint of
the rightfully confused user?

consider the following:

d = data.frame(a=1, b=2)
a = c("a", "b")
z = a
# that is, both a and z are c("a", "b")

subset(d, select=z)
# gives two columns, since z is a two element vector whose elements are
valid column names

subset(d, select=a)
# gives one column, since 'a' (but not a) is a valid column name

subset(d, select=c(a,b))
# gives two columns


this is certainly what the authors intended, and they may have good
grounds for this smart design.  but this must break the expectation of a
naive (r-naive, for that matter) user, who may otherwise have excellent
experience in using a functional programming language, e.g., scheme. 
(especially scheme, where symbols and expressions are first-class
objects, yet the distinction between a symbol or an expression and their
referent is made painfully clear, perhaps except for when one hacks with
macros.)

the examples above illustrate the notorious problem with r that one can
never tell whether 'a' means "the value referred to with the identifier
'a'" or "the symbol 'a'", unless one gets ugly surprises and is forced
to study the documentation.  and even then one may not get a clear answer.

the example given by the confused user is a red flag warning.  it's a
typical abstraction where a nested sequence of operations (here print
over names over subset) is abstracted into a single procedure, which can
be called with whatever arguments are valid:

pns = function(d, g) print(names(subset(d, select=g)))

what sane person, without carefully studying the gory details of subset,
will ever expect that if the first argument happens to have a column
named 'g', only this one will be selected, while if it doesn't, subset
will select the columns named by the components of what 'g' evaluates
to.  i wonder how many users have *not* noticed that what they get is
not what they assume they get because of such tricky tricks, and in
consequence were not able to publish their analyses (or worse, have
published them). 

what is scary is that this may happen with about any other function in
r, because the design is pervasive.  no one should ever use any r
function without first carefully reading the docs (which is not
guaranteed to help) or trying it first on a number of carefully crafted
test cases.  if such care is not taken, results obtained with r cannot
be taken seriously.


vQ


Gabor Grothendieck wrote:
> Forgot the name part.  Try:
>
> TestFunc2 <- function(DF, group) names(DF[group])
> TestFunc3 <- function(...) names(subset(..., subset = TRUE))
> TestFunc4 <- function(...) eval.parent(names(subset(..., subset = TRUE)))
>
> # e.g.
> df1 <- data.frame(group = "G1", visit = "V1", value = 0.9)
> TestFunc2(df1, c("group", "visit"))
> TestFunc3(df1, c("group", "visit"))
> TestFunc4(df1, c("group", "visit"))
> TestFunc4(df1, c(group, visit)) # this works too
>
> On Mon, Nov 10, 2008 at 10:43 AM, Gabor Grothendieck
> <ggrothendieck at gmail.com> wrote:
>   
>> Here are a few things to try:
>>
>> TestFunc1 <- get("[")
>>
>> TestFunc2 <- function(DF, group) DF[group]
>>
>> TestFunc3 <- function(...) subset(..., subset = TRUE)
>>
>>
>>
>> On Mon, Nov 10, 2008 at 10:18 AM, Karl Knoblick <karlknoblich at yahoo.de> wrote:
>>     
>>> Hello!
>>>
>>> I have the problem that in my function the passed variable is not used, but the variable name of the dataframe itself - difficult to explain, but an easy example:
>>>
>>> TestFunc<-function(df, group) {
>>>     print(names(subset(df, select=group)))
>>> }
>>> df1<-data.frame(group="G1", visit="V1", value=0.9)
>>> TestFunc(df1, c("group", "visit"))
>>>
>>> Result:
>>> [1] "group"
>>>
>>> But I expected and want to have [1] "group" "visit" as result! Does anybody know how to get this result?
>>>
>>> Thanks!
>>> Karl
>>>



More information about the R-help mailing list