[Rd] problem using model.frame()

Gavin Simpson gavin.simpson at ucl.ac.uk
Thu Aug 18 00:49:26 CEST 2005


On Wed, 2005-08-17 at 20:24 +0200, Martin Maechler wrote:
> >>>>> "GS" == Gavin Simpson <gavin.simpson at ucl.ac.uk>
> >>>>>     on Tue, 16 Aug 2005 18:44:23 +0100 writes:
> 
>     GS> On Tue, 2005-08-16 at 12:35 -0400, Gabor Grothendieck
>     GS> wrote:
>     >> On 8/16/05, Gavin Simpson <gavin.simpson at ucl.ac.uk>
>     >> wrote: > On Tue, 2005-08-16 at 11:25 -0400, Gabor
>     >> Grothendieck wrote: > > It can handle data frames like
>     >> this:
>     >> > >
>     >> > > model.frame(y1) > > or > > model.frame(~., y1)
>     >> > 
>     >> > Thanks Gabor,
>     >> > 
>     >> > Yes, I know that works, but I want the function
>     >> coca.formula to accept a > formula like this y2 ~ y1,
>     >> with both y1 and y2 being data frames. It is
>     >> 
>     >> The expressions I gave work generally (i.e. lm, glm,
>     >> ...), not just in model.matrix, so would it be ok if the
>     >> user just does this?
>     >> 
>     >> yourfunction(y2 ~., y1)
> 
>     GS> Thanks again Gabor for your comments,
> 
>     GS> I'd prefer the y1 ~ y2 as data frames - as this is the
>     GS> most natural way of doing things. I'd like to have (y2
>     GS> ~., y1) as well, and (y2 ~ spp1 + spp2 + spp3, y1) also
>     GS> work - silently without any trouble.
> 
> I'm sorry, Gavin, I tend to disagree quite a bit.
> 
> The formula notation has quite a history in the S language, and
> AFAIK never was the idea to use data.frames as formula
> components, but rather as "environments" in which formula
> components are looked up --- exactly as Gabor has explained.

Hi Martin, thanks for your comments,

But then one could have a matrix of variables on the rhs of the formula
and it would work - whether this is a documented feature or un-intended
side-effect of matrices being stored as vectors with dims, I don't know.

And whilst the formula may have a long history, a number of packages
have extended the interface to implement a specific feature, which don't
work with standard functions like lm, glm and friends. I don't see how
what I wanted to achieve is greatly different to that or using a matrix.

> To break with such a deeply rooted principle, 
> you should have very very good reasons, because you're breaking
> the concepts on which all other uses of formulae are based.
> And this would potentially lead to much confusion of your users,
> at least in the way they should learn to think about what
> formulae mean.

In the end I managed to treat y1 ~ y2 (both data frames) as a special
case, which allows the existing formula notation to work as well, so I
can use y1 ~ y2, y1 ~ ., data = y2, or y1 ~ var + var2, data = y2. This
is what I wanted all along, to extend my interface (not do anything to
R's formulae), but to also work in the traditional sense.

The model I am writing code for really is modelling the relationship
between two matrices of data. In one version of the method, there is
real equivalence between both sides of the formula so it would seem odd
to treat the two sides of the formula differently. At least to me ;-)

> Martin
> 
> 
>     >> If it really is important to do it the way you describe,
>     >> are the data frames necessarily numeric? If so you could
>     >> preprocess your formula by placing as.matrix around all
>     >> the variables representing data frames using something
>     >> like this:
>     >> 
>     >> https://www.stat.math.ethz.ch/pipermail/r-help/2004-December/061485.html
> 
>     GS> Yes, they are numeric matrices (as data frames). I've
>     GS> looked at this, but I'd prefer to not have to do too
>     GS> much messing with the formula.
> 
>     >> Of course, if they are necessarily numeric maybe they can
>     >> be matrices in the first place?
> 
>     GS> Because read.table etc. produce data.frames and this is
>     GS> the natural way to work with data in R.
> 
> but it is also slightly inefficient if they are numeric.
> There are places for data frames and for matrices.

I agree - and in the code I've written, y1 and y2 quickly get coerced to
matrices before the real number crunching begins.

However, all the other R modelling functions I have used work with
data.frames. Arguably, it could cause more confusion to write a function
that looked, walked and quacked like an R modelling function but needed
the user to apply an extra step to use - a step not usually required
under normal R usage.

All the best,

Gav

> Why should it be a problem to use 
>     M <- as.matrix(read.table(..))
> ?
> 
> For large files, it could be quite a bit more efficient,
> needing a bit more of code, to
> use scan() to read the numeric data directly :
> 
>       h1 <- scan(..., n=1) ## <read variable names>
>       nc <- length(h1)
>       a <- matrix(scan(...., what = numeric(), ...),  
>                   ncol = nc, dimnames = list(NULL, h1))
> 
> maybe this would be useful to be packaged into
> a small utility with usage
> 
>       read.matrix(...,  type = numeric(), ...)      
> 
> 
>     GS> Following your suggestions, I altered my code to
>     GS> evaluate the rhs of the formula and check if it was of
>     GS> class "data.frame". If it is then I stop processing and
>     GS> return it as a data.frame as this point. If not, it
>     GS> eventually gets passed on to model.frame() for it to
>     GS> deal with it.
> 
>     GS> So far - limited testing - it seems to do what I wanted
>     GS> all along. I'm sure there's a gotcha in there somewhere
>     GS> but at least the code runs so I can check for problems
>     GS> against my examples.
> 
>     GS> Right, back to writing documentation...
> 
>     GS> G
> 
>     >> > more intuitive, to my mind at least for this particular
>     >> example and > analysis, to specify the formula with a
>     >> data frame on the rhs.
>     >> > 
>     >> > model.frame doesn't work with the formula "~ y1" if the
>     >> object y1, in > the environment when model.frame
>     >> evaluates the formula, is a data.frame.  > It works if y1
>     >> is a matrix, however. I'd like to work around this >
>     >> problem, say by creating an environment in which y1 is
>     >> modified to be a > matrix, if possible. Can this be done?
>     >> > 
>     >> > At the moment I have something working by grabbing the
>     >> bits of the > formula and then using get() to grab the
>     >> named object. Of course, this > won't work if someone
>     >> wants to use R's formula interface with the > following
>     >> formula y2 ~ var1 + var2 + var3, data = y1, or to use the
>     >> > subset argument common to many formula
>     >> implementations. I'd like to have > the function work in
>     >> as general a manner as possible, so I'm fishing > around
>     >> for potential solutions.
>     >> > 
>     >> > All the best,
>     >> > 
>     >> > Gav
>     >> > 
>     >> > >
>     >> > > On 8/16/05, Gavin Simpson <gavin.simpson at ucl.ac.uk>
>     >> wrote: > > > Hi I'm having a problem with model.frame,
>     >> encapsulated in this example:
>     >> > > >
>     >> > > > y1 <-
>     >> matrix(c(3,1,0,1,0,1,1,0,0,0,1,0,0,0,1,1,0,1,1,1), > > >
>     >> nrow = 5, byrow = TRUE) > > > y1 <- as.data.frame(y1) > >
>     >> > rownames(y1) <- paste("site", 1:5, sep = "") > > >
>     >> colnames(y1) <- paste("spp", 1:4, sep = "") > > > y1
>     >> > > >
>     >> > > > model.frame(~ y1) > > > Error in
>     >> model.frame(formula, rownames, variables, varnames,
>     >> extras, extranames, : > > > invalid variable type
>     >> > > >
>     >> > > > temp <- as.matrix(y1) > > > model.frame(~ temp) > >
>     >> > temp.spp1 temp.spp2 temp.spp3 temp.spp4 > > > 1 3 1 0 1
>     >> > > > 2 0 1 1 0 > > > 3 0 0 1 0 > > > 4 0 0 1 1 > > > 5 0
>     >> 1 1 1
>     >> > > >
>     >> > > > Ideally the above wouldn't have names like
>     >> temp.var1, temp.var2, but one > > > could deal with that
>     >> later.
>     >> > > >
>     >> > > > I have tracked down the source of the error message
>     >> to line 1330 in > > > model.c - here I'm stumped as I
>     >> don't know any C, but it looks as if the > > > code is
>     >> looping over the variables in the formula and checking of
>     >> they > > > are the right "type". So a matrix of variables
>     >> gets through, but a > > > data.frame doesn't.
>     >> > > >
>     >> > > > It would be good if model.frame could cope with
>     >> data.frames in formulae, > > > but seeing as I am
>     >> incapable of providing a patch, is there a way around > >
>     >> > this problem?
>     >> > > >
>     >> > > > Below is the head of the function I am currently
>     >> using, including the > > > function for parsing the
>     >> formula - borrowed and hacked from > > >
>     >> ordiParseFormula() in package vegan.
>     >> > > >
>     >> > > > I can work out the class of the rhs of the
>     >> forumla. Is there a way to > > > create a suitable
>     >> environment for the data argument of parseFormula() > > >
>     >> such that it contains the rhs dataframe coerced to a
>     >> matrix, which then > > > should get through
>     >> model.frame.default without error? How would I go > > >
>     >> about manipulating/creating such an environment? Any
>     >> other ideas?
>     >> > > >
>     >> > > > Thanks in advance
>     >> > > >
>     >> > > > Gav
>     >> > > >
>     >> > > > coca.formula <- function(formula, method =
>     >> c("predictive", "symmetric"), > > > reg.method =
>     >> c("simpls", "eigen"), weights = NULL, > > > n.axes =
>     >> NULL, symmetric = FALSE, data) > > > { > > > parseFormula
>     >> <- function (formula, data) > > > { > > > browser() > > >
>     >> Terms <- terms(formula, "Condition", data = data) > > >
>     >> flapart <- fla <- formula <- formula(Terms, width.cutoff
>     >> = 500) > > > specdata <- formula[[2]] > > > X <-
>     >> eval(specdata, data, parent.frame()) > > > X <-
>     >> as.matrix(X) > > > formula[[2]] <- NULL > > > if
>     >> (formula[[2]] == "1" || formula[[2]] == "0") > > > Y <-
>     >> NULL > > > else { > > > mf <- model.frame(formula, data,
>     >> na.action = na.fail) > > > Y <- model.matrix(formula, mf)
>     >> > > > if (any(colnames(Y) == "(Intercept)")) { > > > xint
>     >> <- which(colnames(Y) == "(Intercept)") > > > Y <- Y[,
>     >> -xint, drop = FALSE] > > > } > > > } > > > list(X = X, Y
>     >> = Y) > > > } > > > if (missing(data)) > > > data <-
>     >> parent.frame() > > > #browser() > > > dat <-
>     >> parseFormula(formula, data)
>     >> > > >
>     >> > > > --
>     >> > > >
>     >> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>     >> > > > Gavin Simpson [T] +44 (0)20 7679 5522 > > > ENSIS
>     >> Research Fellow [F] +44 (0)20 7679 7565 > > > ENSIS
>     >> Ltd. & ECRC [E] gavin.simpsonATNOSPAMucl.ac.uk > > > UCL
>     >> Department of Geography [W]
>     >> http://www.ucl.ac.uk/~ucfagls/cv/ > > > 26 Bedford Way
>     >> [W] http://www.ucl.ac.uk/~ucfagls/ > > > London.  WC1H
>     >> 0AP.  > > >
>     >> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>     >> > > >
>     >> > > > ______________________________________________ > >
>     >> > R-devel at r-project.org mailing list > > >
>     >> https://stat.ethz.ch/mailman/listinfo/r-devel
>     >> > > >
>     >> > --
>     >> >
>     >> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>     >> > Gavin Simpson [T] +44 (0)20 7679 5522 > ENSIS Research
>     >> Fellow [F] +44 (0)20 7679 7565 > ENSIS Ltd. & ECRC [E]
>     >> gavin.simpsonATNOSPAMucl.ac.uk > UCL Department of
>     >> Geography [W] http://www.ucl.ac.uk/~ucfagls/cv/ > 26
>     >> Bedford Way [W] http://www.ucl.ac.uk/~ucfagls/ > London.
>     >> WC1H 0AP.  >
>     >> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>     >> > 
>     >> > 
>     >> >
>     GS> --
>     GS> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
>     GS> Gavin Simpson [T] +44 (0)20 7679 5522 ENSIS Research
>     GS> Fellow [F] +44 (0)20 7679 7565 ENSIS Ltd. & ECRC [E]
>     GS> gavin.simpsonATNOSPAMucl.ac.uk UCL Department of
>     GS> Geography [W] http://www.ucl.ac.uk/~ucfagls/cv/ 26
>     GS> Bedford Way [W] http://www.ucl.ac.uk/~ucfagls/ London.
>     GS> WC1H 0AP.
>     GS> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> 
>     GS> ______________________________________________
>     GS> R-devel at r-project.org mailing list
>     GS> https://stat.ethz.ch/mailman/listinfo/r-devel
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Gavin Simpson                     [T] +44 (0)20 7679 5522
ENSIS Research Fellow             [F] +44 (0)20 7679 7565
ENSIS Ltd. & ECRC                 [E] gavin.simpsonATNOSPAMucl.ac.uk
UCL Department of Geography       [W] http://www.ucl.ac.uk/~ucfagls/cv/
26 Bedford Way                    [W] http://www.ucl.ac.uk/~ucfagls/
London.  WC1H 0AP.
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%



More information about the R-devel mailing list