[R] [Partial Summary] A fundamental formula and dataframe question.

Mon May 12 02:42:54 CEST 2008

Two very good responses to this question, but I wonder, Is there some
more complete documentation on using this form of model and dataframe
construction? I've been using R for ~5 years now and wasn't aware of it.

Response 1: Insert a matrix as a column of the dataframe using I().

	var<-1:10
	mat<-matrix(101:200,10)
	mydf<-data.frame(var,I(mat))
	str(mydf)

Response 2: An equvalent response plus a demonstration that this model
construction technique generalizes at least to lm. But which ends with a
question:

    C1 <- c(1.1,1.2,1.3,1.4)
    C2 <- c(2.1,2.2,2.3,2.4)
    M <- cbind(M1=c(11.1,11.2,11.3,11.4),
               M2=c(12.1,12.2,12.3,12.4))
    DF <- data.frame(C1=C1,C2=C2,M=M)

    "Would you have to "spell out" the interaction term[s] in additional
columns of M?"

    Hmmm, interesting! I hadn't been aware of this aspect
    of formula and dataframe construction for modellinng, until you
pointed it out!

This response had a very useful example, see excerpted below the initial
question...

Thanks responders,
Brent

> There is a very useful and apparently fundamental feature of R (or of 
> the package pls) which I don't understand.
> 
> For datasets with many independent (X) variables such as chemometric 
> datasets there is a convenient formula and dataframe construction that

> allows one to access the entire X matrix with a single term.
> 
> Consider the gasoline dataset available in the pls package. For the 
> model statement in the plsr function one can write: Octane ~ NIR
> 
> NIR refers to a (wide) matrix which is a portion of a dataframe. The 
> naming of the columns is of the form: 'NIR.xxxx nm'
> 
> names(gasoline) returns...
> 
> $names
> [1] "octane" "NIR"   
> 
> instead of...
> 
> $names
> [1] "octane" "NIR.1000 nm" "NIR.1001 nm" ... 
> 
> How do I construct and manipulate such dataframes and the column names

> that go with?
> 
> Does the use of these types of formulas and dataframes generalize to 
> other modeling functions?
> 
> Some specific clues on a help search might be enough, I've tried many.
> 
> Regards,
> Brent

I don't have the 'gasoline' dataset to hand, but I can produce something
to which your descrption applies as follows:

  C1 <- c(1.1,1.2,1.3,1.4)
  C2 <- c(2.1,2.2,2.3,2.4)
   M <- cbind(M1=c(11.1,11.2,11.3,11.4),
              M2=c(12.1,12.2,12.3,12.4))
  DF <- data.frame(C1=C1,C2=C2,M=M)
  DF
#    C1  C2 M.M1 M.M2
# 1 1.1 2.1 11.1 12.1
# 2 1.2 2.2 11.2 12.2
# 3 1.3 2.3 11.3 12.3
# 4 1.4 2.4 11.4 12.4

so the two columns C1 and C2 have gone in as named, and the matrix M
(with named columns M1 and M2) has gone in with columns M.M1, M.M2

Now let's fuzz the numbers a bit, so that the lm() fit makes sense:

  C1 <- C1 + round(0.1*runif(4),2)
  C1 <- C1 + round(0.1*runif(4),2)
   M <- cbind(M1=c(11.1,11.2,11.3,11.4),
              M2=c(12.1,12.2,12.3,12.4)) +
        round(0.1*runif(8),2)
  DF <- data.frame(C1=C1,C2=C2,M=M)
  DF
#     C1  C2  M.M1  M.M2
# 1 1.21 2.1 11.19 12.13
# 2 1.34 2.2 11.23 12.23
# 3 1.38 2.3 11.36 12.30
# 4 1.50 2.4 11.43 12.48

  summary(lm(C1 ~ M),data=DF)
# Call:
# lm(formula = C1 ~ M)
# Residuals:
#        1        2        3        4 
# -0.02422  0.02448  0.01309 -0.01335
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept) -8.28435    2.48952  -3.328    0.186
# MM1         -0.05411    0.66909  -0.081    0.949
# MM2          0.83463    0.50687   1.647    0.347
# Residual standard error: 0.03919 on 1 degrees of freedom
# Multiple R-Squared: 0.9642,     Adjusted R-squared: 0.8925 
# F-statistic: 13.46 on 2 and 1 DF,  p-value: 0.1893 

In other words, a perfectly standard LM fit, equivalent to

  summary(lm(C1 ~ M[,1]+M[,2]))

(as you can check). So all that looks straightforward.

One thing, however, is not clear to me in this scenario.
Suppose, for example, that the columns M1 and M2 of M were factors (and
that you had more rows than I've used above, so that the fit is
non-trivial).

Then, in the standard specification of an LM, you could write

  summary(lm(C1 ~ M[,1]*M[,2]))

and get the main effects and interactions. But how would you do that in
the other type of specification:

Where you used
  summary(lm(C1 ~ M, data=DF))
to get the equivalent of
  summary(lm(C1 ~ M[,1]+M[,2]))
what would you use to get the equivalent of
  summary(lm(C1 ~ M[,1]*M[,2]))??

Would you have to "spell out" the interaction term[s] in additional
columns of M?

Hmmm, interesting! I hadn't been aware of this aspect of formula and
dataframe construction for modellinng, until you pointed it out!