[R] Why can we combine design matrix and data-frame in R?

Sat May 12 10:31:36 CEST 2012

On May 12, 2012, at 04:40 , Michael wrote:

> Hi all,
> 
> Could you please help me?
> 
> I am trying to understand why this line works:
> 
> lm1x = lm(y~X-1, tmp)
> 
> Here it seems that I was combining the design matrix and the data frame...
> 
> And X below is not a single column, in fact, it's a bunch of columns in
> matrix form...
> 
> I don't understand why this line works...

It works because you can have a matrix on the right hand side of a model formula and it will be interpreted as a set of columns... If X is constructed via model.matrix it usually contains an all-1 column, which is why you need to remove the intercept with "-1".

A data frame can contain matrices, you just need to be careful that some functions, notably data.frame(), which will split a matrix into its constituent columns, unless protected with I(). Notice the difference between these two examples:

> d <- data.frame(a=1:3,m=matrix(4:9,,2))
> d$m
NULL
> names(d)
[1] "a"   "m.1" "m.2"
> d <- data.frame(a=1:3,m=I(matrix(4:9,,2)))
> d$m
     [,1] [,2]
[1,]    4    7
[2,]    5    8
[3,]    6    9
> names(d)
[1] "a" "m"

(d <- data.frame(a=1:3) ; d$m <- matrix(4:9,,2) is like the 2nd version)

Your example has the weakness that you do "tmp$X <- X" so that you have X both in the data frame tmp and in the global workspace, and it is really not clear that the one in the data frame is the one used. I am pretty sure that this is in fact the case, but for your peace of mind, you should try renaming one of them.

> 
> Is it just luck, i.e. if we change the data-set and/or formulas to
> something else, this will potentially fail?
> (that's something I would like to catch and avoid...)
> 
> Thank you!
> 
> -------------------------------
> 
> The data is located at:
> http://www.ling.uni-potsdam.de/~vasishth/book.html
> In the section:
> downloadable errata, code, datasets
> ERRATA
> Corrected Pages
> VasishthBroebook.R
> beauty.txt
> mathachieve.txt
> mathachschool.txt
> 
> ------------------------------
> 
> MathAchieve <- read.table("mathachieve.txt")
> 
> colnames(MathAchieve) <- c("School", "Minority", "Sex", "SES",  "MathAch",
> "MEANSES")
> 
> head(MathAchieve)
> 
> 
> MathAchSchool <- read.table("mathachschool.txt")
> colnames(MathAchSchool) <- c("School", "Size", "Sector", "PRACAD",
> "DISCLIM", "HIMINTY", "MEANSES")
> MathScores <- merge(MathAchieve, MathAchSchool, by = "School")
> 
> 
> lm1 = lm(MathAch ~ SES  + factor(Sector) , MathScores)
> 
> X=model.matrix(MathAch ~ SES+factor(Sector) , MathScores)
> y=MathScores$MathAch
> 
> tmp=MathScores
> tmp$y=y
> tmp$X=X
> 
> lm1x = lm(y~X-1, tmp)
> 
> plot(fitted(lm1), fitted(lm1x))
> 
> max(abs(fitted(lm1) - fitted(lm1x)))
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com