[R] Why can we combine design matrix and data-frame in R?
peter dalgaard
pdalgd at gmail.com
Sat May 12 10:31:36 CEST 2012
On May 12, 2012, at 04:40 , Michael wrote:
> Hi all,
>
> Could you please help me?
>
> I am trying to understand why this line works:
>
> lm1x = lm(y~X-1, tmp)
>
> Here it seems that I was combining the design matrix and the data frame...
>
> And X below is not a single column, in fact, it's a bunch of columns in
> matrix form...
>
> I don't understand why this line works...
It works because you can have a matrix on the right hand side of a model formula and it will be interpreted as a set of columns... If X is constructed via model.matrix it usually contains an all-1 column, which is why you need to remove the intercept with "-1".
A data frame can contain matrices, you just need to be careful that some functions, notably data.frame(), which will split a matrix into its constituent columns, unless protected with I(). Notice the difference between these two examples:
> d <- data.frame(a=1:3,m=matrix(4:9,,2))
> d$m
NULL
> names(d)
[1] "a" "m.1" "m.2"
> d <- data.frame(a=1:3,m=I(matrix(4:9,,2)))
> d$m
[,1] [,2]
[1,] 4 7
[2,] 5 8
[3,] 6 9
> names(d)
[1] "a" "m"
(d <- data.frame(a=1:3) ; d$m <- matrix(4:9,,2) is like the 2nd version)
Your example has the weakness that you do "tmp$X <- X" so that you have X both in the data frame tmp and in the global workspace, and it is really not clear that the one in the data frame is the one used. I am pretty sure that this is in fact the case, but for your peace of mind, you should try renaming one of them.
>
> Is it just luck, i.e. if we change the data-set and/or formulas to
> something else, this will potentially fail?
> (that's something I would like to catch and avoid...)
>
> Thank you!
>
> -------------------------------
>
> The data is located at:
> http://www.ling.uni-potsdam.de/~vasishth/book.html
> In the section:
> downloadable errata, code, datasets
> ERRATA
> Corrected Pages
> VasishthBroebook.R
> beauty.txt
> mathachieve.txt
> mathachschool.txt
>
> ------------------------------
>
> MathAchieve <- read.table("mathachieve.txt")
>
> colnames(MathAchieve) <- c("School", "Minority", "Sex", "SES", "MathAch",
> "MEANSES")
>
> head(MathAchieve)
>
>
> MathAchSchool <- read.table("mathachschool.txt")
> colnames(MathAchSchool) <- c("School", "Size", "Sector", "PRACAD",
> "DISCLIM", "HIMINTY", "MEANSES")
> MathScores <- merge(MathAchieve, MathAchSchool, by = "School")
>
>
> lm1 = lm(MathAch ~ SES + factor(Sector) , MathScores)
>
> X=model.matrix(MathAch ~ SES+factor(Sector) , MathScores)
> y=MathScores$MathAch
>
> tmp=MathScores
> tmp$y=y
> tmp$X=X
>
> lm1x = lm(y~X-1, tmp)
>
> plot(fitted(lm1), fitted(lm1x))
>
> max(abs(fitted(lm1) - fitted(lm1x)))
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
More information about the R-help
mailing list