[R] problem with predict()

Thomas Lumley tlumley at u.washington.edu
Fri Jun 28 20:16:56 CEST 2002


On Fri, 28 Jun 2002, Liaw, Andy wrote:

> As Prof. Ripley guessed, the X data is less than full rank.  I'm surprised
> that lm didn't issue warning.  summary() does say:
>
> Coefficients: (17 not defined because of singularities)
>
> I'm also surprised that with such a fitted object, predict(model) works, but
> not predict(model, data), where data is the original data used to fit the
> model.  This does not seem to be user-friendly...

There's something to be said for more warnings, but I think what predict
does is reasonable.

If you have fitted a model to rank-deficient data then you can predict on
data with the same rank deficiency.  In particular, you can predict on the
original data, as predict(model) does.  However, you can't predict on new
data unless they have the same rank-deficiencies, and since this is
something that can't be tested stably it make sense to refuse to predict
on new data.


	-thomas
>
> Andy
>
> > -----Original Message-----
> > From: Czerminski, Ryszard [mailto:ryszard at arqule.com]
> > Sent: Friday, June 28, 2002 1:42 PM
> > To: 'ripley at stats.ox.ac.uk'; Czerminski, Ryszard
> > Cc: r-help at stat.math.ethz.ch; 'Liaw, Andy'
> > Subject: RE: [R] problem with predict()
> >
> >
> > I will try.
> >
> > I tried to use lm() to have basic linear "reference".
> >
> > If this is the reason (i.e. train data set is rank-deficient) and lm
> > cannot handle such situation, what are the other packages in R
> > you could recommend which would handle this type of data ?
> > Also in such case: should not lm() report a problem in a
> > model building
> > phase ?
> >
> > So far I have used SVM approach to do regression with this
> > data set and I am
> > getting
> > rather poor r^2 (~0.25 on test set), but I do not have any numerical
> > problems with SVM.
> > I am also planning to try randomForest() to do
> > classification. This was my
> > immediate
> > motivation to turn to R.
> >
> > All the best,
> >
> > R
> >
> > Ryszard Czerminski   phone: (781)994-0479
> > ArQule, Inc.         email:ryszard at arqule.com
> > 19 Presidential Way  http://www.arqule.com
> > Woburn, MA 01801     fax: (781)994-0679
> >
> >
> > -----Original Message-----
> > From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk]
> > Sent: Friday, June 28, 2002 12:39 PM
> > To: Czerminski, Ryszard
> > Cc: r-help at stat.math.ethz.ch
> > Subject: RE: [R] problem with predict()
> >
> >
> > Have you tried the R debugging tools?  If not, please make
> > use of them.
> > My guess is that you have a rank-deficient problem.
> >
> > ?debugger
> > ?recover
> > ?dump.frames
> > ...
> >
> >
> > On Fri, 28 Jun 2002, Czerminski, Ryszard wrote:
> >
> > > This time I use the same file for train.data and test.data
> > > throwing in "names(test) <- names(train)" before predict()
> > for double
> > > protection (:-)
> > > and it still fails...
> > >
> > > Is it some weird problem with this particular data set ? or a bug ?
> > > (why "subscript out of bounds" ?)
> >
> > That's what the debugging tools are for.
> >
> > >
> > > > rm(list=ls())
> > > > train.data <- read.csv("train.csv", header = TRUE,
> > row.names = "mol",
> > > comment.char="")
> > > > test.data <- read.csv("train.csv", header = TRUE,
> > row.names = "mol",
> > > comment.char="")
> > > > yr <- train.data[,1]; xr <- train.data[,-1]
> > > > xr <- scale(xr)     # matrix <- scale(data.frame)
> > > > x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr,
> > "scaled:scale")
> > > > mask <- apply(xr, 2, function(x) any(is.na(x)))
> > > > xr <- xr[,!mask] # rm NA's
> > > > ys <- test.data[,1]; xs <- test.data[,-1]
> > > > xs <- scale(xs, center = x.center, scale = x.scale)
> > > > xs <- xs[,!mask]
> > > > train <- data.frame(y = yr, x = xr)
> > > > test <- data.frame(y = ys, x = xs)
> > > > model <- lm(y~., train)
> > > > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")
> > > dim(train) = 164 119 ; dim(test) = 164 119
> > > > names(test) <- names(train)
> > > > length(predict(model, test))
> > > Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) :
> > >         subscript out of bounds
> > > >
> > >
> > > Ryszard Czerminski   phone: (781)994-0479
> > > ArQule, Inc.         email:ryszard at arqule.com
> > > 19 Presidential Way  http://www.arqule.com
> > > Woburn, MA 01801     fax: (781)994-0679
> > >
> > >
> > > -----Original Message-----
> > > From: Liaw, Andy [mailto:andy_liaw at merck.com]
> > > Sent: Friday, June 28, 2002 8:46 AM
> > > To: 'Czerminski, Ryszard'
> > > Cc: r-help at stat.math.ethz.ch
> > > Subject: RE: [R] problem with predict()
> > >
> > >
> > > You can try:
> > >
> > >   names(test) <- names(train)
> > >
> > > before calling predict() to make sure that the variable names match.
> > > Without your data files, it's hard to tell why your first
> > example worked.
> > >
> > > Andy
> > >
> > > > -----Original Message-----
> > > > From: Czerminski, Ryszard [mailto:ryszard at arqule.com]
> > > > Sent: Thursday, June 27, 2002 3:29 PM
> > > > To: 'ripley at stats.ox.ac.uk'; Czerminski, Ryszard
> > > > Cc: r-help at stat.math.ethz.ch
> > > > Subject: RE: [R] problem with predict()
> > > >
> > > >
> > > >
> > > > # Yes. You are *still* using a matrix in a data frame.
> > > > Please do read more
> > > > # carefully.
> > > >
> > > > I have read some more R documentation trying to
> > understand difference
> > > > between
> > > > matrices and data frames etc... and I repeat my example this time
> > > > executing EXACTLY the same code with only difference being
> > > > that in one case
> > > > I use smaller data sets ({train,test}-small.csv) and in the
> > > > second case I
> > > > use larger
> > > > data sets ({train,test}.csv) - and I got different behaviour.
> > > >
> > > > Small case (10*4) runs fine, larger case (164*119) fails.
> > > >
> > > > Any ideas why this happens ?
> > > >
> > > > R
> > > >
> > > > > rm(list=ls())
> > > > > train.data <- read.csv("train-small.csv", header =
> > TRUE, row.names =
> > > > "mol", comment.char="")
> > > > > test.data <- read.csv("test-small.csv", header = TRUE,
> > > > row.names = "mol",
> > > > comment.char="")
> > > > > yr <- train.data[,1]; xr <- train.data[,-1]
> > > > > xr <- scale(xr)
> > > > > x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr,
> > > > "scaled:scale")
> > > > > mask <- apply(xr, 2, function(x) any(is.na(x)))
> > > > > xr <- xr[,!mask] # rm NA's
> > > > > ys <- test.data[,1]; xs <- test.data[,-1]
> > > > > xs <- scale(xs, center = x.center, scale = x.scale)
> > > > > xs <- xs[,!mask]
> > > > > train <- data.frame(y = yr, x = xr)
> > > > > test <- data.frame(y = ys, x = xs)
> > > > > model <- lm(y~., train)
> > > > > cat("dim(train) =", dim(train), "; dim(test) =",
> > dim(test), "\n")
> > > > dim(train) = 10 4 ; dim(test) = 10 4
> > > > > length(predict(model, test))
> > > > [1] 10
> > > > > train.data <- read.csv("train.csv", header = TRUE,
> > > > row.names = "mol",
> > > > comment.char="")
> > > > > test.data <- read.csv("test.csv", header = TRUE,
> > row.names = "mol",
> > > > comment.char="")
> > > > [snip...]
> > > > > cat("dim(train) =", dim(train), "; dim(test) =",
> > dim(test), "\n")
> > > > dim(train) = 164 119 ; dim(test) = 35 119
> > > > > length(predict(model, test))
> > > > Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) :
> > > >         subscript out of bounds
> > > > >
> > > >
> > > > Ryszard Czerminski   phone: (781)994-0479
> > > > ArQule, Inc.         email:ryszard at arqule.com
> > > > 19 Presidential Way  http://www.arqule.com
> > > > Woburn, MA 01801     fax: (781)994-0679
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk]
> > > > Sent: Friday, June 21, 2002 3:41 PM
> > > > To: Czerminski, Ryszard
> > > > Cc: r-help at stat.math.ethz.ch
> > > > Subject: RE: [R] problem with predict()
> > > >
> > > >
> > > > On Fri, 21 Jun 2002, Czerminski, Ryszard wrote:
> > > >
> > > > > --- first problem
> > > > >
> > > > > If I store 'simulated' data in data frames:
> > > > > # train.data <- data.frame(matrix(rnorm(164*119), nrow = 164))
> > > > > # test.data <- data.frame(matrix(rnorm(35*119), nrow = 35))
> > > > > it still works the same way i.e. the code below works fine
> > > > > for simulated data and fails for 'real' data the only
> > > > > difference being in actual numeric values stored in data
> > > > > structures of the same shape and type.
> > > > >
> > > > > Any suggestions why this happens ?
> > > >
> > > > Yes. You are *still* using a matrix in a data frame.  Please
> > > > do read more
> > > > carefully.
> > > >
> > > > > --- second problem
> > > > >
> > > > > > As Andy Liaw pointed out, xr is a matrix.  Take a look at
> > > > the names of
> > > > > > train.  Hint: they do not contain `x'.
> > > > >
> > > > > Following your hint I am guessing that the fact that names
> > > > do not contain
> > > > > 'x'
> > > > > explains why lm(y~., train) form works and lm(y~x, train) fails
> > > > > and "lm(y~., train)" means roughly: correlate column "y" to
> > > > all other
> > > > colums
> > > >
> > > > No, it means regress y on all the remaining colums in the
> > > > data argument.
> > > >
> > > > >
> > > > > Where I can find more detail specification of this syntax ?
> > > > > In help(lm) I find this paragraph:
> > > > >
> > > > >      Models for `lm' are specified symbolically.  A typical
> > > > model has
> > > > >      the form `response ~ terms' where `response' is the
> > > > (numeric)...
> > > > >
> > > > > which does not quite cover this case.
> > > >
> > > > In any good book on the subject.
> > > >
> > > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> > > > -.-.-.-.-.-.-.
> > > > -.-
> > > > r-help mailing list -- Read
> > > > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > > > Send "info", "help", or "[un]subscribe"
> > > > (in the "body", not the subject !)  To:
> > > > r-help-request at stat.math.ethz.ch
> > > > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> > > > _._._._._._._.
> > > > _._
> > > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> > > > -.-.-.-.-.-.-.-.-
> > > > r-help mailing list -- Read
> > > > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > > > Send "info", "help", or "[un]subscribe"
> > > > (in the "body", not the subject !)  To:
> > > > r-help-request at stat.math.ethz.ch
> > > > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> > > > _._._._._._._._._
> > > >
> > >
> > >
> > --------------------------------------------------------------
> > --------------
> > > --
> > > Notice: This e-mail message, together with any attachments, contains
> > > information of Merck & Co., Inc. (Whitehouse Station, New
> > Jersey, USA)
> > that
> > > may be confidential, proprietary copyrighted and/or legally
> > privileged,
> > and
> > > is intended solely for the use of the individual or entity
> > named on this
> > > message.  If you are not the intended recipient, and have
> > received this
> > > message in error, please immediately return this by e-mail
> > and then delete
> > > it.
> > >
> > >
> > ==============================================================
> > ==============
> > > ==
> > >
> > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> > -.-.-.-.-.-.-.
> > -.-
> > > r-help mailing list -- Read
> > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > > Send "info", "help", or "[un]subscribe"
> > > (in the "body", not the subject !)  To:
> > r-help-request at stat.math.ethz.ch
> > >
> > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> > _._._._._._._.
> > _._
> > >
> >
> > --
> > Brian D. Ripley,                  ripley at stats.ox.ac.uk
> > Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> > University of Oxford,             Tel:  +44 1865 272861 (self)
> > 1 South Parks Road,                     +44 1865 272860 (secr)
> > Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> >
>
> ------------------------------------------------------------------------------
> Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named on this message.  If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it.
>
> ==============================================================================
>
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>

Thomas Lumley			Asst. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle
^^^^^^^^^^^^^^^^^^^^^^^^
 NOTE NEW EMAIL ADDRESS

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list