[R] problem with predict()
ripley@stats.ox.ac.uk
ripley at stats.ox.ac.uk
Fri Jun 28 18:39:10 CEST 2002
Have you tried the R debugging tools? If not, please make use of them.
My guess is that you have a rank-deficient problem.
?debugger
?recover
?dump.frames
...
On Fri, 28 Jun 2002, Czerminski, Ryszard wrote:
> This time I use the same file for train.data and test.data
> throwing in "names(test) <- names(train)" before predict() for double
> protection (:-)
> and it still fails...
>
> Is it some weird problem with this particular data set ? or a bug ?
> (why "subscript out of bounds" ?)
That's what the debugging tools are for.
>
> > rm(list=ls())
> > train.data <- read.csv("train.csv", header = TRUE, row.names = "mol",
> comment.char="")
> > test.data <- read.csv("train.csv", header = TRUE, row.names = "mol",
> comment.char="")
> > yr <- train.data[,1]; xr <- train.data[,-1]
> > xr <- scale(xr) # matrix <- scale(data.frame)
> > x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr, "scaled:scale")
> > mask <- apply(xr, 2, function(x) any(is.na(x)))
> > xr <- xr[,!mask] # rm NA's
> > ys <- test.data[,1]; xs <- test.data[,-1]
> > xs <- scale(xs, center = x.center, scale = x.scale)
> > xs <- xs[,!mask]
> > train <- data.frame(y = yr, x = xr)
> > test <- data.frame(y = ys, x = xs)
> > model <- lm(y~., train)
> > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")
> dim(train) = 164 119 ; dim(test) = 164 119
> > names(test) <- names(train)
> > length(predict(model, test))
> Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) :
> subscript out of bounds
> >
>
> Ryszard Czerminski phone: (781)994-0479
> ArQule, Inc. email:ryszard at arqule.com
> 19 Presidential Way http://www.arqule.com
> Woburn, MA 01801 fax: (781)994-0679
>
>
> -----Original Message-----
> From: Liaw, Andy [mailto:andy_liaw at merck.com]
> Sent: Friday, June 28, 2002 8:46 AM
> To: 'Czerminski, Ryszard'
> Cc: r-help at stat.math.ethz.ch
> Subject: RE: [R] problem with predict()
>
>
> You can try:
>
> names(test) <- names(train)
>
> before calling predict() to make sure that the variable names match.
> Without your data files, it's hard to tell why your first example worked.
>
> Andy
>
> > -----Original Message-----
> > From: Czerminski, Ryszard [mailto:ryszard at arqule.com]
> > Sent: Thursday, June 27, 2002 3:29 PM
> > To: 'ripley at stats.ox.ac.uk'; Czerminski, Ryszard
> > Cc: r-help at stat.math.ethz.ch
> > Subject: RE: [R] problem with predict()
> >
> >
> >
> > # Yes. You are *still* using a matrix in a data frame.
> > Please do read more
> > # carefully.
> >
> > I have read some more R documentation trying to understand difference
> > between
> > matrices and data frames etc... and I repeat my example this time
> > executing EXACTLY the same code with only difference being
> > that in one case
> > I use smaller data sets ({train,test}-small.csv) and in the
> > second case I
> > use larger
> > data sets ({train,test}.csv) - and I got different behaviour.
> >
> > Small case (10*4) runs fine, larger case (164*119) fails.
> >
> > Any ideas why this happens ?
> >
> > R
> >
> > > rm(list=ls())
> > > train.data <- read.csv("train-small.csv", header = TRUE, row.names =
> > "mol", comment.char="")
> > > test.data <- read.csv("test-small.csv", header = TRUE,
> > row.names = "mol",
> > comment.char="")
> > > yr <- train.data[,1]; xr <- train.data[,-1]
> > > xr <- scale(xr)
> > > x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr,
> > "scaled:scale")
> > > mask <- apply(xr, 2, function(x) any(is.na(x)))
> > > xr <- xr[,!mask] # rm NA's
> > > ys <- test.data[,1]; xs <- test.data[,-1]
> > > xs <- scale(xs, center = x.center, scale = x.scale)
> > > xs <- xs[,!mask]
> > > train <- data.frame(y = yr, x = xr)
> > > test <- data.frame(y = ys, x = xs)
> > > model <- lm(y~., train)
> > > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")
> > dim(train) = 10 4 ; dim(test) = 10 4
> > > length(predict(model, test))
> > [1] 10
> > > train.data <- read.csv("train.csv", header = TRUE,
> > row.names = "mol",
> > comment.char="")
> > > test.data <- read.csv("test.csv", header = TRUE, row.names = "mol",
> > comment.char="")
> > [snip...]
> > > cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")
> > dim(train) = 164 119 ; dim(test) = 35 119
> > > length(predict(model, test))
> > Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) :
> > subscript out of bounds
> > >
> >
> > Ryszard Czerminski phone: (781)994-0479
> > ArQule, Inc. email:ryszard at arqule.com
> > 19 Presidential Way http://www.arqule.com
> > Woburn, MA 01801 fax: (781)994-0679
> >
> >
> > -----Original Message-----
> > From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk]
> > Sent: Friday, June 21, 2002 3:41 PM
> > To: Czerminski, Ryszard
> > Cc: r-help at stat.math.ethz.ch
> > Subject: RE: [R] problem with predict()
> >
> >
> > On Fri, 21 Jun 2002, Czerminski, Ryszard wrote:
> >
> > > --- first problem
> > >
> > > If I store 'simulated' data in data frames:
> > > # train.data <- data.frame(matrix(rnorm(164*119), nrow = 164))
> > > # test.data <- data.frame(matrix(rnorm(35*119), nrow = 35))
> > > it still works the same way i.e. the code below works fine
> > > for simulated data and fails for 'real' data the only
> > > difference being in actual numeric values stored in data
> > > structures of the same shape and type.
> > >
> > > Any suggestions why this happens ?
> >
> > Yes. You are *still* using a matrix in a data frame. Please
> > do read more
> > carefully.
> >
> > > --- second problem
> > >
> > > > As Andy Liaw pointed out, xr is a matrix. Take a look at
> > the names of
> > > > train. Hint: they do not contain `x'.
> > >
> > > Following your hint I am guessing that the fact that names
> > do not contain
> > > 'x'
> > > explains why lm(y~., train) form works and lm(y~x, train) fails
> > > and "lm(y~., train)" means roughly: correlate column "y" to
> > all other
> > colums
> >
> > No, it means regress y on all the remaining colums in the
> > data argument.
> >
> > >
> > > Where I can find more detail specification of this syntax ?
> > > In help(lm) I find this paragraph:
> > >
> > > Models for `lm' are specified symbolically. A typical
> > model has
> > > the form `response ~ terms' where `response' is the
> > (numeric)...
> > >
> > > which does not quite cover this case.
> >
> > In any good book on the subject.
> >
> > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> > -.-.-.-.-.-.-.
> > -.-
> > r-help mailing list -- Read
> > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > Send "info", "help", or "[un]subscribe"
> > (in the "body", not the subject !) To:
> > r-help-request at stat.math.ethz.ch
> > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> > _._._._._._._.
> > _._
> > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> > -.-.-.-.-.-.-.-.-
> > r-help mailing list -- Read
> > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > Send "info", "help", or "[un]subscribe"
> > (in the "body", not the subject !) To:
> > r-help-request at stat.math.ethz.ch
> > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> > _._._._._._._._._
> >
>
> ----------------------------------------------------------------------------
> --
> Notice: This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that
> may be confidential, proprietary copyrighted and/or legally privileged, and
> is intended solely for the use of the individual or entity named on this
> message. If you are not the intended recipient, and have received this
> message in error, please immediately return this by e-mail and then delete
> it.
>
> ============================================================================
> ==
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list