[R] predict: remove columns with new levels automatically
Andreas Wittmann
andreas_wittmann at gmx.de
Wed Nov 25 20:20:30 CET 2009
Thank you all for the good advice.
Now i did a fast hack, which does want i was looking for, maybe anyone
else finds this usefull
set.seed(0)
x <- rnorm(9)
y <- x + rnorm(9)
training <- data.frame(x=x, y=y,
z1=c(rep("A", 3), rep("B", 3), rep("C", 3)),
z2=c(rep("F", 4), rep("G", 5)))
test <- data.frame(x=t<-rnorm(1), y=t+rnorm(1), z1="D", z2="F")
`predict.drop` <- function(f, dat, newdat)
{
datlev <- vector("list", ncol(dat))
newdatlev <- vector("list", ncol(newdat))
`filllevs` <- function(dat, veclev)
{
for (j in 1:ncol(dat))
{
if (is.factor(dat[,j]))
veclev[[j]] <- levels(dat[,j])
else
veclev[[j]] <- NULL
}
return(veclev)
}
datlev <- filllevs(dat, datlev)
newdatlev <- filllevs(newdat, newdatlev)
if (ncol(dat) == ncol(newdat))
{
drop <- logical(ncol(dat))
names(drop) <- colnames(dat)
for (j in 1:ncol(dat))
{
if (!is.null(datlev[[j]]))
{
if (!(newdatlev[[j]] %in% datlev[[j]]))
drop[j] <- TRUE
}
}
}
else
stop("dat and newdat must have the same column length!")
m <- lm(formula(f), data=dat[,(1:ncol(dat))[!drop]])
p <- predict(m, newdat)
return(list(drop=drop, p=p))
}
predict.drop(x ~ ., training, test)
best regards
Andreas
David Winsemius wrote:
>
> On Nov 25, 2009, at 1:48 AM, Andreas Wittmann wrote:
>
>> Sorry for my bad description, i don't want get a constructed
>> algorithm without own work. i only hoped to get some advice how to do
>> this. i don't want to predict any sort of data, i reference only to
>> newdata which variables are the same as in the model data. But if
>> factors in the data than i can by possibly that the newdata has a
>> level which doesn't exist in the original data.
>> So i have to compare each factor in the data and in the newdata and
>> if the newdata has a levels which is not in the original data and
>> drop this variable and do compute the model and prediction again.
>> I thought this problem is quite common and i can use an algorithm
>> somebody has already implemented.
>>
>> best regards
>>
>> Andreas
>>
> If you use str to look at the lm1 object you will find at the bottom a
> list called "x":
>
> lm1$x will show you the factors that were present in variables at the
> time of the model creation
> > lm1$x
> $z
> [1] "A" "B" "C"
>
> New testing scenario good level and bad level:
>
> test <- data.frame(x=t<-rnorm(2), y=t+rnorm(2), z=c("B", "D") )
> lm1 <- lm(x ~ ., data=training)
> predict(lm1, subset(test, z %in% lm1$x$z) ) # get prediction for
> good level only
> 1
> 0.4225204
>
>>
>>
>>
>> -------- Original-Nachricht --------
>>> Datum: Wed, 25 Nov 2009 00:48:59 -0500
>>> Von: David Winsemius <dwinsemius at comcast.net>
>>> An: Andreas Wittmann <andreas_wittmann at gmx.de>
>>> CC: r-help at r-project.org
>>> Betreff: Re: [R] predict: remove columns with new levels automatically
>>
>>>
>>> On Nov 24, 2009, at 2:24 PM, Andreas Wittmann wrote:
>>>
>>>> Dear R-users,
>>>>
>>>> in the follwing thread
>>>>
>>>> http://tolstoy.newcastle.edu.au/R/help/03b/3322.html
>>>>
>>>> the problem how to remove rows for predict that contain levels which
>>>> are not in the model.
>>>>
>>>> now i try to do this the other way round and want to remove columns
>>>> (variables) in the model which will be later problematic with new
>>>> levels for prediction.
>>>>
>>>> ## example:
>>>> set.seed(0)
>>>> x <- rnorm(9)
>>>> y <- x + rnorm(9)
>>>>
>>>> training <- data.frame(x=x, y=y, z=c(rep("A", 3), rep("B", 3),
>>>> rep("C", 3)))
>>>> test <- data.frame(x=t<-rnorm(1), y=t+rnorm(1), z="D")
>>>>
>>>> lm1 <- lm(x ~ ., data=training)
>>>> ## prediction does not work because the variable z has the new level
>>>> "D"
>>>> predict(lm1, test)
>>>>
>>>> ## solution: the variable z is removed from the model
>>>> ## the prediction happens without using the information of variable z
>>>> lm2 <- lm(x ~ y, data=training)
>>>> predict(lm2, test)
>>>>
>>>> How can i autmatically recognice this and calculate according to this?
>>>
>>> Let me get this straight. You want us to predict in advance (or more
>>> accurately design an algorithm that can see into the future and work
>>> around) any sort of newdata you might later construct????
>>>
>>> --
>>>
>>> David Winsemius, MD
>>> Heritage Laboratories
>>> West Hartford, CT
>>
>> --
>> Preisknaller: GMX DSL Flatrate für nur 16,99 Euro/mtl.!
>> http://portal.gmx.net/de/go/dsl02
>
> David Winsemius, MD
> Heritage Laboratories
> West Hartford, CT
>
More information about the R-help
mailing list