[R] any r package can handle factor levels not in the test set

HelponR suncertain at gmail.com
Tue Jan 13 17:59:48 CET 2015


Thanks for your reply. But I cannot control the data.
I am dealing with real world stream data. It is very normal that the test
data(when you apply model to do prediction) have new values that are not
seen in training data.
If I code myself, I would give a random guess or just an intercept for such
situation. But it seems most R package returns an error and exit.

On Mon, Jan 12, 2015 at 6:08 PM, Richard M. Heiberger <rmh at temple.edu>
wrote:

> You need to define the levels of the training set to include all
> levels that you might see.
> Something like this
>
> > A <- factor(letters[1:5])
> > B <- factor(letters[c(1,3,5,7,9)])
> > A
> [1] a b c d e
> Levels: a b c d e
> > B
> [1] a c e g i
> Levels: a c e g i
> > training <- factor(A, levels=unique(c(levels(A), levels(B))))
> > training
> [1] a b c d e
> Levels: a b c d e g i
> >
>
> In the future please "provide commented, minimal, self-contained,
> reproducible code."
>
> On Mon, Jan 12, 2015 at 9:00 PM, HelponR <suncertain at gmail.com> wrote:
> > It looks like gbm, glm all has this issue
> >
> > I wonder if any R package is immune of this?
> >
> > In reality, it is very normal that test data has data unseen in training
> > data. It looks like I have to give up R?
> >
> > Thanks!
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list