[R] strange problem with mars() in package mda
natekupp
natekupp at gmail.com
Mon Aug 6 16:08:19 CEST 2007
Hello all,
So I'm doing some data analysis using MARS. I have a matrix of 65
independent variables, from which I'm trying to predict 71 dependent
variables.
I have 900+ data points (so a 900x136 matrix of data), which I randomly
split into training and validation sets, for ~450 data points in each set.
Occasionally, this works well, and I get decent predictions. However, quite
often MARS predicts extremely wrong values for the entire matrix of
dependent variables. For example:
y1 y2 y3 y4
...
-1248145.629 1272399.812 9687.904417 -17713.04301
-1289951.702 1234426.24 -7355.868156 -17713.00275
-1268022.079 1245287.516 -1169.938246 -17713.32342
-1252243.171 1304869.002 19119.56255 -17713.32218
-1275335.038 1241681.7 -3269.268145 -17713.12027
-1251563.638 1299513.864 17509.25065 -17712.68469
. . . .
. . . .
. . . .
where the average value of these variables is actually more like:
y1 = ~19.89
y2 = ~33.64
y3 = ~1.51
y4 = ~1.52
I think it may be related to the distribution of my data; the vast majority
of it (~850 of the 900+ points) is all very close to the average value of
the points, whereas the remainder of the data is scattered widely around the
measurement space, often very far from the average.
It seems that if I limit my training set to "good" points only, the model is
good (<10% error). As I add these "bad" points to the training set, there
is a certain number I can include, after which MARS predicts extremely wrong
values like the example above.
Is this a bug with the MARS implementation in R, or a limitation of MARS
itself when trained with some outlier data? My code is shown below:
library(mda)
measurements <- read.table("clean_measurements.csv", header=TRUE, colClasses
= "numeric", sep=",")
#divide "good" and "bad" data points based on a 1/-1 label column
selection <- which(measurements[,138] == 1)
passing_Devices <- measurements[selection,1:138]
failing_Devices <- measurements[-selection,1:138]
#Number of passing/failing devices
num_Passing_Devices <- dim(passing_Devices)[1]
num_Failing_Devices <- dim(failing_Devices)[1]
# Use probability vectors to make vectors of indices
pass_Selection <- which (runif(num_Passing_Devices) > 0.5)
fail_Selection <- which (runif(num_Failing_Devices) > 0.5)
# ... which are then used to establish training and validation data sets
with each set containing
# ~50% of "good" and ~50% of "bad" data points
training_Set <-
rbind(passing_Devices[pass_Selection,],failing_Devices[fail_Selection,])
validation_Set <- rbind(passing_Devices[-pass_Selection,],
failing_Devices[-fail_Selection,])
# columns 2 to 66 are independent variables
x <- training_Set[,2:66]
# and 67 to 137 are dependent
y <- training_Set[,67:137]
model <- mars(x,y)
x_v <- validation_Set[,2:66]
y_v <- validation_Set[,67:137]
y_p <- predict(model, x_v)
percent_Error <- abs((y_p - y_v) / y_v)
Thanks in advance for any help or suggestions you might have, I appreciate
it.
~Nate
--
View this message in context: http://www.nabble.com/strange-problem-with-mars%28%29-in-package-mda-tf4224354.html#a12016923
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list