[R] strange problem with mars() in package mda

Mon Aug 6 16:08:19 CEST 2007

Hello all,

So I'm doing some data analysis using MARS.  I have a matrix of 65
independent variables, from which I'm trying to predict 71 dependent
variables.

I have 900+ data points (so a 900x136 matrix of data), which I randomly
split into training and validation sets, for ~450 data points in each set. 
Occasionally, this works well, and I get decent predictions.  However, quite
often MARS predicts extremely wrong values for the entire matrix of
dependent variables.  For example:
y1                    y2                  y3                   y4                
...
-1248145.629	1272399.812	9687.904417	-17713.04301
-1289951.702	1234426.24	-7355.868156	-17713.00275
-1268022.079	1245287.516	-1169.938246	-17713.32342
-1252243.171	1304869.002	19119.56255	-17713.32218
-1275335.038	1241681.7	-3269.268145	-17713.12027
-1251563.638	1299513.864	17509.25065	-17712.68469
.                      .                    .                      .
.                      .                    .                      .
.                      .                    .                      .

 where the average value of these variables is actually more like:
y1 = ~19.89
y2 = ~33.64
y3 = ~1.51
y4 = ~1.52

I think it may be related to the distribution of my data; the vast majority
of it (~850 of the 900+ points) is all very close to the average value of
the points, whereas the remainder of the data is scattered widely around the
measurement space, often very far from the average.

It seems that if I limit my training set to "good" points only, the model is
good (<10% error).  As I add these "bad" points to the training set, there
is a certain number I can include, after which MARS predicts extremely wrong
values like the example above.  

Is this a bug with the MARS implementation in R, or a limitation of MARS
itself when trained with some outlier data?  My code is shown below:

library(mda)
measurements <- read.table("clean_measurements.csv", header=TRUE, colClasses
= "numeric", sep=",")

#divide "good" and "bad" data points based on a 1/-1 label column
selection <- which(measurements[,138] == 1)
passing_Devices <- measurements[selection,1:138]
failing_Devices <- measurements[-selection,1:138]

#Number of passing/failing devices
num_Passing_Devices <- dim(passing_Devices)[1]
num_Failing_Devices <- dim(failing_Devices)[1]

# Use probability vectors to make vectors of indices
pass_Selection <- which (runif(num_Passing_Devices) > 0.5)
fail_Selection <- which (runif(num_Failing_Devices) > 0.5)

# ... which are then used to establish training and validation data sets
with each set containing
# ~50% of "good" and ~50% of "bad" data points
training_Set <-
rbind(passing_Devices[pass_Selection,],failing_Devices[fail_Selection,])
validation_Set <- rbind(passing_Devices[-pass_Selection,],
failing_Devices[-fail_Selection,])

# columns 2 to 66 are independent variables
x <- training_Set[,2:66]

# and 67 to 137 are dependent
y <- training_Set[,67:137]

model <- mars(x,y)

x_v <- validation_Set[,2:66]
y_v <- validation_Set[,67:137]
y_p <- predict(model, x_v) 
percent_Error <- abs((y_p - y_v) / y_v)  

Thanks in advance for any help or suggestions you might have, I appreciate
it.

~Nate
-- 
View this message in context: http://www.nabble.com/strange-problem-with-mars%28%29-in-package-mda-tf4224354.html#a12016923
Sent from the R help mailing list archive at Nabble.com.