[R] Strange column shifting with read.table
Rolf Turner
r.turner at auckland.ac.nz
Mon Aug 3 02:22:40 CEST 2009
On 3/08/2009, at 11:32 AM, Noah Silverman wrote:
> Rolf,
>
> Point taken.
>
> However, some of the variables in the experiment simply don't have
> data for some of the examples.
>
> Since I'm training an SVM that will complain about an NA, how do
> you suggest I handle this.
>
>
> Imagine a model predicting student performance/grades/whatever.
>
> One variable might be "past_gpa".
>
> If we have some students with no history, what do you put for that
> column. NA is more "correct", but won't work with an SVM.
>
> I'm always happy to learn...
I know next to nothing about support vector machines. Despite my
ignorance
I remain suspicious of the concept. I suspect that fortune("machine
learning")
is relevant.
If you have a data set that contains intrinsic NAs and you wish to
apply SVM
methods to these data, then you will need to understand how SVMs work
and decide
what *should* be done to handle these NAs. My vague understanding is
that SVM
tries to build pairs of hyperplanes, as widely separated as possible,
between classes of
data. This requires that each datum be representable as point in n-
dimensional
space. A datum one of whose entries is NA is not (really) such a
point. Moreover
it sure as hell isn't the same as the point produce by replacing that
NA by 0.
To take your example involving past_gpa --- a student who has no past
gpa is very
likely to be very different from a student who has previously studied
and
failed everything!
What you need is a *metric* which tells you the distance between a
point with an NA
in it and another point. The other point may have no NAs amongst its
coordinates,
or it might have an NA in a *different* coordinate. I.e. you need to
define a distance
between points, some of whose coordinates may be missing, in a
*meaningful* way.
After doing that, you will need (!!!) to adapt the SVM software to
work with this
new metric/distance instead of the Euclidean metric. This may
possibly all have
been done already by someone, somewhere. I dunno.
Of course your proposed technique of replacing NAs by zeroes does
define a distance
between such points. But I doubt me an it be meaningful.
OTOH how meaningful is the Euclidean metric between points whose
entries are numeric
but in completely unrelated units (gpa, age, weight, income, ...) ???
I'm sure this is little-to-no help in reality. But I suspect that
little-to-no help
is possible.
A thought that just occurred to me: there ***might*** be some milage
in trying
to ``impute'' values for the NAs in your data. However sensible
imputation requires
(so I believe) pretty stringent conditions --- like multivariate
Gaussianity? ---
on your data, which are unlikely to be satisfied. (Else why are you
using SVM
techniques in the first place?) Frank Harrell might have something
useful --- or
caustic (or both) --- to say on this issue.
cheers,
Rolf Turner
######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}
More information about the R-help
mailing list