[R] Strange column shifting with read.table

Rolf Turner r.turner at auckland.ac.nz
Mon Aug 3 02:22:40 CEST 2009


On 3/08/2009, at 11:32 AM, Noah Silverman wrote:

> Rolf,
>
> Point taken.
>
> However, some of the variables in the experiment simply don't have  
> data for some of the examples.
>
> Since I'm training an SVM that will complain about an NA, how do  
> you suggest I handle this.
>
>
> Imagine a model predicting student performance/grades/whatever.
>
> One variable might be "past_gpa".
>
> If we have some students with no history, what do you put for that  
> column.  NA is more "correct", but won't work with an SVM.
>
> I'm always happy to learn...

I know next to nothing about support vector machines.  Despite my  
ignorance
I remain suspicious of the concept.  I suspect that fortune("machine  
learning")
is relevant.

If you have a data set that contains intrinsic NAs and you wish to  
apply SVM
methods to these data, then you will need to understand how SVMs work  
and decide
what *should* be done to handle these NAs.  My vague understanding is  
that SVM
tries to build pairs of hyperplanes, as widely separated as possible,  
between classes of
data.  This requires that each datum be representable as point in n- 
dimensional
space.  A datum one of whose entries is NA is not (really) such a  
point.  Moreover
it sure as hell isn't the same as the point produce by replacing that  
NA by 0.

To take your example involving past_gpa --- a student who has no past  
gpa is very
likely to be very different from a student who has previously studied  
and
failed everything!

What you need is a *metric* which tells you the distance between a  
point with an NA
in it and another point.  The other point may have no NAs amongst its  
coordinates,
or it might have an NA in a *different* coordinate.  I.e. you need to  
define a distance
between points, some of whose coordinates may be missing, in a  
*meaningful* way.

After doing that, you will need (!!!) to adapt the SVM software to  
work with this
new metric/distance instead of the Euclidean metric.  This may  
possibly all have
been done already by someone, somewhere.  I dunno.

Of course your proposed technique of replacing NAs by zeroes does  
define a distance
between such points.  But I doubt me an it be meaningful.

OTOH how meaningful is the Euclidean metric between points whose  
entries are numeric
but in completely unrelated units (gpa, age, weight, income, ...) ???

I'm sure this is little-to-no help in reality.  But I suspect that  
little-to-no help
is possible.

A thought that just occurred to me:  there ***might*** be some milage  
in trying
to ``impute'' values for the NAs in your data.  However sensible  
imputation requires
(so I believe) pretty stringent conditions --- like multivariate  
Gaussianity? ---
on your data, which are unlikely to be satisfied.  (Else why are you  
using SVM
techniques in the first place?)  Frank Harrell might have something  
useful --- or
caustic (or both) --- to say on this issue.

	cheers,

		Rolf Turner

######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}




More information about the R-help mailing list