[R] classifier for histograms?

context grey mobygeek at yahoo.com
Tue Jan 31 10:30:51 CET 2006


Hi,

Apology for this question being off the topic (OT) of
R, though I expect 
this list might be the best place on the net to ask
this question.


In brief, the question is:  what classification
algorithm 
can one use if the features are  histograms?


I have a classification problem, and believe that
histograms
of the distribution of some values may be the best
"feature" to use.
To make the mail shorter, here's a simpler example
problem:

   Try to classify a person as e.g. drunk or not given
the histogram 
   of their driving speed.

In the training phase, we have a table whose rows
contain the driver, 
whether they are drunk, and a sample of driving speed.
 
>From this one can build separate histograms of driving
speed 
for drunk/non drunk.  
  (In my actual application, I have several such
histogram features, and they
are visibly different; they are also ranked now by
some analytic 
pdf-distance measures such as KL).

Now, how to classify... 

given a single speed, its probability can be evaluated
under the two classes,
but a single speed sample is not going to be reliable
in this problem.
Suppose instead that the _distribution_ of speeds is
sufficient 
to discriminate.  

We have a driver, and a distribution of their speeds
over time.  A histogram
can be built.    What to do with this histogram?...
Is there a standard classifier that can deal with this
situation?  

My thought(s):  
- the test histogram could be compared to each
of the training histograms with the Chi^2 measure - 
sum of squared Gaussian deviations, then get a
probability from this?
-  Alternately, consider training histograms with n
bins as points 
in N-dimensional space, use euclidean closeness in
this space.
This may not generalize to more than one such
histogram feature though....

Thanks for any thoughts.

(Also thanks for the replies to my recent question 
about hashtable/dictionary.)




More information about the R-help mailing list