[R] Histogram Ranking

John Day jday at csi-inc.com
Tue Sep 10 15:45:29 CEST 2002


Dear all,

I received no response to this, which indicates that perhaps I didn't 
express myself clearly. Let me be a little more formal and try again:

Let C be a classification problem where we are trying to classify a set of 
N features F=(f1,f2,...fN) into a small set of M classes T=(t1,t2,...,tM).

We want to learn the classification rules from the distributions of the 
values of F. Suppose we have a large number of labelled samples 
SF=(sf1,sf2,...sfN,t), where the sf's are feature values and t is the 
"ground truth" for that sample.

Then we can construct a sample histograms by class:
hk = (h1k, h2k, ... hNk) for t=1..M where hjk is count of feature j for class k

If the samples are random and unbiased then they constitute an 
approximation of the conditional probabliity distributions for each class. 
My idea is that these histograms can be viewed as points in a metric space 
and ranked by computing their euclidean distance from each other, after 
normalizing them so that each histogram has a probability of 1.

For example: let h1 and h2 be two histograms of feature counts for classes 
1 and 2 for a feature set with 6 features:
 > h1<-c(1,3,5,6,7,8)
 > h2<-c(4,6,4,6,7,9)

Normalize and compute the metric:
 > n1=h1/sum(h1)
 > n2=h2/sum(h2)
 > sqrt(sum((n1-n2)^2))

By adjusting the bin sizes we can probe how well the class concepts can be 
generalized. That is, if the metric is non-zero only when the binsize is 1, 
then the problem cannot be generalized. The size of the metric is roughly 
the "predictive juice" contained in the conditional distributions.

Yes, it's extremely simplistic, but I have found that this metric is useful 
for measuring how effectively features support a classification scheme.  Is 
this a well-known procedure? Is it sound assuming good samples? Are there 
procedures in R to support this kind of analysis?

Or is there a much better way to rank the "goodness" of a feature set?

Thanks,
John Day

At 02:30 PM 9/6/2002 -0400, I wrote:
>Hello,
>
>This is not exactly an R question, but I suspect that there is an R 
>procedure that does what I am calling (for lack of a better name) 
>"histogram ranking".
>
>I'm trying to evaluate a set of regression features by segregating by 
>target class and comparing the feature histograms. My idea is that if the 
>histograms are the same for two different classes then there is no 
>predictive power in those features. Conversely, if the histograms are 
>different then there is probably some predictive "juice" that we can 
>squeeze out of the features with regression.
>
>The histograms are computing by partitioning  the features into equally 
>spaced bins over their spans and counting the sample values in each bin 
>that corresponds to that partition of feature space. This is done for each 
>target class, so the resulting histograms are the features distributions 
>conditioned by target class.
>
>Since the histograms are numeric vectors, we can measure the "goodness" of 
>a feature set by evaluating the "distance" between histograms. The bigger 
>the better etc.
>
>Now I'm no statistics expert. Have I re-invented some "wheel" here? What 
>is the canonical name for this kind of analysis? Is this kind of analysis 
>routinely done in R? [Is there a "better" way to do all this?]
>
>Thanks,
>John Day




-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list