[R] Histogram Ranking

Wed Sep 11 15:41:39 CEST 2002

Dear all,

I received no response to this. Not sure if it was a glitch in the mailing 
list or maybe I didn't express myself clearly. Let me be a little more 
formal and try again:

I want to know if the algorithm described below is a well-known procedure? 
Is it sound assuming unbiased samples? Are there procedures in R to support 
this kind of analysis?

Or is there a much better way to rank the "goodness" of a feature set, 
which is really what I'm trying to do here?

Here's the "historanking" algorithm:
Let C be a classification problem where we are trying to classify a set of 
N features F=(f1,f2,...fN) into a small set of M classes T=(t1,t2,...,tM).

We want to learn the classification rules from the distributions of the 
values of F. Suppose we have a large number of labelled samples 
SF=(sf1,sf2,...sfN,tk), where the sf's are feature values and tk is the kth 
"ground truth" (out of 1..M) for that sample.

Then we can construct a sample histograms by class:
hk = (h1k, h2k, ... hNk) for t=1..M where hjk is count of feature j for class k

If the samples are random and unbiased then they constitute an 
approximation of the conditional probability distributions for each class. 
My idea is that these histograms can be viewed as points in a metric space 
and ranked by computing their euclidean distance from each other, after 
normalizing them so that each histogram has a probability of 1.

For example: let h1 and h2 be two histograms of feature counts for classes 
1 and 2 for a feature set with 6 features:
 > h1<-c(1,3,5,6,7,8)
 > h2<-c(4,6,4,6,7,9)

Normalize and compute the metric:
 > n1=h1/sum(h1)
 > n2=h2/sum(h2)
 > sqrt(sum((n1-n2)^2))

By adjusting the bin sizes we can probe how well the class concepts can be 
generalized. That is, if the metric is non-zero only when the binsize is 1, 
then the problem cannot be generalized. The size of the metric is roughly 
the "predictive juice" contained in the conditional distributions.

Yes, "historanking" is somewhat simplistic and perhaps statistically 
impaired, but I have found that this metric is useful for measuring how 
effectively features support a classification scheme.

Thanks,
John Day

At 02:30 PM 9/6/2002 -0400, I wrote:
>Hello,
>
>This is not exactly an R question, but I suspect that there is an R 
>procedure that does what I am calling (for lack of a better name) 
>"histogram ranking".
>
>I'm trying to evaluate a set of regression features by segregating by 
>target class and comparing the feature histograms. My idea is that if the 
>histograms are the same for two different classes then there is no 
>predictive power in those features. Conversely, if the histograms are 
>different then there is probably some predictive "juice" that we can 
>squeeze out of the features with regression.
>
>The histograms are computing by partitioning  the features into equally 
>spaced bins over their spans and counting the sample values in each bin 
>that corresponds to that partition of feature space. This is done for each 
>target class, so the resulting histograms are the features distributions 
>conditioned by target class.
>
>Since the histograms are numeric vectors, we can measure the "goodness" of 
>a feature set by evaluating the "distance" between histograms. The bigger 
>the better etc.
>
>Now I'm no statistics expert. Have I re-invented some "wheel" here? What 
>is the canonical name for this kind of analysis? Is this kind of analysis 
>routinely done in R? [Is there a "better" way to do all this?]
>
>Thanks,
>John Day

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._