[R] Histogram Ranking
John Day
jday at csi-inc.com
Fri Sep 13 02:30:30 CEST 2002
[Note: I posted this Tuesday and re-sent it Wednesday and today, but it
didn't show up in the list archive, so apparently got dropped on the floor
somehow. I'll try again]
Dear All,
I received no response to this. Not sure if it was a glitch in the mailing
list or maybe I didn't express myself clearly. Let me be a little more
formal and try again:
I want to know if the algorithm described below is a well-known procedure?
Is it sound assuming unbiased samples? Are there procedures in R to support
this kind of analysis?
Or is there a much better way to rank the "goodness" of a feature set,
which is really what I'm trying to do here?
Here's a description of the problem I'm trying to solve and the
"historanking" algorithm:
Let C=(F,T) be a classification problem where we are trying to classify a
set of N features F=(f1,f2,...fN) into a small set of M classes
T=(t1,t2,...,tM).
We want to learn the classification rules from the distributions of the
values of F. Suppose we have a large number of labelled samples
SF=(sf1,sf2,...sfN,tk), where the sf's are feature values and tk is the kth
"ground truth" (out of 1..M) for that sample.
Then we can construct a sample histograms by class:
hk = (h1k, h2k, ... hNk) for t=1..M where hjk is count of feature j for class k
If the samples are random and unbiased then they constitute an
approximation of the conditional probability distributions for each class.
My idea is that these histograms can be viewed as points in a metric space
and ranked by computing their euclidean distance from each other, after
normalizing them so that each histogram has a probability of 1.
For example: let h1 and h2 be two histograms of feature counts for classes
1 and 2 for a feature set with 6 features:
> h1<-c(1,3,5,6,7,8)
> h2<-c(4,6,4,6,7,9)
Normalize and compute the metric:
> n1=h1/sum(h1)
> n2=h2/sum(h2)
> sqrt(sum((n1-n2)^2))
By adjusting the bin sizes we can probe how well the class concepts can be
generalized. That is, if the metric is non-zero only when the binsize is 1,
then the problem cannot be generalized. The size of the metric is roughly
the "predictive juice" contained in the conditional distributions.
Yes, "historanking" is somewhat simplistic and perhaps statistically
impaired, but I have found that this metric is useful for measuring how
effectively features support a classification scheme.
Thanks,
John Day
At 02:30 PM 9/6/2002 -0400, I wrote:
>Hello,
>
>This is not exactly an R question, but I suspect that there is an R
>procedure that does what I am calling (for lack of a better name)
>"histogram ranking".
>
>I'm trying to evaluate a set of regression features by segregating by
>target class and comparing the feature histograms. My idea is that if the
>histograms are the same for two different classes then there is no
>predictive power in those features. Conversely, if the histograms are
>different then there is probably some predictive "juice" that we can
>squeeze out of the features with regression.
>
>The histograms are computing by partitioning the features into equally
>spaced bins over their spans and counting the sample values in each bin
>that corresponds to that partition of feature space. This is done for each
>target class, so the resulting histograms are the features distributions
>conditioned by target class.
>
>Since the histograms are numeric vectors, we can measure the "goodness" of
>a feature set by evaluating the "distance" between histograms. The bigger
>the better etc.
>
>Now I'm no statistics expert. Have I re-invented some "wheel" here? What
>is the canonical name for this kind of analysis? Is this kind of analysis
>routinely done in R? [Is there a "better" way to do all this?]
>
>Thanks,
>John Day
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list