[R] Density Estimation

Wed Sep 15 16:07:33 CEST 2004

On 15-Sep-04 Brian Mac Namee wrote:
> Sorry if this is a rather loing post. I have a simple list of single
> feature data points from which I would like to generate a probability
> that an unseen point comes from the same distribution. To do this I am
> trying to estimate the probability density of the list of points and
> use this to generate a probability for the new unseen points. I have
> managed to use the R density function to generate the density estimate
> but have not been able to do anything with this - i.e. generate a
> rpobability that a new point comes from the same distribution. Is
> there a function to do this, or am I way off the mark using the
> density function at all?

It's not clear what you're really after, but it looks as though you
may be wanting to sample from the distribution estimated by 'density'.

A possible approach, which you could refine, is exemplified by

  x<-rnorm(1000)
  d<-density(x,n=4096)
  y<-sample(d$x,size=1000,prob=d$y)

Check performance with

  hist(y)

Looks OK to me! See "?density" and "?sample".

On an alternative interpretation, perhaps you want to first estimate
the density based on data you already have, and then when you have
got further data (but these would then be "seen" and not "unseen")
come to a judgement about whether these new points are compatible
with coming from the distributikon you have estimated.

A possible approach to this question (again susceptible to refinement)
would be as follows.

1. Use a fine-grained grid for 'density', i.e. a large value for "n".

2. Replace each of the points in the new data by the nearest point
   in this grid. Call these values z1, z2, ... , zk corresponding
   to index values i1, i2, ... , ik in d$x.

3. Evaluate the probability P(z1,...,zk) from the density as the
   product of d$y[i] where i<-c(i1,...,ik).
   Better still, evaluated the logarithm of this. Call the result L.

4. Now simulate a large number of draws of k values from d on the
   lines of sample(d$x,size=k,prob=d$y) as above, and evaluate L
   for each  of these. Where is the value of L from (3) situated in
   the distribution of these values of L from (4)? If (say) only
   1 per cent of the simulated values of L from "d" are less than
   the value of L from (3), then you have a basis for a test that
   your new data did not come from the distribution you have estimated
   from your old data, in that the new data are from the low-density
   part of the estimated distribution.

There are of course alternative ways to view this question. The
value of "k" is relevant. In particular, if "k" is small (say 3
or 4) then the suggestion in (4) is probably the best way to
approach it. However, if "k" is large then you can use a test on
the lines of Kolmogorov-Smirnov with the reference distribution
estimated as the cumulative distribution of d$y and the distribution
being tested as the empirical cumulative distribution of your new
data.

Even sharper focus is available if you are in a position to make
a paramatric model for your data, but your description does not
suggest that this is the case.

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 167 1972
Date: 15-Sep-04                                       Time: 15:07:33
------------------------------ XFMail ------------------------------