[R] How to do a probability density based filtering in 2D?

David Winsemius dwinsemius at comcast.net
Sat Nov 20 05:24:20 CET 2010

On Nov 19, 2010, at 9:40 PM, Emmanuel Levy wrote:

> Hello David,
> I thought about this at first as well, e.g.,
>  x1.lim = quantile(x1,prob=c(0.05,0.95))
>  y2.lim = quantile(y2,prob=c(0.05,0.95))
>  x1.sub = x1[ x1 > x1.lim[1] & x1 < x1.lim[2] & y2 > y2.lim[1] & y2 <
> y2.lim[2]]
>  y2.sub = y2[ x1 > x1.lim[1] & x1 < x1.lim[2] & y2 > y2.lim[1] & y2 <
> y2.lim[2]]
> But this is actually does not do what I want because it is not based
> on the density of the data. What I would like is to keep only the
> points within an area where the density of points is over x.

I now see what you were asking.  My initial efforts with the example  
in the kde2d help page did not turn out the way I naively expected.  
And the multimodal nature of the geyser example makes it even less  
amenable to the approach you outlined above. The points in the kde2d  
grid are not associated with any particular point in the data,  
although I suppose one could find the closest point on the 50 x 50  
kde2d grid to each of of the data points and take the median or  
whatever percentile were desired of that set of measures. Sounds too  

But I think I came up with a potential solution that will draw  
estimated boundaries that will contain the highest density areas and  
with 50% of the points:

require(coda)  # I was worried that it might not run without BRugs but  
it didn't complain
require(emdbook)  # (lots of dependencies)
HPDregionplot(mcmc(geyser, prob=0.5))

Looking at this result made me wonder whether I had seen this before,  
and I thought of the locfit package. See page 86 of Loader's text:

fit.trim<- locfit( ~ lp(waiting, duration, nn=0.1,scale=0), data=geyser)
plot(fit.trim,  xlim=c(min(geyser$waiting),min(geyser 
$duration),max(geyser$waiting), max(geyser$duration) ) )
geyser$dens <-(fitted(fit.trim)) ; med.dens <- median(geyser$dens);  
high <- geyser$dens > med.dens
# These densities are associated with points.
points(geyser[!high, "waiting"], geyser[!high, "duration"],  
col="blue", cex=0.5 )
points(geyser[high, "waiting"], geyser[high, "duration"], col="red",  
cex=0.5 )

(I could not get the contour lines at 95% and 50% coverage using the  
code in the book so this was my best shot. I would be interested in  
knowing how to do it with locfit,  which I why I am copying Andy Liaw,  
although he probably won't see it until Monday.)
> In other words, if you imagine a contour plot, I'd like to keep all
> the points within a given contour line. So the data has to be
> interpolated or smoothed at some point. I am using the kde2d function
> of the MASS package to plot contour lines but I don't know how to
> retrieve subsets of what I plot.
> Also I wouldn't be surprised if there is a trick with quantile that
> escapes my mind.
> Thanks for your help,
> Emmanuel
> On 19 November 2010 21:25, David Winsemius <dwinsemius at comcast.net>  
> wrote:
>> On Nov 19, 2010, at 8:44 PM, Emmanuel Levy wrote:
>>> Hello,
>>> This sounds like a problem to which many solutions should exist,  
>>> but I
>>> did not manage to find one.
>>> Basically, given a list of datapoints, I'd like to keep those within
>>> the X% percentile highest density.
>>> That would be equivalent to retain only points within a given line  
>>> of
>>> a contour plot.
>>> Thanks to anybody who could let me know which function I could use!
>> What's wrong with quantile?


David Winsemius, MD
West Hartford, CT

More information about the R-help mailing list