[R] non-linear binning? power-law in R

Dan Bolser dmb at mrc-dunn.cam.ac.uk
Wed Jun 16 12:52:54 CEST 2004


First, thanks to everyone who helped me get to grips with R in (x)emacs
(I get confused easily). Special thanks to Stephen Eglen for continued
support.

My question is about non-linear binning, or density functions over
distributions governed by a power law ...

y ~ mu*x**lambda	# In one of its forms 
                        # (can't find Pareto in the online help)

Looking at the following should show my problem....

x3 <- runif(10000)**3	# Probably a better (correct) way to do this

plot( density(x3,cut=0,bw=0.1))
plot( density(x3,cut=0,bw=0.01))
plot( density(x3,cut=0,bw=0.001))

plot(density(x3,cut=0,bw=0.1),  log='xy')
plot(density(x3,cut=0,bw=0.01), log='xy')
plot(density(x3,cut=0,bw=0.001),log='xy')

The upper three plots show that the bw has a big effect on the appearance
of the graph by rescaling based on the initial density at low values of x,
which is very high.

The lower plots show (I think) an error in the use of linear bins to view
a non linear trend. I would expect this curve to be linear on log-log
scales (from experience), and you can see the expected behavior in the
tails of these plots.

If you play with drawing these curves on top of each other they look OK
apart from at the beginning. However, changing the band width to 0.0001 has
a radical effect on these plots, and they begin to show a different trend
(look like they are being governed by a different power).

Hmmm....

x3log <- -log(x3)

plot( density(x3log,cut=0,bw=0.5),  log='y',col=1)

lines(density(x3log,cut=0,bw=0.2),  log='y',col=2)
lines(density(x3log,cut=0,bw=0.1),  log='y',col=3)
lines(density(x3log,cut=0,bw=0.01), log='y',col=4)

Sorry...


'Real' data of this form is usually discrete, with the value of 1 being
the most frequent (minimum) event, and higher values occurring less
frequently according to a power (power-law). This data can be easily
grouped into discrete bins, and frequency plotted on log scales. The
continuous data generated above requires some form of density estimation
or rescaling into discreet values (make the smallest value equal to 1 and
round everything else into an integer).

I see the aggregate function, but which function lets me simply count the
number of values in a class (integer bin)?

The analysis of even the discretized data is made more accurate by the use
of exponentially growing bins. This way you don't need to plot the data on
log scales, and the increasing variance associated with lower probability
events is handled by the increasing bin size (giving good accuracy of
power fitting). How can I easily (ignorantly) implement exponentially
increasing bin sizes?

Thanks for any feedback,

Dan.




More information about the R-help mailing list