# [R] Analysis of pre-calculated frequency distribution?

(Ted Harding) Ted.Harding at nessie.mcc.ac.uk
Sun Nov 21 17:47:05 CET 2004

```On 21-Nov-04 Dan Bolser wrote:
>
> Sorry for the dumb question, but I cant work out how to do this.
>
> Quick version,
>
> How can I re-bin a given frequency distribution using new breaks
> without reference to the original data? Given distribution has
> integer valued bins.
>
>
> Long version,
>
> I am loading a frequency table into R from a file. The original
> data is very large, and it is a very simple process to get a
> frequency distribution from an SQL database, so in all this is
> a convenient method for me. Point being I don't start with 'raw' data.
>
> The data looks like this...
>
>> dat
>              COUNT FREQUENCY
> 1                1 5734
> 2                2 1625
> [...]
> 365           9442    1
> 366          12280    1
>
> [...]
>
> People tipically quote the curve in the form of y ~ Cx^(-a).
> I want to use the binning method of parameter estimation given here...
>
> http://www.ece.uc.edu/~annexste/Courses/cs690/Zipf,%20Power-law,%20Paret
> o%20-%20a%20ranking%20tutorial.htm
>
> (bin the data with exponentially increasing bin widths within the data
> range).
>
> But I can't work out how to re-bin my existing frequency data.

Hi Dan,
Your starting point can be the fact that the number of cases
with property i ("in class i") is COUNT_i + FREQUENCY_I

So if you construct a vector with these numbers in it you have
in effect reconstructed the original data.

I.e.  N[i] <- COUNT[i]*FREQUENCY[i]

which can be done in one stroke with N <- COUNT*FREQUENCY

One way (and maybe others can suggest better) to bin these
classes non-uniformly could be:

Say you have k "upper" breakpoints for your k bins,
say BP, so that e.g. if BP[1] = 2 then there are N[1]+N[2]
cases with class <= 2, and if BP[2] = 5 then there are
N[3] + N[4] + N[5] cases with class > 2 and class <= 5,
and so on. In your case BP[k] = 366.

Let

csN <- cumsum(N)

Then (if I've not overlooked something)

diff(c(0,csN[BP]))

will give you the counts in yhour new bins.

E.g. (just to show it should work):

> N<-rep(1,31)
> BP<-c(1,3,7,15,31)
> csN <- cumsum(N)
> diff(c(0,csN[BP]))
[1]  1  2  4  8 16

> BP<-c(2,3,5,9,17,31)
> diff(c(0,csN[BP]))
[1]  2  1  2  4  8 14

I hope this matches the sort of thing you have in mind!
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 21-Nov-04                                       Time: 16:47:05
------------------------------ XFMail ------------------------------

```