[R] Analysis of pre-calculated frequency distribution?

Sun Nov 21 19:12:16 CET 2004

On Sun, 21 Nov 2004 Ted.Harding at nessie.mcc.ac.uk wrote:

>On 21-Nov-04 Dan Bolser wrote:
>> 
>> Sorry for the dumb question, but I cant work out how to do this. 
>> 
>> Quick version, 
>> 
>> How can I re-bin a given frequency distribution using new breaks
>> without reference to the original data? Given distribution has
>> integer valued bins.
>> 
>> 
>> Long version,
>> 
>> I am loading a frequency table into R from a file. The original
>> data is very large, and it is a very simple process to get a
>> frequency distribution from an SQL database, so in all this is
>> a convenient method for me. Point being I don't start with 'raw' data.
>> 
>> The data looks like this...
>> 
>>> dat
>>              COUNT FREQUENCY
>> 1                1 5734
>> 2                2 1625
>> [...]
>> 365           9442    1
>> 366          12280    1
>> 
>> [...]
>> 
>> People tipically quote the curve in the form of y ~ Cx^(-a).
>> I want to use the binning method of parameter estimation given here...
>> 
>> http://www.ece.uc.edu/~annexste/Courses/cs690/Zipf,%20Power-law,%20Paret
>> o%20-%20a%20ranking%20tutorial.htm
>> 
>> (bin the data with exponentially increasing bin widths within the data
>> range).
>> 
>> But I can't work out how to re-bin my existing frequency data.
>
>Hi Dan,
>Your starting point can be the fact that the number of cases
>with property i ("in class i") is COUNT_i + FREQUENCY_I
>
>So if you construct a vector with these numbers in it you have
>in effect reconstructed the original data.
>
>I.e.  N[i] <- COUNT[i]*FREQUENCY[i]

Cheers for this, I was trying this, but my results looked wrong with
respect to the data shown on the webpage cited above.

Thanks to James Holtman for the other suggestion - 

My confusion was coming from thinking I had to use hist, but in fact cut +
tapply was the ticket.

Cheers,
Dan.

>
>which can be done in one stroke with N <- COUNT*FREQUENCY
>
>One way (and maybe others can suggest better) to bin these
>classes non-uniformly could be:
>
>  Say you have k "upper" breakpoints for your k bins,
>  say BP, so that e.g. if BP[1] = 2 then there are N[1]+N[2]
>  cases with class <= 2, and if BP[2] = 5 then there are
>  N[3] + N[4] + N[5] cases with class > 2 and class <= 5,
>  and so on. In your case BP[k] = 366.
>
>  Let
>
>    csN <- cumsum(N)
>
>  Then (if I've not overlooked something)
>
>    diff(c(0,csN[BP]))
>
>  will give you the counts in yhour new bins.
>
>E.g. (just to show it should work):
>
>  > N<-rep(1,31)
>  > BP<-c(1,3,7,15,31)
>  > csN <- cumsum(N)
>  > diff(c(0,csN[BP]))
>  [1]  1  2  4  8 16
>
>
>  > BP<-c(2,3,5,9,17,31)
>  > diff(c(0,csN[BP]))
>  [1]  2  1  2  4  8 14
>
>I hope this matches the sort of thing you have in mind!
>Ted.
>
>
>--------------------------------------------------------------------
>E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
>Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
>Date: 21-Nov-04                                       Time: 16:47:05
>------------------------------ XFMail ------------------------------
>