[R] Analysis of pre-calculated frequency distribution?
Dan Bolser
dmb at mrc-dunn.cam.ac.uk
Sun Nov 21 15:35:07 CET 2004
Sorry for the dumb question, but I cant work out how to do this.
Quick version,
How can I re-bin a given frequency distribution using new breaks without
reference to the original data? Given distribution has integer valued
bins.
Long version,
I am loading a frequency table into R from a file. The original data is
very large, and it is a very simple process to get a frequency
distribution from an SQL database, so in all this is a convenient method
for me. Point being I don't start with 'raw' data.
The data looks like this...
> dat
COUNT FREQUENCY
1 1 5734
2 2 1625
3 3 793
4 4 480
5 5 294
6 6 237
7 7 205
8 8 200
9 9 123
10 10 108
11 11 90
12 12 62
13 13 60
14 14 68
15 15 64
16 16 56
17 17 68
18 18 45
19 19 38
20 20 37
21 21 29
22 22 39
23 23 35
24 24 33
25 25 36
...
148 153 5
149 156 2
150 157 3
151 158 2
152 159 2
153 162 1
154 163 3
155 164 3
156 165 2
157 166 1
158 168 2
159 169 4
160 170 1
...
354 2106 1
355 2189 1
356 2194 1
357 2217 1
358 2246 1
359 2474 1
360 2801 1
361 3697 1
362 3702 1
363 7353 1
364 8738 1
365 9442 1
366 12280 1
This is a tipical 'count / frequency' distribution in biology, where low
counts of a certain property are very frequent (across genomes, proteins,
ecosystems, etc...), and high counts of of a certain property are very
rare.
In the above example a certain property occurs 12280 times with a
frequency of 1, another property occurs 9442 times with the same
frequency. At the other end of the extreem, a certain property occurs once
with a frequency of 5734, and another property occurs twice with a
frequency of 1625.
This kind of distribution is variously known as a "zipf", a "power law", a
"Pareto", "scale free", "heavy tailed" or a "80:20" distribution, or
coloquially "the dominance of the few over the many". The term I choose is
a "log linear" distribution, because that makes no assumptions about the
underlying cause of the overall shape.
People tipically quote the curve in the form of y ~ Cx^(-a). I want to use
the binning method of parameter estimation given here...
http://www.ece.uc.edu/~annexste/Courses/cs690/Zipf,%20Power-law,%20Pareto%20-%20a%20ranking%20tutorial.htm
(bin the data with exponentially increasing bin widths within the data
range).
But I can't work out how to re-bin my existing frequency data.
Sorry for the long question,
all the best
Dan.
More information about the R-help
mailing list