[R] Fitting pareto distribution / plotting observed & fitted dists
Stefan Sobernig
stefan.sobernig at wu.ac.at
Fri Feb 15 14:13:05 CET 2013
Some background: I have some data on structural dependencies in a base
of code artifacts. The dependency structure is reflected in terms of
relative node degrees, with each node representing some code unit (just
as an example).
This gives me real data of the following form (sorry for the longish
posting):
dat1 <-
c(0.00245098039215686, 0, 0, 0, 0, 0, 0, 0, 0.0563725490196078,
0, 0, 0, 0.071078431372549, 0, 0, 0, 0, 0, 0, 0, 0.00490196078431373,
0.00245098039215686, 0, 0, 0, 0.00245098039215686, 0.0196078431372549,
0, 0.0588235294117647, 0.00490196078431373, 0.00980392156862745,
0, 0, 0.00245098039215686, 0, 0, 0, 0.0220588235294118, 0, 0,
0.00245098039215686, 0, 0, 0.00245098039215686, 0, 0.00245098039215686,
0.00980392156862745, 0.0294117647058824, 0.0637254901960784,
0, 0, 0, 0.00245098039215686, 0.0392156862745098, 0, 0.0147058823529412,
0, 0.017156862745098, 0.00245098039215686, 0, 0.00980392156862745,
0, 0.00735294117647059, 0.00490196078431373, 0.0514705882352941,
0.00245098039215686, 0, 0, 0, 0, 0.00245098039215686, 0,
0.00245098039215686,
0, 0.00245098039215686, 0, 0, 0.0147058823529412, 0.0367647058823529,
0.0269607843137255, 0.0269607843137255, 0.00735294117647059,
0.0441176470588235, 0, 0, 0.0196078431372549, 0, 0.00490196078431373,
0.0245098039215686, 0.00490196078431373, 0.00490196078431373,
0.0196078431372549, 0, 0.0318627450980392, 0.0245098039215686,
0, 0.00245098039215686, 0, 0.0416666666666667, 0, 0, 0.00490196078431373,
0.00490196078431373, 0.00245098039215686, 0.0122549019607843,
0.00490196078431373, 0.00490196078431373, 0.071078431372549,
0, 0, 0, 0, 0.00245098039215686, 0.00245098039215686, 0.00490196078431373,
0.0343137254901961, 0.00980392156862745, 0.00245098039215686,
0.053921568627451, 0, 0.0245098039215686, 0.00245098039215686,
0.0245098039215686, 0, 0.0294117647058824, 0.00490196078431373,
0.00980392156862745, 0.0367647058823529, 0, 0, 0.017156862745098,
0, 0.0245098039215686, 0, 0.071078431372549, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.00980392156862745, 0.00980392156862745,
0.00490196078431373, 0.0343137254901961, 0.0147058823529412,
0.0122549019607843, 0.00735294117647059, 0, 0.00245098039215686,
0, 0.00245098039215686, 0.110294117647059, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0.00490196078431373, 0.00735294117647059, 0, 0.00245098039215686,
0, 0, 0.00735294117647059, 0.0735294117647059, 0, 0, 0, 0, 0,
0.00490196078431373, 0, 0.00245098039215686, 0.105392156862745,
0, 0, 0, 0, 0, 0.00735294117647059, 0.00980392156862745,
0.00245098039215686,
0.0147058823529412, 0, 0, 0.00245098039215686, 0.00490196078431373,
0.00245098039215686, 0.00735294117647059, 0.0563725490196078,
0, 0, 0, 0, 0.0318627450980392, 0, 0, 0.00490196078431373,
0.0465686274509804,
0, 0.0147058823529412, 0.017156862745098, 0.00735294117647059,
0.0245098039215686, 0.017156862745098, 0, 0, 0, 0.071078431372549,
0, 0, 0, 0, 0, 0, 0, 0.00980392156862745, 0.0122549019607843,
0.00980392156862745, 0.0196078431372549, 0, 0, 0, 0.00980392156862745,
0.00490196078431373, 0.00735294117647059, 0, 0.0196078431372549,
0.0220588235294118, 0, 0, 0.00490196078431373, 0.0661764705882353,
0, 0, 0, 0, 0, 0, 0, 0, 0.00490196078431373, 0.0955882352941176,
0, 0, 0, 0, 0, 0.00490196078431373, 0.00490196078431373, 0,
0.00245098039215686,
0.0588235294117647, 0, 0, 0, 0, 0.0857843137254902, 0, 0, 0,
0, 0.017156862745098, 0, 0, 0.0294117647058824, 0, 0, 0.0122549019607843,
0, 0, 0.0147058823529412, 0, 0, 0.0318627450980392, 0, 0,
0.00245098039215686,
0.00245098039215686, 0.00980392156862745, 0.00245098039215686,
0.00245098039215686, 0.00245098039215686, 0.0784313725490196,
0, 0, 0.0784313725490196, 0, 0, 0, 0.00735294117647059,
0.00245098039215686,
0.00490196078431373, 0.00245098039215686, 0, 0.00245098039215686,
0, 0.0122549019607843, 0, 0.0857843137254902, 0, 0, 0, 0,
0.0196078431372549,
0, 0.00980392156862745, 0.00245098039215686, 0, 0, 0.00490196078431373,
0.00735294117647059, 0.00735294117647059, 0.00735294117647059,
0.00735294117647059, 0.00735294117647059, 0.00735294117647059,
0.00735294117647059, 0, 0, 0.00735294117647059, 0.00980392156862745,
0.0122549019607843, 0.0245098039215686, 0.00980392156862745,
0.00245098039215686, 0.00490196078431373, 0.00490196078431373,
0.00245098039215686, 0.00245098039215686, 0.00735294117647059,
0.00490196078431373, 0, 0.00245098039215686, 0.017156862745098,
0.00490196078431373, 0.00490196078431373, 0.00245098039215686,
0.00245098039215686, 0.0269607843137255, 0, 0.00490196078431373,
0.00490196078431373, 0.00490196078431373, 0.00980392156862745,
0.00735294117647059, 0.00980392156862745, 0.0245098039215686,
0, 0.0563725490196078, 0, 0, 0, 0, 0, 0, 0.053921568627451, 0,
0, 0.0220588235294118, 0, 0, 0, 0.00245098039215686, 0, 0, 0,
0, 0.00490196078431373, 0.00735294117647059, 0.00735294117647059,
0.0122549019607843, 0.0147058823529412, 0, 0.00980392156862745,
0, 0, 0.0196078431372549, 0, 0, 0.0122549019607843, 0, 0,
0.0147058823529412,
0, 0.00490196078431373, 0.0220588235294118, 0, 0.00980392156862745,
0.0220588235294118, 0.0269607843137255, 0)
Out of curiosity, I plotted the double-log CDF and the Pareto Q-Q plots.
This provided me with a visual cue of a potential power-law distribution
(at least, that's what I am boldly thinking).
So I went ahead and performed the procedure as suggested by Clauset,
Shalizi, and Newman (2007/09), by reusing the essentials found in
http://tuvalu.santafe.edu/~aaronc/powerlaws/plfit.r.
I obtained and tested the following fitting estimates based on dat1:
xmin <- 0.01715686
alpha <- 2.381581
I then simply wanted to print the observed and fitted dists in one plot,
so I ran:
library(ggplot2)
library(VGAM)
dat2 <- rpareto(length(dat1), location=xmin, shape=alpha)
hi1 <- hist(dat1, plot=FALSE, breaks="FD")
hi2 <- hist(dat2, plot=FALSE, breaks="FD")
y1 <- hi1$counts/sum(hi1$counts)
y2 <- hi2$counts/sum(hi2$counts)
x1 <- hi1$mids
x2 <- hi2$mids
qplot() + geom_line(aes(x=x1,y=y1)) + geom_line(aes(x=x2,y=y2), color="red")
y1.c <- rev(cumsum(rev(y1)))
y2.c <- rev(cumsum(rev(y2)))
qplot() + geom_line(aes(x=x1,y=y1.c)) + geom_line(aes(x=x2,y=y2.c),
color="red")
In the plot, the fitted dist line, while ressembling the observed, is
shifted to the right; IMO because of the xmin/location parameter, correct?
Is it appropriate to correct for xmin like so?
x2 <- x2 - xmin
Am I missing anything? Do I need to be more careful at an earlier stage
(e.g., breaks as this is binned data)?
I apologize in advance, I am a statistical autodidact, so I might just
be confusing the obvious.
I'd highly appreciate your hints :)
Stefan
--
Institute for Information Systems and New Media
Vienna University of Economics and Business
Augasse 2-6, A-1090 Vienna
`- http://nm.wu.ac.at/en/sobernig
`- stefan.sobernig at wu.ac.at
`- ss at thinkersfoot.net
`- +43-1-31336-4878 [phone]
`- +43-1-31336-746 [fax]
More information about the R-help
mailing list