[R] Histogram from frequency data in pre-made bins

Sun Aug 21 12:20:18 CEST 2011

Dear R user,
I am using UK census data on travel to work. The authorities have provided a
breakdown in each area by mode (car, bicycle etc.) and distance travelled (0
– 2 km, 2 – 5 km etc). Therefore, after processing, the data for Sheffield
look like this https://files.one.ubuntu.com/ej2VtVbJTEaelvMRlsocRg :

dshef <- read.table("distmodesheff.csv", sep=",", header=TRUE)
print(dshef)

      Dist  Tr Bici  Met  Pas  Foot   Bus   Car
1     2 >   45  571  491 2125 16644  4469 13494
2   2 – 5   80 1136 2540 4738  3659 17290 30212
3  5 – 10  217  466 2335 3994  1041 12963 35221
4 10 – 20  191   76  491 1333   332  2439 16322
5 20 – 30  168    6   25  235    41   175  3711
6 30 – 40   78    6    3  122    20    74  2179
7  40 – 60 349    6   21  261    96   333  3501
8     60 < 332   62  125  369   534   433  3276
9    Other 148   40   79  905   388   622  6481

It's interesting to look at the different distributions of different
transport modes: 

attach(dshef)
rs <- rbind(Tr,Bici,Met,Pas,Foot,Bus,Car)

barplot(rs, beside=TRUE, names=Dist, col=rainbow(7), legend=TRUE)

http://r.789695.n4.nabble.com/file/n3758198/1.png 

This is brilliant, and creates output similar to that of OO calc:

http://r.789695.n4.nabble.com/file/n3758198/egraphmini.jpg 

However, as you can see, the pre-made categories (0 – 2 km etc.) are
unevenly spaced bins within a continuous variable. This puts the analysis
into histogram mode (with frequency determined by the area, not the height).
What I would look for for the vector Car, for example, would be something
like this: 

n <- c(rep(1.5,Car[1]), rep(3,Car[2]), rep(7.7,Car[3]),
rep(15,Car[4]),rep(25,Car[5]), 

	rep(35,Car[6]), rep(50,Car[7]), rep(100,Car[8]) )

hist(n, breaks=c(0,2,5,10,20,30,40,60,200))

http://r.789695.n4.nabble.com/file/n3758198/2.png 

This produces a histogram, but it's a tedious an ugly way of getting there.
Also, this does not allow for trend-line analysis of the likely distribution
of the continuous variable distance: lines(density(n)), for example results
in peaks around my arbitrary value.

Has anyone else encountered similar issues? I've searched high and low but
can find no solution other than creating a barplot with variable widths:
http://r.789695.n4.nabble.com/Histogram-using-frequency-data-td827927.html

Any ideas about how to resolve this issue very greatly appreciated.
Eventually I hope to model the distribution of distances travelled in order
to estimate the mean distance within each bin.

Many thanks, 

Robin

--
View this message in context: http://r.789695.n4.nabble.com/Histogram-from-frequency-data-in-pre-made-bins-tp3758198p3758198.html
Sent from the R help mailing list archive at Nabble.com.