[R] Violin plot of categorical/binned data

Brian Diggs diggsb at ohsu.edu
Tue Nov 6 23:30:16 CET 2012


On 11/3/2012 5:47 PM, Jim Lemon wrote:
> On 11/04/2012 06:27 AM, Nathan Miller wrote:
>> Hi,
>>
>> I'm trying to create a plot showing the density distribution of some
>> shipping data. I like the look of violin plots, but my data is not
>> continuous but rather binned and I want to make sure its binned nature
>> (not
>> smooth) is apparent in the final plot. So for example, I have the
>> number of
>> individuals per vessel, but rather than having the actual number of
>> individuals I have data in the format of: 7 values of zero, 11 values
>> between 1-10, 6 values between 10-100, 13 values between 100-1000,
>> etc. To
>> plot this data I generated a new dataset with the first 7 values being 0,
>> representing the 7 values of 0, the next 11 values being 5.5,
>> representing
>> the 11 values between 1-10, etc. Sample data below.
>>
>> I can make a violin plot (code below) using a log y-axis, which looks
>> alright (though I do have to deal with the zeros still), but in its
>> default
>> format it hides the fact that these are binned data, which seems a bit
>> misleading. Is it possible to make a violin plot that looks a bit more
>> angular (more corners, less smoothing) or in someway shows the
>> distribution, but also clearly shows the true nature of these data? I've
>> tried playing with the bandwidth adjustment and the kernel but haven't
>> been
>> able to get a figure that seems to work.
>>
>> Anyone have some thoughts on this?
>>
> Hi Nate,
> I'm not exactly sure what you are doing in the data transformation, but
> you can display this type of information as a single polygon for each
> instance (kiteChart) or separate rectangles (battleship.plot).
>
> library(plotrix)
> vessels<-matrix(c(zero=sample(1:10,5),one2ten=sample(5:20,5),
>   ten2hundred=sample(15:36,5),hundred2thousand=sample(10:16,5)),
>   ncol=4)
> battleship.plot(vessels,xlab="Number of passengers",
>   yaxlab=c("Barnacle","Maelstrom","Poopdeck","Seasick","Wallower"),
>   xaxlab=c("0","1-10","10-100","100-1000"))
> kiteChart(vessels,xlab="Number of passengers",ylab="Vessel",
>   varlabels=c("Barnacle","Maelstrom","Poopdeck","Seasick","Wallower"),
>   timelabels=c("0","1-10","10-100","100-1000"))
>
> Jim

Expanding on the idea of the battleship.plot, you can draw rectangles of 
the right width with ggplot2 if you want.

Original data:

data2 <- read.csv(text=
"count, bin
7,0
11,1-10
6,11-100
13,101-1000
7,1001-10000
3,10001-100000
2,100001-1000000")
data2$bin <- ordered(data2$bin, 
levels=c("0","1-10","11-100","101-1000","1001-10000","10001-100000","100001-1000000"))

Define the lower and upper reaches of each bin:

data2$low <- c(0,1,11,101,1001,10001,100001)
data2$high <- c(0,10,100,1000,10000,100000,1000000)

And make multiple ones for different vessels (or whatever grouping):

data3 <- rbind(data2, data2, data2, data2)
data3$vessel <- rep(c("Barnacle","Maelstrom","Poopdeck","Seasick"),
                     each=7)
data3$count <- abs(data2$count + sample(-5:5, 7*4, replace=TRUE))

With each bin taking the same size, regardless of its extent:

ggplot(data3) +
   geom_blank(aes(x=count/2, y=bin)) +
   geom_rect(aes(ymin=as.numeric(bin)-0.5, ymax=as.numeric(bin)+0.5,
                 xmin = -count/2, xmax = count/2)) +
   facet_grid(~vessel)

Width (height, really) of rectangles is based on range. Since 
logarithmic scale and exponential binning, rectangles are same height 
(with some gaps due to discrete nature). Since log scale, still problems 
with 0.

ggplot(data3) +
   geom_rect(aes(ymin=low, ymax=high, xmin=-count/2, xmax=count/2)) +
   facet_grid(~vessel) +
   scale_y_log10()




-- 
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University




More information about the R-help mailing list