[R] Excluding "small data" from plot.
Kieran
kroberts012 at gmail.com
Wed Feb 17 10:19:47 CET 2016
To R-help users:
I want to use ggplot two plot summary statistics on the frequency of
letters from
a page of text. My data frame has four columns:
(1) The line number [1 to 30]
(2) The letter [a to z]
(3) The frequency of the letter [assuming there is 80 letters per line]
(4) The factor 'type': bad or good (purely artificial factor)
I want to achieve the following plot:
(a) Bar plot with an x-axis to be the letters and the y-axis the sum of
30 letter frequencies from each line of each letter.
(b) Split each bar (for a letter) into two bars for 'good' and 'bad' types.
(c) Display the union of the top 8 most frequency used letters for both types
'good' and 'bad'.
By point (c) I mean: if a,e,f,h,i,t,s,r are the most frequent letter of type
'good' and a,e,f,h,i,m,l,p are the most frequent letter of type 'bad'. Then
I would like my plot to feature the letters a,e,f,h,i,t,s,r,m,l,p.
Here is my code:
# There will be 30 lines and we want to record the frequency of each letter
# on each line.
lines <- c(rep(1:30, each=26))
letter <- c(rep(letters, times=30))
# We have taken the letter frequencies from
# http://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html
freq <- c(8.12, 1.49, 2.71, 4.32, 12.02, 2.30, 2.03, 5.92, 7.31, 0.10, 0.69,
3.98, 2.61, 6.95, 7.68, 1.82, 0.11, 6.02, 6.28, 9.10, 2.88, 1.11, 2.09, 0.17,
2.11, 0.07)
freq <- freq/100
# We assume each line contains 80 letters and change the seed for each line
# for variability.
letterfreq <- integer()
for (i in 1:30) {
set.seed(i)
s<-data.frame(sample(letters, size = 80, replace = TRUE, prob = freq))
names(s) <- "ltr"
s$ltr <- factor(s$ltr, levels = letters)
frq<-as.data.frame(table(s))
letterfreq <- append(letterfreq, frq$Freq)
}
ltrfreq <- data.frame(lines, letter, letterfreq)
# Add an artificial factor column _type_: good/bad. So each pair
# (week, letter) has type 'good' or 'bad' with equal probability.
# Set the seed for reproducibility.
set.seed(999)
ltrfreq$type <- factor(sample(c("good","bad"), size = 780, replace = TRUE,
prob = c(0.5,0.5)))
# Here is the plot I want but this includes all 26 letters.
ggplot(ltrfreq,aes(x=factor(letter),y=letterfreq, fill=type), color=type) +
stat_summary(fun.y=sum,position=position_dodge(),geom="bar")
Best regards,
Kieran.
More information about the R-help
mailing list