[R] Analysing Character Strings for subsequent frequency analysis
Marc Schwartz
marc_schwartz at me.com
Thu Dec 30 21:59:55 CET 2010
On Dec 30, 2010, at 12:03 PM, bob stoner wrote:
> Hi
> I'm trying to get to grips with R and establish R as a teaching medium in my secondary school. I would like to use R to analyse text so I can produce frequency analysis of the text for subsequent examination of ciphers. I can produce code in VBA but I am struggling when writing in R to examine each character. There must be a clear method using the vectorised format of R. Furthermore, how do you substr a text string and reference each letter? I can use nchar to see how many letters per string but not to select each letter. I would prefer to remain in R and not deviate to Python etc as getting R onto the school mainframe has been a long journey...
> Many thanks
>
> Bob Stoner
> Sleaford, Lincolnshire, UK
There are likely to be some text analysis packages on CRAN, but taking a basic approach to generating a frequency table of characters in a vector:
Vec <- "The lazy brown fox"
# See ?strsplit, which returns a list
> strsplit(Vec, "")
[[1]]
[1] "T" "h" "e" " " "l" "a" "z" "y" " " "b" "r" "o" "w" "n" " " "f" "o"
[18] "x"
# Get the first list element
> strsplit(Vec, "")[[1]]
[1] "T" "h" "e" " " "l" "a" "z" "y" " " "b" "r" "o" "w" "n" " " "f" "o"
[18] "x"
# Where are the o's in the vector?
> which(strsplit(Vec, "")[[1]] == "o")
[1] 12 17
# generate the frequency table of letters
> table(strsplit(Vec, "")[[1]])
a b e f h l n o r T w x y z
3 1 1 1 1 1 1 1 2 1 1 1 1 1 1
Now, let's say that Vec has multiple elements, perhaps the result of using readLines() on a text file:
Vec <- c("The lazy brown fox", "jumped over the fence")
> strsplit(Vec, "")
[[1]]
[1] "T" "h" "e" " " "l" "a" "z" "y" " " "b" "r" "o" "w" "n" " " "f" "o"
[18] "x"
[[2]]
[1] "j" "u" "m" "p" "e" "d" " " "o" "v" "e" "r" " " "t" "h" "e" " " "f"
[18] "e" "n" "c" "e"
# Use lapply() to loop over each list element returned by strsplit()
# generating a frequency table for each
> lapply(strsplit(Vec, ""), table)
[[1]]
a b e f h l n o r T w x y z
3 1 1 1 1 1 1 1 2 1 1 1 1 1 1
[[2]]
c d e f h j m n o p r t u v
3 1 1 5 1 1 1 1 1 1 1 1 1 1 1
# Get the first 4 letters in each
# See ?substr
> substr(Vec, 1, 4)
[1] "The " "jump"
HTH,
Marc Schwartz
More information about the R-help
mailing list