[R] re ading tokens

Tue Nov 3 13:51:43 CET 2009

On Tue, Nov 3, 2009 at 3:02 AM, Dieter Menne
<dieter.menne at menne-biomed.de> wrote:
>
>
>
> j daniel wrote:
>>
>> I am not familiar with processing text in R.  Can someone tell me how to
>> read each line of words as separate elements in a list?
>>
>> FE, I would like to turn:
>>
>> word1 word2 word3
>> word2 word4
>>
>> into a list of length two with three character elements in the first list
>> and two elements in the second.  I know that this should be easy, but I am
>> a little confused by the text functions.
>>
>
> You could use scan. Have a look at package gsubfn, where there is a demo,
> that show additional features you are going to use
>
> library(gsubfn)
> demo(gsubfn-gries)
> ....
>
> The example code is a bit overnested, but to better understand what is going
> on, unwrap it:
>
> So
>  tail(sort(table(unlist(strapply(Lines1, "\\w+", perl = TRUE)))))
>
> is:
>
> x1 = strapply(Lines1, "\\w+", perl = TRUE)
> x1
> x2 = ulist(x2)
> x2
> x3 = table(x2)
> x3
> x4 = sort(x3)
> x4
> tail(x4)
>

Just one small optimization. You don't actually need perl = TRUE here.
  By default, strapply uses tcl and the regex engine in tcl which does
support \w and which can process the input faster although with the
input of the above size it won't be material.   If you do specify perl
= TRUE (or engine = "R") then it will use R and R's perl engine.
There is a link to the tcl regex page on the gsubfn home page in the
links box:
http://gsubfn.googlecode.com