[R] re ading and analyzing a word document

David Winsemius dwinsemius at comcast.net
Thu Oct 1 14:42:34 CEST 2009


On Oct 1, 2009, at 12:18 AM, cls59 wrote
>
> PDXRugger wrote:
>>
>> Considering your instructions:
>>
>> #Define words to find
>> to.find <- c( 'the', 'is', 'are' ,'dr')
>> #Read in the file...
>> file.text <- readLines( 'data/letter.txt' )
>> #Count number of occurnces of deined word in text
>> line.matches <- unlist( lapply( to.find, grep, x =  
>> unlist(file.text[2]) )
>> )
>>
>> Result:
>>> line.matches
>> [1] 1 1 1
>>
>> This is not right of course as there are actually four words and  
>> secondly
>> becasue the searched words appear multiple times.
>>
>>
>
> The example I gave was only meant to identify those lines on which  
> matches
> occurred. Using x = unlist(file.text[2]) only feeds one line of the  
> file
> into the matching routine so the result indicates that all the  
> matches were
> on line 1-- the only line present for searching.
>
> If you want to count the individual occurrences of the words on each  
> line,
> you may need to look at using a function such as gregexpr. grep only
> indicates if a match or matches is present in a line of text--  
> gregexpr
> indicates at which positions those matches occur in the line.
>
> However, you may be getting to the point with this where R is no  
> longer an
> appropriate tool for this job. R is amazingly flexible it is  
> possible that
> it can give you what you want. However, R was not designed to  
> perform text
> processing-- Perl comes to mind as being a language that was  
> explicitly
> designed to perform these sorts of operations.

Perhaps you should use the R-search facilities for such questions:

http://finzi.psych.upenn.edu/R/library/tau/html/00Index.html
http://finzi.psych.upenn.edu/views/NaturalLanguageProcessing.html
http://www.jstatsoft.org/v25/i05/

R may not have been designed for text processing, but it is rather  
amazing how much has been done.

>
-- 

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list