[R] re ading and analyzing a word document

spencerg spencer.graves at prodsyse.com
Thu Oct 1 16:32:13 CEST 2009


library(sos)
tm <- findFn('text mining')
tm


      This produced 15 matches, which you could also find using 
"RSiteSearch('text mining', 'function')".  The difference is that 
findFn{sos} displays the results in a table sorted to place the package 
with the most matches first.  In this case, there is actually a "Text 
Mining Package" called "tm".  "summary(tm)" says these 15 matches are in 
11 packages.  The first of the 11 is "FactoMineR". 


      Hope this helps. 
      Spencer Graves


David Winsemius wrote:
>
> On Oct 1, 2009, at 12:18 AM, cls59 wrote
>>
>> PDXRugger wrote:
>>>
>>> Considering your instructions:
>>>
>>> #Define words to find
>>> to.find <- c( 'the', 'is', 'are' ,'dr')
>>> #Read in the file...
>>> file.text <- readLines( 'data/letter.txt' )
>>> #Count number of occurnces of deined word in text
>>> line.matches <- unlist( lapply( to.find, grep, x = 
>>> unlist(file.text[2]) )
>>> )
>>>
>>> Result:
>>>> line.matches
>>> [1] 1 1 1
>>>
>>> This is not right of course as there are actually four words and 
>>> secondly
>>> becasue the searched words appear multiple times.
>>>
>>>
>>
>> The example I gave was only meant to identify those lines on which 
>> matches
>> occurred. Using x = unlist(file.text[2]) only feeds one line of the file
>> into the matching routine so the result indicates that all the 
>> matches were
>> on line 1-- the only line present for searching.
>>
>> If you want to count the individual occurrences of the words on each 
>> line,
>> you may need to look at using a function such as gregexpr. grep only
>> indicates if a match or matches is present in a line of text-- gregexpr
>> indicates at which positions those matches occur in the line.
>>
>> However, you may be getting to the point with this where R is no 
>> longer an
>> appropriate tool for this job. R is amazingly flexible it is possible 
>> that
>> it can give you what you want. However, R was not designed to perform 
>> text
>> processing-- Perl comes to mind as being a language that was explicitly
>> designed to perform these sorts of operations.
>
> Perhaps you should use the R-search facilities for such questions:
>
> http://finzi.psych.upenn.edu/R/library/tau/html/00Index.html
> http://finzi.psych.upenn.edu/views/NaturalLanguageProcessing.html
> http://www.jstatsoft.org/v25/i05/
>
> R may not have been designed for text processing, but it is rather 
> amazing how much has been done.
>
>>


-- 
Spencer Graves, PE, PhD
President and Chief Operating Officer
Structure Inspection and Monitoring, Inc.
751 Emerson Ct.
San José, CA 95126
ph:  408-655-4567




More information about the R-help mailing list