[R] Burt table from word frequency list

Mon Mar 30 16:06:35 CEST 2009

Thank you very much for all your comments, and sorry for the confusion  
of my messages. My corpus is a collection of responses to an open  
question from a questionnaire. Since my intention is not to create  
groups of respondents but to treat all responses as a "whole  
discourse" on a particular issue so that I can find out different  
"semantic contexts" within the text. I have all the responses in a  
single document, then I want to split it into strings of (specified) n  
words. The resulting semantic contexts would be sets of (correlated)  
word-strings containing particularly relevant (correlated) words.

I guess I must dive deeper into the "ca" and "tm" packages. Any other  
ideas will be really welcomed.

best,

Pep Vallbé

On Mar 30, 2009, at 2:05 PM, Alan Zaslavsky wrote:

> Maybe not terribly hard, depending on exactly what you need.   
> Suppose you turn your text into a character vector 'mytext' of  
> words.  Then for a table of words appearing delta words apart  
> (ordered), you can table mytext against itself with a lag:
>
> nwords=length(mytext)
> burttab=table(mytext[-(1:delta)],mytext[nwords+1-(1:delta)])
>
> Add to its transpose and sum over delta up to your maximum distance  
> apart. If you want only words appearing near each other within the  
> same sentence (or some other unit), pad out the sentence break with  
> at least delta instances of a dummy spacer:
>
>    the cat chased the greedy rat SPACER SPACER SPACER the dog chased  
> the
>    clever cat
>
> This will count all pairings at distance delta; if you want to count  
> only those for which this was the NEAREST co-occurence (so
>
>    the cat and the rate chased the dog
>
> would count as two at delta=3 but not one at delta=6) it will be  
> trickier and I'm not sure this approach can be modified to handle it.
>
>> Date: Sun, 29 Mar 2009 22:20:15 -0400
>> From: "Murray Cooper" <myrmail at earthlink.net>
>> Subject: Re: [R] Burt table from word frequency list
>> The usual approach is to count the co-occurence within so many  
>> words of
>> each other.  Typical is between 5 words before and 5 words after a
>> given word.  So for each word in the document, you look for the
>> occurence of all other words within -5 -4 -3 -2 -1 0 1 2 3 4 5 words.
>> Depending on the language and the question being asked certain words
>> may be excluded.
>> This is not a simple function! I don't know if anyone has done a
>> package, for this type of analysis but with over 2000 packages  
>> floating
>> around you might get lucky.