[R] Burt table from word frequency list
Joan-Josep Vallbé
Pep.Vallbe at uab.cat
Mon Mar 30 16:06:35 CEST 2009
Thank you very much for all your comments, and sorry for the confusion
of my messages. My corpus is a collection of responses to an open
question from a questionnaire. Since my intention is not to create
groups of respondents but to treat all responses as a "whole
discourse" on a particular issue so that I can find out different
"semantic contexts" within the text. I have all the responses in a
single document, then I want to split it into strings of (specified) n
words. The resulting semantic contexts would be sets of (correlated)
word-strings containing particularly relevant (correlated) words.
I guess I must dive deeper into the "ca" and "tm" packages. Any other
ideas will be really welcomed.
best,
Pep Vallbé
On Mar 30, 2009, at 2:05 PM, Alan Zaslavsky wrote:
> Maybe not terribly hard, depending on exactly what you need.
> Suppose you turn your text into a character vector 'mytext' of
> words. Then for a table of words appearing delta words apart
> (ordered), you can table mytext against itself with a lag:
>
> nwords=length(mytext)
> burttab=table(mytext[-(1:delta)],mytext[nwords+1-(1:delta)])
>
> Add to its transpose and sum over delta up to your maximum distance
> apart. If you want only words appearing near each other within the
> same sentence (or some other unit), pad out the sentence break with
> at least delta instances of a dummy spacer:
>
> the cat chased the greedy rat SPACER SPACER SPACER the dog chased
> the
> clever cat
>
> This will count all pairings at distance delta; if you want to count
> only those for which this was the NEAREST co-occurence (so
>
> the cat and the rate chased the dog
>
> would count as two at delta=3 but not one at delta=6) it will be
> trickier and I'm not sure this approach can be modified to handle it.
>
>> Date: Sun, 29 Mar 2009 22:20:15 -0400
>> From: "Murray Cooper" <myrmail at earthlink.net>
>> Subject: Re: [R] Burt table from word frequency list
>> The usual approach is to count the co-occurence within so many
>> words of
>> each other. Typical is between 5 words before and 5 words after a
>> given word. So for each word in the document, you look for the
>> occurence of all other words within -5 -4 -3 -2 -1 0 1 2 3 4 5 words.
>> Depending on the language and the question being asked certain words
>> may be excluded.
>> This is not a simple function! I don't know if anyone has done a
>> package, for this type of analysis but with over 2000 packages
>> floating
>> around you might get lucky.
More information about the R-help
mailing list