[R] coding to generate a matrix to prepare for chi-sqr test f or text mining

Huntsinger, Reid reid_huntsinger at merck.com
Thu Jun 16 00:27:45 CEST 2005


I would compile a table of all the words in the dataset (maybe you have it
already), then create a list where each component is an integer vector of
indices of words. That is, replace words by their positions in the table. 

>From that sparse form you could create binary features to use with standard
classification methods, or for example compute the X'X matrix for linear
regression directly (you would probably want to throw out infrequently
occurring words to keep the matrix small enough to work with in memory). For
your specific question, say "words" is the list of integer vectors as above,
and "class" is the vector of class labels (1 or 2 to make it a valid index)
corresponding to a given vector. Then you can fill in the "present" (==1)
parts of the table class x presence x word via


n <- length(words)
tab <- array(as.integer(0),dim=c(2,2,n))

for (i in 1:n) {
  for (word in words[[i]]) tab[class[i],1,word] <- tab[class[i],1,word] + 1
}

and the "absent" (==2) parts are then easy:

tab[1,2,] <- sum(class == 1) - tab[1,1,]
tab[2,2,] <- sum(class == 2) - tab[2,1,] 

so now you can use chisq.test on each of the 2 x 2 tables tab[,,i] for i a
word index, all at once using apply() if convenient.

Reid Huntsinger

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Weiwei Shi
Sent: Wednesday, June 15, 2005 5:10 PM
To: R-help at stat.math.ethz.ch
Subject: [R] coding to generate a matrix to prepare for chi-sqr test for
text mining


Hi, there:
I have a dataset like the following:

1412|WINDOW|SHATTER|TORN|SOFT|TOP|WATER|RAIN|LAB|AI|BOLL|CAMP|0
1413|PARK|IV|STRUCK|PARK|PUSH|COD|POLICI|CIA|TB|SIC|0
2412|ACCID|REAREND|MULTI|EH|IV|MIDDL|FAN|DUAL|LOSS|CALM|1
2414|IV|REAREND|CD|COG|LAB|ADVERS|1
2415|ACCID|SINGL|VEHICL|IV|SWERV|AVOID|OBJECT|STRUCK|PHONE|POLE|FAN|0
2417|ACCID|SINGL|VEHICL|ROLL|DUE|FATAL|FAN|DUAL|LOSS|CALM|1
2418|AI|FELL|ASLEEP|WHEEL|VEHICL|RETENT|POND|LAB|ADVERS|1
2419|ACCID|SINGL|VEHICL|TREE|FELL|IV|LIGHTN|STORM|IV|CAMP|CALM|AD|1
2422|THEFT|RECOV|TOTAL|THEFT|0
...

The first column is always id_num, the last one is class label. I want
to do some chi-square test on the dependency between a word (or
further a word combination) on the class label.

for example, my goal is to build a table like the following, ready for
chi-square test
                      ACCID (Yes)                 ACCID(No)
class label
         1                  10                                15
         0                    5                                 9
 
the number is the number of lines (observations).
and later I want to do word-combination like ACCID & WINDOW (this
result was generated from association analysis from my another
program) instead of ACCID only.

My first question is, how to do it automatically in R to build a data
structure (data frame) to represent the table above for each word)
since I am learning R programming and I don't want to do it using
python.  (Don't worry if a word appears twice in one observation, and
I have another version of data set which only lists unique word.)

My target is to find a p-value for each word/class label from
chi-square test and evaluate the significance of feature for later
text mining. I am not sure if this is a good idea and I am reading
some papers on this.

Thanks,

--  
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html




More information about the R-help mailing list