[R] Goodness of Fit for Word Frequency Counts

Thu Mar 25 16:08:25 CET 2010

Dear Mailing list!

sorry to bother you - but maybe you can help me out. I have been searching
and searching for appropriate tests.

I have a huge dataset of loan requests and I have data at portfolio level,
with average portfolio size 200 loans. I want to test whether portfolios are
randomly drawn. The problem is that I have rather qualitative data, namely I
want to characterize whether loans are randomly selected using word counts.

For each loan, I have a "sector", "activity" and "use description". The "use
description" contains about 15 words, the "activity" description is usually
only one or two words.

What I did as of now was to find the word-counts in the overall portfolio,
which is 110,000 loans. From this, I can compute, based on knowing the size
of a team portfolio the expected frequency of certain keywords appearing.
The "sector" variable is categorical and can take only 17 values, whereas in
the overall distribution I found 180 different words as activity
description.

I now wanted to do a type of "goodness of fit" test to see whether the
portfolios are randomly selected or not. I would expect that certain
portfolios are indeed randomly selected, whereas others arent. 

I did a chi^2, Pearson Tukey and G-test of Goodness of Fit. The problem is
that these tests are usually constructed for categorical data - but if I use
the "activity" word-count it need not be categorical. So I am wondering,
whether this is still appropriate? 

I may have a portfolio of 200 loans in which certain words never appear. In
this case, I am not sure which degrees of freedom to look at. Should I use
as prescribed 179 degrees of freedom as I have 180 "categories" - but these
"arent" real categories...

An example may look as follows - the word is on the left, the expected and
observed word counts are given:

     +---------------------------------------+
     |        word      observed   expected  |
     |---------------------------------------|
  1. |           food         54   57.511776 |
  2. |         retail         46    49.04432 |
  3. |    agriculture         39   36.557732 |
  4. |       services         23   15.867387 |
  5. |       clothing         13   14.126975 |
     |---------------------------------------|
  6. | transportation         10   6.5851929 |
  7. |        housing          3     4.65019 |
  8. |   construction          2   4.3173841 |
  9. |           arts          5   4.2500955 |
 10. |  manufacturing          1   3.0170768 |
     |---------------------------------------|
 11. |         health          2    1.751323 |
 12. |            use          0   .55215646 |
 13. |       personal          0   .25241221 |
 14. |      education          2   .68743521 |
 15. |      wholesale          0   .11241227 |
     |---------------------------------------|
 16. |  entertainment          0   .32743521 |
 17. |          green          0   .42743782 |
     +---------------------------------------+

I can have R calculate the Chi² statistic from this, but should I use now 17
degrees of freedom? The problem is this is not categorical data! In this
case, do I have to make comparisons on a word-by-word basis? Like a
"Bernoulli"?

I was looking for other goodness of fit tests for this kind of data for days
now, but I cant really find any others!

I really appreciate your thoughts,

best Thiemo

---
http://freigeist.devmag.net