[R] Goodness of Fit for Word Frequency Counts
Thiemo Fetzer
tf at devmag.net
Thu Mar 25 16:08:25 CET 2010
Dear Mailing list!
sorry to bother you - but maybe you can help me out. I have been searching
and searching for appropriate tests.
I have a huge dataset of loan requests and I have data at portfolio level,
with average portfolio size 200 loans. I want to test whether portfolios are
randomly drawn. The problem is that I have rather qualitative data, namely I
want to characterize whether loans are randomly selected using word counts.
For each loan, I have a "sector", "activity" and "use description". The "use
description" contains about 15 words, the "activity" description is usually
only one or two words.
What I did as of now was to find the word-counts in the overall portfolio,
which is 110,000 loans. From this, I can compute, based on knowing the size
of a team portfolio the expected frequency of certain keywords appearing.
The "sector" variable is categorical and can take only 17 values, whereas in
the overall distribution I found 180 different words as activity
description.
I now wanted to do a type of "goodness of fit" test to see whether the
portfolios are randomly selected or not. I would expect that certain
portfolios are indeed randomly selected, whereas others arent.
I did a chi^2, Pearson Tukey and G-test of Goodness of Fit. The problem is
that these tests are usually constructed for categorical data - but if I use
the "activity" word-count it need not be categorical. So I am wondering,
whether this is still appropriate?
I may have a portfolio of 200 loans in which certain words never appear. In
this case, I am not sure which degrees of freedom to look at. Should I use
as prescribed 179 degrees of freedom as I have 180 "categories" - but these
"arent" real categories...
An example may look as follows - the word is on the left, the expected and
observed word counts are given:
+---------------------------------------+
| word observed expected |
|---------------------------------------|
1. | food 54 57.511776 |
2. | retail 46 49.04432 |
3. | agriculture 39 36.557732 |
4. | services 23 15.867387 |
5. | clothing 13 14.126975 |
|---------------------------------------|
6. | transportation 10 6.5851929 |
7. | housing 3 4.65019 |
8. | construction 2 4.3173841 |
9. | arts 5 4.2500955 |
10. | manufacturing 1 3.0170768 |
|---------------------------------------|
11. | health 2 1.751323 |
12. | use 0 .55215646 |
13. | personal 0 .25241221 |
14. | education 2 .68743521 |
15. | wholesale 0 .11241227 |
|---------------------------------------|
16. | entertainment 0 .32743521 |
17. | green 0 .42743782 |
+---------------------------------------+
I can have R calculate the Chi² statistic from this, but should I use now 17
degrees of freedom? The problem is this is not categorical data! In this
case, do I have to make comparisons on a word-by-word basis? Like a
"Bernoulli"?
I was looking for other goodness of fit tests for this kind of data for days
now, but I cant really find any others!
I really appreciate your thoughts,
best Thiemo
---
http://freigeist.devmag.net
More information about the R-help
mailing list