[R] Essay identification

Mon Jun 13 01:47:05 CEST 2005

On 12-Jun-05 Berton Gunter wrote:
> I assume that you know the usual procedure is to 'score'
> each essay by a vector that gives the frequency of occurrence
> of commonly used (sometimes adding subject matter specific)
> words and phrases. This multivariate response is then fed in
> as a "training set" into your favorite supervised
> learning/classification procedure. R has many of these -- trees,
> logisic regression, boosting, Random Forests,svm's,LDA,SOM's
> (whoops -- that's an Unsupervised one),  ... . Try
> RSiteSearch('Classification',restrict=('functions').
> 
> The devil is in the details as to what works best, I believe.
> With only 78 exemplars in 10 groups, unless there is a lot of
> separation (disparate styles that you could probably detect
> manually) it may be difficult. It also depends on how large
> each group is (balance is generally better).
> 
> Cheers,
> Bert

I would add to Berton's list such scores as numbers of different
words used, sentence lengths, relative frequencies of verbs,
nouns, adjectives, adverbs, and so on, perhaps scaled by overall
length. Length of Essay might even be a discriminant!

You could also look at more subtle characteristics such as
"Zipf bins"[*] -- the relative numbers of different
words which occur once only, twice, three times, ... (though
I'm not sure how you would score such a thing for classification
purposes).
[*] A term I've just invented inspired by the original instance
    of this by the linguist Zipf, later giving rise to the
    logarithmic distribution in the historic paper by Fisher,
    Corbett & Williams in the "Numbers of Species and Numbers
    of Individuals" in butterfly traps.

If you really want to go to town you can try things related to
grammatical complexity, e.g. numbers of subordinate clauses
per sentence, relative clauses, the "reach" of relative pronouns
(how far from the referring pronoun is the thing referred to)
and so on.

There's quite an extensive literature on this sort of thing.
though it's not as fashionable as it used to be.

Th real problem is that you can get carried away by "good
ideas" of things to try!

The other factor to bear in mind is that if the Essays
can be grouped by subject this is likely to influence many
of the scores (such as the above).

Hoping this helps and does not distract!
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 13-Jun-05                                       Time: 00:43:10
------------------------------ XFMail ------------------------------