[R] Different TFIDF settings in test set prevent testing model

Fri Aug 11 17:49:35 CEST 2023

В Fri, 11 Aug 2023 10:20:27 +0000
James C Schopf <jcschopf using hotmail.com> пишет:

> > train_text_dtm <-
> > DocumentTermMatrix(Corpus(VectorSource(all_train_tokens)))  

> > test_text_dtm <-
> > DocumentTermMatrix(Corpus(VectorSource(all_test_tokens)))

I understand the need to prepare the test dataset separately
(e.g. in order to be able to work with text that don't exist at the
time when model is trained), but since the model has no representation
for tokens it (well, the tokeniser) hasn't seen during the training
process, you have to ensure that test_text_dtm references exactly the
same tokens as train_text_dtm, in the same order of the columns.

Also, it probably makes sense to reuse the term frequency learned on
the training document set; otherwise you may be importance-weighting
different tokens than ones your SVM has learned as important if your
test set has a significantly different distribution from that of the
training set.

Bert is probably right: with the API given by the tm package, it's
seems easiest to tokenise and weight document-term matrices first, then
split them into the train and test subsets. It may be worth asking the
maintainer about applying previously "learned" transformations to new
corpora.

-- 
Best regards,
Ivan