[R] Different TFIDF settings in test set prevent testing model

Fri Aug 11 17:09:28 CEST 2023

I know nothing about tf, etc., but can you not simply read in the whole
file into R and then randomly split using R? The training and test sets
would simply be defined by a single random sample of subscripts which is
either chosen or not.

e.g. (simplified example -- you would be subsetting the rows of your full
dataset):

> x<- 1:10
> samp <- sort(sample(x,5))
> x[samp] ## training
[1] 3 4 6 7 8
> x[-samp] ## test
[1]  1  2  5  9 10

Apologies if my ignorance means this can't work.

Cheers,
Bert

On Fri, Aug 11, 2023 at 7:17 AM James C Schopf <jcschopf using hotmail.com> wrote:

> Hello, I'd be very grateful for your help.
>
> I randomly separated a .csv file with 1287 documents 75%/25% into 2 csv
> files, one for training an algorithm and the other for testing the
> algorithm.  I applied similar preprocessing, including TFIDF
> transformation, to both sets, but R won't let me make predictions on the
> test set due to a different TFIDF matrix.
> I get the error message:
>
> Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type
> "nmatrix.27118" was supplied
>
> I'd greatly appreciate a suggestion to overcome this problem.
> Thanks!
>
>
> Here's my R codes:
>
> > library(tidyverse)
> > library(tidytext)
> > library(caret)
> > library(kernlab)
> > library(tokenizers)
> > library(tm)
> > library(e1071)
>
> ***LOAD TRAINING SET/959 rows with text in column1 and yes/no in column2
> (labelled M2)
> > url <- "D:/test/M2_75.csv"
> > d <- read_csv(url)
> ***CREATE TEXT CORPUS FROM TEXT COLUMN
> > train_text_corpus <- Corpus(VectorSource(d$Text))
> ***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM
> > tokenize_document <- function(doc) {
> +     doc_tokens <- unlist(tokenize_words(doc))
> +     doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2))
> +     doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3))
> +     all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams)
> +     return(all_tokens)
> + }
> ***APPLY TOKENS TO DOCUMENTS
> > all_train_tokens <- lapply(train_text_corpus, tokenize_document)
> ***CREATE A DTM FROM THE TOKENS
> > train_text_dtm <-
> DocumentTermMatrix(Corpus(VectorSource(all_train_tokens)))
> ***TRANSFORM THE DTM INTO A TF-IDF MATRIX
> > train_text_tfidf <- weightTfIdf(train_text_dtm)
> ***CREATE A NEW DATA FRAME WITH M2 COLUMN FROM ORIGINAL DATA
> > trainData <- data.frame(M2 = d$M2)
> ***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO DATA FRAME
> > trainData$text_tfidf <- I(as.matrix(train_text_tfidf))
> ***DEFINE THE ML MODEL
> > ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2,
> classProbs = TRUE)
> ***TRAIN SVM
> > model_svmRadial <- train(M2 ~ ., data = trainData, method = "svmRadial",
> trControl = ctrl)
> ***SAVE SVM
> > saveRDS(model_svmRadial, file = "D:/SML/model_M23_svmRadial_UP.RDS")
>
> R code on my test set, which didn't work at last step:
>
> ***LOAD TEST SET/ 309 rows with text in column1 and yes/no in column2
> (labelled M2)
> > url <- "D:/test/M2_25.csv"
> > d <- read_csv(url)
> ***CREATE TEXT CORPUS FROM TEXT COLUMN
> > test_text_corpus <- Corpus(VectorSource(d$Text))
> ***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM
> > tokenize_document <- function(doc) {
>      doc_tokens <- unlist(tokenize_words(doc))
>      doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2))
>      doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3))
>      all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams)
>      return(all_tokens)
>  }
> ***APPLY TOKEN TO DOCUMENTS
> > all_test_tokens <- lapply(test_text_corpus, tokenize_document)
> ***CREATE A DTM FROM THE TOKENS
> > test_text_dtm <-
> DocumentTermMatrix(Corpus(VectorSource(all_test_tokens)))
> ***TRANSFORM THE DTM INTO A TF-IDF MATRIX
> > test_text_tfidf <- weightTfIdf(test_text_dtm)
> ***CREATE A NEW DATA WITH M2 COLUMN FROM ORIGINAL TEST DATA
> > testData <- data.frame(M2 = d$M2)
> ***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO TEST DATA
> > testData$text_tfidf <- I(as.matrix(test_text_tfidf))
> ***LOAD OLD MODEL
> model_svmRadial <- readRDS("D:/SML/model_M2_75_svmRadial.RDS")
>  ***MAKE PREDICTIONS
> predictions <- predict(model_svmRadial, newdata = testData)
>
> This last line produces the error message:
>
> Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type
> "nmatrix.27118" was supplied
>
> Please help.  Thanks!
>
>
>
>
>
>
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]