[R] Preprocess and training model with text data
Neha gupta
neh@@bo|ogn@90 @end|ng |rom gm@||@com
Fri Apr 15 15:15:34 CEST 2022
Hi everyone,
I am working on text categorization (my first project so learning) and my
dataset has several columns as text (detail about the data is pasted in the
bottom. I worked before on numeric data but my advisor now asked me to
perform predictive modeling on this text data.
I am doing some preprocessing such as tokenizing, lower case, stemming etc.
The following code is used for tokenization
train.tokens <- tokens(train$DESCRIPTION,, what = "word",
remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_hyphens = TRUE). then
train.tokens <- tokens_tolower(train.tokens) then
train.tokens <- tokens_wordstem(train.tokens, language = "english")
I have two questions
(1) If we have more text features (apart from DESCRIPTION), do I need to
repeat these steps for each feature? I tried the following but does not work
*train.tokens <- tokens(c(train$DESCRIPTION,,train$NAME) , what = "word",
remove_numbers = TRUE, remove_punct = TRUE,*
* remove_symbols = TRUE, remove_hyphens = TRUE).*
(2) My second question, after we preprocess the data and create our
bag-of-words model like below
train.tokens.dfm <- dfm(train.tokens, tolower = FALSE)
train.tokens.matrix= as.matrix(train.tokens.dfm)
, are we ready to *train our model and perform prediction*?
*My data as mentioned also in previous emails*
Rows: 1,819
Columns: 14
$ PLUGIN_RULE_KEY <chr> "InsufficientBranchCoverage",
"InsufficientLin~
$ PLUGIN_CONFIG_KEY <chr> "", "", "", "", "", "", "", "", "", "",
"S1120~
$ PLUGIN_NAME <chr> "common-java", "common-java",
"common-java", "~
$ DESCRIPTION <chr> "An issue is created on a file as soon
as the ~
$ SEVERITY <chr> "MAJOR", "MAJOR", "MAJOR", "MAJOR",
"MAJOR", "~
$ NAME <chr> "Branches should have sufficient
coverage by t~
$ DEF_REMEDIATION_FUNCTION <chr> "LINEAR", "LINEAR", "LINEAR",
"LINEAR_OFFSET",~
$ REMEDIATION_GAP_MULT <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA~
$ DEF_REMEDIATION_BASE_EFFORT <chr> "", "", "", "10min", "", "", "5min",
"5min", "~
$ GAP_DESCRIPTION <chr> "number of uncovered conditions",
"number of l~
$ SYSTEM_TAGS <chr> "bad-practice", "bad-practice",
"convention", ~
$ IS_TEMPLATE <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0~
$ DESCRIPTION_FORMAT <chr> "HTML", "HTML", "HTML", "HTML", "HTML",
"HTML"~
$ TYPE <chr> "CODE_SMELL", "CODE_SMELL",
"CODE_SMELL", "COD~
[[alternative HTML version deleted]]
More information about the R-help
mailing list