## **Introduction**
The `text2emotion` package is designed to to detect and classify emotions from text data. By processing textual input, it identifies emotions such as happiness, sadness, anger, fear, and neutrality. The package leverages advanced techniques like TF-IDF feature extraction and Random Forest classification to accurately predict emotions. Additionally, it associates corresponding emojis with the predicted emotions, providing an engaging way to visualize sentiment in text. This vignette will guide you through the steps of using the package, from preprocessing and model training to emotion prediction and emoji mapping.
---
## **Installation**
To install the `text2emotion` package, you can use the following command:
``` r
install.packages("text2emotion")
library(stringr)
library(textclean)
#> Warning: package 'textclean' was built under R version 4.4.3
library(magrittr)
library(text2vec)
#> Warning: package 'text2vec' was built under R version 4.4.3
library(ranger)
#> Warning: package 'ranger' was built under R version 4.4.3
library(caret)
#> Warning: package 'caret' was built under R version 4.4.3
#> Loading required package: ggplot2
#> Loading required package: lattice
library(parallel)
library(stats)
preprocess_text
The preprocess_text
function is designed to clean and
preprocess raw text data, preparing it for further analysis or modeling.
This function includes a variety of stages that handle common text
preprocessing tasks such as basic cleaning, contraction expansion, slang
handling, and punctuation standardization. Below is a detailed guide on
how to use this function and what to consider when applying it to your
data.
text: A character vector of text that needs to be processed.
use_textclean: A logical value (default is
TRUE
). If TRUE
, the function will replace
internet slang and emoticons using the textclean
package.
If FALSE
, this step will be skipped.
custom_slang: A named list of custom slang replacements. This allows users to specify additional slang terms and their full forms that are not covered by the predefined list. This argument is optional.
Basic Cleaning: The function first converts all
text to lowercase and replaces HTML entities (such as
 
;). It also replaces non-ASCII characters with
their closest ASCII equivalents (e.g., full-width characters to
half-width characters).
Contraction Expansion: Common English contractions like “can’t” and “won’t” are expanded to their full forms (e.g., “can’t” becomes “can not”). The function handles various contractions, with special attention given to negations (e.g., “isn’t” becomes “is not”).
Slang Handling: The function can replace common
internet slang terms with their full forms (e.g., “u” becomes “you”). By
default, it uses the textclean
package to handle internet
slang and emoticons, but this step can be disabled by setting
use_textclean = FALSE
. Additionally, you can add your own
slang terms via the custom_slang
argument, allowing for
more flexible processing.
Final Standardization: The function standardizes
punctuation by normalizing repeated punctuation marks (e.g., “!!!”
becomes “!”) and ensures consistent spacing around punctuation marks. It
also removes any excess whitespace using
stringr::str_squish()
.
# Sample text with contractions, slang, and emoticons
text <- "I'm so excited!! 2nite we go 4 a gg :)"
# Preprocess the text
cleaned_text <- preprocess_text(text)
# View the processed text
cleaned_text
#> [1] "i am so excited ! tonight we go for a good game smiley"
In the example above, you can see that:
Contractions like “I’m” are expanded to “I am”.
“2nite” is converted to “tonight”.
“4” is replaced with “for”.
Emoticons like “:)” remain intact (as defined by
textclean
).
If you have additional slang terms or abbreviations that are not
covered by the predefined list, you can pass a custom_slang
list. Here is an example:
# Define custom slang terms
custom_slang <- c(
"bff" = "best friend forever",
"omg" = "oh my god"
)
# Preprocess the text with custom slang
text_with_custom_slang <- preprocess_text("omg! My bff is here!", custom_slang = custom_slang)
# View the processed text
text_with_custom_slang
#> [1] "oh my god ! my best friends , forever is here !"
Text Size: This function is designed to handle relatively short pieces of text (e.g., tweets, reviews, messages). If you have a large corpus of text, you may want to apply this function to each document individually for better performance.
Internet Slang: The function handles common
internet slang by default, but it may not catch every possible slang
term. For more specific needs, consider using the
custom_slang
argument to provide additional
replacements.
Punctuation Handling: The function standardizes repeated punctuation marks (e.g., “!!!” becomes “!”) but keeps single punctuation marks (e.g., “!” remains as it is). This can help preserve the emotional intensity of the text.
Language Support: Currently, this function is designed for English text. If you’re working with other languages, you might need to modify the function or adjust the slang terms accordingly.
The predict_emotion_with_emoji()
function allows you to
analyze input text and obtain a predicted emotion, its corresponding
emoji, or both. It combines preprocessing, TF-IDF feature extraction,
and a trained Random Forest classifier to produce interpretable results
for sentiment or emotion-aware applications.
text: A character string. The input text you want to analyze.
output_type: The format of the returned result. Available options:
“emotion” – returns the predicted emotion label (e.g.,
sad
)
“emoji” – returns only the emoji (e.g., 😢)
“textemoji” – returns the original text with the emoji appended (default)
The function assumes that the following files exist in your **working directory*:**
trained_rf_ranger_model.rds
trained_tfidf_model.rds
trained_vectorizer.rds
You can train and save these using your own dataset and pipeline. Refer to the model training section of the package documentation.
The preprocessing step internally calls
preprocess_text()
to normalize text before feature
extraction.
The emotion classes predicted include:
"positive"
, "angry"
, "sad"
,
"fear"
, and "neutral"
.This vignette describes the development process and core logic behind the modeling pipeline in this package, which maps textual input to emotion categories and ultimately to emojis. The workflow includes text preprocessing, TF-IDF vectorization, model selection, hyperparameter tuning, and final evaluation.
Before modeling, we apply a series of preprocessing techniques to clean and standardize the input texts. This step is essential for improving downstream classification performance. Our preprocessing includes:
"not"
precedes a
word, we combine them into a new token (e.g., "not happy"
→
"not_happy"
), ensuring the model learns the distinct
semantics of negated expressions."not"
, "never"
, and "no"
.The function handle_negation()
is responsible for
applying this rule-based transformation to token sequences.
For feature extraction, we utilize Term Frequency-Inverse Document
Frequency (TF-IDF) to encode the preprocessed text. The function
train_tfidf_model()
performs this transformation with the
following steps:
Token filtering: Uninformative tokens are removed based on our custom stopword list.
Support for unigrams and trigrams: We include
both single words and short phrases to better capture emotional context
(e.g., "so happy"
or "not good"
).
Vocabulary pruning: Low-frequency and overly
frequent terms are filtered using min_df
,
max_df
, and a maximum feature size.
Sentiment-priority term boosting: Words known to carry strong emotional weight (e.g., “happy”, “furious”, “disappointed”) are manually prioritized to ensure they are retained in the final vocabulary.
The resulting sparse TF-IDF matrix is used as input for the classification model.
We experimented with several classification models for sentiment prediction, including:
After extensive evaluation, we found that Random Forest consistently outperformed both Decision Trees and XGBoost in terms of test accuracy and macro F1-score. Therefore, we adopted Random Forest as our final model of choice.
We perform a grid search over mtry
and
ntree
to optimize the Random Forest model using the
tune_rf_model()
function. For each hyperparameter
combination, the model is trained and evaluated on the training set, and
accuracy is recorded:
mtry
(number of features to consider at each
split)
ntree
(number of trees in the forest)
best_params <- tune_rf_model(
train_matrix = tfidf_result$tfidf_matrix,
train_labels = train_labels,
mtry_grid = c(5, 10, 20),
ntree_grid = c(100, 200, 300),
seed = 123,
verbose = TRUE
)
The best-performing parameter set is chosen based on training accuracy. A caching mechanism is used to store the training dataset and accelerate re-runs.
With the best hyperparameters obtained, the
train_rf_model()
function fits the final model using the
full training data. The trained model is saved to disk as an RDS file
for reuse.
To evaluate generalization, we use evaluate_rf_model()
to assess model performance on held-out test data. The function
includes:
eval_result <- evaluate_rf_model(
rf_model = rf_model,
test_texts = preprocessed_test_texts,
test_labels = test_labels,
tfidf_model = tfidf_result$tfidf_model,
vectorizer = tfidf_result$vectorizer,
stopwords = stopwords,
verbose = TRUE
)
The accuracy of our Random Forest model remains stable above 65% across different runs. This robust evaluation framework allows us to monitor both overall and per-class performance, which is particularly important in imbalanced sentiment datasets.
The text2emotion
package offers a streamlined and
efficient pipeline for emotion classification from text. It integrates
robust text preprocessing techniques, TF-IDF vectorization, and machine
learning models to deliver reliable emotion predictions. Unlike many
black-box models, our approach is transparent, customizable, and
suitable for small to medium-sized datasets. Extensive testing showed
that Random Forest consistently outperformed decision trees and XGBoost
in our use case, achieving stable accuracy above 65%. Additionally, the
package maps emotions to representative emojis, enhancing its utility in
applications such as sentiment tagging, chatbot responses, and social
media analysis.
In conclusion, this package provides a comprehensive toolset for emotion classification in text, combining powerful preprocessing, feature extraction, and predictive modeling. By using a Random Forest classifier, it achieves stable and reliable performance, especially when compared to other models like XGBoost and Decision Trees. With the added ability to map emotions to emojis, the package offers a unique, user-friendly experience for analyzing and visualizing emotions in text, making it a valuable resource for anyone working in sentiment analysis and NLP.