text2emotion

Dependencies

library(stringr)
library(textclean)
#> Warning: package 'textclean' was built under R version 4.4.3
library(magrittr)
library(text2vec)
#> Warning: package 'text2vec' was built under R version 4.4.3
library(ranger)
#> Warning: package 'ranger' was built under R version 4.4.3
library(caret)
#> Warning: package 'caret' was built under R version 4.4.3
#> Loading required package: ggplot2
#> Loading required package: lattice
library(parallel)
library(stats)

Usage Method

Text Preprocessing with `preprocess_text`

The preprocess_text function is designed to clean and preprocess raw text data, preparing it for further analysis or modeling. This function includes a variety of stages that handle common text preprocessing tasks such as basic cleaning, contraction expansion, slang handling, and punctuation standardization. Below is a detailed guide on how to use this function and what to consider when applying it to your data.

Arguments

text: A character vector of text that needs to be processed.
use_textclean: A logical value (default is TRUE). If TRUE, the function will replace internet slang and emoticons using the textclean package. If FALSE, this step will be skipped.
custom_slang: A named list of custom slang replacements. This allows users to specify additional slang terms and their full forms that are not covered by the predefined list. This argument is optional.

Function Stages

Basic Cleaning: The function first converts all text to lowercase and replaces HTML entities (such as &nbsp;). It also replaces non-ASCII characters with their closest ASCII equivalents (e.g., full-width characters to half-width characters).
Contraction Expansion: Common English contractions like “can’t” and “won’t” are expanded to their full forms (e.g., “can’t” becomes “can not”). The function handles various contractions, with special attention given to negations (e.g., “isn’t” becomes “is not”).
Slang Handling: The function can replace common internet slang terms with their full forms (e.g., “u” becomes “you”). By default, it uses the textclean package to handle internet slang and emoticons, but this step can be disabled by setting use_textclean = FALSE. Additionally, you can add your own slang terms via the custom_slang argument, allowing for more flexible processing.
Final Standardization: The function standardizes punctuation by normalizing repeated punctuation marks (e.g., “!!!” becomes “!”) and ensures consistent spacing around punctuation marks. It also removes any excess whitespace using stringr::str_squish().

Example Usage

# Sample text with contractions, slang, and emoticons
text <- "I'm so excited!! 2nite we go 4 a gg :)"

# Preprocess the text
cleaned_text <- preprocess_text(text)

# View the processed text
cleaned_text
#> [1] "i am so excited ! tonight we go for a good game smiley"

In the example above, you can see that:

Contractions like “I’m” are expanded to “I am”.
“2nite” is converted to “tonight”.
“4” is replaced with “for”.
Emoticons like “:)” remain intact (as defined by textclean).

Custom Slang

If you have additional slang terms or abbreviations that are not covered by the predefined list, you can pass a custom_slang list. Here is an example:

# Define custom slang terms
custom_slang <- c(
  "bff" = "best friend forever",
  "omg" = "oh my god"
)

# Preprocess the text with custom slang
text_with_custom_slang <- preprocess_text("omg! My bff is here!", custom_slang = custom_slang)

# View the processed text
text_with_custom_slang
#> [1] "oh my god ! my best friends , forever is here !"

Important Notes

Text Size: This function is designed to handle relatively short pieces of text (e.g., tweets, reviews, messages). If you have a large corpus of text, you may want to apply this function to each document individually for better performance.
Internet Slang: The function handles common internet slang by default, but it may not catch every possible slang term. For more specific needs, consider using the custom_slang argument to provide additional replacements.
Punctuation Handling: The function standardizes repeated punctuation marks (e.g., “!!!” becomes “!”) but keeps single punctuation marks (e.g., “!” remains as it is). This can help preserve the emotional intensity of the text.
Language Support: Currently, this function is designed for English text. If you’re working with other languages, you might need to modify the function or adjust the slang terms accordingly.

Predicting Emotion with Emoji Representation

The predict_emotion_with_emoji() function allows you to analyze input text and obtain a predicted emotion, its corresponding emoji, or both. It combines preprocessing, TF-IDF feature extraction, and a trained Random Forest classifier to produce interpretable results for sentiment or emotion-aware applications.

Arguments

text: A character string. The input text you want to analyze.
output_type: The format of the returned result. Available options:
- “emotion” – returns the predicted emotion label (e.g., sad)
- “emoji” – returns only the emoji (e.g., 😢)
- “textemoji” – returns the original text with the emoji appended (default)

Example Usage

predict_emotion_with_emoji("I'm feeling great today!")
#> I'm feeling great today! 😊

predict_emotion_with_emoji("He's super angry!!", output_type = "emoji")
#> 😡

predict_emotion_with_emoji("I feel scared", output_type = "emotion")
#> fear

Notes and Considerations

The function assumes that the following files exist in your **working directory*:**
- trained_rf_ranger_model.rds
- trained_tfidf_model.rds
- trained_vectorizer.rds
You can train and save these using your own dataset and pipeline. Refer to the model training section of the package documentation.
The preprocessing step internally calls preprocess_text() to normalize text before feature extraction.
The emotion classes predicted include:
- "positive", "angry", "sad", "fear", and "neutral".

Modeling Workflow

This vignette describes the development process and core logic behind the modeling pipeline in this package, which maps textual input to emotion categories and ultimately to emojis. The workflow includes text preprocessing, TF-IDF vectorization, model selection, hyperparameter tuning, and final evaluation.

1. Text Preprocessing

Before modeling, we apply a series of preprocessing techniques to clean and standardize the input texts. This step is essential for improving downstream classification performance. Our preprocessing includes:

Lowercasing and tokenization: All text is converted to lowercase and tokenized into words.
Slang and abbreviation expansion: We replace informal terms (e.g., “idk” → “I don’t know”) using a custom slang dictionary.
Negation handling: Special care is taken to preserve negation cues. If a token like "not" precedes a word, we combine them into a new token (e.g., "not happy" → "not_happy"), ensuring the model learns the distinct semantics of negated expressions.
Stopword filtering: We use a curated stopword list, but explicitly retain sentiment-related words such as "not", "never", and "no".

The function handle_negation() is responsible for applying this rule-based transformation to token sequences.

2. Feature Engineering: TF-IDF Vectorization

For feature extraction, we utilize Term Frequency-Inverse Document Frequency (TF-IDF) to encode the preprocessed text. The function train_tfidf_model() performs this transformation with the following steps:

Token filtering: Uninformative tokens are removed based on our custom stopword list.
Support for unigrams and trigrams: We include both single words and short phrases to better capture emotional context (e.g., "so happy" or "not good").
Vocabulary pruning: Low-frequency and overly frequent terms are filtered using min_df, max_df, and a maximum feature size.
Sentiment-priority term boosting: Words known to carry strong emotional weight (e.g., “happy”, “furious”, “disappointed”) are manually prioritized to ensure they are retained in the final vocabulary.

The resulting sparse TF-IDF matrix is used as input for the classification model.

3. Model Selection and Evaluation

We experimented with several classification models for sentiment prediction, including:

Decision Trees
XGBoost
Random Forests

After extensive evaluation, we found that Random Forest consistently outperformed both Decision Trees and XGBoost in terms of test accuracy and macro F1-score. Therefore, we adopted Random Forest as our final model of choice.

4. Hyperparameter Tuning

We perform a grid search over mtry and ntree to optimize the Random Forest model using the tune_rf_model() function. For each hyperparameter combination, the model is trained and evaluated on the training set, and accuracy is recorded:

mtry (number of features to consider at each split)
ntree (number of trees in the forest)

best_params <- tune_rf_model(
    train_matrix = tfidf_result$tfidf_matrix,
    train_labels = train_labels,
    mtry_grid = c(5, 10, 20),
    ntree_grid = c(100, 200, 300),
    seed = 123,
    verbose = TRUE
  )

The best-performing parameter set is chosen based on training accuracy. A caching mechanism is used to store the training dataset and accelerate re-runs.

5. Final Model Training

With the best hyperparameters obtained, the train_rf_model() function fits the final model using the full training data. The trained model is saved to disk as an RDS file for reuse.

rf_model <- train_rf_model(
    train_matrix = tfidf_result$tfidf_matrix,
    train_labels = train_labels,
    ntree = best_params$ntree,
    mtry = best_params$mtry,
    seed = 123,
    verbose = TRUE,
    train_df_cache_path = train_df_cache_path
  )

6. Model Evaluation on Test Data

To evaluate generalization, we use evaluate_rf_model() to assess model performance on held-out test data. The function includes:

Preprocessing and TF-IDF encoding of test texts
Ensuring vocabulary alignment with the training vectorizer
Prediction and computation of standard metrics, including:
- Accuracy
- Confusion Matrix
- Macro F1-score

eval_result <- evaluate_rf_model(
    rf_model = rf_model,
    test_texts = preprocessed_test_texts,
    test_labels = test_labels,
    tfidf_model = tfidf_result$tfidf_model,
    vectorizer = tfidf_result$vectorizer,
    stopwords = stopwords,
    verbose = TRUE
  )

eval_result$text_accuracy
eval_result$macro_f1
eval_result$confusion

The accuracy of our Random Forest model remains stable above 65% across different runs. This robust evaluation framework allows us to monitor both overall and per-class performance, which is particularly important in imbalanced sentiment datasets.

Advantages

The text2emotion package offers a streamlined and efficient pipeline for emotion classification from text. It integrates robust text preprocessing techniques, TF-IDF vectorization, and machine learning models to deliver reliable emotion predictions. Unlike many black-box models, our approach is transparent, customizable, and suitable for small to medium-sized datasets. Extensive testing showed that Random Forest consistently outperformed decision trees and XGBoost in our use case, achieving stable accuracy above 65%. Additionally, the package maps emotions to representative emojis, enhancing its utility in applications such as sentiment tagging, chatbot responses, and social media analysis.

Summary

In conclusion, this package provides a comprehensive toolset for emotion classification in text, combining powerful preprocessing, feature extraction, and predictive modeling. By using a Random Forest classifier, it achieves stable and reliable performance, especially when compared to other models like XGBoost and Decision Trees. With the added ability to map emotions to emojis, the package offers a unique, user-friendly experience for analyzing and visualizing emotions in text, making it a valuable resource for anyone working in sentiment analysis and NLP.

text2emotion

Dependencies

Usage Method

Text Preprocessing with preprocess_text

Arguments

Function Stages

Example Usage

Custom Slang

Important Notes

Predicting Emotion with Emoji Representation

Arguments

Example Usage

Notes and Considerations

Modeling Workflow

1. Text Preprocessing

2. Feature Engineering: TF-IDF Vectorization

3. Model Selection and Evaluation

4. Hyperparameter Tuning

5. Final Model Training

6. Model Evaluation on Test Data

Advantages

Summary

Text Preprocessing with `preprocess_text`