This vignette describes the most basic usage of the sentopics package by estimating an LDA model and analysis it’s output. Two other vignettes, describing time series and topic models with sentiment are also available.

Topic modeling

Introduction

sentopics implements three types of topic model. The simplest, Latent Dirichlet Allocation (LDA), assumes that textual documents are issued from a generative process involving $K$ topics.

A given document $d$ is constituted of a list of words $d = (w_1, \dots, w_N)$, with $N$ being the document’s length. Each word $w_i$ originates from a vocabulary consisting of $V$ distinct terms. Then, documents are generated from the following random process:

For each topic $k \in K$, a distribution $\phi_k$ over the vocabulary is drawn. This distribution represent the probability of a word appearing given it belong to the topic and is drawn from a Dirichlet distribution with hyperparameter $\beta$. \[\phi \sim Dirichlet(\beta)\]
For each document, a mixture of the $K$ topics, $\theta_d$, assign the probability of a word in document $d$ being generated from topic $k$. This mixture is also drawn from a Dirichlet distribution with hyperparameter $\alpha$. \[\theta \sim Dirichlet(\alpha)\]
For each word position $i$ of document $d$, the following sequence of draws is executed:
1. A latent topic assignment $z_i$ is drawn from the document mixture. $z_i \sim Multinomial(\theta)$
2. A word $w_i$ is drawn from the topic’s vocabulary distribution. $w_i \sim Multinomial(\phi_{z_i})$

In sentopics the LDA model is estimated through Gibbs sampling, that iteratively sample the topic assignment $z_i$ of every word of the corpus until reaching a convergence. The topic assignments are sampled from the following distribution: \[ p(z_i = k|w,z^{-i}) \propto \frac{n_{k,v,.}^{-i} + \beta}{n_{k,.,.}^{-i} + V\beta} \frac{n_{k,.,d}^{-i} + \alpha}{n_{.,.,d}^{-i} + K\alpha},\] where $n_{k,v,d}$ is the count of words at index $v$ of the vocabulary, assigned to topic $k$ and part of document $d$. The replacement of one of the indices $\{k,v,d\}$ by a dot indicates instead the count for all topics, all vocabulary indices or all documents. The superscript $-i$ indicates that the current word position $i$ is left out from the count variables.

Estimating LDA models with `sentopics`

The estimation of an LDA model is easily replicated using the LDA() and fit() function. The first function prepares the R object and initialize the assignment of the latent topics. The second function estimates the model using Gibbs sampling for a given number of iterations. Note that fit() may be used to iterate the model multiple times without resetting the estimation.

set.seed(123)
lda <- LDA(ECB_press_conferences_tokens)
lda
# An LDA model with 5 topics. Currently fitted by 0 Gibbs sampling iterations.
# ------------------Useful methods------------------
# fit       :Estimate the model using Gibbs sampling
# topics    :Return the most important topic of each document
# topWords  :Return a data.table with the top words of each topic/sentiment
# plot      :Plot a sunburst chart representing the estimated mixtures
# This message is displayed once per session, unless calling `print(x, extended = TRUE)`
lda <- fit(lda, iterations = 100)
lda
# An LDA model with 5 topics. Currently fitted by 100 Gibbs sampling iterations.

Internally, the lda object is stored as a list and contains the model’s parameters and outputs.

str(lda, max.level = 1, give.attr = FALSE)
# List of 10
#  $ tokens       :List of 3860
#  $ vocabulary   :Classes 'data.table' and 'data.frame':   1168 obs. of  3 variables:
#  $ K            : num 5
#  $ alpha        : num [1:5, 1] 1 1 1 1 1
#  $ beta         : num [1:5, 1:1168] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ...
#  $ it           : num 100
#  $ za           :List of 3860
#  $ theta        : num [1:3860, 1:5] 0.0455 0.0357 0.0909 0.0667 0.0625 ...
#  $ phi          : num [1:1168, 1:5] 4.32e-07 4.32e-07 4.32e-07 6.47e-03 4.32e-07 ...
#  $ logLikelihood: num [1:100, 1] -943778 -927554 -912671 -893733 -864875 ...

tokens is the initial tokens object used to create the model. vocabulary is a data.frame indexing the set of words. K is the number of topics. alpha is the hyperparameter of the document-topic mixtures. beta is the hyperparameter of the topic-word mixtures. it is the number of iterations of the model. za contains the topic assignments of each word of the corpus. theta are the estimated document-topic mixtures. phi are the estimated topic-word mixtures. logLikelihood is the log-likelihood of the model at each iteration.

Estimated mixtures are easily accessible through the $ operator. But the package also includes the topWords() function to extract the most probable words of each topic. topWords() includes three types of outputs: long data.table/data-frame, matrix or ggplot object (also accessible through the alias plot_topWords()).

head(lda$theta)
#       topic
# doc_id     topic1     topic2    topic3     topic4     topic5
#    1_1 0.04545455 0.04545455 0.7727273 0.04545455 0.09090909
#    1_2 0.03571429 0.14285714 0.7500000 0.03571429 0.03571429
#    1_3 0.09090909 0.09090909 0.6363636 0.09090909 0.09090909
#    1_4 0.06666667 0.06666667 0.7333333 0.06666667 0.06666667
#    1_5 0.06250000 0.06250000 0.7500000 0.06250000 0.06250000
#    1_6 0.05263158 0.10526316 0.5789474 0.10526316 0.15789474
topWords(lda, output = "matrix")
#       topic1           topic2       topic3              topic4     
#  [1,] "price"          "fiscal"     "governing_council" "growth"   
#  [2,] "inflation"      "euro_area"  "ecb"               "quarter"  
#  [3,] "development"    "growth"     "meeting"           "loan"     
#  [4,] "annual"         "country"    "president"         "financial"
#  [5,] "increase"       "policy"     "bank"              "euro_area"
#  [6,] "projection"     "reform"     "operation"         "rate"     
#  [7,] "hicp"           "structural" "outcome"           "sector"   
#  [8,] "oil"            "market"     "press"             "condition"
#  [9,] "euro_area"      "economic"   "vice"              "annual"   
# [10,] "inflation_rate" "measure"    "euro"              "credit"   
#       topic5           
#  [1,] "risk"           
#  [2,] "economic"       
#  [3,] "monetary"       
#  [4,] "price_stability"
#  [5,] "euro_area"      
#  [6,] "development"    
#  [7,] "interest_rate"  
#  [8,] "outlook"        
#  [9,] "monetary_policy"
# [10,] "growth"

In addition, document-level is facilitated through the use of the melt() method, that joins estimated topical proportions to document metadata present in the tokens input. This result in a long data.table/data.frame that can be used for plotting or easily reshaped to a wide format (for example using data.table::dcast).

melt(lda, include_docvars = TRUE)
#         topic       prob      .date    .id doc_id
#        <fctr>      <num>     <Date> <char> <char>
#     1: topic1 0.04545455 1998-06-09    1_1      1
#     2: topic1 0.03571429 1998-06-09    1_2      1
#     3: topic1 0.09090909 1998-06-09    1_3      1
#     4: topic1 0.06666667 1998-06-09    1_4      1
#     5: topic1 0.06250000 1998-06-09    1_5      1
#    ---                                           
# 19296: topic5 0.28947368 2021-12-16 260_20    260
# 19297: topic5 0.10526316 2021-12-16 260_21    260
# 19298: topic5 0.05000000 2021-12-16 260_22    260
# 19299: topic5 0.41025641 2021-12-16 260_23    260
# 19300: topic5 0.14285714 2021-12-16 260_24    260
#                                               title
#                                              <char>
#     1: ECB Press conference: Introductory statement
#     2: ECB Press conference: Introductory statement
#     3: ECB Press conference: Introductory statement
#     4: ECB Press conference: Introductory statement
#     5: ECB Press conference: Introductory statement
#    ---                                             
# 19296:                             PRESS CONFERENCE
# 19297:                             PRESS CONFERENCE
# 19298:                             PRESS CONFERENCE
# 19299:                             PRESS CONFERENCE
# 19300:                             PRESS CONFERENCE
#                                                                             section_title
#                                                                                    <char>
#     1:          Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
#     2:          Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
#     3:          Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
#     4:          Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
#     5:          Willem F. Duisenberg, President of the European Central Bank, 9 June 1998
#    ---                                                                                   
# 19296: Christine Lagarde, President of the ECB,Luis de Guindos, Vice-President of the ECB
# 19297: Christine Lagarde, President of the ECB,Luis de Guindos, Vice-President of the ECB
# 19298: Christine Lagarde, President of the ECB,Luis de Guindos, Vice-President of the ECB
# 19299: Christine Lagarde, President of the ECB,Luis de Guindos, Vice-President of the ECB
# 19300: Christine Lagarde, President of the ECB,Luis de Guindos, Vice-President of the ECB
#         .sentiment
#              <num>
#     1: -0.01470588
#     2: -0.02500000
#     3:  0.00000000
#     4:  0.00000000
#     5:  0.00000000
#    ---            
# 19296: -0.01960784
# 19297:  0.00000000
# 19298:  0.05555556
# 19299: -0.01052632
# 19300:  0.00000000

To ease the result analysis, we can rename the default topic labels using the sentopics_labels() function. As a result, all outputs of the model will now display the custom labels.

sentopics_labels(lda) <- list(
  topic = c("Inflation", "Fiscal policy", "Governing council", "Financial sector", "Uncertainty")
)
head(lda$theta)
#       topic
# doc_id  Inflation Fiscal policy Governing council Financial sector Uncertainty
#    1_1 0.04545455    0.04545455         0.7727273       0.04545455  0.09090909
#    1_2 0.03571429    0.14285714         0.7500000       0.03571429  0.03571429
#    1_3 0.09090909    0.09090909         0.6363636       0.09090909  0.09090909
#    1_4 0.06666667    0.06666667         0.7333333       0.06666667  0.06666667
#    1_5 0.06250000    0.06250000         0.7500000       0.06250000  0.06250000
#    1_6 0.05263158    0.10526316         0.5789474       0.10526316  0.15789474
plot_topWords(lda) + ggplot2::theme_grey(base_size = 9)

Besides modifying topic labels, it is also possible to merge topics into a greater thematic. This is often useful when estimating a large number of topics (e.g, K > 15). The mergeTopics() does this job and re-label topics accordingly.

merged <- mergeTopics(lda, list(
  `Big big thematic` = c(1, 3:5),
  `Fical policy` = 2
))
merged
# An LDA model with 2 topics. Currently fitted by 100 Gibbs sampling iterations.

Note that merging topics is only useful for presentation purpose. Using again fit on a model with merged topics will drastically change the results as the current state of the model does not results from a standard estimation with the merged set of parameters.

Provided that the plotly package is installed, one can also directly use plot() on the estimated topic model to enjoy a dynamic view of topic proportions and their most probable words (presented as a screenshot hereafter to limit this vignette’s size).

plot(lda)

Basic usage

Data

Topic modeling

Introduction

Estimating LDA models with `sentopics`

Basic usage

Data

Topic modeling

Introduction

Estimating LDA models with sentopics

Estimating LDA models with `sentopics`