```
library(dplyr) # For easier data manipulation with pipes `%>%`
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(lda) # Needed if using prep_docs() function
library(psychtm)
data(teacher_rate) # Synthetic student ratings of instructors
```

In this data set, the documents we want to model are stored as character vectors in a column of the data frame called `doc`

. We can get a preview of the data structure to verify this. The documents are synthetic for this example but are generated to have similar word distributions to a real-world data set. Obviously, the synthetic documents are not as easily readable.

```
glimpse(teacher_rate)
#> Rows: 3,733
#> Columns: 4
#> $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, …
#> $ rating <dbl> 5, 5, 5, 5, 2, 5, 5, 5, 1, 5, 5, 5, 5, 5, 5, 5, 1, 5, 5, 5, 4, …
#> $ grade <dbl> 3, 2, 2, 1, 3, 5, 4, 1, 5, 4, 5, 3, 1, 2, 5, 2, 5, 3, 4, 2, 2, …
#> $ doc <chr> "he is an alright there to understand care to is make project w…
```

To use the topic modeling functions in `psychtm`

, we need to prepare the text data to create a vector containing the vocabulary of all unique terms used in the documents of interest as well as a matrix of word indices that represent which term in the vocabulary was used in each document (a row) at each word position (a column). This matrix has D rows (assuming we have D documents in our data) and \(\max \{N_d}\) columns where \(N_d\) is the number of words in document \(d, d = 1, 2, \ldots, D\) — that is, the number of columns in this matrix is equal to the number of words in the longest document in our data. The `psychtm`

package provides a function `prep_docs()`

to convert a column of documents represented as character vectors in a data frame into a `documents`

matrix and a `vocab`

vector which will be needed to use this package’s topic modeling functionality.

`<- prep_docs(teacher_rate, "doc") docs_vocab `

We can see that we have 6264 unique terms in the vocabulary. We will save this for use later when modeling.

```
str(docs_vocab$vocab)
#> chr [1:6264] "he" "is" "an" "alright" "there" "to" "understand" "care" ...
<- length(docs_vocab$vocab) vocab_len
```

We can also inspect the structure of the `documents`

matrix created by `prep_docs()`

. Notice that we have 3733 rows — the same as the number of rows in our original data frame `teacher_rate`

— and 73 columns — corresponding to the number of terms or words in the longest document in our data.

```
str(docs_vocab$documents)
#> int [1:3733, 1:73] 1 18 42 64 107 18 117 24 1 117 ...
```

Shorter documents represent unused word positions with a 0. For example, the first document only contains 19 words, so the remaining elements in the `documents`

matrix are all set to 0. The first 25 are shown below.

```
print(docs_vocab$documents[1, 1:25])
#> [1] 1 2 3 4 5 6 7 8 6 2 9 10 11 12 13 14 15 16 17 0 0 0 0 0 0
```

Nonzero elements are indices for the terms stored in the `vocab`

vector. To demonstrate, we can recover the original document as follows word by word. In practice, one may want to further preprocess or clean the documents before modeling (e.g., remove stop-words or perform stemming). We do not demonstrate that here for simplicity, but see, for example, the ebook *Welcome to Text Mining with R* for some examples using the `tidytext`

R package (Silge & Robinson, 2016).

```
$vocab[docs_vocab$documents[1, 1:17]]
docs_vocab#> [1] "he" "is" "an" "alright" "there"
#> [6] "to" "understand" "care" "to" "is"
#> [11] "make" "project" "was" "between" "it"
#> [16] "necessarily" "point"
```

The Supervised Latent Dirichlet Allocation with Covariates (SLDAX) model or its special cases Supervised Latent Dirichlet Allocation (SLDA) or Latent Dirichlet Allocation (LDA) can be fit using the `gibbs_sldax()`

function. The function uses Gibbs sampling, a Markov Chain Monte Carlo algorithm, so the number of iterations to run the sampling algorithm `m`

needs to be specified. It is usually a good idea to specify a “burn-in” period of iterations `burn`

to discard while the algorithm iterates toward a converged solution so that pre-converged values are not treated as draws from the posterior distribution we want to sample from. Finally, a thinning period `thin`

can be specified so that only draws separated by the thinning period are kept, resulting in lower autocorrelation among the final posterior samples.

`m`

can be calculated as the desired number of posterior draws `T`

(e.g., 150 for speed in this tutorial; this is generally too low in practice) multiplied by `thin`

plus `burn`

: `m`

= `T`

x `thin`

+ `burn`

. Below, we have `m`

= 150 x 1 + 300 = 450. In practice, a longer burn-in period, total number of iterations, and a larger thinning period may be advisable. For any of SLDAX, SLDAX, and LDA, the `documents`

matrix prepared above is supplied to the `docs`

argument and the size of the vocabulary calculated earlier is supplied to the `V`

argument. Finally, we specify that we are fitting an LDA model by supplying `model = "lda"`

. Be patient as the algorithm may take a few minutes to complete. Progress can be displayed (optional) using the `display_progress`

argument. Other options for `gibbs_sldax`

such as prior specifications can be found in the documentation by running `?gibbs_sldax`

.

Here, we fit an LDA model with three topics `K`

, so no covariate or outcome variables need to be provided. We set a seed for reproducibility.

```
set.seed(92850827)
<- gibbs_sldax(m = 450, burn = 300, thin = 1,
fit_lda docs = docs_vocab$documents,
V = vocab_len,
K = 3, model = "lda", display_progress = TRUE)
```

The estimated (posterior mean or median) topic proportions for each document \(\theta_d\) can be obtained using `est_theta()`

. Here I show the estimated topic proportions for Topic 1, 2, and 3 for the first six documents. Note that each row sums to 1 across topics for each document.

```
<- est_theta(fit_lda)
theta_hat head(theta_hat)
#> [,1] [,2] [,3]
#> [1,] 0.7988248 0.11116770 0.09000745
#> [2,] 0.9080048 0.04194245 0.05005275
#> [3,] 0.8932106 0.06379452 0.04299490
#> [4,] 0.8992168 0.05135308 0.04943008
#> [5,] 0.9279868 0.03686728 0.03514594
#> [6,] 0.9024830 0.05714753 0.04036944
```

Similarly, we can obtain the estimated (posterior mean or median) topic–word probabilities \(\beta_k\) for each topic using `est_beta()`

. Here I show the estimated topic–word probabilities for Topic 1, 2, and 3 for the first ten terms in the vocabulary. Note that each row sums to 1 across terms for each topic.

```
<- est_beta(fit_lda)
beta_hat colnames(beta_hat) <- docs_vocab$vocab
1:10]
beta_hat[, #> he is an alright there
#> [1,] 0.0201289387 0.0307835332 0.0057009246 1.975882e-05 0.0026515015
#> [2,] 0.0007937285 0.0010155003 0.0006101854 8.618505e-05 0.0007116854
#> [3,] 0.0004666478 0.0004244107 0.0005179860 1.194118e-04 0.0007991100
#> to understand care make project
#> [1,] 0.0296174541 0.0022887459 0.0016311004 0.0055352938 0.0006269008
#> [2,] 0.0004114374 0.0002372083 0.0012341072 0.0003099339 0.0002623891
#> [3,] 0.0004771996 0.0002817537 0.0006374755 0.0003966857 0.0002771233
```

To help interpret the topics, two quantities can be useful and are provided by the `get_topwords()`

function. First, the most probable terms associated with each topic directly estimated in \(\beta_k\) for each topic can be evaluated. Because common words such as stop words (e.g., “the”, “and”, “to”) were not removed ahead of analysis in this demo, the topics can be difficult to interpret using the original probabilities.

```
get_topwords(beta_ = beta_hat, nwords = 10, docs_vocab$vocab, method = "prob") %>%
print(n = 30)
#> # A tibble: 30 × 3
#> topic word prob
#> <chr> <chr> <dbl>
#> 1 1 and 0.0404
#> 2 1 the 0.0378
#> 3 1 is 0.0308
#> 4 1 to 0.0296
#> 5 1 you 0.0277
#> 6 1 a 0.0258
#> 7 1 he 0.0201
#> 8 1 i 0.0176
#> 9 1 class 0.0156
#> 10 1 of 0.0154
#> 11 2 she 0.0198
#> 12 2 professor 0.00599
#> 13 2 her 0.00475
#> 14 2 i 0.00236
#> 15 2 final 0.00227
#> 16 2 keep 0.00220
#> 17 2 offer 0.00197
#> 18 2 highly 0.00189
#> 19 2 science 0.00185
#> 20 2 now 0.00170
#> 21 3 took 0.00341
#> 22 3 her 0.00212
#> 23 3 over 0.00196
#> 24 3 look 0.00191
#> 25 3 this 0.00185
#> 26 3 she 0.00175
#> 27 3 but 0.00159
#> 28 3 dr. 0.00159
#> 29 3 can 0.00151
#> 30 3 our 0.00131
```

However, another metric which down-weights words that are highly probable in many topics (e.g., stopwords like “and” or “to”) is the term score. That is, term scores can be interpreted as a measure of uniquely representative a term is for a topic where larger term scores denote terms that have a high probability for a topic and lower probabilities for other topics (i.e., more “unique” to a given topic) and smaller term scores denote terms that are probable in multiple topics (i.e., more “common” to all topics). This is the default metric computed by `get_topwords()`

. Inspecting the top 10 terms for each topic below using term scores, it is now clearer that Topic 1 corresponds with descriptions of course professors, whereas Topic 2 corresponds to course assessment methods such as readings and exams.

```
get_topwords(beta_hat, 15, docs_vocab$vocab, method = "termscore") %>%
print(n = 30)
#> # A tibble: 45 × 3
#> topic word termscore
#> <chr> <chr> <dbl>
#> 1 1 and 0.117
#> 2 1 the 0.114
#> 3 1 to 0.0830
#> 4 1 you 0.0803
#> 5 1 is 0.0790
#> 6 1 a 0.0673
#> 7 1 he 0.0470
#> 8 1 of 0.0362
#> 9 1 class 0.0343
#> 10 1 i 0.0271
#> 11 1 are 0.0217
#> 12 1 for 0.0195
#> 13 1 his 0.0195
#> 14 1 not 0.0189
#> 15 1 if 0.0183
#> 16 2 she 0.0182
#> 17 2 professor 0.00348
#> 18 2 now 0.00272
#> 19 2 keep 0.00227
#> 20 2 science 0.00221
#> 21 2 offer 0.00203
#> 22 2 instructor 0.00201
#> 23 2 final 0.00165
#> 24 2 found 0.00161
#> 25 2 right 0.00154
#> 26 2 day. 0.00142
#> 27 2 print 0.00120
#> 28 2 mrs. 0.00117
#> 29 2 myself 0.00113
#> 30 2 able 0.00112
#> # … with 15 more rows
```

For models with a larger number of topics than illustrated here, it can be useful to extract the most probable subset of topics for each document using the `get_toptopics()`

function. For example, the most probable two topics for each document can be retrieved. Results for the first three documents are shown.

```
head(get_toptopics(theta = theta_hat, ntopics = 2))
#> # A tibble: 6 × 3
#> doc topic prob
#> <dbl> <dbl> <dbl>
#> 1 1 1 0.799
#> 2 1 2 0.111
#> 3 2 1 0.908
#> 4 2 3 0.0501
#> 5 3 1 0.893
#> 6 3 2 0.0638
```

Model fit metrics of topic coherence and topic exclusivity can be computed using `get_coherence()`

and `get_exclusivity()`

. This can be useful, for example, when using cross-validation to determine the optimal number of topics. By default, coherence and exclusivity are computed for each topic, but a global measure can be defined, for example, by averaging over topics if desired (not shown).

```
get_coherence(beta_ = beta_hat, docs = docs_vocab$documents)
#> [1] -23.06239 -136.66708 -99.65906
```

```
get_exclusivity(beta_ = beta_hat)
#> [1] 9.989794 7.714247 4.146112
```

To fit an SLDAX model where the latent topics along with covariates are used to model an outcome, we need to further specify the regression model for the covariates of interest (the topics are automatically entered into the model additively). We also need to specify a data frame `data`

containing the covariates and outcome of interest. Be careful to ensure that this is the same set of observations for which documents were provided previously. Missing data imputation methods for missing documents, covariates, or outcomes are currently not implemented. Only complete data can be analyzed. We also change the model to be estimated to `model = "sldax"`

.

```
set.seed(44680835)
<- gibbs_sldax(rating ~ I(grade - 1), data = teacher_rate,
fit_sldax m = 450, burn = 300, thin = 1,
docs = docs_vocab$documents,
V = vocab_len,
K = 3, model = "sldax", display_progress = TRUE)
```

Topic proportion and topic–word probabilities can be summarized with the same functions as demonstrated above in the section on LDA. As we saw above for an LDA model, we can interpret the three topics from the SLDAX model using term scores.

```
get_topwords(est_beta(fit_sldax), 15, docs_vocab$vocab, method = "termscore") %>%
print(n = 30)
#> # A tibble: 45 × 3
#> topic word termscore
#> <chr> <chr> <dbl>
#> 1 1 and 0.114
#> 2 1 the 0.107
#> 3 1 to 0.0877
#> 4 1 is 0.0744
#> 5 1 you 0.0743
#> 6 1 a 0.0681
#> 7 1 he 0.0487
#> 8 1 class 0.0365
#> 9 1 of 0.0350
#> 10 1 i 0.0262
#> 11 1 are 0.0204
#> 12 1 not 0.0201
#> 13 1 his 0.0189
#> 14 1 if 0.0186
#> 15 1 very 0.0181
#> 16 2 less 0.00190
#> 17 2 ton 0.00161
#> 18 2 success. 0.00125
#> 19 2 guy 0.00118
#> 20 2 able 0.00110
#> 21 2 print 0.00105
#> 22 2 enjoyable 0.00105
#> 23 2 funny, 0.000992
#> 24 2 pop 0.000983
#> 25 2 instructor 0.000959
#> 26 2 right 0.000927
#> 27 2 caring, 0.000905
#> 28 2 slow 0.000898
#> 29 2 can't 0.000883
#> 30 2 spanish 0.000875
#> # … with 15 more rows
```

Here we illustrate summarization of the regression relationships in the SLDAX model. The `post_regression()`

function constructs a `coda::mcmc`

object that can be further manipulated by the methods in the `coda`

R package (Plummer et al., 2006) such as `summary.mcmc()`

as shown below. The burn-in length and thinning period are automatically reflected in these posterior summaries.

The results of `summary()`

provide the posterior mean estimates, corresponding posterior standard deviations, Bayesian credible intervals, and the standard error and autocorrelation-adjusted standard error of the posterior mean estimates. See `?coda:::summary.mcmc`

for further information .

Here, we see that a 1-unit improvement in a student’s grade is associated with a roughly -0.3 decrease in their rating of the instructor while holding the topical content of their comments constant. The posterior mean estimates of the regression coefficients for the topics (e.g., `topic1`

) correspond to conditional mean ratings when a student received the lowest possible grade and *only* that topic was present in their written comment. To obtain meaningful topic “effects” associated with changing the prevalence of a given topic while holding all others constant, we need to estimate the posterior distribution of a contrast of the topic regression coefficients. The `post_regression()`

function calculates these contrasts as \(c_k = \eta_k^{(t)} - K^{-1} \sum_{j \ne k} \eta_j^{(t)}\) where \(\eta_k\) and \(\eta_j\) are regression coefficients for the topics (not for any of the covariates). These topic effects are labeled as `effect_t1`

and so on. Here, we see using the posterior means and credible intervals that Topic 1 is positively associated with instructor ratings while holding a student’s grade constant, Topic 2 is negatively associated with instructor ratings while holding a student’s grade constant, and Topic 3 does not appear to be associated with instructor ratings while holding student’s grade constant. In practice, the topic interpretations should be inspected and longer chains should be used along with convergence checks.

```
<- post_regression(fit_sldax)
eta_post summary(eta_post)
#>
#> Iterations = 301:450
#> Thinning interval = 1
#> Number of chains = 1
#> Sample size per chain = 150
#>
#> 1. Empirical mean and standard deviation for each variable,
#> plus standard error of the mean:
#>
#> Mean SD Naive SE Time-series SE
#> I(grade - 1) -0.2624 0.007971 0.0006508 0.0006508
#> topic1 4.8660 0.032105 0.0026213 0.0030664
#> topic2 3.8448 0.237901 0.0194245 0.0337216
#> topic3 4.2241 0.251126 0.0205044 0.0280480
#> effect_t1 0.8315 0.151761 0.0123912 0.0241455
#> effect_t2 -0.7003 0.313717 0.0256149 0.0455078
#> effect_t3 -0.1312 0.320977 0.0262077 0.0449124
#> sigma2 1.1340 0.028963 0.0023649 0.0023649
#>
#> 2. Quantiles for each variable:
#>
#> 2.5% 25% 50% 75% 97.5%
#> I(grade - 1) -0.2765 -0.2680 -0.26234 -0.25762 -0.2474
#> topic1 4.8042 4.8451 4.86125 4.89006 4.9256
#> topic2 3.4282 3.6604 3.82417 4.00860 4.2798
#> topic3 3.7542 4.0594 4.20768 4.41378 4.6294
#> effect_t1 0.5422 0.7124 0.83249 0.93923 1.1372
#> effect_t2 -1.2408 -0.9318 -0.69565 -0.50744 -0.1145
#> effect_t3 -0.7336 -0.3298 -0.09543 0.07387 0.4314
#> sigma2 1.0802 1.1143 1.13087 1.15058 1.1965
```