For the inference and selection of the optimal number of topics for the LDA model, we start from the DTM matrix, taking into account that a small *K* can generate wide and heterogeneous topics, and a high *K* will produce specific topics `LDABiplots` obtains this optimal k from the coherence of the topic, this being a measure of the quality of the desired topic from the point of view of human interpretability. This is based on the distribution hypothesis that states that words with similar interpretations tend to coexist in similar contexts. The best number of topics will be the one that offers the greatest measure of coherence, this is done based on probability theory and consists of adjusting several models with different topics and calculating the coherence of each of them. For the option of this number, the models that you want to check must be parameterized in the *Inference* section in *Candidate number of topics K*, it must be identified from the range of topics for the test, in the *Parameters section Gibbs sampling* control, you must select the number of iterations *Iteratition* of the sample based on Gibbs sampling and the number of the first N samples to discard *Burn-in*, to choose an N that is big enough.
An *Alpha* hyperparameter value should be selected, considering that a high alpha value means that each document is likely to contain a combination of most topics, and not a single particular topic. A low alpha value places fewer restrictions on documents and means that a document is more likely to contain a combination of only a few, or even just one, of the topics. Several authors have defined some rules to determine the value of alpha, *α = (0.1, 50/K)(Griffiths et al, 2004), also (0.1, 0.1)(Asunción et al, 2009) and (1/K, 1/K)( Rehůřek and Sojka, 2010)*, by default `LDABiplots` uses the value of *0.1* for calculation, see video 6.
Once the number of topics was defined, according to the obtained coherence of 0.069, it was inferred that the best number of topics is 4, with this optimal K, the LDA model is generated from the DTM matrix, with the optimal K number You must define the parameters similar to the process where the inference was obtained, for the example, 100 iterations and a Burn-in of 5 were selected, as well as an Alpha of 0.1, after evaluating the optimal K according to the determined rules. The result obtained with `LDABiplots` are two matrices, the first is the Theta matrix, which shows in the columns an identifier of the news of the analyzed newspapers and in the rows a distribution of topics in the analyzed documents. Another matrix obtained is the phi, which shows in the rows that represent a distribution of words on the topics.
Both matrices can be downloaded in the *Tabular result* section, where before downloading the matrices you can select the number of terms *Select number of term*, select the number of labels in *Select number of label*, and the value of the assignments *Select Assignments*, to parameterize the number of words and the labels that you want to observe and download.
In *Worcloud* we can observe through a graph of words, which ones have greater weight in each of the topics. In *Heatmap* we observe through a heat map the probabilities of belonging to each of the newspapers, where, according to the color scale shown, it can be seen which topics are found more in any of the digital news newspapers in particular.
In the *Cluster* tab, you can see the grouping of the topics found, for which you can select the grouping method in *Agglomeration method* among the methods included in the `LDABiplots` we have *complete, single, Ward. D, Ward.D2, average, mcquitty, median, centroid*, *Ward*'s minimum variance method aims to find compact and spherical groups. The *complete* method finds similar groups. The *single* method, which is closely related to the minimal spanning tree, adopts a *friend of friend* grouping strategy. The other methods can be thought of as targeting groups with features somewhere between the single and complete methods. The methods *median* and *centroid* do not lead to a monotonic distance measure or, equivalently. In the *type of plot* section, you can select the type of graph to display, there are the options of *rectangle* which draws rectangles around the branches of a dendrogram highlighting the corresponding groups, and *circular* which generates a graph efficiently and optimally with a heuristic and *phylogenic* circular grouping that shows through a phylogenetic tree how the hypothetical topics are related to each other, as well as a scroll bar to select the number of clusters to perform between topics, the package allows you to download the plot in pdf or png format. see video 7
.
Video 7. LDA Model and Representations
## Representations Biplot
Biplot graphs approximate the distribution of a multivariate sample in a reduced dimension space, and superimpose on its representations of the variables on which the sample is measured, this graph allows graphically displaying the information of the rows (represented by points, markers rows) and columns (represented by Vectors, column markers), `LDABiplots`, allows us to graphically and tabularly display the results obtained when processing the Biplots, we select the desired Biplot among the *JK-Biplot*, where the coordinates of the rows are the coordinates on the main components and the coordinates of the columns are the eigenvectors of the covariance or correlation matrix. The Euclidean distances between row points in the Biplot approximate the Euclidean distances between rows in multidimensional space. Or the *GH-Biplot*, where the coordinates of the rows are standardized and the distance between rows approximates the Mahalanobis distance in multidimensional space. And the *HJ-Biplot* that generates a high quality of representation for both rows and columns, by presenting both identical goodnesses of fit, it is possible to interpret the row-column relationship.
For our example, the HJ-Biplot was selected, for the interpretation, the following rules are considered: See figure 2.
.
```{r fig2,fig.cap='Figura 2. Interpretation HJ-Biplot ',fig.align='center', echo=FALSE, out.width = '50%'}
knitr::include_graphics("../inst/img/HJ-Biplot_Ing.PNG")
```
Where;
- The distances between row markers are interpreted as an inverse function of their similarities, so that neighboring markers are more similar
- The length of the vectors (column markers) approximates the standard deviation of the daily news.
- The cosines of the angles between the column markers approximate the correlations between the Diaries, acute angles associate a high positive correlation between them, obtuse angles indicate a negative correlation and right angles indicate uncorrelated variables.
- The order of the orthogonal projections of the points (row markers) onto a vector (column marker) approximates the order of the row elements (centers) in that column. The greater the projection of a point on a vector, the more the center deviates from the mean of that journal.
Before selecting the generation of the Biplot representation to be carried out, it is necessary to mark how the centering of the covariance matrix will be carried out, LDABiplots gives us 4 options for centering and scaling the matrix, such: *scale*, *center*, *center_scale*, or *none*.
By clicking on *run*, the selected Biplots and the results will be generated in tabular form, which can be downloaded in different formats, the tabular results shown are: *Eigenvalues* the vectors with the eigenvalues, *Variance explained * a vector containing the proportion of variance explained by the first 1, 2,., K main components obtained, *loadings* The loadings of the main components, *Coordinates of individuals* matrix with the coordinates of the individuals, *Coordinates of variables* array with the coordinates of the variables.
In the *Biplot* tab, you will find the graphical representation generated according to the previously selected parameters, the graphic can be modified in its form, and with the different options offered by the package, it can be modified in the *Options to Customize the section. Biplot*, the *theme*, the axes to display in *Axis-X* and in *Axis-Y*, the color of the column markers, and the color of the row markers, you can also change the size of the markers and add the labels of both markers in different sizes. The representation can be downloaded in png or pdf format. See video 8
.
Video 8. Representations Biplot
## Citation
If you use `LDABiplots`, please cite it in your work as:
*Pilacuan-Bonete L., Galindo-Villardón P., De la Hoz-M J., & Delgado-Álvarez F.(2022). LDABiplots: Biplot Graphical Interface for LDA Models. R package version 0.1.2*
## References
*Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.*
*Galindo-Villardón,P. (1986). Una alternativa de representación simultánea: HJ-Biplot (An alternative of simultaneous representation: HJ-Biplot). Questíio 1986, 10, 13–23.*
*Gabriel, K. R. (1971). The biplot graphic display of matrices with application to principal component analysis. Biometrika, 58(3), 453-467.*
*Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235.*