--- title: "Tutorial LDABiplots" author: "Luis Pilacuan-Bonete, Purificación Galindo-Villardón, Javier De La Hoz-Maestre y Francisco Javier Delgado Álvarez" #date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Tutorial_LDABiplots_English} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ![LDABiplots](../inst/img/LOGOBIPLOTS.png) # Introduction
`LDABiplots` it is an extraction, analysis, and visualization tool for the exploratory analysis of news published on the web by digital newspapers, which, by extracting data from the web (Bradley et al. 2019), allows the implementation of the Latent Dirichlet Allocation probabilistic model(LDA) (Blei, Ng y Jordán, 2003) and the generation of Biplot (Gabriel K.R, 1971) and HJ-Biplot (Galindo-Villardón P, 1986) visualizations of the main topics of the headlines of the news published on the web. `LDABiplots` allows for optimizing the data extraction from the web, the LDA modeling routine, and the generation of Biplot visualizations in an interactive way for users who are not adapted to the use of R.
# Download & installation To download install the stable version of Comprehensive R Archive Network (CRAN) ```{r} # install.packages("LDABiplots") # library(LDABiplots) ``` Once the library is loaded, to use the web interface, type in the R console ```{r} # runLDABiplots() ``` # Import or Load of Data
`LDABiplots` allows us to extract data from the web page *www.google.com*, the data belongs to the news section in the GOOGLE search engine. For users using a different extraction page `LDABiplots` also allows the loading of files in Excel format.
## Importing Data from File
The data can be imported from a file in the directory, by selecting the *Import or Load Data* tab the *Import excel file* option, and selecting the file to upload from *Browse* and the work tab where the data is located *Worksheet Name*. The data to be uploaded must have the header and format according to figure 1
. ```{r fig1,fig.cap='Figura 1. File format to import ',fig.align='center', echo=FALSE, out.width = '90%'} knitr::include_graphics("../inst/img/Encabezado.png") ``` .
LDABiplots
Video 1. Importing Data
## Data Extraction from the WEB
It is recommended that before performing news extraction through `LDABiplots`, you set your computer's google search engine to *Advanced Search*, set the region and preferred search language in the news section for better extraction results.(Video 2)
.
LDABiplots
Video 2. Browser Configuration
.
The data can be extracted by web scraping directly in `LDABiplots`, selecting in the *Import or Load Data* tab the *Load web data* option, and writing the search keywords (use a maximum of 4 keywords, for a better search performance, select the search language in *Choose Language*, and the pagination number to be extracted, by default google shows 10 news per page, select *Run* to execute the search and extraction.(Video 3)
.
LDABiplots
Video 3. Webscraping News from the WEB
# Application Example
To exemplify the operation of `LDABiplots`, we will extract from the web the news related to "*covid, coronavirus, France*", according to what is shown in video 3, this extraction allows us to list in two tables the number of newspapers with their respective frequency of news, as well as showing us each of the newspapers with the headlines of the news, these tables can be downloaded in various formats from the application, for the processing of this data we will proceed as follows: **Selection of Digital Newspapers to Analyze**, it is recommended to select the newspapers with the most frequent digital news. **Inclusion of n-grams**, n-grams is a contiguous sequence of words. According to your study, you must select between *unigrams, bigrams, or trigrams*, for the example bigrams are selected **Remove numbers**, this option allows us to remove the numbers, in case they are not informative. **Select language for stopwords**, stopwords are those words that have no lexical meaning and that appear with high frequency in the news, such as articles or pronouns. We proceeded to select according to the language of extraction of the news, that is, in English. **Add stopword words**, this allows us to eliminate from the study words that can be considered highly frequent and that do not add value to the study, in the example the words *covid* and *France* were added. **Select Lemmatization**, this allows us to reduce the words to the basic form, it should be used with caution in studies, for example, the lemmatization was not selected. **Selecting Sparcity**, allows us to eliminate terms that are used infrequently in very few news before generating the models. Allowing better computational performance since it eliminates information that does not contribute to the model, in our case *sparcity* was used with 0.985(98.5%), that is, the DTM will be generated with the terms that appear in the 1.5% of the headlines of the news. **Create DTM**, finally the document terms matrix (DTM) is generated, once the process is finished, the dimensions of the DTM are shown in a summary table, see video 4.
.
LDABiplots
Video 4. Obtaining DTM matrix
## Visualization of Terms Matrix
After processing the original data of the selected newspapers, a matrix of the corpus of 402 unique terms have been obtained, out of the 444 of the original corpus, this allows obtaining a better computational performance for the following analyses. The DTM matrix obtained can be seen by selecting *Document Term Matrix Visualization*, in the *Data* tab in a tabular manner, and can be downloaded in different formats, such as Excel, CSV, and pdf. This DTM matrix shows us the frequencies of the terms, the number of documents in which each term appears, and the IDF or inverse frequency of documents, which is a measure of the importance of the term. The *Barplot* option allows us to generate an ordered bar graph, where the words are displayed according to their frequency, and enable the option of changing the color of the bars in the graph and downloading it in various formats with the *export* button. The *Worcloud* Option shows us a cloud of words, which can be modified, by selecting the number of words to show, with the *export* button, the graph can be downloaded in various formats. *Co-occurrence* displays a word co-occurrence plot which plots the sparse term correlations as a graph structure, based on the glasso procedure (Lasso Plot), to reduce the correlation matrix and keep only the relevant correlations between terms, with the *Select Number* option, it allows us to select the number of terms for the correlation graph, and *Download the plot*, allows us to download the graph in png and pdf format. Visualize video 5.
.
LDABiplots
Video 5. DTM matrix visualization
## Selection of k optimum.
For the inference and selection of the optimal number of topics for the LDA model, we start from the DTM matrix, taking into account that a small *K* can generate wide and heterogeneous topics, and a high *K* will produce specific topics `LDABiplots` obtains this optimal k from the coherence of the topic, this being a measure of the quality of the desired topic from the point of view of human interpretability. This is based on the distribution hypothesis that states that words with similar interpretations tend to coexist in similar contexts. The best number of topics will be the one that offers the greatest measure of coherence, this is done based on probability theory and consists of adjusting several models with different topics and calculating the coherence of each of them. For the option of this number, the models that you want to check must be parameterized in the *Inference* section in *Candidate number of topics K*, it must be identified from the range of topics for the test, in the *Parameters section Gibbs sampling* control, you must select the number of iterations *Iteratition* of the sample based on Gibbs sampling and the number of the first N samples to discard *Burn-in*, to choose an N that is big enough. An *Alpha* hyperparameter value should be selected, considering that a high alpha value means that each document is likely to contain a combination of most topics, and not a single particular topic. A low alpha value places fewer restrictions on documents and means that a document is more likely to contain a combination of only a few, or even just one, of the topics. Several authors have defined some rules to determine the value of alpha, *α = (0.1, 50/K)(Griffiths et al, 2004), also (0.1, 0.1)(Asunción et al, 2009) and (1/K, 1/K)( Rehůřek and Sojka, 2010)*, by default `LDABiplots` uses the value of *0.1* for calculation, see video 6.
.
LDABiplots
Video 6. Obtaining of k optimum
## Obtaining LDA Model
Once the number of topics was defined, according to the obtained coherence of 0.069, it was inferred that the best number of topics is 4, with this optimal K, the LDA model is generated from the DTM matrix, with the optimal K number You must define the parameters similar to the process where the inference was obtained, for the example, 100 iterations and a Burn-in of 5 were selected, as well as an Alpha of 0.1, after evaluating the optimal K according to the determined rules. The result obtained with `LDABiplots` are two matrices, the first is the Theta matrix, which shows in the columns an identifier of the news of the analyzed newspapers and in the rows a distribution of topics in the analyzed documents. Another matrix obtained is the phi, which shows in the rows that represent a distribution of words on the topics. Both matrices can be downloaded in the *Tabular result* section, where before downloading the matrices you can select the number of terms *Select number of term*, select the number of labels in *Select number of label*, and the value of the assignments *Select Assignments*, to parameterize the number of words and the labels that you want to observe and download. In *Worcloud* we can observe through a graph of words, which ones have greater weight in each of the topics. In *Heatmap* we observe through a heat map the probabilities of belonging to each of the newspapers, where, according to the color scale shown, it can be seen which topics are found more in any of the digital news newspapers in particular. In the *Cluster* tab, you can see the grouping of the topics found, for which you can select the grouping method in *Agglomeration method* among the methods included in the `LDABiplots` we have *complete, single, Ward. D, Ward.D2, average, mcquitty, median, centroid*, *Ward*'s minimum variance method aims to find compact and spherical groups. The *complete* method finds similar groups. The *single* method, which is closely related to the minimal spanning tree, adopts a *friend of friend* grouping strategy. The other methods can be thought of as targeting groups with features somewhere between the single and complete methods. The methods *median* and *centroid* do not lead to a monotonic distance measure or, equivalently. In the *type of plot* section, you can select the type of graph to display, there are the options of *rectangle* which draws rectangles around the branches of a dendrogram highlighting the corresponding groups, and *circular* which generates a graph efficiently and optimally with a heuristic and *phylogenic* circular grouping that shows through a phylogenetic tree how the hypothetical topics are related to each other, as well as a scroll bar to select the number of clusters to perform between topics, the package allows you to download the plot in pdf or png format. see video 7
.
LDABiplots
Video 7. LDA Model and Representations
## Representations Biplot
Biplot graphs approximate the distribution of a multivariate sample in a reduced dimension space, and superimpose on its representations of the variables on which the sample is measured, this graph allows graphically displaying the information of the rows (represented by points, markers rows) and columns (represented by Vectors, column markers), `LDABiplots`, allows us to graphically and tabularly display the results obtained when processing the Biplots, we select the desired Biplot among the *JK-Biplot*, where the coordinates of the rows are the coordinates on the main components and the coordinates of the columns are the eigenvectors of the covariance or correlation matrix. The Euclidean distances between row points in the Biplot approximate the Euclidean distances between rows in multidimensional space. Or the *GH-Biplot*, where the coordinates of the rows are standardized and the distance between rows approximates the Mahalanobis distance in multidimensional space. And the *HJ-Biplot* that generates a high quality of representation for both rows and columns, by presenting both identical goodnesses of fit, it is possible to interpret the row-column relationship. For our example, the HJ-Biplot was selected, for the interpretation, the following rules are considered: See figure 2.
. ```{r fig2,fig.cap='Figura 2. Interpretation HJ-Biplot ',fig.align='center', echo=FALSE, out.width = '50%'} knitr::include_graphics("../inst/img/HJ-Biplot_Ing.PNG") ``` Where;
- The distances between row markers are interpreted as an inverse function of their similarities, so that neighboring markers are more similar - The length of the vectors (column markers) approximates the standard deviation of the daily news. - The cosines of the angles between the column markers approximate the correlations between the Diaries, acute angles associate a high positive correlation between them, obtuse angles indicate a negative correlation and right angles indicate uncorrelated variables. - The order of the orthogonal projections of the points (row markers) onto a vector (column marker) approximates the order of the row elements (centers) in that column. The greater the projection of a point on a vector, the more the center deviates from the mean of that journal. Before selecting the generation of the Biplot representation to be carried out, it is necessary to mark how the centering of the covariance matrix will be carried out, LDABiplots gives us 4 options for centering and scaling the matrix, such: *scale*, *center*, *center_scale*, or *none*. By clicking on *run*, the selected Biplots and the results will be generated in tabular form, which can be downloaded in different formats, the tabular results shown are: *Eigenvalues* the vectors with the eigenvalues, *Variance explained * a vector containing the proportion of variance explained by the first 1, 2,., K main components obtained, *loadings* The loadings of the main components, *Coordinates of individuals* matrix with the coordinates of the individuals, *Coordinates of variables* array with the coordinates of the variables. In the *Biplot* tab, you will find the graphical representation generated according to the previously selected parameters, the graphic can be modified in its form, and with the different options offered by the package, it can be modified in the *Options to Customize the section. Biplot*, the *theme*, the axes to display in *Axis-X* and in *Axis-Y*, the color of the column markers, and the color of the row markers, you can also change the size of the markers and add the labels of both markers in different sizes. The representation can be downloaded in png or pdf format. See video 8
.
LDABiplots
Video 8. Representations Biplot
## Citation If you use `LDABiplots`, please cite it in your work as:
*Pilacuan-Bonete L., Galindo-Villardón P., De la Hoz-M J., & Delgado-Álvarez F.(2022). LDABiplots: Biplot Graphical Interface for LDA Models. R package version 0.1.2*
## References
*Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.* *Galindo-Villardón,P. (1986). Una alternativa de representación simultánea: HJ-Biplot (An alternative of simultaneous representation: HJ-Biplot). Questíio 1986, 10, 13–23.* *Gabriel, K. R. (1971). The biplot graphic display of matrices with application to principal component analysis. Biometrika, 58(3), 453-467.* *Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1), 5228-5235.*