The liminal R package is designed to help you understand and interrogate non-linear dimension reduction results via the (grand) tour and linked graphics. It offers two main functions that are useful for exploring high-dimensional datasets,
limn_tour_link(). Both of these functions create interactive visualisations that are embedded in either the RStudio Viewer pane or on a web browser through shiny (Chang et al. 2021).
A tour is a dynamic visualisation that is displayed as a smooth animation of low-dimensional projections. A high dimensional dataset is projected onto a sequence of lower dimensional targets (also called bases) enabling the tour to explore the subspace of all lower dimensional projections. The way that targets are generated is called the tour path, in liminal we default to using the grand tour, which generates random Gaussian targets, although any tour path available in the tourr package can be used (Wickham et al. 2011; Wickham and Cook 2021). For more details, see Lee et al. (2021) for a review of tour methods and Wickham et al. (2011) for their implementation in R.
The liminal package comes with a built in
data.frame which is a high-dimensional tree structured dataset called
fake_trees. It consists of 3000 observations over 100 numeric variables. The tree can really be embedded in 2 dimensions and has 10 branches.
First, let’s generate principal components:
library(liminal) data("fake_trees") <- prcomp(fake_trees[, -ncol(fake_trees)]) pcs # var explained head(cumsum(pcs$sdev / sum(pcs$sdev))) #>  0.05569726 0.09768259 0.13586130 0.17200716 0.20405094 0.23398570
And visualise the results as scatter plot by augmenting the original data:
library(ggplot2) <- dplyr::bind_cols(fake_trees, as.data.frame(pcs$x)) fake_trees ggplot(fake_trees, aes(x = PC1, y = PC2, color = branches)) + geom_point() + scale_color_manual(values = limn_pal_tableau10())
We see some separation of the branches in first two principal components, however, we can’t see each of the branches clearly or their relation to each other.
We can tour the components that represent most of the variation in the data to get a sense of the underlying structure.
The following code generates a shiny application, that provides an interface to the tour:
# this loads a shiny app on the first fifteen PCs limn_tour(fake_trees, cols = PC1:PC15, color = branches)
The interface consists of the tour view which is a dynamic scatterplot and an axis view which corresponds to magnitude and direction of the generated targets.
From the tour view, we can see that the blue branch is hidden (after highlighting it and letting the animation play) and forms the backbone of the tree.
Brushing on the tour view is activated with the shift key plus a mouse drag. It will highlight points that fall inside the brush and pause the current view.
There are several additional interactions available on this view: * There is a play button, that when pressed will start the tour. * There is also a text view of the half range which is the maximum squared Euclidean distance between points in the tour view. The half range is a scale factor for projections and can be thought of as a way of zooming in and out on points. It can be dynamically modified by scrolling (via a mouse-wheel). To reset double click the tour view. * The legend can be toggled to highlight groups of points with shift+mouse-click. Multiple groups can be selected in this way. To reset double click the legend title.
We can also compare this to t-SNE embedding run with default settings:
set.seed(2099) <- Rtsne::Rtsne(dplyr::select(fake_trees, dplyr::starts_with("dim"))) tsne <- data.frame(tsneX = tsne$Y[,1], tsne_df tsneY = tsne$Y[,2]) ggplot(tsne_df, aes(x = tsneX, y = tsneY, color = fake_trees$branches)) + geom_point() + scale_color_manual(values = limn_pal_tableau10())
The topology is a little messed up as the blue branch is now broken into two distinct pieces.
We can see where our embedding is different via a linked tour:
limn_tour_link(embed_data = tsne_df, tour_data = fake_trees, cols = PC1:PC10, # tour columns to select color = branches # variable to highlight across both view, can come for either data frames )
This function requires two tables that will be linked together in separate views. The tour interface is the same as above, except now brushing on the tour view will highlight points on the right hand side scatter plot. The right hand side scatterplot view is an interactive scatterplot.
Brushing on the right view is activated via click and drag movements.
We can see from brushing on the right, where t-SNE has broken up the global structure in the data and distorted the distance between points. For either interface, you can assign the results to an R object, that will return a list consisting of the selected basis (target) and the brushing bounding boxes.
<- limn_tour_link(embed_data = tsne_df, res tour_data = fake_trees, cols = PC1:PC10, # tour columns to select color = branches # variable to highlight across both view, can come for either data frames )