---
title: "Basic functionality: Visualization, domain detection, and spatial heterogeneity"
date: '`r format(Sys.Date())`'
output: rmarkdown::html_vignette
bibliography: vignette_references.bib
vignette: >
  %\VignetteIndexEntry{spatialge_tutorial_main}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r vig_settings, include=F}
knitr::opts_chunk$set(
  collapse=TRUE,
  comment="#>",
  fig.width=8, fig.height=5
)
options(rmarkdown.html_vignette.check_title=F)
```

The package ***`spatialGE`*** provides a collection of tools for the visualization of gene expression from spatially-resolved transcriptomic experiments. The data input methods have been designed so that any data can be analyzed as long as it contains gene expression counts per spot or cell, and the spatial coordinates of those spots or cells. Specialized data input options are available to allow interoperability with workflows such as [*Space Ranger*](https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/what-is-space-ranger).

This tutorial covers basic analysis possible within spatialGE. Tutorials explaining more advanced spatial analyses and other data input types can be found in the documentation site of the [spatialGE GitHub repository](https://fridleylab.github.io/spatialGE/). A point-and-click, user-friendly interface is also available at https://spatialge.moffitt.org.

## Installation

The `spatialGE` repository is available at the Comprehensive R Archive Network (CRAN) repository and can be installed via `install.packages`:

```{r cran_install}
# install.packages('spatialGE')
```

A development version is available at GitHub and can be installed via `devtools`:
```{r install_devtools}
# if("devtools" %in% rownames(installed.packages()) == FALSE){
#   install.packages("devtools")
# }

#devtools::install_github("fridleylab/spatialGE")
```

To use `spatialGE`, load the package.

```{r load_spatialGE, message=F}
library('spatialGE')
```

## Spatially-resolved expression of triple negative breast cancer tumor biopsies

To show the utility of some of the functions in `spatialGE`, we use the spatial transcriptomics data set generated with the platform Visium by @bassiouni_2023. This data set includes triple negative breast cancer biopsies from 22 patients, with two tissue slices per patient. The Visium platform allows gene expression quantitation in approximately 5000 capture areas (i.e., "spots") separated each other by 100μM. The spots are 55μM in diameter, which corresponds to 1-10 cells approximately, depending on cell size.

Data for all the tissue slices are available at the Gene Expression Omnibus (GEO). For the purpose of this tutorial, we will use eight samples from four patients. 

The GEO repositories can be accessed using the following links:

-   [sample_117d](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6433599)
-   [sample_117e](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6433600)
-   [sample_120d](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6433611)
-   [sample_120e](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6433612)
-   [sample_092a](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6433585)
-   [sample_092b](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6433586)
-   [sample_094c](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6433593)
-   [sample_094d](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6433594)

For convenience, the data is also available at our [spatialGE_Data GitHub repository](https://github.com/FridleyLab/spatialGE_Data). During this tutorial, we will pull the data from this repository. Nonetheless, users are eoncouraged to download the data and explore its contents in order to familiarize with the input formats accepted by `spatialGE`. For each sample directory, the file `filtered_feature_bc_matrix.h5` contains the gene expression counts. The `spatial` sub-directory contains the `tissue_positions_list.csv`, `tissue_hires_image.png`, and `scalefactors_json.json` files. This file structure, corresponds roughly to the file structure in outputs generated by Space Ranger, the software provided by 10X Genomics to process raw Visium data.


```{r extrenal_figs1, echo=FALSE, fig.align='center'}
knitr::include_graphics("img/dir_structure.png")
```


In this tutorial, data will be deposited in a temporary directory. The data is downloaded from this link:

```{r data_download1}
lk='https://github.com/FridleyLab/spatialGE_Data/raw/refs/heads/main/tnbc_bassiouni.zip?download='
```

The following code block creates a temporary directory:

```{r temp_dir}
visium_tmp = tempdir()
unlink(visium_tmp, recursive=TRUE)
dir.create(visium_tmp)
```

The data will be downloaded to the temporary directory. The data is compressed in a 
zip file, which will be unzipped into a sub-directory called `tnbc_bassiouni`:

```{r data_download2}
tryCatch({ # In case data is not available from network
  download.file(lk, destfile=paste0(visium_tmp, '/', 'tnbc_bassiouni.zip'), mode='wb')
  zip_tmp = list.files(visium_tmp, pattern='tnbc_bassiouni.zip$', full.names=TRUE)
  unzip(zipfile=zip_tmp, exdir=thrane_tmp)
  net_check <<- TRUE
}, error = function(e) {
  message("Could not download data. Are you connected to the internet?")
  net_check <<- FALSE
})
```

## Creating an STList (Spatial Transcriptomics List)

In `spatialGE`, raw and processed data are stored in an ***STlist*** (S4 class object). As previously mentioned, an STlist can be created with the function `STlist`, from a variety of data formats (see [here](https://fridleylab.github.io/spatialGE/reference/STlist.html) for more info or type `?STlist` in the R console). In this tutorial we will provide the file paths to the folders downloaded in the previous step.

Additionally, we will input meta data associated with each sample. The meta data is provided in the form of a comma-delimited file. We have extracted some of the clinical meta data for the eight samples from the original publication and saved it as part of the `spatialGE` package. The user is encouraged to look at the structure of this file by downloading it from the [spatialGE_Data GitHub repository](https://github.com/FridleyLab/spatialGE_Data). The most important aspect when constructing this file is that the sample names are in the first column, and they match the names of the folders containing the data:


```{r extrenal_figs2, echo=FALSE, fig.align='center', eval=net_check}
knitr::include_graphics("img/metadata_file.png")
```


From the temporary directory, we can use R to generate the file paths to be passed to the `STlist` function:

```{r data_fpaths, eval=net_check}
visium_folders <- list.dirs(paste0(visium_tmp, '/tnbc_bassiouni'), full.names=TRUE, recursive=FALSE)
```

The meta data can be accessed directly from the `spatialGE` package installed in the computer like so:

```{r data_fpaths2, eval=net_check}
clin_file <- list.files(paste0(visium_tmp, '/tnbc_bassiouni'), full.names=TRUE, recursive=FALSE, pattern='clinical')
```

We can load the files into an STlist using this command:

```{r create_stlist, warning=F, eval=net_check}
tnbc <- STlist(rnacounts=visium_folders, samples=clin_file, cores=2)
```

The `tnbc` object is an STlist that contains the count data, spot coordinates, and clinical meta data.

```{r call_stlist, eval=net_check}
tnbc
```

As observed, by calling the `tnbc` object, information on the number of spots and genes per sample is displayed. For count statistics, the `summarize_STlist` function can be used:

```{r count_stats, eval=net_check}
summarize_STlist(tnbc)
```

The minimum number of counts per spot is 42, which seems low. We can look at the distribution of counts and genes per spot using the `distribution_plots` function:

```{r count_dstr, eval=net_check}
cp <- distribution_plots(tnbc, plot_type='violin', plot_meta='total_counts')
cp[['total_counts']]
```

Now, let us remove spots with low counts by keeping only those spots with at least 5000 counts. We will also restrict the data set to spots expressing at least 1000 genes. We also will remove a few spots that have abnormally large number of counts, as in sample_094c. These criteria are not a rule, and samples in each study have to be carefully examined. For example, this criteria may be not enough to reduce the differences in counts, especially for sample_120d and samples_120e.

We run this filter with the `filter_data` function:

```{r filter_chunk, eval=net_check}
tnbc <- filter_data(tnbc, spot_minreads=5000, spot_mingenes=1000, spot_maxreads=150000)

cp2 <- distribution_plots(tnbc, plot_type='violin', plot_meta='total_counts')
cp2[['total_counts']]
```

## Exploring variation between spatial arrays

The functions `pseudobulk_samples` and `pseudobulk_pca_plot` are the initial steps to obtain a quick snapshot of the variation in gene expression among samples. The function `pseudobulk_samples` creates (pseudo) "bulk" RNAseq data sets by combining all counts from each sample. Then, it log transforms the "pseudo bulk" RNAseq counts and runs a Principal Component Analysis (PCA). Note that the spatial coordinate information is not considered here, which is intended only as an exploratory analysis analysis. The `max_var_genes` argument is used to specify the maximum number of genes used for computation of PCA. The genes are selected based on their standard deviation across samples.

```{r pseudobulk, eval=net_check}
tnbc <- pseudobulk_samples(tnbc, max_var_genes=5000)
```

In this case, we apply the function to look for agreement between samples from the same patient: It is expected that tissue slices from the same patient are more similar among them than tissue slices from other patients. The `pseudobulk_pca` allows to map a sample-level variable to the points in the PCA by including the name of the column from the sample metadata (`patient_id` in this example).

```{r pca_chunk, fig.align='center', eval=net_check}
pseudobulk_dim_plot(tnbc, plot_meta='patient_id')
```

Users can also generate a heatmap from pseudobulk counts by calling the `pseudobulk_heatmap` function, which also requires prior use of the `pseudobulk_samples` function. The number of variable genes to show can be controlled via the `hm_display_genes` argument.

```{r heatmap_chunk, fig.align='center', eval=net_check}
hm_p <- pseudobulk_heatmap(tnbc, plot_meta='patient_id', hm_display_genes=30)
```

## Transformation of spatially-resolved transcriptomics data

Many transformation methods are available for RNAseq count data. In `spatialGE`, the function `transform_data` applies log-transformation to the counts, after library size normalization performed on each sample separately. Similar to [Seurat](https://satijalab.org/seurat/reference/normalizedata), it applies a scaling factor (`scale_f=10000` by default).

```{r norm_chunk, eval=net_check}
tnbc <- transform_data(tnbc, method='log', cores=2)
```

Users also can apply variance-stabilizing transformation (SCT; @hafemeister_2019), which is another method increasingly used in single-cell and spatial transcriptomics studies. See [here](https://fridleylab.github.io/spatialGE/reference/transform_data.html) for details.

## Visualization of gene expression from spatially-resolved transcriptomics data

After data transformation, expression of specific genes can be visualized using "quilt" plots. The function `STplot` shows the transformed expression of one or several genes. We have adopted the color palettes from the packages `khroma` and `RColorBrewer`. The name of a color palette can be passed using the argument `color_pal`. The default behavior of the function produces plots for all samples within the STlist, but we can pass specific samples to be plotted using the argument `samples`.

Let's produce a quilt plot for the genes *NDRG1* and *IGKC* (hypoxia and B-cell markers, respectively), for sample number 1 of patient 14 (`samples=sample_094c`).

```{r genequilt_chunk, eval=net_check}
quilts1 <- STplot(tnbc, 
                  genes=c('NDRG1', 'IGKC'), 
                  samples='sample_094c', 
                  color_pal='BuRd', 
                  ptsize=0.8)
```

Because `spatialGE` functions output lists of ggplot objects, we can plot the results side-by-side using functions such as `ggarrange()`:

```{r genequilt_chunk2, fig.align='center', eval=net_check}
ggpubr::ggarrange(plotlist=quilts1, nrow=1, ncol=2, legend='bottom')
```

We can see that gene expression patterns of both genes are non-overlapping: *IGKC* is expressed in the upper right portion of the tissue, whereas *NDRG1* is expressed in the right bottom portion (although relatively lower compared to *IGKC* as indicated by the white-colored spots). The location of gene expression of those two genes may be indicative of an immune-infiltrated area and a tumor area. With the help of spatial interpolation, visualization of these regions can be easier as will be showed next.

### Spatial interpolation of gene expression

We can predict a smooth gene expression surface for each sample. In `spatialGE`, this prediction is achieved by using a spatial interpolation method very popular in spatial statistics. The method known as 'kriging' allows the estimation of gene expression values in the un-sampled areas between spots, or cells/spots that were filtered during data quality control. Estimating a transcriptomic surface via kriging assumes that gene expression of two given points is correlated to the spatial distance between them.

The function `gene_interpolation` performs kriging of gene expression via the `fields` package. We specify that kriging will be performed for two of the spatial samples (`samples=c('sample_094c', 'sample_117e')`):

```{r genekrige_chunk, eval=net_check}
tnbc <- gene_interpolation(tnbc, 
                           genes=c('NDRG1', 'IGKC'),
                           samples=c('sample_094c', 'sample_117e'), cores=2)
```

Generating gene expression surfaces can be time consuming. The finer resolution to which the surface is to be predicted (`ngrid` argument), the longer the time it takes. The execution time also depends on the number of spots/cells. The surfaces can be visualized using the `STplot_interpolation()` function:

```{r plotkrige_chunk1, fig.align='center', eval=net_check}
kriges1 <- STplot_interpolation(tnbc,
                                genes=c('NDRG1', 'IGKC'),
                                samples='sample_094c')
#ggpubr::ggarrange(plotlist=kriges1, nrow=1, ncol=2, common.legend=TRUE, legend='bottom')
```

By looking at the transcriptomic surfaces of the two genes, it is easier to detect where "pockets" of high and low expression are located within the tissue. It is now more evident that expression of *NDRG1* is higher in the lower right region of the tumor slice, as well in a smaller area to the left (potentially another hypoxic region).

If tissue images were uploaded to the STlist (`tissue_hires_image.png.gz`), it can be displayed for comparison using the `plot_image` function:

```{r plot_tissue, eval=net_check}
ip = plot_image(tnbc, samples='sample_094c')

ip[['image_sample_094c']]
```

## Unsupervised spatially-informed clustering (*STclust*)

Detecting tissue compartments or niches is an important part of the study of the tissue architecture. We can do this by applying ***`STclust`***, a spatially-informed clustering method implemented in `spatialGE`. The `STclust` method uses weighted average matrices to capture the transcriptomic differences among the cells/spots. As a first step in `STclust`, top variable genes are identified via Seurat's `FindVariableFeatures`, and transcriptomic scaled distances are calculated using only those genes. Next, scaled euclidean distances are computed from the spatial coordinates of the spots/cells. The user defines a weight (`ws`) from 0 to 1, to apply to the physical distances. The higher the weight, the less biologically meaningful the clustering solution is, given that the clusters would only reflect the physical distances between the spots/cells and less information on the transcriptomic profiles will be used. After many tests, we have found that weights between 0.025 - 0.25 are enough to capture the tissue heterogeneity. By default, `STclust` uses dynamic tree cuts [@langfelder_2008] to define the number of clusters. But users can also test a series of k values (`ks`). For a more detailed description of the method, please refer to the paper describing `spatialGE` and `STclust` [@ospina_2022].

We'll try several weights to see it's effect on the cluster assignments:

```{r clustespots_chunk, warning=F, message=F, eval=net_check}
tnbc <- STclust(tnbc, 
                ks='dtc', 
                ws=c(0, 0.025, 0.05, 0.2), cores=2)
```

Results of clustering can be plotted via the `STplot` function:

```{r plotclustspots_chunk, fig.align='center', eval=net_check}
cluster_p <- STplot(tnbc, 
                    samples='sample_094c', 
                    ws=c(0, 0.025, 0.05, 0.2),
                    color_pal='highcontrast')
ggpubr::ggarrange(plotlist=cluster_p, nrow=2, ncol=2, legend='right')
```

We can see that from `w=0` and `w=0.05`, we can only detect two tissue niches. At `w=0.025`, we gain higher resolution as one of the clusters is split and a third ('yellow') cluster appears, potentially indicating that an different transcriptional profile is present there. At `w=0.2`, the clusters seem too compact, indicating that the weight of spatial information is probably too high. We have used here the dynamic tree cuts (`dtc`) to automatically select the number of clusters, resulting in very coarse resolution tissue niches, however, users can define their own range of k to be evaluated, allowing further detection of tissue compartments.

## Association between spatial heterogeneity and sample-level variables

To explore the relationship between a clinical (sample-level) variable of interest and the level of gene expression spatial uniformity within a sample, we can use the `SThet()` function:

```{r sthet_chunk, eval=net_check}
tnbc <- SThet(tnbc, 
              genes=c('NDRG1', 'IGKC'),
              method='moran', cores=2)
```

The `SThet` function calculates the Moran's I statsitic (or Geary's C) to measure the level of spatial heterogeneity in the expression of the genes ( *NDRG1*, *IGKC*). The estimates can be compared across samples using the function `compare_SThet()`

```{r clinplot_chunk, fig.align='center', eval=net_check}
p <- compare_SThet(tnbc, 
                   samplemeta='overall_survival_days', 
                   color_by='patient_id',
                   gene=c('NDRG1', 'IGKC'), 
                   color_pal='muted',
                   ptsize=3)

p
```

The calculation of spatial statistics with `SThet` and and multi-sample comparison with `compare_SThet` provides and easy way to identify samples and genes exhibiting spatial patterns. The previous figure shows that expression of *NDRG1* is more spatially uniform (lower Moran's I) across the tissues in samples from patients 2 and 8 compared to patents 9 and 14. The samples with higher spatial uniformity in the expression of *NDRG1* also tended to have higher overall survival. Trends are less clear for *IGKC*, however, it looks like samples where expression of *IGKC* was spatially aggregated in "pockets" (higher Moran's I) tended to have lower survival (but note that patient 9 does not follow this trend). As studies using spatial transcriptomics become larger, more samples will provide more insightful patterns into the association of gene expression spatial distribution and non-spatial traits associated with the tissues.

The computed statistics are stored in the STlist for additional analysis/plotting that the user may want to complete. The statistics value can be accessed as a data frame using the `get_gene_meta` function:

```{r stats_slot, eval=net_check}
get_gene_meta(tnbc, sthet_only=TRUE)
```


<details>

<summary>**How can the statistics generated by `SThet` can be interpreted? **</summary>

See the table below for a simplistic interpretation of the spatial autocorrelation statistics calculated in `spatialGE`:

```{r sphet_table, echo=F, eval=net_check}
library('magrittr') # Use of pipe operator
sphet_info <- tibble::tibble(
  "Statistic"=c('Moran’s I', 'Geary’s C'),
  "Clustered expression"=c('Closer to 1', 'Closer to 0'),
  "No expression pattern"=c('Closer to 0', 'Closer to 1'),
  "Uniform expression"=c('Closer to -1', 'Closer to 2')
  )

kbl_t = kableExtra::kbl(sphet_info, align='c', centering=TRUE) %>%
  kableExtra::kable_styling(position="center", full_width=FALSE)
kableExtra::add_footnote(kbl_t, 
                         c("Note: The boundaries indicated for each statistic are reached when the number of spots/cells is very large."),
                         notation="none")
```

To better understand how the Moran's I and Geary's C statistics quantify spatial heterogeneity, tissue can be simulated using the `scSpatialSIM` and `spatstat` packages. Also `tidyverse` and `janitor` for some data manipulation.

```{r load_simpkg, message=F, eval=net_check}
library('rpart')
library('spatstat')
library('scSpatialSIM')
library('tidyverse')
library('janitor')
```

The `sc.SpatialSIM` package uses spatial point processes to simulate the locations of spots/cells within a tissue. To facilitate interpretability of the Moran's I and Geary's C statistics, first tissue will be simulated, and then gene expression values will be simulated. 

The first step is to create a `spatstat` observation window:

```{r tissue_sim_1, warning=F, message=F, eval=net_check}
wdw <- owin(xrange=c(0, 3), yrange=c(0, 3))
sim_visium <- CreateSimulationObject(sims=1, cell_types=1, window=wdw)
```

Next, the `sc.SpatialSIM` is used to generate the spatial point process (positions of the spots) within the observation window. Then, assignments of spots to tissue domains are simulated and visualized:

```{r tissue_sim_2, warning=F, fig.height=4, fig.width=6, eval=net_check}
# Generate point process
# Then, simulate tissue compartments
set.seed(12345)
sim_visium <- GenerateSpatialPattern(sim_visium, gridded=TRUE, overwrite=TRUE) %>% 
  GenerateTissue(k=1, xmin=1, xmax=2, ymin=1, ymax=2, sdmin=1, sdmax=2, overwrite=TRUE)

PlotSimulation(sim_visium, which=1, what="tissue points") + 
  scale_shape_manual(values=c(19, 1))
```

Now, the simulated tissue domain assignments are extracted from the `SpatSimObj` object. Gene counts will be simulated in such a way that:

- Expression of "gene_1" is exclusive to "Tissue_1". If a spot was assigned to "Tissue 1", expression of "gene_1" is 1 (high expression). If assigned to "Tissue 2", expression of "gene_1" is 0.1 (low expression). Given "Tissue 1" spots are aggregated towards the center of the tissue, it is expected that Moran's I > 0 and Geary's C < 1 for "gene_1".  
- Spots with high expression of "gene_2" (expression = 1) are equally separated from spots with low expression (expression = 0.1). This pattern is highly unlikely in a biological tissue. It is expected this pattern yields a Moran's I < 0 and a Geary's > 1.
- Expression of "gene_3" results from randomly assigning expression of "gene_2" across the entire tissue. This pattern should result in Moran's I closer to 0 and Geary's C closer to 1.

```{r tissue_sim_3, warning=F, eval=net_check}
# Extract tissue assignments from the `SpatSimObj` object
# Simulate expression of 'gene_1'
sim_visium_df <- sim_visium@`Spatial Files`[[1]] %>%
  clean_names() %>% 
  mutate(gene_1=case_when(tissue_assignment == 'Tissue 1' ~ 1, TRUE ~ 0.1))

# Generate expression patter of "gene_2"
for(i in 1:nrow(sim_visium_df)){
  if(i%%2 == 0){
    sim_visium_df[i, 'gene_2'] = 1
  } else{
    sim_visium_df[i, 'gene_2'] = 0.1
  }
}

# Generate expression of "gene_3"
# Set seed for resproducibility
set.seed(12345)
sim_visium_df[['gene_3']] <- sample(sim_visium_df[['gene_2']])
```

To visualize the simulated expression and run `SThet`, an STlist is created:

```{r tissue_sim_4, warning=F, fig.height=4, eval=net_check}
# Extract simulated expression data
sim_expr <- sim_visium_df %>% 
  add_column(libname=paste0('spot', seq(1:nrow(.)))) %>%
  select(c('libname', 'gene_1', 'gene_2', 'gene_3')) %>%
  column_to_rownames('libname') %>% t() %>% 
  as.data.frame() %>% rownames_to_column('genename')

# Extract simulated spot locations
sim_xy <- sim_visium_df %>% 
  add_column(libname=paste0('spot', seq(1:nrow(.)))) %>%
  select(c('libname', 'y', 'x'))

# The `STlist` function can take a list of data frames
simulated <- STlist(rnacounts=list(sim_visium=sim_expr), 
                    spotcoords=list(sim_visium=sim_xy), cores=2)

# Plot expression
ps <- STplot(simulated, genes=c('gene_1', 'gene_2', 'gene_3'), data_type='raw', color_pal='sunset', ptsize=1)
ggpubr::ggarrange(plotlist=ps, ncol=3)
```

As expected, simulated expression of "gene_1" is aggregated (i.e., clustered, red spots concentrated towards the center of the tissue). This result was intently obtained by setting the simulation parameters to  `xmin=2`, `xmax=3`, `ymin=2`, `ymax=3` in `sc.SpatialSIM`.

```{r tissue_sim_5, warning=F, message=F, eval=net_check}
# The `SThet` function requires normalized data
simulated <- transform_data(simulated, cores=2)

# Run `SThet`
simulated <- SThet(simulated, genes=c('gene_1', 'gene_2', 'gene_3'), method=c('moran', 'geary'))

# Extract results
get_gene_meta(simulated, sthet_only=TRUE)
```

The results for each of the metrics are as expected: "gene_1" shows aggregation/clustering, indicative of "hot-spot" expression. "gene_2" and "gene_3" show uniform and random expression respectively. Notice that none of these values are very far from the "random" expectation (Moran's I = 0 and Geary's C = 1). One of reasons for this result is the effect of library-size normalization. In addition, obtaining extreme values of I and C require extreme spatial patterns, which are unlikely to be observed even in the way data has been simulated here.

</details>


## References

<div id="refs"></div>


<details>

<summary>**Session Info**</summary>

```{r}
sessionInfo()
```

</details>