---
title: 'Supervised Learning-based Receptor Abundance Estimation using STREAK: An Application to the 10X Genomics human extranodal marginal zone B-cell tumor/mucosa-associated lymphoid tissue (MALT) dataset'
author: "Azka Javaid and H. Robert Frost"
output:
  pdf_document: default
  html_document:
    df_print: paged
#output: rmarkdown::html_vignette
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{STREAKvignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Load the STREAK package 

STREAK is a supervised receptor abundance estimation method that depends on functionalities from the Seurat [@seuratCell2021; @seuratCell2019; @seuratNature2018; @seuratNature2015], SPECK [@SPECK-R], VAM [@VAM-R] and Ckmeans.1d.dp [@Wang2011Ckmeans; @song2020wuc] packages.

```{r message=FALSE, warning=FALSE}
library(STREAK)
```
## Receptor gene set construction using a subset of joint scRNA-seq/CITE-seq training data

STREAK performs receptor abundance estimation by leveraging expression associations learned from joint scRNA-seq/CITE-seq training data. These associations can either be manually specified using pre-existing ground truth or can be built using a subset of joint transcriptomics and proteomics data. Below, we use a subset of 1000 cells from the 10X Genomics human extranodal marginal zone B-cell tumor/mucosa-associated lymphoid tissue (MALT) scRNA-seq/CITE-seq joint dataset to build a gene set weights membership matrix for the CD3, CD4, CD8a, CD14 and CD15 receptors. Given a $m \times n$ training scRNA-seq counts matrix and a $m \times h$ CITE-seq matrix, the `receptorGeneSetConstruction()` function is utilized to learn associations between each CITE-seq ADT transcript and all scRNA-seq transcripts. The resulting gene weights membership matrix is $n \times h$. 

```{r message=FALSE, warning=FALSE}
data("train.malt.rna.mat")
data("train.malt.adt.mat")
receptor.geneset.matrix.out <- receptorGeneSetConstruction(train.rnaseq = 
                                                  train.malt.rna.mat, 
                                                train.citeseq = 
                                                  train.malt.adt.mat[,1:5], 
                                                rank.range.end = 100, 
                                                min.consec.diff = 0.01, 
                                                rep.consec.diff = 2,
                                                manual.rank = NULL, 
                                                seed.rsvd = 1)
dim(receptor.geneset.matrix.out)
head(receptor.geneset.matrix.out)
```

## Receptor abundance estimation for target scRNA-seq data 

Following the development of weighted gene sets, the `receptorAbundanceEstimation()` function is used to perform receptor abundance estimation. A subset of 1100 cells from the 10X Genomics MALT scRNA-seq data is used for estimation. Given a $m \times n$ target scRNA-seq counts matrix and a $n \times h$ gene set weights membership matrix, target scRNA-seq expression from top most weighted genes with each ADT transcript is used for gene set scoring and subsequent thresholding. The resulting estimated receptor abundance matrix is $m \times h$. 

```{r message=FALSE, warning=FALSE}
data("target.malt.rna.mat")
receptor.abundance.estimates.out <- 
  receptorAbundanceEstimation(target.rnaseq = target.malt.rna.mat,
                              receptor.geneset.matrix = 
                                receptor.geneset.matrix.out,
                              num.genes = 10, rank.range.end = 100, 
                              min.consec.diff = 0.01, rep.consec.diff = 2,
                              manual.rank = NULL, seed.rsvd = 1, 
                              max.num.clusters = 4, seed.ckmeans = 2)
dim(receptor.abundance.estimates.out)
head(receptor.abundance.estimates.out)
```


## References