Cancer cells accumulate DNA mutations as result of DNA damage and DNA repair processes. mutSignatures is a computational framework aimed at deciphering DNA mutational signatures operating in cancer.

Tutorials and How-tos

More vignettes discussing how to extract and analyze mutational signatures from different sources using mutSignatures can be found at the following URLs:

Installation

The package is available from CRAN. The official version of mutSignatures (version 2.1.1) can be installed from any CRAN mirror.

install.packages("mutSignatures")

The latest (most recently updated, beta) version of the package is available on GitHub. It is possible to install mutSignatures using devtools.

devtools::install_github("dami82/mutSignatures", force = TRUE, 
                         build_opts = NULL, build_vignettes = TRUE)

Note on software Dependencies

One of the pre-processing steps in the mutSignatures pipeline relies on a list of Bioconductor (https://www.bioconductor.org/) libraries. These libraries are optional, since they are only required for the attachContext() step (these libs are not automatically downloaded when installing the package). The bioconductor libraries are: i) IRanges; ii) GenomicRanges; iii) BSgenome; iv) GenomeInfoDb. You can install them as shown below.

# Get BiocManager
if (!"BiocManager" %in% rownames(utils::installed.packages()))
  install.packages("BiocManager")

# Get Bioc libs
BiocManager::install(pkgs = c("IRanges", "GenomicRanges", "BSgenome", "GenomeInfoDb"))

To complete the attachContext() step, a BSgenome package including full genome sequences of the organism of interest has to be installed as well. For example, human hg19 genome sequences can be installed as shown below.

BiocManager::install(pkgs = "BSgenome.Hsapiens.UCSC.hg19")

Pipeline

The typical mutSignatures analytic pipeline includes three steps:

Alternative pipeline

If the set of mutational signatures is already known, it is possible to use mutSignatures to estimate the contribution of each mutational pattern in a collection of samples. In this case, the pipeline includes the following steps:

  • Data Import and pre-processing: see above.

  • Estimate exposures to known mutational signatures: Use a Fast Combinatorial Nonnegative Least-Square approach to compute exposures when the mutation signatures are pre-defined.

  • Visualization, downstream analysis, data export: see above.

Load library, set-up environment

For running the demos described in this vignette, the following R libraries will be used: dplyr, reshape2, ggplot2, kableExtra, BSgenome.Hsapiens.UCSC.hg19, and mutSignatures. Before proceeding, we assign the hg19 BSgenome object to a variable named hg19, and we load the mutSigData dataset, which is provided together with the mutSignatures package.

# Required libs
library(dplyr)
library(reshape2)
library(kableExtra)
library(ggplot2)
library(gridExtra)
library(BSgenome.Hsapiens.UCSC.hg19)

# Load mutSignatures
library(mutSignatures)

# prep hg19
hg19 <- BSgenome.Hsapiens.UCSC.hg19

# load data
data("mutSigData")

Example 1 - De novo extraction of Mutational Signatures from BLCA samples

The first demo shows how to extract mutational signatures from a dataset including DNA mutations from bladder cancer samples.

Data Import and pre-processing

Here, the input has a VCF-like structure, and is decorated with an extra column (namely, SAMPLEID) that includes a unique identifier for the biological samples.

# Import data (VCF-like format)
x <- mutSigData$inputC

# Filter non SNV
x <- filterSNV(dataSet = x,  seq_colNames = c("REF", "ALT"))

# Visualize head
head(x) %>% kable() %>% kable_styling(bootstrap_options = "striped")
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT XTR1 SAMPLEID
chr15 73624589 . G T 29.084 . . GT:PL 0/1 BLCA.001103810
chr22 35684398 . G A 21.995 . . GT:PL 0/1 BLCA.001103810
chr3 183689484 . G A 27.296 . . GT:PL 0/1 BLCA.001103810
chr16 89261395 . G A 24.398 . . GT:PL 0/1 BLCA.001103810
chr19 44662014 . G C 21.626 . . GT:PL 0/1 BLCA.001103810
chr16 30724634 . G C 38.741 . . GT:PL 0/1 BLCA.001103810
# Attach context
x <- attachContext(mutData = x,
                   chr_colName = "CHROM",
                   start_colName = "POS",
                   end_colName = "POS",
                   nucl_contextN = 3,
                   BSGenomeDb = hg19)
# Visualize head
head(x) %>% kable() %>% kable_styling(bootstrap_options = "striped")
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT XTR1 SAMPLEID context
chr15 73624589 . G T 29.084 . . GT:PL 0/1 BLCA.001103810 CGA
chr22 35684398 . G A 21.995 . . GT:PL 0/1 BLCA.001103810 TGA
chr3 183689484 . G A 27.296 . . GT:PL 0/1 BLCA.001103810 TGC
chr16 89261395 . G A 24.398 . . GT:PL 0/1 BLCA.001103810 AGG
chr19 44662014 . G C 21.626 . . GT:PL 0/1 BLCA.001103810 AGA
chr16 30724634 . G C 38.741 . . GT:PL 0/1 BLCA.001103810 GGA
# Remove mismatches
x <- removeMismatchMut(mutData = x,                  # input data.frame
                       refMut_colName = "REF",       # column name for ref base
                       context_colName = "context",  # column name for context
                       refMut_format = "N")    
# Compute mutType
x <- attachMutType(mutData = x,                      # as above
                   ref_colName = "REF",              # column name for ref base
                   var_colName = "ALT",              # column name for mut base
                   context_colName = "context") 
# Visualize head
head(x) %>% kable() %>% kable_styling(bootstrap_options = "striped")
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT XTR1 SAMPLEID context mutType
chr15 73624589 . G T 29.084 . . GT:PL 0/1 BLCA.001103810 CGA T[C>A]G
chr22 35684398 . G A 21.995 . . GT:PL 0/1 BLCA.001103810 TGA T[C>T]A
chr3 183689484 . G A 27.296 . . GT:PL 0/1 BLCA.001103810 TGC G[C>T]A
chr16 89261395 . G A 24.398 . . GT:PL 0/1 BLCA.001103810 AGG C[C>T]T
chr19 44662014 . G C 21.626 . . GT:PL 0/1 BLCA.001103810 AGA T[C>G]T
chr16 30724634 . G C 38.741 . . GT:PL 0/1 BLCA.001103810 GGA T[C>G]C
# Count
blca.counts <- countMutTypes(mutTable = x,
                             mutType_colName = "mutType",
                             sample_colName = "SAMPLEID")
# Mutation Counts
print(blca.counts)
##  Mutation Counts object - mutSignatures
## 
##  Total num of MutTypes: 96
##  MutTypes: A[C>A]A, A[C>A]C, A[C>A]G, A[C>A]T, A[C>G]A ...
## 
##  Total num of Samples: 50
##  Sample Names: BLCA.001103810, BLCA.001289938, BLCA.001362782, BLCA.001427035, BLCA.001575414 ...

Signature Extraction

After multi-nucleotide mutation count data have been prepared, it is possible to proceed with the de-novo signature extraction. Settings guiding the analysis are defined as a list of parameters that can be tuned via the setMutClusterParams() function. Next, the NMF analysis is executed by the decipherMutationalProcesses() function. At the end of the analysis, a silhouette plot is returned. This plot can be very helpful to understand if the obtained signatures are relatively weak or robust.

# how many signatures should we extract? 
num.sign <- 4

# Define parameters for the non-negative matrix factorization procedure.
# you should parallelize if possible
blca.params <- 
  mutSignatures::setMutClusterParams( 
    num_processesToExtract = num.sign,    # num signatures to extract
    num_totIterations = 20,               # bootstrapping: usually 500-1000
    num_parallelCores = 4)                # total num of cores to use (parallelization)
# Extract new signatures - may take a while
blca.analysis <- 
  decipherMutationalProcesses(input = blca.counts,
                              params = blca.params)

Downstream analyses and visualization

Profiles of de-novo extracted mutational signatures can be visualized (by barplots). Also it is possible to plot the estimated exposures of each sample to the identified mutational patterns. It is also possible to compare de-novo extracted signatures with known signatures (such as signatures from previous analyses, or COSMIC signatures). To cast mutSignature objects as data.frames, you can use the mutSignatures::as.data.frame() method.

Note: you can use the mutSignatures::getCosmicSignatures() function to download a copy of the COSMIC signatures.

# Retrieve signatures (results)
blca.sig <- blca.analysis$Results$signatures

# Retrieve exposures (results)
blca.exp <- blca.analysis$Results$exposures

# Plot signature 1 (standard barplot, you can pass extra args such as ylim)
msigPlot(blca.sig, signature = 1, ylim = c(0, 0.10))

# Plot exposures (ggplot2 object, you can customize as any other ggplot2 object)
msigPlot(blca.exp, main = "BLCA samples") + 
  scale_fill_manual(values = c("#1f78b4", "#cab2d6", "#ff7f00", "#a6cee3"))

# Export Signatures as data.frame
xprt <- coerceObj(x = blca.sig, to = "data.frame") 
head(xprt) %>% kable() %>% kable_styling(bootstrap_options = "striped")
Sign.01 Sign.02 Sign.03 Sign.04
A[C>A]A 0.0022120 0.0040737 0.0057090 0.0062098
A[C>A]C 0.0007129 0.0088800 0.0016303 0.0145214
A[C>A]G 0.0002019 0.0039400 0.0007855 0.0120000
A[C>A]T 0.0005315 0.0022517 0.0110999 0.0075955
A[C>G]A 0.0009370 0.0048720 0.0024274 0.0098213
A[C>G]C 0.0013582 0.0001527 0.0054321 0.0046073
# Get signatures from data (imported as data.frame) 
# and then convert it to mutSignatures object
cosmixSigs <- mutSigData$blcaSIGS %>% 
  dplyr::select(starts_with("COSMIC")) %>% 
  as.mutation.signatures()

blcaKnwnSigs <- mutSigData$blcaSIGS %>% 
  dplyr::select(starts_with("BLCA")) %>% 
  as.mutation.signatures()

# Compare de-novo signatures with selected COSMIC signatures
msig1 <- matchSignatures(mutSign = blca.sig, reference = cosmixSigs, 
                         threshold = 0.45, plot = TRUE) 
msig2 <- matchSignatures(mutSign = blca.sig, reference = blcaKnwnSigs, 
                         threshold = 0.45, plot = TRUE)
# Visualize match
# signature 1 is similar to COSMIC ; 
# signatures 2 and 3 are similar to COSMIC
# Here, we should probably extract only 2 mutational signatures
hm1 <- msig1$plot + ggtitle("Match to COSMIC signs.")
hm2 <- msig2$plot + ggtitle("Match to known BLCA signs.")

# Show
grid.arrange(hm1, hm2, ncol = 2)

Example 2 - Estimate the contribution of a panel of known mutational signatures

The second demo illustrates a case where the signatures are known. We will use the COSMIC signatures 1, 2, 5, and 13 in the analysis, as well as a set of custom signatures previously identified in bladder cancer tumor samples. We will estimate their contributions to a collection of blca mutations that are included in the mutSigData dataset.

Data Import and pre-processing

Data import and pre-processing can be conducted as shown above. Note that if a data.frame with counts of mutation types across samples is available, it is sufficient to use the as.mutation.counts() method to convert it to the desired object.

# Retrieve a mutation.counts data.frame
x <- mutSigData$blcaMUTS

# Visualize header
x[1:10, 1:5] %>% kable() %>% kable_styling(bootstrap_options = "striped")
BLCA.001103810 BLCA.001289938 BLCA.001362782 BLCA.001427035 BLCA.001575414
A[C>A]A 1 0 0 0 0
A[C>A]C 0 0 0 0 0
A[C>A]G 0 0 0 1 0
A[C>A]T 1 0 0 1 0
A[C>G]A 1 0 0 2 0
A[C>G]C 0 0 0 0 0
A[C>G]G 1 0 0 0 0
A[C>G]T 0 0 2 0 0
A[C>T]A 1 0 1 3 0
A[C>T]C 2 1 1 1 0
# Convert it
xx <- as.mutation.counts(x)
# Print to screen
print(xx)
##  Mutation Counts object - mutSignatures
## 
##  Total num of MutTypes: 96
##  MutTypes: A[C>A]A, A[C>A]C, A[C>A]G, A[C>A]T, A[C>G]A ...
## 
##  Total num of Samples: 50
##  Sample Names: BLCA.001103810, BLCA.001289938, BLCA.001362782, BLCA.001427035, BLCA.001575414 ...

Estimate Exposures to known Mutational Signatures

After preparing the mutational signatures as a mutationSignatures-class object, it is sufficient to run the resolveMutSignatures() function. It is not recommended to use a large number of signatures when running this analysis (for example, refrain from using all COSMIC signatures). Likewise, you don’t want to run this step using a very low number of signatures. The algorithm is way more accurate if ONLY AND ALL the relevant signatures are used (i.e., the signatures we are reasonably expecting to be operative in the tumor samples being analyzed). This analysis is typically very fast!

# Obtain 4 COSMIC signatures
cosmx <- mutSigData$blcaSIGS %>% dplyr::select(starts_with("COSMIC"))
cosmx <- as.mutation.signatures(cosmx)

# Obtain 4 BLCA signatures
blcmx <- mutSigData$blcaSIGS %>% dplyr::select(starts_with("BLCA"))
blcmx <- as.mutation.signatures(blcmx)
# Visualize cosmx
print(cosmx)
##  Mutation Signatures object - mutSignatures
## 
##  Total num of Signatures: 4
##  Total num of MutTypes: 96
## 
##     Sign.1   Sign.2   Sign.3   Sign.4
##     ------   ------   ------   ------
##   + 0.0111   0.0007   0.0149   0.0003  +  A[C>A]A
##   + 0.0091   0.0006   0.0090   0.0006  +  A[C>A]C
##   + 0.0015   0.0001   0.0022   0.0000  +  A[C>A]G
##   + 0.0062   0.0003   0.0092   0.0008  +  A[C>A]T
##   + 0.0018   0.0003   0.0117   0.0038  +  A[C>G]A
##   + 0.0026   0.0003   0.0073   0.0009  +  A[C>G]C
##   + 0.0006   0.0002   0.0023   0.0000  +  A[C>G]G
##   + 0.0030   0.0006   0.0117   0.0039  +  A[C>G]T
##   + 0.0295   0.0074   0.0218   0.0015  +  A[C>T]A
##   + 0.0143   0.0027   0.0128   0.0000  +  A[C>T]C
##     ......   ......   ......   ......
# Visualize cosmx
print(blcmx)
##  Mutation Signatures object - mutSignatures
## 
##  Total num of Signatures: 4
##  Total num of MutTypes: 96
## 
##     Sign.1   Sign.2   Sign.3   Sign.4
##     ------   ------   ------   ------
##   + 0.0019   0.0020   0.0072   0.0059  +  A[C>A]A
##   + 0.0003   0.0073   0.0111   0.0105  +  A[C>A]C
##   + 0.0001   0.0064   0.0049   0.0069  +  A[C>A]G
##   + 0.0006   0.0038   0.0071   0.0061  +  A[C>A]T
##   + 0.0007   0.0028   0.0055   0.0118  +  A[C>G]A
##   + 0.0017   0.0002   0.0021   0.0060  +  A[C>G]C
##   + 0.0017   0.0028   0.0062   0.0029  +  A[C>G]G
##   + 0.0024   0.0031   0.0015   0.0089  +  A[C>G]T
##   + 0.0019   0.0082   0.0135   0.0155  +  A[C>T]A
##   + 0.0028   0.0119   0.0171   0.0100  +  A[C>T]C
##     ......   ......   ......   ......
# Run analysis
blca.expo1 <- resolveMutSignatures(mutCountData = xx, 
                                   signFreqData = cosmx)

blca.expo2 <- resolveMutSignatures(mutCountData = xx, 
                                   signFreqData = blcmx)

Downstream analysis

As discussed above, exposures to mutational signatures can be plotted (barplots), and results can be exported using the mutSignatures::data.frame method.

# Retrieve exposures (results)
blca.exp.1x <- blca.expo1$results$count.result
blca.exp.2x <- blca.expo2$results$count.result

# Plot exposures
bp1 <- msigPlot(blca.exp.1x, main = "BLCA | COSMIC sigs.") + 
  scale_fill_manual(values = c("#fdbf6f", "#e31a1c", "#fb9a99", "#1f78b4"))

bp2 <- msigPlot(blca.exp.2x, main = "BLCA | pre-blca sigs.") + 
  scale_fill_manual(values = c("#fdbf6f", "#e31a1c", "#fb9a99", "#1f78b4"))

# Visualize
grid.arrange(bp1, bp2, ncol = 2)

# Compare sigs

# Export Exposures as data.frame
xprt <- as.data.frame(blca.exp.1x, transpose = TRUE)
head(xprt) %>% round() %>% kable() %>% 
  kable_styling(bootstrap_options = "striped")
COSMIC.1 COSMIC.2 COSMIC.5 COSMIC.13
BLCA.001103810 32 29 23 30
BLCA.001289938 42 24 0 32
BLCA.001362782 50 119 24 228
BLCA.001427035 79 35 14 74
BLCA.001575414 49 49 0 40
BLCA.001698493 34 6 53 0

More info and examples

More info and other examples can be found at the following URLs:

SessionInfo

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] mutSignatures_2.1.1 foreach_1.5.0       gridExtra_2.3      
## [4] ggplot2_3.3.2       kableExtra_1.1.0    reshape2_1.4.4     
## [7] dplyr_1.0.0        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5        pracma_2.2.9      highr_0.8         pillar_1.4.6     
##  [5] compiler_4.0.2    plyr_1.8.6        iterators_1.0.12  tools_4.0.2      
##  [9] digest_0.6.25     gtable_0.3.0      evaluate_0.14     lifecycle_0.2.0  
## [13] tibble_3.0.3      viridisLite_0.3.0 pkgconfig_2.0.3   rlang_0.4.7      
## [17] rstudioapi_0.11   parallel_4.0.2    yaml_2.2.1        xfun_0.16        
## [21] cluster_2.1.0     withr_2.2.0       stringr_1.4.0     httr_1.4.2       
## [25] knitr_1.29        xml2_1.3.2        generics_0.0.2    vctrs_0.3.2      
## [29] hms_0.5.3         webshot_0.5.2     grid_4.0.2        tidyselect_1.1.0 
## [33] glue_1.4.1        R6_2.4.1          rmarkdown_2.3     farver_2.0.3     
## [37] purrr_0.3.4       readr_1.3.1       magrittr_1.5      codetools_0.2-16 
## [41] scales_1.1.1      ellipsis_0.3.1    htmltools_0.5.0   rvest_0.3.6      
## [45] colorspace_1.4-1  labeling_0.3      proxy_0.4-24      stringi_1.4.6    
## [49] doParallel_1.0.15 munsell_0.5.0     crayon_1.3.4

Thanks for your interest in our software. If you run into any issue, bug, or you have a question about mutSignatures, feel free to email . At this time, mutSignatures is mainly supported by DF in his free time (so, please, be patient upon your inquires). C-2020 (November-01, 2020).