Analysing and visualising pathway enrichment in multi-omics data using ActivePathways

Mykhaylo Slobodyanyuk, Jonathan Barenboim and Juri Reimand

2023-10-31

Multi-omics pathway enrichment analysis

Introduction

ActivePathways is a tool for multivariate pathway enrichment analysis that identifies gene sets, such as pathways or Gene Ontology terms, that are over-represented in a list or matrix of genes. ActivePathways uses a data fusion method to combine multiple omics datasets, prioritizes genes based on the significance and direction of signals from the omics datasets, and performs pathway enrichment analysis of these prioritized genes. We can find pathways and genes supported by single or multiple omics datasets, as well as additional genes and pathways that are only apparent through data integration and remain undetected in any single dataset alone.

The new version of ActivePathways is described in our recent preprint.

Mykhaylo Slobodyanyuk^, Alexander T. Bahcheli^, Zoe P. Klein, Masroor Bayati, Lisa J. Strug, Jüri Reimand. Directional integration and pathway enrichment analysis for multi-omics data. bioRxiv (2023-09-24) (^ - co-first authors) https://www.biorxiv.org/content/10.1101/2023.09.23.559116v1

The first version of ActivePathways was published in Nature Communications with the PCAWG Pan-Cancer project.

Marta Paczkowska^, Jonathan Barenboim^, Nardnisa Sintupisut, Natalie S. Fox, Helen Zhu, Diala Abd-Rabbo, Miles W. Mee, Paul C. Boutros, PCAWG Drivers and Functional Interpretation Working Group, PCAWG Consortium, Juri Reimand. Integrative pathway enrichment analysis of multivariate omics data. Nature Communications 11 735 (2020) (^ - co-first authors) https://www.nature.com/articles/s41467-019-13983-9 https://pubmed.ncbi.nlm.nih.gov/32024846/

Pathway enrichment analysis using the ranked hypergeometric test

From a matrix of p-values, ActivePathways creates a ranked gene list where genes are prioritised based on their combined significance. The combined significance of each gene is determined by performing statistical data fusion on a series of omics datasets provided in the input matrix. The ranked gene list includes the most significant genes first. ActivePathways then performs a ranked hypergeometric test to determine if a pathway (i.e., a gene set with a common functional annotation) is enriched in the ranked gene list, by performing a series of hypergeometric tests (also known as Fisher’s exact tests). In each such test, a larger set of genes from the top of the ranked gene list is considered. At the end of the series, the ranked hypergeometric test returns the top, most significant p-value from the series, corresponding to the point in the ranked gene list where the pathway enrichment reached the greatest significance of enrichment. This approach is useful when the genes in our ranked gene list have varying signals of biological importance in the input omics datasets, as the test identifies the top subset of genes that are the most relevant to the enrichment of the pathway.

Using the package

A basic example of using ActivePathways is shown below.

We will analyse cancer driver gene predictions for a collection of cancer genomes. Each dataset (i.e., column in the matrix) contains a statistical significance score (P-value) where genes with small P-values are considered stronger candidates of cancer drivers based on the distribution of mutations in the genes. For each gene (i.e., row in the matrix), we have several predictions representing genomic elements of the gene, such as coding sequences (CDS), untranslated regions (UTR), and core promoters (promCore).

To analyse these driver genes using existing knowledge of gene function, we will use gene sets corresponding to known molecular pathways from the Reactome database. These gene sets are commonly distributed in text files in the GMT format (Gene Matrix Transposed) file.

Let’s start by reading the data from the files embedded in the R package. For the p-value matrix, ActivePathways expects an object of the matrix class so the table has to be cast to the correct class after reading the file.

scores <- read.table(
system.file('extdata', 'Adenocarcinoma_scores_subset.tsv', package = 'ActivePathways'), 
header = TRUE, sep = '\t', row.names = 'Gene')
scores <- as.matrix(scores)
scores
##                      X3UTR        X5UTR          CDS     promCore
## A2M                     NA 3.339676e-01 9.051708e-01 4.499201e-01
## AAAS                    NA 4.250601e-01 7.047723e-01 7.257641e-01
## ABAT          0.9664125635 4.202735e-02 7.600985e-01 1.903789e-01
## ABCC1         0.9383431571 4.599443e-01 2.599319e-01 2.980455e-01
## ABCC5         0.8021028187 4.478902e-01 4.788846e-01 8.447995e-01
## ABCC8                   NA 4.699356e-01 6.890250e-01 9.226617e-01
## ABHD6         0.4019418029 4.488881e-01 7.587176e-01 3.062514e-01
## ABI1          0.2894847305 1.977600e-01 4.815032e-01 3.486056e-01
##  [ reached getOption("max.print") -- omitted 2406 rows ]

ActivePathways does not allow missing (NA) values in the matrix of P-values and these need to be removed. One conservative option is to re-assign all missing values as ones, indicating our confidence that the missing values are not indicative of cancer drivers. Alternatively, one may consider removing genes with NA values.

scores[is.na(scores)] <- 1

Basic use

The basic use of ActivePathways requires only two input parameters, the matrix of P-values with genes in rows and datasets in columns, as prepared above, and the path to the GMT file in the file system. Importantly, the gene IDs (symbols, accession numbers, etc) in the P-value matrix need to match those in the GMT file.

Here we use a GMT file provided with the package. This GMT file is heavily filtered and outdated; thus users must provide their own GMT file when using the package. These GMT files can be acquired from multiple sources such as Gene Ontology, Reactome and others. For better accuracy and statistical power these pathway databases should be combined. Acquiring an up-to-date GMT file is essential to avoid using unreliable outdated annotations (see this paper).

library(ActivePathways)
gmt_file <- system.file('extdata', 'hsapiens_REAC_subset.gmt', package = 'ActivePathways')
ActivePathways(scores, gmt_file)
##          term_id                     term_name adjusted_p_val term_size
##  1: REAC:2424491               DAP12 signaling   4.491268e-05       358
##  2:  REAC:422475                 Axon guidance   4.625918e-04       555
##  3:  REAC:177929             Signaling by EGFR   6.197504e-04       366
##                                              overlap       evidence
##  1:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...            CDS
##  2:          PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR,promCore
##  3:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...            CDS
##                              Genes_X3UTR Genes_X5UTR
##  1:                                   NA          NA
##  2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,...          NA
##  3:                                   NA          NA
##                                      Genes_CDS
##  1:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
##  2:                                         NA
##  3:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
##                                  Genes_promCore
##  1:                                          NA
##  2: EFNA1,IQGAP1,COL4A1,SCN2B,RPS6KA3,CALM2,...
##  3:                                          NA
##  [ reached getOption("max.print") -- omitted 31 rows ]

ActivePathways 2.0 Directional integration of multi-omics data

ActivePathways 2.0 extends our integrative pathway analysis framework significantly. Users can now provide directional assumptions of input omics datasets for more accurate analyses. This allows us to prioritise genes and pathways where certain directional assumptions are met and penalise those where the assumptions are violated.

For example, fold-change in protein expression would be expected to associate positively with mRNA fold-change of the corresponding gene, while negative associations would be unexpected and indicate more-complex situations or potential false positives. We can instruct the pathway analysis to prioritise positively-associated protein/mRNA pairs and penalise negative associations (or vice versa).

Two additional inputs are included in ActivePathways that allow diverse multi-omics analyses. These inputs are optional.

The scores_direction and constraints_vector parameters are provided in the merge_p_values() and ActivePathways() functions to incorporate this directional penalty into the data fusion and pathway enrichment analyses.

The parameter constraints_vector is a vector that allows the user to represent the expected relationship between the input omics datasets. The vector size is n_datasets. Values include +1, -1, and 0.

The parameter scores_direction is a matrix that reflects the directions that the genes/transcripts/protein show in the data. The matrix size is n_genes * n_datasets, that is the same size as the P-value matrix. This is a numeric matrix, but only the signs of the values are accounted for.

Directional data integration at the gene level

Load a dataset of P-values and fold-changes for mRNA and protein levels. This dataset is embedded in the package. Examine a few example genes.

fname_data_matrix <- system.file('extdata', 
        'Differential_expression_rna_protein.tsv',
        package = 'ActivePathways')

pvals_FCs <- read.table(fname_data_matrix, header = TRUE, sep = '\t')

example_genes <- c('ACTN4','PIK3R4','PPIL1','NELFE','LUZP1','ITGB2')
pvals_FCs[pvals_FCs$gene %in% example_genes,]
##        gene     rna_pval rna_log2fc protein_pval protein_log2fc
## 73   PIK3R4 1.266285e-03  1.1557077 2.791135e-03     -0.8344799
## 74    PPIL1 1.276838e-03 -1.1694221 1.199303e-04     -1.1193605
## 606   NELFE 1.447553e-02 -0.9120687 1.615592e-05     -1.2630114
## 4048  LUZP1 3.253382e-05  1.5830796 4.129125e-02      0.5791377
## 4050  ITGB2 4.584450e-05  1.6472117 1.327997e-01      0.4221579
## 4052  ACTN4 5.725503e-05  1.5531533 8.238317e-07      1.4279158

Create a matrix of gene/protein P-values. Where the columns are different omics datasets (mRNA, protein) and the rows are genes. Examine a few genes in the P-value matrix, and convert missing values to P = 1.

pval_matrix <- data.frame(
        row.names = pvals_FCs$gene, 
        rna = pvals_FCs$rna_pval, 
        protein = pvals_FCs$protein_pval)

pval_matrix <- as.matrix(pval_matrix)
pval_matrix[is.na(pval_matrix)] <- 1

pval_matrix[example_genes,]
##                 rna      protein
## ACTN4  5.725503e-05 8.238317e-07
## PIK3R4 1.266285e-03 2.791135e-03
## PPIL1  1.276838e-03 1.199303e-04
## NELFE  1.447553e-02 1.615592e-05
## LUZP1  3.253382e-05 4.129125e-02
## ITGB2  4.584450e-05 1.327997e-01

Create a matrix of gene/protein directions similarly to the P-value matrix (i.e., scores_direction). ActivePathways only uses the signs of the direction values (ie +1 or -1). If directions are missing (NA), we recommend setting the values to zero. Lets examine a few genes in the direction matrix.

dir_matrix <- data.frame(
        row.names = pvals_FCs$gene, 
        rna = pvals_FCs$rna_log2fc, 
        protein = pvals_FCs$protein_log2fc)

dir_matrix <- as.matrix(dir_matrix)
dir_matrix <- sign(dir_matrix)
dir_matrix[is.na(dir_matrix)] <- 0

dir_matrix[example_genes,]
##        rna protein
## ACTN4    1       1
## PIK3R4   1      -1
## PPIL1   -1      -1
## NELFE   -1      -1
## LUZP1    1       1
## ITGB2    1       1

This matrix has to be accompanied by a vector that provides the expected relationship between the different datasets. Here, mRNA levels and protein levels are expected to have consistent directions: either both positive or both negative (eg log fold-change). Alternatively, we can use another vector to prioritise genes/proteins where the directions are the opposite.

constraints_vector <- c(1,1)
constraints_vector
## [1] 1 1
# constraints_vector <- c(1,-1)

Now we merge the P-values of the two datasets using directional assumptions and compare these with the plain non-directional merging. The top 5 scoring genes differ if we penalise genes where this directional logic is violated: While 4 of 5 genes retain significance, the gene PIK3R4 is penalised. Interestingly, as a consequence of penalizing PIK3R4, other genes such as ITGB2 move up in rank.

directional_merged_pvals <- merge_p_values(pval_matrix, 
        method = "DPM", dir_matrix, constraints_vector)

merged_pvals <- merge_p_values(pval_matrix, method = "Brown")

sort(merged_pvals)[1:5]
##        ACTN4        PPIL1        NELFE        LUZP1       PIK3R4 
## 1.168708e-09 2.556067e-06 3.804646e-06 1.950607e-05 4.790125e-05
sort(directional_merged_pvals)[1:5]
##        ACTN4        PPIL1        NELFE        LUZP1        ITGB2 
## 1.168708e-09 2.556067e-06 3.804646e-06 1.950607e-05 7.920157e-05

PIK3R4 is penalised because the fold-changes of its mRNA and protein levels are significant and have the opposite signs:

pvals_FCs[pvals_FCs$gene == "PIK3R4",]
##      gene    rna_pval rna_log2fc protein_pval protein_log2fc
## 73 PIK3R4 0.001266285   1.155708  0.002791135     -0.8344799
pval_matrix["PIK3R4",]
##         rna     protein 
## 0.001266285 0.002791135
dir_matrix["PIK3R4",]
##     rna protein 
##       1      -1
merged_pvals["PIK3R4"]
##       PIK3R4 
## 4.790125e-05
directional_merged_pvals["PIK3R4"]
##    PIK3R4 
## 0.8122527

To assess the impact of the directional penalty on gene merged P-value signals we create a plot showing directional results on the y axis and non-directional results on the x. Blue dots are prioritised hits, red dots are penalised.

lineplot_df <- data.frame(original = -log10(merged_pvals),
              modified = -log10(directional_merged_pvals))

ggplot(lineplot_df) +
    geom_point(size = 2.4, shape = 19,
        aes(original, modified,
            color = ifelse(original <= -log10(0.05),"gray",
                                    ifelse(modified > -log10(0.05),"#1F449C","#F05039")))) +
    labs(title = "",
         x ="Merged -log10(P)",
         y = "Directional Merged -log10(P)") + 
            geom_hline(yintercept = 1.301, linetype = "dashed",
               col = 'black', size = 0.5) +
            geom_vline(xintercept = 1.301, linetype = "dashed",
               col = "black", size = 0.5) + 
            geom_abline(size = 0.5, slope = 1,intercept = 0) +
        scale_color_identity()

Constraints vector intuition

The constraints_vector parameter is provided in the merge_p_values() and ActivePathways() functions to incorporate directional penalty into the data fusion and pathway enrichment analyses. The constraints_vector parameter is a vector of size n_datasets that allows the user to represent the expected relationship between the input omics datasets, and includes the values +1, -1, and 0.

The constraints_vector should reflect the expected relative directional relationship between datasets. For example, the two constraints_vector values shown below are functionally identical.

constraints_vector <- c(-1,1)

constraints_vector <- c(1,-1)

We use different constraints_vector values depending on the type of data we are analyzing and the specific question we are asking. The simplest directional use case and the default for the ActivePathways package is to assume that the datasets have no directional constraints. This is useful when merging datasets where the relative impact of gene perturbations is not clear. For example, gene-level mutational burden and chromatin-immunoprecipitation sequencing (ChIP-seq) experiments can provide gene-level information, however, we cannot know from these datatypes whether gene function is increased or decreased. Therefore, we would set the constraints_vector for gene mutational burden and ChIP-seq datatypes to 0.

constraints_vector <- c(0,0)

When combining datasets that have directional information, we provide the expected relative directions between datasets in the constraints_vector. This is useful when measuring different data types or different conditions.

To investigate the transcriptomic differences following gene knockdown or overexpression, we would provide constraints_vector values that reflect the expected opposing relationship between gene knockdown and gene overexpression.

constraints_vector <- c(1,-1)

The constraints_vector is also useful when merging different data types such as gene and protein expression, promoter methylation, and chromatin accessibility. Importantly, the expected relative direction between each datatype must be closely considered given the experimental conditions. For example, when combining gene expression, protein expression, and promoter methylation data measured in the same biological condition, we would provide a constraints_vector to reflect the expected agreement between gene and protein expression and disagreement with promoter methylation.

constraints_vector <- c(1,1,-1)

The intuition for merging these datasets is that direction of change in gene expression and protein expression tend to associate with each other according to the central dogma of biology while the direction of change of promoter methylation has the opposite effect as methylation insulates genes from transcription.

Pathway-level insight

To explore how changes on the individual gene level impact biological pathways, we can compare results before and after incorporating a directional penalty.

Use the example GMT file embedded in the package.

fname_GMT2 <- system.file("extdata", "hsapiens_REAC_subset2.gmt", 
        package = "ActivePathways")

First perform integrative pathway enrichment analysis with no directionality. Then perform directional integration and pathway enrichment analysis. For this analysis the directional coefficients and constraints_vector are from the gene-based analysis described above.

enriched_pathways <- ActivePathways(
        pval_matrix, gmt = fname_GMT2, cytoscape_file_tag = "Original_")

constraints_vector <- c(1,1)
constraints_vector
## [1] 1 1
dir_matrix[example_genes,]
##        rna protein
## ACTN4    1       1
## PIK3R4   1      -1
## PPIL1   -1      -1
## NELFE   -1      -1
## LUZP1    1       1
## ITGB2    1       1
enriched_pathways_directional <- ActivePathways(
        pval_matrix, gmt = fname_GMT2, cytoscape_file_tag = "Directional_", merge_method = "DPM",
        scores_direction = dir_matrix, constraints_vector = constraints_vector)

Examine the pathways that are lost when directional information is incorporated in the data integration.

pathways_lost_in_directional_integration = 
        setdiff(enriched_pathways$term_id, enriched_pathways_directional$term_id)

pathways_lost_in_directional_integration
## [1] "REAC:R-HSA-3858494" "REAC:R-HSA-69206"   "REAC:R-HSA-69242"  
## [4] "REAC:R-HSA-9013149"
enriched_pathways[enriched_pathways$term_id %in% pathways_lost_in_directional_integration,] 
##               term_id                              term_name adjusted_p_val
## 1: REAC:R-HSA-3858494 Beta-catenin independent WNT signaling    0.013437464
## 2:   REAC:R-HSA-69206                        G1/S Transition    0.026263457
## 3:   REAC:R-HSA-69242                                S Phase    0.009478766
## 4: REAC:R-HSA-9013149                      RAC1 GTPase cycle    0.047568911
##    term_size                                  overlap    evidence
## 1:       143 PSMA5,PSMB4,PSMC5,PSMD11,PSMA8,GNG13,... rna,protein
## 2:       130  PSMA5,PSMB4,CDK4,PSMC5,PSMD11,CCNB1,...     protein
## 3:       162   PSMA5,PSMB4,RFC3,CDK4,PSMC5,PSMD11,...    combined
## 4:       184 SRGAP1,TIAM1,BAIAP2,FMNL1,DOCK9,PAK3,...         rna
##                                       Genes_rna
## 1:       GNG13,PSMC1,PSMA5,PSMB4,ITPR3,DVL1,...
## 2:                                           NA
## 3:                                           NA
## 4: SRGAP1,TIAM1,FMNL1,ARHGAP30,FARP2,DOCK10,...
##                                Genes_protein
## 1: PSMA8,PSMD11,PSMA5,PRKG1,PSMD10,PSMB4,...
## 2:   PSMD11,PSMA5,PSMD10,PSMB4,CDK7,ORC2,...
## 3:                                        NA
## 4:                                        NA

An example of a lost pathway is Beta-catenin independent WNT signaling. Out of the 32 genes that contribute to this pathway enrichment, 10 genes are in directional conflict. The enrichment is no longer identified when these genes are penalised due to the conflicting log2 fold-change directions.

wnt_pathway_id <- "REAC:R-HSA-3858494"

enriched_pathway_genes <- unlist(
        enriched_pathways[enriched_pathways$term_id == wnt_pathway_id,]$overlap)
enriched_pathway_genes
##  [1] "PSMA5"  "PSMB4"  "PSMC5"  "PSMD11" "PSMA8"  "GNG13"  "SMURF1" "PSMC1" 
##  [9] "PSMA4"  "PLCB2"  "PRKG1"  "PSMD4"  "PSMD1"  "PSMD10" "PSMA6"  "PSMA2" 
## [17] "PSMA1"  "PRKCA"  "PSMC6"  "RHOA"   "PSMB3"  "PSMB1"  "PSME3"  "ITPR3" 
## [25] "AGO4"   "DVL3"   "PSMA3"  "PPP3R1" "DVL1"   "CLTA"   "PSME2"  "CALM1" 
## [33] "PSMD6"  "PSMB6"

Examine the pathway genes that have directional disagreement and contribute to the lack of pathway enrichment in the directional analysis

pathway_gene_pvals = pval_matrix[enriched_pathway_genes,]
pathway_gene_directions = dir_matrix[enriched_pathway_genes,]

directional_conflict_genes = names(which(
        pathway_gene_directions[,1] != pathway_gene_directions[,2] &
        pathway_gene_directions[,1] != 0 & pathway_gene_directions[,2] != 0))

pathway_gene_pvals[directional_conflict_genes,]
##               rna     protein
## PSMD11 0.34121101 0.002094310
## PSMA8  0.55510836 0.001415197
## SMURF1 0.03353629 0.042995333
## PSMD1  0.04650877 0.100178048
## RHOA   0.01786687 0.474628084
## PSME3  0.07148904 0.130184883
## ITPR3  0.01660850 0.589929787
## DVL3   0.46381447 0.022535743
## PSME2  0.03274707 0.514351089
## PSMB6  0.02863259 0.677224905
pathway_gene_directions[directional_conflict_genes,]
##        rna protein
## PSMD11   1      -1
## PSMA8    1      -1
## SMURF1   1      -1
## PSMD1    1      -1
## RHOA    -1       1
## PSME3    1      -1
## ITPR3    1      -1
## DVL3     1      -1
## PSME2   -1       1
## PSMB6   -1       1
length(directional_conflict_genes)
## [1] 10

To visualise differences in biological pathways between ActivePathways analyses with or without a directional penalty, we combine both outputs into a single enrichment map for plotting.

Significance threshold and returning all results

A pathway is considered to be significantly enriched if it has adjusted_p_val <= significant. The parameter significant represents the maximum adjusted P-value for a resulting pathway to be considered statistically significant. Only the significant pathways are returned. P-values from pathway enrichment analysis are adjusted for multiple testing correction to provide a more conservative analysis (see below).

nrow(ActivePathways(scores, gmt_file, significant = 0.05))
## [1] 33
nrow(ActivePathways(scores, gmt_file, significant = 0.1))
## [1] 36

GMT objects

In the most common use case, a GMT file is downloaded from a database and provided directly to ActivePathways as a location in the file system. In some cases, it may be useful to load a GMT file separately for preprocessing. The ActivePathways package includes an interface for working with GMT objects. The GMT object can be read from a file using the read.GMT function. The GMT is structured as a list of terms (e.g., molecular pathways, biological processes, etc.). In the GMT object, each term is a list containing an id, a name, and the list of genes associated with this term.

gmt <- read.GMT(gmt_file)
names(gmt[[1]])
## [1] "id"    "name"  "genes"
# Pretty-print the GMT
gmt[1:3]
## REAC:1630316 - Glycosaminoglycan metabolism 
##  ST3GAL1, CHP1, B4GALT5, ST3GAL4, SLC9A1, CHST11, B3GAT3, SLC26A1, ARSB, ABCC5, LUM, PAPSS1, SDC4, NAGLU, AC022400.6, HYAL1, NDST3, SLC35B2, B3GAT1, CHST2, DCN, CEMIP, IDS, IDUA, HS3ST3B1, B3GNT7, CHST12, CHST14, BCAN, B4GALT3, HS3ST1, B4GALT1, CSPG4, CHPF2, HEXB, UST, XYLT1, NDST2, AL050331.3, NAT6, CHSY3, NCAN, FMOD, B3GNT2, HPSE, ACAN, LYVE1, GPC2, GPC1, B4GALT2, SLC35B3, KERA, GPC4, GUSB, HYAL3, HS6ST3, HAS3, CHPF, OMD, SDC2, B3GNT3, CSGALNACT2, CSPG5, BGN, CHST6, HS6ST2, SDC3, HS2ST1, CD44, AGRN, B3GALT6, PRELP, GLB1, GPC6, B3GNT4, PAPSS2, HS3ST4, CHST9, GNS, CHST1, CSGALNACT1, HS6ST1, CHST7, AL590542.1, HYAL2, HAS1, CHSY1, OGN, B4GAT1, B4GALT4, DSEL, EXT1, SDC1, SLC35D2, EXT2, ST3GAL2, ST3GAL6, CHST3, CHST5, HSPG2, SLC26A2, HEXA, XYLT2, HS3ST6, HS3ST5, GLCE, AC244197.3, B4GALT7, HAS2, NDST4, CHST15, B4GALT6, B3GAT2, STAB2, HMMR, ST3GAL3, VCAN, NDST1, HS3ST2, GPC5, HS3ST3A1, HPSE2, GLB1L, SGSH, GPC3, CHST13, DSE 
## 
##  REAC:5633007 - Regulation of TP53 Activity 
##  PPP1R13L, RFC3, SGK1, HDAC1, SETD9, RPA3, CSNK2A2, EHMT1, PDPK1, CCNG1, TAF6, ING5, TP53INP1, BRCA1, TAF1, TAF9, EHMT2, RBBP8, CHD4, USP2, BRPF1, BANP, PPP2R5C, SSRP1, TAF4, TMEM55B, PPP1R13B, TAF10, PRR5, ATM, ZNF385A, MRE11, PIP4K2C, SMYD2, BARD1, MBD3, TP53BP2, AURKA, RPA1, KAT6A, CHD3, ATR, EXO1, PIN1, RMI2, PRKAB2, AKT2, PPP2CA, TAF13, RAD9B, PPP2CB, MDM4, POU4F1, TAF4B, MDM2, BRD1, CDK2, CSNK2B, PIP4K2A, RAD50, PML, AKT3, RAD1, USP7, PRDM1, RFC4, MTOR, POU4F2, RFC2, KMT5A, DNA2, NUAK1, PIP4K2B, MAPKAP1, PPP2R1A, RBBP7, MTA2, TP53RK, TAF1L, GATAD2B, JMY, TAF12, CDK5R1, RICTOR, PHF20, AC134772.2, STK11, TAF3, BLM, ATRIP, TAF7L, TP73, MAP2K6, RPA2, PRKAG1, TOPBP1, PRKAB1, TOP3A, CCNA1, RBBP4, PPP2R1B, DAXX, TTC5, CHEK2, CDK5, RMI1, TAF2, NOC2L, TP63, CSNK2A1, UBB, PRKAG2, ING2, RNF34, MLST8, HIPK2, TBP, L3MBTL1, EP300, MAPK11, BRIP1, PRMT5, PRKAG3, RFC5, NBN, TPX2, HDAC2, CDK1, MAPKAPK5, CHEK1, TAF5, BRPF3, RPS27A, AKT1, GATAD2A, RHNO1, CCNA2, HUS1, MEAF6, PRKAA1, CDKN2A, RAD17, KAT5, SUPT16H, TAF7, PLK3, UBA52, WRN, MAPK14, TAF9B, UBC, HIPK1, TP53, DYRK2, RFFL, TAF11, BRD7, RAD9A, AURKB, PRKAA2 
## 
##  REAC:5358346 - Hedgehog ligand biogenesis 
##  PSMB11, PSMD8, PSMB6, PSMB8, PSMD13, PSMA2, PSMB3, PSMB7, PSME2, PSMD1, PSMA8, PSMA1, PSMD2, PSMD14, PSMC1, IHH, SEL1L, SCUBE2, PSMB10, PSMD10, PSMA5, PSMD3, PSMC2, PSMC6, PSME1, PSME4, PSMC4, AC010132.3, PSMD5, UBB, GPC5, PSMA7, PSMB2, PSMD9, SEM1, PSMA3, DISP2, PSMB9, PSMD6, OS9, PSMC3, PSMA6, RPS27A, ADAM17, PSMB1, DHH, PSMF1, HHAT, UBA52, UBC, DERL2, PSMD4, VCP, ERLEC1, PSME3, PSMB4, P4HB, PSMA4, PSMD7, PSMD11, PSMC5, NOTUM, PSMD12, PSMB5, SYVN1, SHH
# Look at the genes annotated to the first term
gmt[[1]]$genes
##  [1] "ST3GAL1"    "CHP1"       "B4GALT5"    "ST3GAL4"    "SLC9A1"    
##  [6] "CHST11"     "B3GAT3"     "SLC26A1"    "ARSB"       "ABCC5"     
## [11] "LUM"        "PAPSS1"     "SDC4"       "NAGLU"      "AC022400.6"
## [16] "HYAL1"      "NDST3"      "SLC35B2"    "B3GAT1"     "CHST2"     
## [21] "DCN"        "CEMIP"      "IDS"        "IDUA"       "HS3ST3B1"  
## [26] "B3GNT7"     "CHST12"     "CHST14"     "BCAN"       "B4GALT3"   
## [31] "HS3ST1"     "B4GALT1"    "CSPG4"      "CHPF2"      "HEXB"      
##  [ reached getOption("max.print") -- omitted 92 entries ]
# Get the full name of Reactome pathway 2424491
gmt$`REAC:2424491`$name
## [1] "DAP12 signaling"

The most common processing step for GMT files is the removal of gene sets that are too large or small. Here we remove pathways (gene sets) that have less than 10 or more than 500 annotated genes.

gmt <- Filter(function(term) length(term$genes) >= 10, gmt)
gmt <- Filter(function(term) length(term$genes) <= 500, gmt)

The new GMT object can now be used for analysis with ActivePathways

ActivePathways(scores, gmt)
##          term_id                     term_name adjusted_p_val term_size
##  1: REAC:2424491               DAP12 signaling   0.0002890847       358
##  2:  REAC:177929             Signaling by EGFR   0.0015162274       366
##  3: REAC:2559583           Cellular Senescence   0.0001560920       196
##                                              overlap  evidence
##  1:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...       CDS
##  2:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...       CDS
##  3:           TP53,RB1,ATM,MAP2K7,CDKN1A,RPS6KA3,...       CDS
##                            Genes_X3UTR Genes_X5UTR
##  1:                                 NA          NA
##  2:                                 NA          NA
##  3:                                 NA          NA
##                                      Genes_CDS Genes_promCore
##  1:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...             NA
##  2:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...             NA
##  3:      TP53,RB1,ATM,MAP2K7,CDKN1A,CDKN1B,...             NA
##  [ reached getOption("max.print") -- omitted 28 rows ]

This filtering can also be done automatically using the geneset_filter option to the ActivePathways function. By default, ActivePathways removes gene sets with less than five or more than a thousand genes from the GMT prior to analysis. In general, gene sets that are too large are likely not specific and less useful in interpreting the data and may also cause statistical inflation of enrichment scores in the analysis. Gene sets that are too small are likely too specific for most analyses and make the multiple testing corrections more stringent, potentially causing deflation of results. A stricter filter can be applied by running ActivePathways with the parameter geneset_filter = c(10, 500).

ActivePathways(scores, gmt_file, geneset_filter = c(10, 500))
##          term_id                     term_name adjusted_p_val term_size
##  1: REAC:2424491               DAP12 signaling   3.660807e-05       358
##  2:  REAC:177929             Signaling by EGFR   5.039994e-04       366
##  3: REAC:2559583           Cellular Senescence   5.366637e-05       196
##                                              overlap  evidence
##  1:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...       CDS
##  2:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...       CDS
##  3:           TP53,RB1,ATM,MAP2K7,CDKN1A,RPS6KA3,...       CDS
##                            Genes_X3UTR Genes_X5UTR
##  1:                                 NA          NA
##  2:                                 NA          NA
##  3:                                 NA          NA
##                                      Genes_CDS Genes_promCore
##  1:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...             NA
##  2:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...             NA
##  3:      TP53,RB1,ATM,MAP2K7,CDKN1A,CDKN1B,...             NA
##  [ reached getOption("max.print") -- omitted 28 rows ]

This GMT object can be saved to a file

write.GMT(gmt, 'hsapiens_REAC_subset_filtered.gmt')

Background gene set for statistical analysis

To perform pathway enrichment analysis, a global set of genes needs to be defined as a statistical background set. This represents the universe of all genes in the organism that the analysis can potentially consider. By default, this background gene set includes every gene that is found in the GMT file in any of the biological processes and pathways. Another option is to provide the full set of all protein-coding genes, however, this may cause statistical inflation of the results since a sizable fraction of all protein-coding genes still lack any known function.

Sometimes the statistical background set needs to be considerably narrower than the GMT file or the full set of genes. Genes need to be excluded from the background if the analysis or experiment specifically excluded these genes initially. An example would be a targeted screen or sequencing panel that only considered a specific class of genes or proteins (e.g., kinases). In analysing such data, all non-kinase genes need to be excluded from the background set to avoid statistical inflation of all gene sets related to kinase signalling, phosphorylation and similar functions.

To alter the background gene set in ActivePathways, one can provide a character vector of gene names that make up the statistical background set. In this example, we start from the original list of genes in the entire GMT and remove one gene, the tumor suppressor TP53. The new background set is then used for the ActivePathways analysis.

background <- makeBackground(gmt)
background <- background[background != 'TP53']
ActivePathways(scores, gmt_file, background = background)
##          term_id                     term_name adjusted_p_val term_size
##  1: REAC:2424491               DAP12 signaling   0.0022333168       358
##  2:  REAC:422475                 Axon guidance   0.0001161168       555
##  3:  REAC:177929             Signaling by EGFR   0.0043366915       366
##                                              overlap  evidence
##  1:               PIK3CA,KRAS,PTEN,BRAF,NRAS,B2M,...       CDS
##  2:          PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,...     X3UTR
##  3:             PIK3CA,KRAS,PTEN,BRAF,NRAS,CALM2,...       CDS
##                              Genes_X3UTR Genes_X5UTR
##  1:                                   NA          NA
##  2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,...          NA
##  3:                                   NA          NA
##                                     Genes_CDS Genes_promCore
##  1:     PTEN,KRAS,PIK3CA,BRAF,NRAS,CDKN1A,...             NA
##  2:                                        NA             NA
##  3:     PTEN,KRAS,PIK3CA,BRAF,NRAS,CDKN1A,...             NA
##  [ reached getOption("max.print") -- omitted 28 rows ]

Note that only the genes found in the background set are used for testing enrichment. Any genes in the input data that are not in the background set will be automatically removed by ActivePathways.

Merging p-values

A key feature of ActivePathways is the integration of multiple complementary omics datasets to prioritise genes for the pathway analysis. In this approach, genes with significant scores in multiple datasets will get the highest priority, and certain genes with weak scores in multiple datasets may be ranked higher, highlighting functions that would be missed when only single datasets were analysed. ActivePathways accomplishes this by merging the series of p-values in the columns of the scores matrix for each gene into a single combined P-value. The four methods to merge P-values are Fisher’s method (the default), Brown’s method (extension of Fisher’s), Stouffer’s method and Strube’s method (extension of Stouffer’s). Each of these methods have been extended to account for the directional activity of genes across multi-omics datasets with Fisher_directional, DPM, Stouffer_directional, and Strube_directional methods. The Brown’s and Strube’s methods are more conservative in the case when the input datasets show some large-scale similarities (i.e., covariation), since they will take that into account when prioritising genes across similar datasets. The Brown’s or Strube’s method are recommended for most cases since omics datasets are often not statistically independent of each other and genes with high scores in one dataset are more likely to have high scores in another dataset just by chance.

The following example compares the merged P-values of the first few genes between the four methods. Fisher’s and Stouffer’s method are two alternative strategies to merge p-values and as a result the top scoring genes and p-values may differ. The genes with the top scores for Brown’s method are the same as Fisher’s, but their P-values are more conservative. This is the case for Strube’s method as well, in which the top scoring genes are the same as Stouffer’s method, but the P-values are more conservative.

sort(merge_p_values(scores, 'Fisher'))[1:5]
##         TP53          VHL       PIK3CA         KRAS         PTEN 
## 2.275933e-32 4.677780e-31 9.878802e-30 5.169163e-29 5.813021e-29
sort(merge_p_values(scores, 'Brown'))[1:5]
##         TP53          VHL       PIK3CA         KRAS         PTEN 
## 2.850936e-21 1.933490e-20 1.334809e-19 3.808817e-19 4.102937e-19
sort(merge_p_values(scores, 'Stouffer'))[1:5]
##         TP53          VHL       PIK3CA         KRAS         PTEN 
## 5.889756e-18 1.228308e-15 3.067765e-15 1.396875e-13 2.275649e-13
sort(merge_p_values(scores, 'Strube'))[1:5]
##         TP53          VHL       PIK3CA         KRAS         PTEN 
## 1.170932e-11 3.249661e-10 5.746455e-10 6.206504e-09 8.414282e-09

This function can be used to combine some of the data before the analysis for any follow-up analysis or visualisation. For example, we can merge the columns X5UTR, X3UTR, and promCore into a single non_coding column (these correspond to predictions of driver mutations in 5’UTRs, 3’UTRs, and core promoters of genes, respectively). This will consider the three non-coding regions as a single column, rather than giving them all equal weight to the CDS column.

scores2 <- cbind(scores[, 'CDS'], merge_p_values(scores[, c('X3UTR', 'X5UTR', 'promCore')], 'Brown'))
colnames(scores2) <- c('CDS', 'non_coding')
scores[c(2179, 1760),]
##          X3UTR       X5UTR          CDS  promCore
## TP53 0.8331532 0.005613463 1.842404e-33 0.0254239
## PTEN 0.1627596 0.310132725 2.180301e-32 0.6867016
scores2[c(2179, 1760),]
##               CDS non_coding
## TP53 1.842404e-33  0.0189964
## PTEN 2.180301e-32  0.3437851
ActivePathways(scores, gmt_file)
##          term_id                     term_name adjusted_p_val term_size
##  1: REAC:2424491               DAP12 signaling   4.491268e-05       358
##  2:  REAC:422475                 Axon guidance   4.625918e-04       555
##  3:  REAC:177929             Signaling by EGFR   6.197504e-04       366
##                                              overlap       evidence
##  1:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...            CDS
##  2:          PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR,promCore
##  3:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...            CDS
##                              Genes_X3UTR Genes_X5UTR
##  1:                                   NA          NA
##  2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,...          NA
##  3:                                   NA          NA
##                                      Genes_CDS
##  1:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
##  2:                                         NA
##  3:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
##                                  Genes_promCore
##  1:                                          NA
##  2: EFNA1,IQGAP1,COL4A1,SCN2B,RPS6KA3,CALM2,...
##  3:                                          NA
##  [ reached getOption("max.print") -- omitted 31 rows ]
ActivePathways(scores2, gmt_file)
##          term_id                     term_name adjusted_p_val term_size
##  1: REAC:2424491               DAP12 signaling   0.0003694968       358
##  2:  REAC:422475                 Axon guidance   0.0051873493       555
##  3:  REAC:177929             Signaling by EGFR   0.0014545468       366
##  4: REAC:2559583           Cellular Senescence   0.0005822291       196
##                                     overlap evidence
##  1:     TP53,PIK3CA,PTEN,KRAS,BRAF,NRAS,...      CDS
##  2: PIK3CA,KRAS,BRAF,NRAS,RPS6KA3,CALM2,... combined
##  3:     TP53,PIK3CA,PTEN,KRAS,BRAF,NRAS,...      CDS
##  4:  TP53,RB1,ATM,MAP2K7,CDKN1A,RPS6KA3,...      CDS
##                                      Genes_CDS Genes_non_coding
##  1:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...               NA
##  2:                                         NA               NA
##  3:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...               NA
##  4:      TP53,RB1,ATM,MAP2K7,CDKN1A,CDKN1B,...               NA
##  [ reached getOption("max.print") -- omitted 26 rows ]

Cutoff for filtering the ranked gene list for pathway enrichment analysis

To perform pathway enrichment of the ranked gene list of merged P-values, ActivePathways defines a P-value cutoff to filter genes that have little or no significance in the series of omics datasets. This threshold represents the maximum p-value for a gene to be considered of interest in our analysis. The threshold is 0.1 by default but can be changed using the cutoff option. The default option considers raw P-values that have not been adjusted for multiple-testing correction. Therefore, the default option provides a relatively lenient approach to filtering the input data. This is useful for finding additional genes with weaker signals that associate with well-annotated and strongly significant genes in the pathway and functional context.

nrow(ActivePathways(scores, gmt_file))
## [1] 33
nrow(ActivePathways(scores, gmt_file, cutoff = 0.01))
## [1] 18

Adjusting P-values using multiple testing correction

Multiple testing correction is essential in the analysis of omics data since each analysis routinely considers thousands of hypotheses and apparently significant P-values will occur by chance alone. ActivePathways uses multiple testing correction at the level of pathways as P-values from the ranked hypergeometric test are adjusted for multiple testing (note that the ranked gene list provided to the ranked hypergeometric test remain unadjusted for multiple testing by design).

The package uses the p.adjust function of base R to run multiple testing corrections and all methods in this function are available. By default, ‘holm’ correction is used. The option correction_method = 'none' can be used to override P-value adjustment (not recommended in most cases).

nrow(ActivePathways(scores, gmt_file))
## [1] 33
nrow(ActivePathways(scores, gmt_file, correction_method = 'none'))
## [1] 88

The results table of ActivePathways

Consider the results object from the basic use case of ActivePathways

res <- ActivePathways(scores, gmt_file)
res
##          term_id                     term_name adjusted_p_val term_size
##  1: REAC:2424491               DAP12 signaling   4.491268e-05       358
##  2:  REAC:422475                 Axon guidance   4.625918e-04       555
##  3:  REAC:177929             Signaling by EGFR   6.197504e-04       366
##                                              overlap       evidence
##  1:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...            CDS
##  2:          PIK3CA,KRAS,BRAF,NRAS,CALM2,RPS6KA3,... X3UTR,promCore
##  3:              TP53,PIK3CA,KRAS,PTEN,BRAF,NRAS,...            CDS
##                              Genes_X3UTR Genes_X5UTR
##  1:                                   NA          NA
##  2: CALM2,ARPC2,RHOA,NUMB,CALM1,ACTB,...          NA
##  3:                                   NA          NA
##                                      Genes_CDS
##  1:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
##  2:                                         NA
##  3:        TP53,PTEN,KRAS,PIK3CA,BRAF,NRAS,...
##                                  Genes_promCore
##  1:                                          NA
##  2: EFNA1,IQGAP1,COL4A1,SCN2B,RPS6KA3,CALM2,...
##  3:                                          NA
##  [ reached getOption("max.print") -- omitted 31 rows ]

The columns term_id, term_name, and term_size give information about each pathway detected in the enrichment analysis. The adjusted_p_val column with the adjusted P-value indicates the confidence that the pathway is enriched after multiple testing correction.

The overlap column provides the set of genes from the integrated gene list that occur in the given enriched gene set (i.e., molecular pathway or biological process). These genes were quantified across multiple input omics datasets and prioritized based on their joint significance in the input data. Note that the genes with the strongest scores across the multiple datasets are listed first.

res$overlap[1:3]
## [[1]]
##  [1] "TP53"   "PIK3CA" "KRAS"   "PTEN"   "BRAF"   "NRAS"   "B2M"    "CALM2" 
##  [9] "CDKN1A" "CDKN1B"
## 
## [[2]]
##  [1] "PIK3CA"  "KRAS"    "BRAF"    "NRAS"    "CALM2"   "RPS6KA3" "ACTB"   
##  [8] "EFNA1"   "SCN2B"   "EPHA2"   "GAP43"   "COL4A1"  "RASAL2"  "CLTC"   
## [15] "IQGAP1"  "NF1"     "FGF9"    "ADAM10"  "PTPRC"   "ITGA10"  "PDGFB"  
## [22] "COL4A2"  "RGMB"    "RASA1"   "FGF6"    "CALM1"   "PSMB7"   "PPP2R5D"
## [29] "PPP2R1A" "RAP1A"   "CNTNAP1" "GRIN2B"  "GRB2"    "DLG1"    "MET"    
##  [ reached getOption("max.print") -- omitted 6 entries ]
## 
## [[3]]
## [1] "TP53"   "PIK3CA" "KRAS"   "PTEN"   "BRAF"   "NRAS"   "CALM2"  "CDKN1A"
## [9] "CDKN1B"

This column is useful for further data analysis, allowing the researcher to go from the space of enriched pathways back to the space of individual genes and proteins involved in pathways and their input omics datasets.

The evidence column provides insights to which of the input omics datasets (i.e., columns in the scores matrix) contributed to the discovery of this pathway or process in the integrated enrichment analysis. To achieve this level of detail, ActivePathways also analyses the gene lists ranked by the individual columns of the input matrix to detect enriched pathways. The evidence column lists the name of a given column of the input matrix if the given pathway is detected both in the integrated analysis and the analysis of the individual column. For example, in this analysis the majority of the detected pathways have only ‘CDS’ as their evidence, since these pathways were found to be enriched in data fusion through P-value merging and also by analysing the gene scores in the column CDS (for reference, CDS corresponds to protein-coding sequence where the majority of known driver mutations have been found). As a counter-example, the record for the pathway REAC:422475 in our results lists as evidence list('X3UTR', 'promCore'), meaning that the pathway was found to be enriched when considering either the X3UTR column, the promCore column, or the combined omics datasets.

unlist(res[res$term_id == "REAC:422475","evidence"])
##  evidence1  evidence2 
##    "X3UTR" "promCore"

Finally, if a pathway is found to be enriched only with the combined data and not in any individual column, ‘combined’ will be listed as the evidence. This subset of results may be particularly interesting since it highlights complementary aspects of the analysis that would remain hidden in the analysis of any input omics dataset separately.

The following columns named as Genes_{column} help interpret how each pathway was detected in the multi-omics data integration, as listed in the column evidence. These columns show the genes present in the pathway and any of the input omics datasets. If the given pathway was not identified using the scores of the given column of the input scores matrix, an NA value is shown. Again, the genes are ranked by the significance of their scores in the input data, to facilitate identification of the most relevant genes in the analysis.

Writing results to a CSV file

The results are returned as a data.table object due to some additional data structures needed to store lists of gene IDs and supporting evidence. The usual R functions write.table and write.csv will struggle with exporting the data unless the gene and evidence lists are manually transformed as strings. Fortunately, the fwrite function of data.table can be used to write the file directly and the ActivePathways package includes the function export_as_CSV as a shortcut that uses the vertical bar symbol to concatenate gene lists.

result_file <- paste('ActivePathways_results.csv', sep = '/')
export_as_CSV (res, result_file) # remove comment to run
read.csv(result_file, stringsAsFactors = F)[1:3,]

The fwrite can be called directly for customised output.

result_file <- paste('ActivePathways_results2.txt', sep = '/')
data.table::fwrite(res, result_file, sep = '\t', sep2 = c('', ',', ''))
cat(paste(readLines(result_file)[1:2], collapse = '\n'))

Visualising pathway enrichment results using enrichment maps in Cytoscape

The Cytoscape software and the EnrichmentMap app provide powerful tools to visualise the enriched pathways from ActivePathways as a network (i.e., an Enrichment Map). To facilitate this visualisation step, ActivePathways provides the files needed for building enrichment maps. To create these files, a file prefix must be supplied to ActivePathways using the argument cytoscape_file_tag. The prefix can be a path to an existing writable directory.

res <- ActivePathways(scores, gmt_file, cytoscape_file_tag = "enrichmentMap__")

Four files are written using the prefix:

Creating enrichment maps using the ActivePathways results

The following sections will discuss how to create a pathway enrichment map using the results from ActivePathways. The datasets analysed earlier in the vignette will be used. To follow the steps, save the required files from ActivePathways in an accessible location.

Required software

  1. Cytoscape, see https://cytoscape.org/download.html
  2. EnrichmentMap app of Cytoscape, see menu Apps>App manager or https://apps.cytoscape.org/apps/enrichmentmap
  3. EnhancedGraphics app of Cytoscape, see menu Apps>App manager or https://apps.cytoscape.org/apps/enhancedGraphics

Required files

ActivePathways writes four files that are used to build enrichment maps in Cytoscape.

files <- c(system.file('extdata', 'enrichmentMap__pathways.txt', package='ActivePathways'),
           system.file('extdata', 'enrichmentMap__subgroups.txt', package='ActivePathways'),
           system.file('extdata', 'enrichmentMap__pathways.gmt', package='ActivePathways'),
           system.file('extdata', 'enrichmentMap__legend.pdf', package='ActivePathways'))

The following commands will perform the basic analysis again and write output files required for generating enrichment maps into the current working directory of the R session. All file names use the prefix enrichmentMap__. The generated files are also available in the ActivePathways R package as shown above.

gmt_file <- system.file('extdata', 'hsapiens_REAC_subset.gmt', package = 'ActivePathways')
scores_file <- system.file('extdata', 'Adenocarcinoma_scores_subset.tsv', package = 'ActivePathways')

scores <- read.table(scores_file, header = TRUE, sep = '\t', row.names = 'Gene')
scores <- as.matrix(scores)
scores[is.na(scores)] <- 1

res <- ActivePathways(scores, gmt_file, cytoscape_file_tag = "enrichmentMap__")

The four files written are:

The following code will examine a few lines of the files generated by ActivePathways.

cat(paste(readLines(files[1])[1:5], collapse='\n'))
## term_id  term_name   adjusted_p_val
## REAC:2424491 DAP12 signaling 4.49126833230489e-05
## REAC:422475  Axon guidance   0.00046259184602835
## REAC:177929  Signaling by EGFR   0.000619750411866923
## REAC:2559583 Cellular Senescence 6.59544666946083e-05
cat(paste(readLines(files[2])[1:5], collapse='\n'))
## term_id  X3UTR   X5UTR   CDS promCore    combined    instruct
## REAC:2424491 0   0   1   0   0   piechart: attributelist="X3UTR,X5UTR,CDS,promCore,combined" colorlist="#FF0000,#CCFF00,#00FF66,#0066FF,#FFFFF0" showlabels=FALSE
## REAC:422475  1   0   0   1   0   piechart: attributelist="X3UTR,X5UTR,CDS,promCore,combined" colorlist="#FF0000,#CCFF00,#00FF66,#0066FF,#FFFFF0" showlabels=FALSE
## REAC:177929  0   0   1   0   0   piechart: attributelist="X3UTR,X5UTR,CDS,promCore,combined" colorlist="#FF0000,#CCFF00,#00FF66,#0066FF,#FFFFF0" showlabels=FALSE
## REAC:2559583 0   0   1   0   0   piechart: attributelist="X3UTR,X5UTR,CDS,promCore,combined" colorlist="#FF0000,#CCFF00,#00FF66,#0066FF,#FFFFF0" showlabels=FALSE
cat(paste(readLines(files[3])[18:19], collapse='\n'))
## REAC:1257604 PIP3 activates AKT signaling    MTOR    ERBB4   NR4A1   MET TSC2    AGO1    AKT3    PIP4K2A TNRC6B  TRIB3   FOXO3   GSK3A   MDM2    FGF1    PIP5K1A PHLPP1  PPP2CB  KLB PPP2CA  FGF8    CREB1   AKT2    CDKN1B  BAD FGF9    PPP2R5E PIK3AP1 FOXO4   RPS6KB2 AKT1S1  AL358075.4  FGF2    ICOS    BTC CDKN1A  KITLG   FGF17   PIK3CB  PIK3R2  FGF4    FGF18   PIP4K2C FGF7    PDGFA   PRR5    EREG    PPP2R5B FGF19   FGFR1   VAV1    ERBB3   HBEGF   FGF3    PPP2R5C PDGFRB  PDGFB   TNRC6C  CD80    FGFR3   MIR26A2 TRAT1   FGF23   AGO3    PTPN11  PIP5K1B GAB1    MAPK1   LCK NRG1    TNRC6A  MIR26A1 PDPK1   FGF22   CHUK    FOXO1   NRG2    PIK3R1  FGF20   FGFR2   PPP2R5D SRC PIK3CA  TP53    NRG4    MOV10   KIT PIP5K1C NRG3    PTEN    CD19    INS HGF AGO2    PHLPP2  IRS2    PIK3CD  FGFR4   PPP2R5A AKT1    AL672043.1  KL  PIK3R3  EGFR    FRS2    IER3    INSR    MLST8   CD28    FYN AGO4    FGF16   THEM4   FGF5    IRS1    PPP2R1B FGF10   EGF MAPK3   GSK3B   PDGFRA  RICTOR  ERBB2   CASP9   GRB2    PPP2R1A FGF6    MAPKAP1 PIP4K2B INS-IGF2    CD86
## REAC:212436  Generic Transcription Pathway   ZNF776  ZNF736  ZNF700  CHM ZNF658B TP53AIP1    CASP10  RARG    ZNF235  HUS1    ZNF200  ZNF17   ZNF664  RRM2B   TFAP2C  ZNF786  PSMB1   ZNF737  TAF5    NDUFA4  COX6B1  TOM1L1  ZNF493  ZNF649  NBN ZNF557  RFC5    POLR2I  ZNF213  PRMT5   ZNF600  ZNF418  ZNF671  MAPK11  ZNF426  SRC TAF11   GTF2H4  BTG2    ZNF655  KIT ERCC2   ZNF677  ZNF709  ZNF234  ATAD2   SCO1    TAF9B   ZNF28   ZNF691  MAPK14  ZNF584  ZNF317  CTGF    ZNF445  ZNF492  ZNF771  KAT5    ZNF558  BCL6    TEAD3   TBL1X   ZIM3    TAF7L   ZNF157  ZNF136  ZNF415  ZNF792  BLM PSMD1   STK11   RICTOR  NELFCD  DDB2    ZNF446  SP1 ZNF727  NOTCH2  RUNX3   TRAPPC2 LAMTOR5 ZNF460  ZNF331  PSMB6   NKX2-5  ZKSCAN5 PPP2R1A MAPKAP1 CITED1  ZNF554  DNA2    MED1    SMAD3   ZNF681  CASP2   JUN ZNF20   ZNF839  ZNF14   CTNNB1  EP300   PSMC4   ZNF189  L3MBTL1 NDRG1   MED15   SPP1    ZNF567  RORB    ZNF483  PRKAG2  KDM5B   ZNF12   CCNT1   NOC2L   CGA TNKS1BP1    NR2F1   RXRB    RMI1    RABGGTB ZNF705E MT-CO3  CCNA1   ZNF697  GLS WWTR1   ESRRA   BCL2L11 KRBOX4  PPP2CB  CCND1   ZNF112  ZNF740  ZNF10   NR2E1   ZNF486  SMARCD3 PSMB9   PIN1    PSMF1   ELOC    GTF2H5  NR2C2AP PSMC3   MYBL2   ZNF223  PSMA6   ZNF41   RABGGTA POLR2B  ZNF225  CHD3    ZNF75A  KAT6A   MED26   HNF4A   CNOT4   CENPJ   RUNX2   DEK ZNF23   TCEA1   MBD3    ZNF540  BARD1   ZNF595  NR5A1   ZNF343  TXN PRDX1   ZNF33A  WWOX    MED24   BRD2    TSC2    POLR2D  ZNF440  RAD1    GATA4   CTDP1   ZNF479  ZNF347  ZNF619  ZNF749  CSNK2B  ZNF211  CDK13   CDK2    PPM1A   ZNF256  FOXO3   ZNF433  SMURF2  ZNF37A  COX11   ZNF227  TFAP2D  ZNF33B  ZNF133  PSMD14  BRPF1   ESRRB   ZNF577  USP2    ITGAL   PCNA    UBE2I   CITED2  YWHAE   IGFBP3  ZNF224  ZNF772  BCL2L14 ZNF555  TAF1    BRCA1   NR3C2   NOP2    ZNF547  ZNF729  RBL2    TNRC6A  ZNF706  CASP6   PLAGL1  CSNK2A2 SKIL    HDAC1   RXRG    CDKN2B  PSME2   PSMB8   SMURF1  PSMA2   UBE2D3  PSMD13  AC010132.3  AC003002.1  CNOT11  ZFP1    PMS2    ATM ZNF385A ZNF304  ERCC3   ZNF519  TAF10   ZNF597  POLR2J  PSME1   SUPT4H1 THRB    TAF4    MNAT1   PARP1   ZNF160  ZNF230  PLK2    ZNF333  NR2E3   MAML3   ZNF684  BANP    CNOT7   BNIP3L  ZNF773  ZNF202  ZNF425  CDK9    TCF7L2  ZNF692  ZNF710  ZSCAN25 ZNF180  ZNF670  ZNF510  GATAD2A MT-CO1  ZNF552  COX14   RPS27A  BRPF3   NRBP1   MAPKAPK5    HDAC2   CDK1    ZNF568  ZNF221  GTF2F1  COX5B   BRIP1   AURKB   ZNF208  TFDP1   ZNF667  COX7A2L RGCC    RAD9A   BRD7    ZNF717  PSMD12  ZNF571  POLR2F  ZNF614  MYC UBA52   PTEN    MED6    SUPT16H PSMD7   NR1H3   PRDX2   PSME3   CDKN2A  PRKAA1  ZNF155  COX16   ATRIP   SMAD2   ZFP14   NR5A2   ZNF468  ZNF607  GATAD2B RBBP7   PRDX5   RRAGA   ZNF680  NUAK1   SMAD4   VEGFA   ZNF471  NCOR1   ZNF79   TCF7    RXRA    COX19   NPPA    GTF2H3  RNF34   ZNF613  ING2    UBB AGO4    CSNK2A1 MED10   CITED4  ZNF778  ZNF282  PPARA   CDK5    ZNF429  ZKSCAN1 PRKAB1  ELOA    ZNF138  TMEM219 ZNF70   ZNF431  RPA2    ZNF26   SCO2    CCNH    TRIM28  RARA    ZKSCAN8 RUNX1   GLS2    ZNF735  PPP2CA  ZNF543  RAD51D  ZNF582  APAF1   ZNF720  ZNF782  EXO1    LAMTOR4 NR1H4   KCTD15  ZFP28   CNOT2   CCNE2   PSMA7   TGS1    RPA1    ZNF19   NR3C1   PSMA3   ZNF337  TP53BP2 ZNF586  ZSCAN32 SURF1   MTOR    COX18   POLR2G  SESN3   CNOT6   GTF2H1  PSMC5   ZNF714  NR4A1   AGO1    ZNF716  CNOT6L  ZNF154  ZNF92   THRA    ZNF775  RAD50   AR  PSMD4   TBX5    BRD1    ZNF490  ZNF324B ZNF263  MAML2   ZNF212  PSMA4   E2F7    MED17   TAF4B   MED27   POU4F1  ZNF254  PSMB4   ZNF18   ZIM2    ZIK1    ZNF790  ZNF641  TGIF1   ELOA3D  ELOA3B  CHD4    CRADD   HELZ2   ZNF222  COX8A   AGO3    RBBP8   ZNF561  ELL ZNF302  MDC1    MEN1    RORC    HKR1    PDPK1   KAT2B   PMAIP1  ZNF589  ZNF585A MED7    PRELID1 E2F5    NOTCH3  ZNF544  ZNF257  SMYD2   ZNF248  NR1H2   TFDP2   PIP4K2C ZNF267  ELOB    TNFRSF10C   PRMT1   ELOA2   CCNT2   ZNF253  YY1 TMEM55B ZNF573  ZNF559  TP53I3  ZNF724  NEDD4L  ZFHX3   ZNF702P NRBF2   ZNF436  JAG1    ZNF443  ESRRG   NR2F6   ZNF761  JUNB    FANCC   MEAF6   LEF1    ZNF514  NLRC4   CCNA2   ESR2    ZNF34   RBL1    ZNF606  CCNB1   BBC3    AKT1    TNFRSF10B   POLR2E  VDR ITGA4   ZNF668  MED13   SMAD7   ZNF682  ZNF774  CDK12   PSMD9   PSMB2   ZNF688  EGFR    ZNF675  NR2C2   ZNF662  PIDD1   ZNF141  FOS ZNF777  PRKAA2  ZNF564  MED20   ZNF3    ZNF484  KRAS    MED25   ZNF99   SUPT5H  ZNF430  ZNF268  ZNF570  RFFL    NR6A1   DYRK2   ZNF135  CASP1   UBC BAX WRN PLK3    LAMTOR3 TAF7    ZNF214  PITX2   ATP1B4  RAD17   ZNF708  CYCS    NOTCH4  TAF3    ZNF705A COX20   POLR2A  PHF20   CDK5R1  TAF12   ZNF383  PSMD10  ZNF703  ZNF705D ZKSCAN4 ZNF611  ZNF419  CGB8    ZNF197  HDAC4   TFAP2A  CNOT8   POLR2L  TRIM33  COX7C   ZNF620  ZNF665  ZNF454  ZNF799  KMT5A   AC002310.5  PSMB7   POU4F2  TGIF2   TBL1XR1 RPTOR   TBP ZNF354C RARB    CREBBP  MLST8   ZNF77   ZNF566  RRAGD   LAMTOR2 TACO1   TP63    FANCI   CNOT3   ZNF626  PSMC2   COX6A1  CHEK2   CBFB    ZFP37   ZNF287  ZNF718  DAXX    RBBP4   CCNE1   NCOA1   TRIAP1  ZNF615  PRKAG1  PRDM7   NR1D1   GADD45A RAD9B   TAF13   ZNF730  ZNF610  E2F4    ZNF721  CDC25C  ZNF517  HSPD1   CDKN1B  MED16   CCNK    RHEB    NCOA6   ATR ZNF300  ZNF770  MT-CO2  CGB5    ZNF696  MAMLD1  ZNF750  NR4A3   ZNF658  CDKN1A  ZNF528  AURKA   YAP1    MAML1   ZNF195  TEAD4   ZNF689  NOTCH1  CNOT1   RBPJ    PERP    ZNF599  ZNF556  NR1I3   ZNF205  ZNF439  CNOT10  AIFM2   ZNF506  ZNF669  PRDM1   ZNF266  NR0B2   ZNF100  COX4I1  YWHAZ   TNRC6B  ZNF746  YWHAH   POLR2K  ZNF624  ZNF184  ZNF548  ESR1    PRELID3A    TCF7L1  ZFP69B  PSMC1   GPI TNRC6C  STEAP3  ZNF676  TSC1    MED14   EHMT2   YWHAQ   ZNF114  ING5    TP53INP1    CCNG1   TIGAR   ZNF530  GSR SETD9   ZNF804B ZNF747  ZNF354B ZNF678  PSMB3   YEATS4  PPP1R13L    ZNF273  MRE11   ZNF587  G6PD    ZNF585B PSMD5   NELFB   ZNF75D  NR1D2   ZNF101  ZNF699  ZNF473  TEAD1   ZNF660  ZNF767P PPP1R13B    ZNF398  ZNF442  ZNF565  ZNF563  SERPINE1    PSMD3   CCNC    SSRP1   ZNF350  COX7B   ZNF711  ZNF285  PPP2R5C ZNF286A ZNF704  ZNF500  ZNF461  FANCD2  PPARG   TNFRSF10D   PSMD6   ZNF860  KCTD1   ZNF416  RHNO1   CARM1   SNW1    ZNF562  NR1I2   ZNF394  SESN1   ZNF583  CHEK1   LRPPRC  ZNF264  CDK7    TPX2    ELOA3   NCOR2   PPARD   PRKAG3  BID ZNF549  ZNF785  RRAGB   KRBA1   TEAD2   MSH2    COX6C   ZNF840P HES1    ZNF71   PSMB5   TP53    ZNF334  MOV10   HIPK1   ZNF529  ZNF354A TFAP2B  ZNF485  MED4    HNF4G   PSMD11  ZNF726  AGO2    ZNF43   MAP2K6  PSMA1   TP73    TXNRD1  LAMTOR1 AC134772.2  BIRC5   POLR2H  COX5A   ERBB2   JMY ZNF124  TAF1L   TP53RK  MTA2    ZNF791  MED30   NPM1    ZNF605  PIP4K2B ZFP30   ZNF496  ZNF226  RFC2    NELFE   ZNF480  MED23   HIPK2   YWHAB   MLH1    SLC38A9 PSMC6   TAF2    ZKSCAN7 ZNF30   TTC5    PCBP4   GTF2H2  PPP2R1B ZFP69   TOP3A   TOPBP1  ZNF233  CDK8    USP9X   ZNF546  ZNF701  CHD9    ZNF324  ZKSCAN3 AKT2    ZNF215  ARID3A  PRKAB2  MED8    RMI2    ZNF793  ZNF732  ZNF74   ZNF738  MED31   SEM1    NELFA   ZNF560  ZNF627  ZNF25   ZFP90   ZNF551  TNFRSF10A   SUMO1   ZNF596  NR0B1   GPX2    ZNF311  APOE    RFC4    NCOA2   ZNF705G E2F8    USP7    GTF2F2  ZNF250  AKT3    NR2C1   MED12   PML RRAGC   ZNF713  PIP4K2A ZNF616  ZNF274  RORA    ZNF569  CNOT9   RNF111  TGFB1   ZNF2    MDM2    ATF2    SFN MDM4    ZNF432  PSMD2   ZFP2    FAS MIR26A2 PSMB10  POLR2C  E2F1    TAF9    ZNF621  TAF6    MIR26A1 CGB3    ZNF420  EHMT1   TGFA    PSMD8   RPA3    ZNF625  PGR SGK1    KAT2A   ZNF764  ZNF382  ZNF707  ZNF45   RFC3    TFAP2E  SESN2   NR4A2   ZNF320  PRR5    ZNF470  ZNF140  ZNF441  SKI UBE2D1  ZNF175  DDIT4   ZNF679  PSMA5   YWHAG   ZNF169  ZNF417  ZNF550

Creating the enrichment map

Colour the nodes of the network to visualise supporting omics datasets

To color nodes in the network (i.e., molecular pathways, biological processes) according to the omics datasets supporting the enrichments, the third file enrichmentMap__subgroups.txt needs to be imported to Cytoscape directly. To import the file, activate the menu option File -> Import -> Table from File and select the file enrichmentMap__subgroups.txt. In the following dialogue, select To a Network Collection in the dropdown menu Where to Import Table Data. Click OK to proceed.

Next, Cytoscape needs to use the imported information to color nodes using a pie chart visualisation. To enable this, click the Style tab in the left control panel and select the Image/Chart1 Property in a series of dropdown menus (Properties -> Paint -> Custom Paint 1 -> Image/Chart 1).

The image/Chart 1 property now appears in the Style control panel. Click the triangle on the right, then set the Column to instruct and the Mapping Type to Passthrough Mapping.

This step colours the nodes corresponding to the enriched pathways according to the supporting omics datasets, based on the scores matrix initially analysed in ActivePathways.

To allow better interpretation of the enrichment map, ActivePathways generates a color legend in the file enrichmentMap__legend.pdf that shows which colors correspond to which omics datasets.

Note that one of the colors corresponds to a subset of enriched pathways with combined evidence that were only detected through data fusion and P-value merging and not when any of the input datasets were detected separately. This exemplifies the added value of integrative multi-omics pathway enrichment analysis.

Visualizing directional impact with node borders

From the drop-down Properties menu, select Border Line Type.

Set Column to directional impact and Mapping Type to Discrete Mapping. Now we can compare findings between a non-directional and a directional method. We highlight pathways that were shared (0), lost (1), and gained (2) between the approaches. Here, we have solid lines for the shared pathways, dots for the lost pathways, and vertical lines for the gained pathways. Border widths can be adjusted in the Border Width property, again with discrete mapping.

This step changes node borders in the aggregated enrichment map, depicting the additional information provided by directional impact.

Alternative node coloring

For a more diverse range of colors, ActivePathways supports any color palette from RColorBrewer. The color_palette parameter must be provided.

res <- ActivePathways(scores, gmt_file, cytoscape_file_tag = "enrichmentMap__", color_palette = "Pastel1")

Instead, to manually input the color of each dataset the custom_colors parameter must be specified as a vector. This vector should contain the same number of colors as columns in the scores matrix.

res <- ActivePathways(scores, gmt_file, cytoscape_file_tag = "enrichmentMap__", custom_colors = c("violet","green","orange","red"))

To change the color of the combined contribution, a color must be provided to the color_integrated_only parameter.

Tip: if the coloring of nodes did not work in Cytoscape after setting the options in the Style panel, check that the EnhancedGraphics Cytoscape app is installed.

References