WayFindR: Creating Graph Structures From WikiPathways Files

Polina Bombina and Kevin R. Coombes

2024-04-09

Introduction

Comprehending biological pathways stands as a pivotal endeavor in the realm of life sciences. However, merely possessing visual representations such as PNG or SVG files is insufficient for comprehensive analysis. The WayFindR package is a robust tool meticulously crafted for both biologists and bioinformaticians. Its primary objective is streamlining the intricate task of dissecting cellular pathways. Through seamless integration of WikiPathways (a well-known repository for biological pathways) and R’s igraph package, WayFindR bestows users with the capability to transform complex pathway architectures into actionable insights.

From a mathematical perspective, control theory asserts that negative feedback is essential for effective regulation and control of dynamic systems. Negative feedback mechanisms serve as stabilizing forces, ensuring that deviations from desired states are minimized and the system returns to equilibrium. In the context of biological pathways, where intricate networks of molecular interactions govern cellular processes, the concept of negative feedback is paramount.

Biologists rely on understanding the dynamics of biological pathways to elucidate mechanisms underlying disease, identify potential drug targets, and uncover regulatory networks. Negative feedback loops within these pathways play crucial roles in maintaining cellular homeostasis, modulating responses to external stimuli, and preventing excessive or aberrant signaling.

By enabling the conversion of pathway structures into computationally tractable formats and providing tools for analysis, WayFindR facilitates the identification and characterization of negative feedback loops within pathways. This capability is invaluable for deciphering the regulatory logic governing cellular processes and designing interventions to modulate pathway activity.

In essence, by bridging the gap between biological knowledge and mathematical principles, WayFindR empowers researchers to navigate the intricate landscapes of cellular pathways with precision and insight, ultimately advancing our understanding of complex biological systems and paving the way for innovative therapeutic strategies.

Installation

The WayFindR package is available from CRAN:

#install.packages("WayFindR")
#install.packages("WayFindR", repos="http://R-Forge.R-project.org")
library(WayFindR)

Retrieving data structure from GPML file

We will explore the functionality of WayFindR using one example file. You can obtain a GPML file by visiting https://www.wikipathways.org/ and selecting a pathway that piques your interest. We previously downloaded the pathway named “Factors and pathways influencing insulin-like growth factor (IGF1)-Akt signaling (WP3850),” which can be accessed directly at https://www.wikipathways.org/pathways/WP3850.html. We included this GPML file in the package, and you can find it as a system file.

gpmlFile <- system.file("pathways/WP3850.gpml", package = "WayFindR")

Functions in WayFindR rely on the XML R package to parse GPML files. (GPML is a “dialect” of the extensible markup language, XML.) You can simly supply the file name to the functions, but they also work is you explicitly parse them yourself:

xmlfile <- XML::xmlParseDoc(gpmlFile)

We are going to illustrate the use of functions that allow users to explore the various types of entities that are present in GPML files. These are useful for understanding both how GPML files are defined and how WayFindR works. You do not need to use them separately on every file; if you just want to get to the resulting igraph, you can skip ahead to the section on “Converting pathways into igraph objects”.

Edges

A user can generate a matrix with information about edges from the provided GPML file:

edges <- collectEdges(xmlfile)
class(edges)
## [1] "matrix" "array"
dim(edges)
## [1] 46  3

The matrix comprises 46 rows (representing edges) and 3 columns: Source, Target, and MIM (Molecular Interaction Maps). Each edge is identified by a unique ID, such as id2257fd8. These edges are directional, indicating they originate from one node and terminate at another. The Source and Target columns contain labels corresponding to these nodes. The MIM column illustrates the type of interaction occurring between the nodes.

head(edges)
##            Source  Target  MIM             
## id2257fd8  "c0520" "b2305" "mim-inhibition"
## id25882942 "a531f" "b3910" "Arrow"         
## id25c739ca "f5259" "ceb96" "Arrow"         
## id2848101a "cc7e6" "fea9d" "Arrow"         
## id295ae5fe "a8bb6" "b687e" "Arrow"         
## id2e105012 "fea9d" "cd260" "Arrow"
tail(edges)
##            Source  Target  MIM             
## idf7bb43f7 "f2f0d" "a2d31" "mim-inhibition"
## idfc5a5200 "f58ce" "d3cde" "Arrow"         
## idfea9f3db "ceb96" "a2d31" "Arrow"         
## idfed646d2 "ba0fb" "fca28" "Arrow"         
## id2d5bc3d8 "b0088" "b2d78" "mim-inhibition"
## id70c7e487 "b62b2" "b3356" "mim-inhibition"

Nodes

Likewise, users can acquire a matrix of nodes in a similar manner.

nodes <- collectNodes(xmlfile)
class(nodes)
## [1] "matrix" "array"
dim(nodes)
## [1] 52  3

In the selected pathway, there are a total of 52 nodes. The resulting matrix consists of three columns: GraphId, label, and Type. GraphId correlates to a distinct node ID extracted from the GPML file, while label corresponds to the name of a gene, protein, group, complex or biological process. The Type column contains information regarding the type of node.

head(nodes)
##       GraphId label    Type         
## a169c "a169c" "GSK3B"  "GeneProduct"
## a2d31 "a2d31" "PI3K"   "GeneProduct"
## a4cc9 "a4cc9" "SMAD3"  "GeneProduct"
## a531f "a531f" "4EPB"   "GeneProduct"
## a5550 "a5550" "EIF2B2" "GeneProduct"
## a8bb6 "a8bb6" "N-WASP" "GeneProduct"
tail(nodes)
##       GraphId label               Type         
## b3910 "b3910" "Protein synthesis" "Pathway"    
## c2a5c "c2a5c" "MTOR"              "GeneProduct"
## f5051 "f5051" "MLST8"             "GeneProduct"
## e1ccf "e1ccf" "MAPKAP1"           "GeneProduct"
## c6e5f "c6e5f" "RICTOR"            "GeneProduct"
## a3b41 "a3b41" "MTOR"              "GeneProduct"

Groups

We’ve already extracted all edges (Interactions) and vertices (DataNodes) from a GPML file. Another significant element within a GPML file is referred to as a “Group.” In some instances, Groups denote a complex, often associated to a separate DataNode labeled with the attribute Type="Complex". In other cases, Groups seem to represent genes with interchangeable roles within the pathway. For instance, within the IGF1-AKT pathway, there is an anonymous group featuring SMAD2 and SMAD3.

groups <- collectGroups(xmlfile)
class(groups)
## [1] "list"
names(groups)
## [1] "nodes" "edges"
groups$nodes
##       GraphId          label    Type
## b5c6b   b5c6b mTORC2 complex Complex
## b3356   b3356 mTORC1 complex Complex
## d51fd   d51fd         Group1 Complex
## dae36   dae36         Group2   Group
groups$edges
##      Source Target       MIM
## ec1   a4cc9  d51fd contained
## ec2   b3463  d51fd contained
## ec3   bb118  b3356 contained
## ec4   bd3fb  dae36 contained
## ec5   ca5fb  b3356 contained
## ec6   efb74  dae36 contained
## ec7   f8fb6  b3356 contained
## ec8   fad48  b3356 contained
## ec9   c2a5c  b3356 contained
## ec10  f5051  b5c6b contained
## ec11  e1ccf  b5c6b contained
## ec12  c6e5f  b5c6b contained
## ec13  a3b41  b5c6b contained

The entry “contained” in the MIM column does not come from the MIM standard, the WayFindR package has invented this edge type to be able to relate a group to its constituent parts.

Anchors

The Anchor, which is the fourth and final “semantic” structure outlined in the GPML specification, serves a specific purpose. It is used to designate edges (or interactions, or arrows) that act as targets for other edges, instead of the typical nodes (or DataNodes, or vertices). To handle this scenario, WayFindR modifies the target arrow, originally represented as A -> B, into a two-step arrow, A -> EDGE -> B. Subsequently, we transform the unconventional arrow into a simpler, more standardized representation that now points to the newly introduced EDGE node.

links <- collectAnchors(xmlfile)
class(links)
## [1] "list"
links$nodes
##   GraphId   label Type
## 1   b2d78 Anchor1 EDGE
links$edges
##   Source Target    MIM
## 1  b2305  b2d78 Source
## 2  b2d78  cd260  Arrow

Converting Pathways into igraph objects

The core feature of WayFindR is its ability to convert pathways from WikiPathways GPML format into igraph objects. In the preocess, we also define “standard” colors, line types, and shapes to display edges and nodes.

data(edgeColors)
data(edgeTypes)
data(nodeColors)
data(nodeShapes)
if (requireNamespace("Polychrome")) {
  opar <- par(mfrow = c(2,1))
  Polychrome::swatch(edgeColors, main = "Edge Types")
  Polychrome::swatch(nodeColors, main = "Node Types")
} else {
  opar <- par(mfrow = c(1,2))
  plot(0,0, type = "n", xlab="", ylab = "", main = "Edges")
  legend("center", legend = names(edgeColors),  lwd = 3,
         col = edgeColors,  lty = edgeTypes)
  num <- c(rectangle = 15, circle = 16)
  plot(0,0, type = "n", xlab="", ylab = "", main = "Nodes")
  legend("center", legend = names(nodeColors),  cex = 1.5,
         col = nodeColors,  pch = num[nodeShapes])
}
## Loading required namespace: Polychrome

par(opar)

Bundling the Process

The main function collects all the entities from the GPML file, adds colors and styles, and produces an igraph.

G <- GPMLtoIgraph(xmlfile)

To visualize and analyze our graph, we can apply the graphopt (or any of the many other) layout algorithm(s) from the igraph package:

set.seed(13579)
L <- igraph::layout_with_graphopt
plot(G, layout=L)
nodeLegend("topleft", G)
edgeLegend("bottomright", G)

Finding cycles

Once the pathway is converted, WayFindR provides a suite of tools for in-depth analysis. Let’s explore how to find cycles within the pathway graph (if they exist).

cyc <- findCycles(G)
length(cyc)
## [1] 5
cyc
## [[1]]
##       dbe54 c0520 b2305 c7b3c b3356 ef7f5 ceb96 a2d31 
##     2    27    15     8    18    52    30    23     2 
## 
## [[2]]
##       dbe54 c0520 b2305 b3356 ef7f5 ceb96 a2d31 
##     2    27    15     8    52    30    23     2 
## 
## [[3]]
##       c7b3c b2305 
##     8    18     8 
## 
## [[4]]
##       c7b3c b3356 d51fd c0520 b2305 
##     8    18    52    53    15     8 
## 
## [[5]]
##       b3356 d51fd c0520 b2305 
##     8    52    53    15     8

The output consists of a list of cycles extracted from a directed graph. Each cycle is represented as a sequence of vertices, where each vertex is indicated by its numerical position within the graph’s vertex list. These cycles denote closed loops of vertices in the graph, signifying paths that both start and end at the same vertex. The WP3850 pathway has 5 cycles.

Furthermore, within WayFindR, there exists a function named “interpretCycle” designed to interpret a cycle within the context of the associated graph. This function takes a cycle and the graph it belongs to as inputs. Its purpose is to extract pertinent details from the cycle, including edge ids, arrow types (indicating the edge direction), and labels associated with genes.

lapply(cyc, interpretCycle, graph = G)
## [[1]]
##            genes          arrows
## 1           PI3K mim-stimulation
## 2           PDK1 mim-stimulation
## 3           AKT1  mim-inhibition
## 4          FoxO  mim-stimulation
## 5          MAFbx  mim-inhibition
## 6 mTORC1 complex mim-stimulation
## 7            S6K  mim-inhibition
## 8           IRS1 mim-stimulation
## 
## [[2]]
##            genes          arrows
## 1           PI3K mim-stimulation
## 2           PDK1 mim-stimulation
## 3           AKT1  mim-inhibition
## 4          FoxO   mim-inhibition
## 5 mTORC1 complex mim-stimulation
## 6            S6K  mim-inhibition
## 7           IRS1 mim-stimulation
## 
## [[3]]
##   genes          arrows
## 1 FoxO  mim-stimulation
## 2 MAFbx mim-stimulation
## 
## [[4]]
##            genes          arrows
## 1          FoxO  mim-stimulation
## 2          MAFbx  mim-inhibition
## 3 mTORC1 complex  mim-inhibition
## 4         Group1  mim-inhibition
## 5           AKT1  mim-inhibition
## 
## [[5]]
##            genes         arrows
## 1          FoxO  mim-inhibition
## 2 mTORC1 complex mim-inhibition
## 3         Group1 mim-inhibition
## 4           AKT1 mim-inhibition

Notice that the cycles in the chosen pathway include only inhibition and stimulation processes.

Finally, we can generate a subgraph from the original graph, containing only the vertices that participate in the specified cycles.

S <- cycleSubgraph(G, cyc)

We depict the section of the graph that impacts cycles within the IGF1-AKT wikiPathway.

set.seed(93217)
plot(S)
nodeLegend("topleft", S)
edgeLegend("bottomright", S)

Here, Group1 corresponds to the the pair SMAD2, SMAD3.