--- title: "netcutter" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{netcutter} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction This is an R implementation of the NetCutter method described in [Identification and Analysis of Co-Occurrence Networks with NetCutter](https://doi.org/10.1371/journal.pone.0003178) by Heiko Müller and Francesco Mancuso. Co-occurrence analysis is used to find groups of terms that occur together in more (or fewer) containers than you would expect by chance. For instance, the terms could be words and the containers could be articles: two words would co-occur if they tend to occur together in many articles. The approach used by NetCutter to evaluate the statistical significance of co-occurrences is to apply a randomisation procedure to the occurrence matrix to obtain the occurrence probabilities under the null distribution, then use the [Poisson-Binomial distribution](https://en.wikipedia.org/wiki/Poisson_binomial_distribution) to compute a p-value for the actual number of occurrences. The randomisation procedure, based on edge-swapping, approximates the more rigorous approach, which requires generating all the edge permutations of the occurrence graph and is therefore computationally intractable. ### NOTE I have made reasonable efforts to contact the original authors of the paper, but haven't heard from them so far. They are also listed as authors of this R package because they came up with the original idea (all the credit goes to them). I just liked the paper and, not finding any other implementation, decided to write an R package (all mistakes and bugs are due to me, not to the original authors). ## Usage There are two main functions in the package: `nc_occ_probs()` and `nc_eval()`. Let's start by loading the package. ```{r} library(netcutter) ``` ### Data preparation The ultimate input for the package is an occurrence matrix, *i.e.* a matrix with as many rows as containers, as many columns as terms, and Boolean entries indicating whether a term is found in a container. The matrix can of course be interpreted as a graph. Throughout this vignette we will use a sample matrix with three rows and nine columns. Each container represents a scientific article and each term represents a gene name; the aim is to find which genes co-occur across articles. The same matrix is used as an example in the original NetCutter paper. ```{r} occ_matrix <- matrix(F, nrow = 3, ncol = 9, dimnames = list(paste0("PMID", 1:3), paste0("gene", 1:9))) occ_matrix[1, 1:3] <- occ_matrix[2, c(1:2, 4:5)] <- occ_matrix[3, c(1, 6:9)] <- T print(occ_matrix) ``` ### `nc_occ_probs()` As a first step, we need to compute the occurrence probabilities under the null distribution. This is usually the most time-consuming step, but it can be parallelised on Linux and Mac.^[Internally, `nc_occ_probs()` uses `mclapply()`.] It entails randomising the occurrence matrix `R` times, each time performing `S` edge swap operations. `S` should be significantly higher than the number of edges to make sure that we do a proper randomisation. Of note, we have to use a special random number generator that is particularly suitable for parallel computations, the "L'Ecuyer" generator.^[Check the documentation of `RNGkind()` and `parallel::mcparallel()` for the reasons why.] For convenience, netcutter relies on another package, [rlecuyer](https://cran.r-project.org/package=rlecuyer), to generate random streams of numbers. If we want to set a seed to make results reproducible, we need to do it through the `rlecuyer` package, as shown below. ```{r} # Set a seed with rlecuyer. The function expects a vector of 6 integers. # NOTE: The usual set.seed() won't work! rlecuyer::.lec.SetPackageSeed(c(19, 42, 54, 17, 7, 7, 2)) n_edges <- sum(occ_matrix) occ_probs <- nc_occ_probs(occ_matrix, R = 100, S = n_edges * 20) print(occ_probs) ``` Use the `mc.cores` argument to select the number of parallel computations. Note that, in the current implementation, each parallel computation necessarily makes a copy of the occurrence matrix, so memory could become a concern for bigger matrices. Making copies can speed things up a little, but can be memory-expensive. For this reason, we also provide the `n_batches` argument, to split the computations into smaller chunks. If your RAM explodes, try increasing `n_batches`. If you have infinite resources, set it to 1 for maximum speed and efficiency. ### `nc_eval()` Once we have the occurrence probabilities, we can evaluate the p-values. The function is `nc_eval()`, and it can be called several times with different parameters but re-using the same `occ_probs` matrix. It has additional arguments to specify which modules to consider for co-occurrence analysis. `module_size` specifies how many terms are part of the same co-occurrence groups. For example, with `module_size = 2`, we analyse co-occurrence of pairs, with `module_size = 3`, we focus on triplets, and so on. `min_occurrences` is the minimum number of containers that any term should be part of, otherwise it's not considered. For example, the term "gene4" occurs in 1 container, while "gene2" occurs in 2; with a `min_occurrences = 2`, no module could contain "gene4". If you are only interested in the co-occurrences of a subset of terms, you can list them in the `terms_of_interest` argument: only modules containing those terms will be considered. `min_support` is the minimum number of containers that any valid module should be part of as a whole. For example, the pair "gene5-gene6" does not co-occur in any container, so its support is 0. Note that negative co-occurrence, *i.e.* occurring together in fewer containers than expected by chance, can also be interesting, in which case setting a `min_support` would be detrimental. At any rate, `nc_eval()` will return a `data.frame` with one row for each module, and corresponding columns for the support of that module and the p-value. Modules below the `min_support` threshold will have `NA` as p-value. As with `nc_occ_probs()`, the `mc.cores` argument is also supported; setting it to a high value could speed-up computations for larger problems. ```{r} nc <- nc_eval(occ_matrix, occ_probs, module_size = 2) print(nc) ``` Note that `nc_eval()` will return, in addition to the support `k`, two probabilities. `Pr(==k)` is the probability that this module be found in exactly `k` containers under the null hypothesis; `Pr(>=k)` is the one-sides p-value, or the probability that this module be found in `k` or more containers under the null hypothesis. Negative co-occurrence, that is, when a module occurs less often than expected, indicating that the terms are somewhat "mutually exclusive", can be evaluated by computing `Pr(<=k)` as `1 - Pr(>=k) + Pr(==k)`. Don't forget to adjust the p-values for multiple testing.