This is an R implementation of the NetCutter method described in Identification and Analysis of Co-Occurrence Networks with NetCutter by Heiko Müller and Francesco Mancuso. Co-occurrence analysis is used to find groups of terms that occur together in more (or fewer) containers than you would expect by chance. For instance, the terms could be words and the containers could be articles: two words would co-occur if they tend to occur together in many articles. The approach used by NetCutter to evaluate the statistical significance of co-occurrences is to apply a randomisation procedure to the occurrence matrix to obtain the occurrence probabilities under the null distribution, then use the Poisson-Binomial distribution to compute a p-value for the actual number of occurrences. The randomisation procedure, based on edge-swapping, approximates the more rigorous approach, which requires generating all the edge permutations of the occurrence graph and is therefore computationally intractable.
I have made reasonable efforts to contact the original authors of the paper, but haven’t heard from them so far. They are also listed as authors of this R package because they came up with the original idea (all the credit goes to them). I just liked the paper and, not finding any other implementation, decided to write an R package (all mistakes and bugs are due to me, not to the original authors).
There are two main functions in the package:
nc_occ_probs()
and nc_eval()
. Let’s start by
loading the package.
The ultimate input for the package is an occurrence matrix, i.e. a matrix with as many rows as containers, as many columns as terms, and Boolean entries indicating whether a term is found in a container. The matrix can of course be interpreted as a graph. Throughout this vignette we will use a sample matrix with three rows and nine columns. Each container represents a scientific article and each term represents a gene name; the aim is to find which genes co-occur across articles. The same matrix is used as an example in the original NetCutter paper.
occ_matrix <- matrix(F, nrow = 3, ncol = 9, dimnames = list(paste0("PMID", 1:3), paste0("gene", 1:9)))
occ_matrix[1, 1:3] <- occ_matrix[2, c(1:2, 4:5)] <- occ_matrix[3, c(1, 6:9)] <- T
print(occ_matrix)
#> gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9
#> PMID1 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#> PMID2 TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
#> PMID3 TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
nc_occ_probs()
As a first step, we need to compute the occurrence probabilities
under the null distribution. This is usually the most time-consuming
step, but it can be parallelised on Linux and Mac.1 It entails randomising
the occurrence matrix R
times, each time performing
S
edge swap operations. S
should be
significantly higher than the number of edges to make sure that we do a
proper randomisation. Of note, we have to use a special random number
generator that is particularly suitable for parallel computations, the
“L’Ecuyer” generator.2 For convenience, netcutter relies on
another package, rlecuyer, to
generate random streams of numbers. If we want to set a seed to make
results reproducible, we need to do it through the rlecuyer
package, as shown below.
# Set a seed with rlecuyer. The function expects a vector of 6 integers.
# NOTE: The usual set.seed() won't work!
rlecuyer::.lec.SetPackageSeed(c(19, 42, 54, 17, 7, 7, 2))
#> Warning in .lec.CheckSeed(seed): Seed may not exceed the length of 6. Truncated
#> to 19, 42, 54, 17, 7, 7
#> [1] 19 42 54 17 7 7
n_edges <- sum(occ_matrix)
occ_probs <- nc_occ_probs(occ_matrix, R = 100, S = n_edges * 20)
print(occ_probs)
#> gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9
#> PMID1 1 0.52 0.22 0.24 0.21 0.18 0.21 0.21 0.21
#> PMID2 1 0.71 0.42 0.28 0.30 0.32 0.33 0.27 0.37
#> PMID3 1 0.77 0.36 0.48 0.49 0.50 0.46 0.52 0.42
Use the mc.cores
argument to select the number of
parallel computations. Note that, in the current implementation, each
parallel computation necessarily makes a copy of the occurrence matrix,
so memory could become a concern for bigger matrices. Making copies can
speed things up a little, but can be memory-expensive. For this reason,
we also provide the n_batches
argument, to split the
computations into smaller chunks. If your RAM explodes, try increasing
n_batches
. If you have infinite resources, set it to 1 for
maximum speed and efficiency.
nc_eval()
Once we have the occurrence probabilities, we can evaluate the
p-values. The function is nc_eval()
, and it can be called
several times with different parameters but re-using the same
occ_probs
matrix. It has additional arguments to specify
which modules to consider for co-occurrence analysis.
module_size
specifies how many terms are part of the same
co-occurrence groups. For example, with module_size = 2
, we
analyse co-occurrence of pairs, with module_size = 3
, we
focus on triplets, and so on. min_occurrences
is the
minimum number of containers that any term should be part of, otherwise
it’s not considered. For example, the term “gene4” occurs in 1
container, while “gene2” occurs in 2; with a
min_occurrences = 2
, no module could contain “gene4”. If
you are only interested in the co-occurrences of a subset of terms, you
can list them in the terms_of_interest
argument: only
modules containing those terms will be considered.
min_support
is the minimum number of containers that any
valid module should be part of as a whole. For example, the pair
“gene5-gene6” does not co-occur in any container, so its support is 0.
Note that negative co-occurrence, i.e. occurring together in
fewer containers than expected by chance, can also be interesting, in
which case setting a min_support
would be detrimental. At
any rate, nc_eval()
will return a data.frame
with one row for each module, and corresponding columns for the support
of that module and the p-value. Modules below the
min_support
threshold will have NA
as p-value.
As with nc_occ_probs()
, the mc.cores
argument
is also supported; setting it to a high value could speed-up
computations for larger problems.
nc <- nc_eval(occ_matrix, occ_probs, module_size = 2)
print(nc)
#> Term1 Term2 k Pr(==k) Pr(>=k)
#> 1 gene1 gene2 2 0.4634480 0.7477320
#> 2 gene1 gene3 1 0.4541920 0.7104640
#> 3 gene1 gene4 1 0.4631680 0.7154560
#> 4 gene1 gene5 1 0.4668100 0.7179700
#> 5 gene1 gene6 1 0.4712000 0.7212000
#> 6 gene1 gene7 1 0.4602340 0.7141780
#> 7 gene1 gene8 1 0.4758520 0.7231840
#> 8 gene1 gene9 1 0.4553020 0.7113340
#> 9 gene2 gene3 1 0.4211957 0.5507696
#> 10 gene2 gene4 1 0.4318840 0.5579571
#> 11 gene2 gene5 1 0.4361762 0.5634502
#> 12 gene2 gene6 0 0.4307865 1.0000000
#> 13 gene2 gene7 0 0.4404909 1.0000000
#> 14 gene2 gene8 0 0.4317322 1.0000000
#> 15 gene2 gene9 0 0.4443820 1.0000000
#> 16 gene3 gene4 0 0.6913814 1.0000000
#> 17 gene3 gene5 0 0.6865704 1.0000000
#> 18 gene3 gene6 0 0.6816842 1.0000000
#> 19 gene3 gene7 0 0.6855458 1.0000000
#> 20 gene3 gene8 0 0.6873354 1.0000000
#> 21 gene3 gene9 0 0.6837759 1.0000000
#> 22 gene4 gene5 1 0.3008983 0.3347513
#> 23 gene4 gene6 0 0.6620137 1.0000000
#> 24 gene4 gene7 0 0.6715589 1.0000000
#> 25 gene4 gene8 0 0.6587088 1.0000000
#> 26 gene4 gene9 0 0.6796152 1.0000000
#> 27 gene5 gene6 0 0.6567207 1.0000000
#> 28 gene5 gene7 0 0.6671366 1.0000000
#> 29 gene5 gene8 0 0.6546374 1.0000000
#> 30 gene5 gene9 0 0.6749073 1.0000000
#> 31 gene6 gene7 1 0.3022069 0.3373444
#> 32 gene6 gene8 1 0.3156316 0.3494912
#> 33 gene6 gene9 1 0.2944645 0.3298623
#> 34 gene7 gene8 1 0.3036382 0.3375491
#> 35 gene7 gene9 1 0.2875320 0.3229459
#> 36 gene8 gene9 1 0.2935762 0.3275070
Note that nc_eval()
will return, in addition to the
support k
, two probabilities. Pr(==k)
is the
probability that this module be found in exactly k
containers under the null hypothesis; Pr(>=k)
is the
one-sides p-value, or the probability that this module be found in
k
or more containers under the null hypothesis. Negative
co-occurrence, that is, when a module occurs less often than expected,
indicating that the terms are somewhat “mutually exclusive”, can be
evaluated by computing Pr(<=k)
as
1 - Pr(>=k) + Pr(==k)
. Don’t forget to adjust the
p-values for multiple testing.