netcutter

Usage

There are two main functions in the package: nc_occ_probs() and nc_eval(). Let’s start by loading the package.

library(netcutter)

Data preparation

The ultimate input for the package is an occurrence matrix, i.e. a matrix with as many rows as containers, as many columns as terms, and Boolean entries indicating whether a term is found in a container. The matrix can of course be interpreted as a graph. Throughout this vignette we will use a sample matrix with three rows and nine columns. Each container represents a scientific article and each term represents a gene name; the aim is to find which genes co-occur across articles. The same matrix is used as an example in the original NetCutter paper.

occ_matrix <- matrix(F, nrow = 3, ncol = 9, dimnames = list(paste0("PMID", 1:3), paste0("gene", 1:9)))
occ_matrix[1, 1:3] <- occ_matrix[2, c(1:2, 4:5)] <- occ_matrix[3, c(1, 6:9)] <- T

print(occ_matrix)
#>       gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9
#> PMID1  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#> PMID2  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
#> PMID3  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

`nc_occ_probs()`

As a first step, we need to compute the occurrence probabilities under the null distribution. This is usually the most time-consuming step, but it can be parallelised on Linux and Mac.¹ It entails randomising the occurrence matrix R times, each time performing S edge swap operations. S should be significantly higher than the number of edges to make sure that we do a proper randomisation. Of note, we have to use a special random number generator that is particularly suitable for parallel computations, the “L’Ecuyer” generator.² For convenience, netcutter relies on another package, rlecuyer, to generate random streams of numbers. If we want to set a seed to make results reproducible, we need to do it through the rlecuyer package, as shown below.

# Set a seed with rlecuyer. The function expects a vector of 6 integers.
# NOTE: The usual set.seed() won't work!
rlecuyer::.lec.SetPackageSeed(c(19, 42, 54, 17, 7, 7, 2))
#> Warning in .lec.CheckSeed(seed): Seed may not exceed the length of 6. Truncated
#> to 19, 42, 54, 17, 7, 7
#> [1] 19 42 54 17  7  7
n_edges <- sum(occ_matrix)
occ_probs <- nc_occ_probs(occ_matrix, R = 100, S = n_edges * 20)

print(occ_probs)
#>       gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9
#> PMID1     1  0.52  0.22  0.24  0.21  0.18  0.21  0.21  0.21
#> PMID2     1  0.71  0.42  0.28  0.30  0.32  0.33  0.27  0.37
#> PMID3     1  0.77  0.36  0.48  0.49  0.50  0.46  0.52  0.42

Use the mc.cores argument to select the number of parallel computations. Note that, in the current implementation, each parallel computation necessarily makes a copy of the occurrence matrix, so memory could become a concern for bigger matrices. Making copies can speed things up a little, but can be memory-expensive. For this reason, we also provide the n_batches argument, to split the computations into smaller chunks. If your RAM explodes, try increasing n_batches. If you have infinite resources, set it to 1 for maximum speed and efficiency.

`nc_eval()`

Once we have the occurrence probabilities, we can evaluate the p-values. The function is nc_eval(), and it can be called several times with different parameters but re-using the same occ_probs matrix. It has additional arguments to specify which modules to consider for co-occurrence analysis. module_size specifies how many terms are part of the same co-occurrence groups. For example, with module_size = 2, we analyse co-occurrence of pairs, with module_size = 3, we focus on triplets, and so on. min_occurrences is the minimum number of containers that any term should be part of, otherwise it’s not considered. For example, the term “gene4” occurs in 1 container, while “gene2” occurs in 2; with a min_occurrences = 2, no module could contain “gene4”. If you are only interested in the co-occurrences of a subset of terms, you can list them in the terms_of_interest argument: only modules containing those terms will be considered. min_support is the minimum number of containers that any valid module should be part of as a whole. For example, the pair “gene5-gene6” does not co-occur in any container, so its support is 0. Note that negative co-occurrence, i.e. occurring together in fewer containers than expected by chance, can also be interesting, in which case setting a min_support would be detrimental. At any rate, nc_eval() will return a data.frame with one row for each module, and corresponding columns for the support of that module and the p-value. Modules below the min_support threshold will have NA as p-value. As with nc_occ_probs(), the mc.cores argument is also supported; setting it to a high value could speed-up computations for larger problems.

nc <- nc_eval(occ_matrix, occ_probs, module_size = 2)

print(nc)
#>    Term1 Term2 k   Pr(==k)   Pr(>=k)
#> 1  gene1 gene2 2 0.4634480 0.7477320
#> 2  gene1 gene3 1 0.4541920 0.7104640
#> 3  gene1 gene4 1 0.4631680 0.7154560
#> 4  gene1 gene5 1 0.4668100 0.7179700
#> 5  gene1 gene6 1 0.4712000 0.7212000
#> 6  gene1 gene7 1 0.4602340 0.7141780
#> 7  gene1 gene8 1 0.4758520 0.7231840
#> 8  gene1 gene9 1 0.4553020 0.7113340
#> 9  gene2 gene3 1 0.4211957 0.5507696
#> 10 gene2 gene4 1 0.4318840 0.5579571
#> 11 gene2 gene5 1 0.4361762 0.5634502
#> 12 gene2 gene6 0 0.4307865 1.0000000
#> 13 gene2 gene7 0 0.4404909 1.0000000
#> 14 gene2 gene8 0 0.4317322 1.0000000
#> 15 gene2 gene9 0 0.4443820 1.0000000
#> 16 gene3 gene4 0 0.6913814 1.0000000
#> 17 gene3 gene5 0 0.6865704 1.0000000
#> 18 gene3 gene6 0 0.6816842 1.0000000
#> 19 gene3 gene7 0 0.6855458 1.0000000
#> 20 gene3 gene8 0 0.6873354 1.0000000
#> 21 gene3 gene9 0 0.6837759 1.0000000
#> 22 gene4 gene5 1 0.3008983 0.3347513
#> 23 gene4 gene6 0 0.6620137 1.0000000
#> 24 gene4 gene7 0 0.6715589 1.0000000
#> 25 gene4 gene8 0 0.6587088 1.0000000
#> 26 gene4 gene9 0 0.6796152 1.0000000
#> 27 gene5 gene6 0 0.6567207 1.0000000
#> 28 gene5 gene7 0 0.6671366 1.0000000
#> 29 gene5 gene8 0 0.6546374 1.0000000
#> 30 gene5 gene9 0 0.6749073 1.0000000
#> 31 gene6 gene7 1 0.3022069 0.3373444
#> 32 gene6 gene8 1 0.3156316 0.3494912
#> 33 gene6 gene9 1 0.2944645 0.3298623
#> 34 gene7 gene8 1 0.3036382 0.3375491
#> 35 gene7 gene9 1 0.2875320 0.3229459
#> 36 gene8 gene9 1 0.2935762 0.3275070

Note that nc_eval() will return, in addition to the support k, two probabilities. Pr(==k) is the probability that this module be found in exactly k containers under the null hypothesis; Pr(>=k) is the one-sides p-value, or the probability that this module be found in k or more containers under the null hypothesis. Negative co-occurrence, that is, when a module occurs less often than expected, indicating that the terms are somewhat “mutually exclusive”, can be evaluated by computing Pr(<=k) as 1 - Pr(>=k) + Pr(==k). Don’t forget to adjust the p-values for multiple testing.

Internally, nc_occ_probs() uses mclapply().↩︎
Check the documentation of RNGkind() and parallel::mcparallel() for the reasons why.↩︎

netcutter

Introduction

NOTE

Usage

Data preparation

nc_occ_probs()

nc_eval()

`nc_occ_probs()`

`nc_eval()`