library(CommKern)
## 
## Attaching package: 'CommKern'
## The following object is masked from 'package:stats':
## 
##     kernelIn the past decade, network data has become a popular method for analyzing data from a variety of fields, including genomics, metabolomics, ecology, social networks, and neuroimaging. Brain networks are a complex system of structural-functional couplings within a hierarchy of mesoscopic interactions. Because of the unique aspects to brain connectivity, many of the popular graph theoretic measures fail to capture its complex dynamics. The CommKern package was created to address the limitations in analysis of network community structures through two main aspects. The first is a hierarchical, multimodal spinglass algorithm that is computationally efficient and able to robustly identify communities using multiple sources of information. The flexibility of the multimodal and hierarchical aspects to this algorithm set it apart from other existing methods which may only allow for one source of information or lack the hierarchical option. The second is a semiparametric kernel test of association, which allows for the use of a kernel to model a variety of cluster evaluation metrics without making any assumption as to the parametric form of its association with the outcome. As well, because the model is semi-parametric, potential confounders can be controlled for within a parametric framework, allowing for ease of parameter estimate interpretation, should they be desired.
This vignette is designed to provide an overview of CommKern’s functionality using the simulated datasets provided within the package. While focusing on applications to neuroimaging, the methodologies within this package are flexible enough to be applied to any type of network data, which will be pointed out in the vignette when applicable.
Several simulated datasets have been provided within the package.
SBM_netSBM_net is a dataset containing multimodal network
information simulated to emulate functional and structural connectivity
within a nested hierarchical structure.
matrix_plot(SBM_net)The SBM_net object is the result of calling
matrix_to_df.
net <- matrix_to_df(func_mat = SBM_net$func_mat, str_mat = SBM_net$str_mat)
identical(net, SBM_net)
## [1] TRUENetwork data used within the package’s community detection algorithm
takes a specific form. Using functional and structural adjacency
matrices as inputs, the matrix_to_df function creates a
list of five dataframes:
func_edges, which describes the pairwise functional
edge weights between nodes.str_edges, which describes the pairwise structural edge
weights between nodes.vertexes, which describes nodal characteristics,
including functional degree, structural degree, node label (which can
contain additional information about the node), and community assignment
(for later use).func_matrix, which is the functional adjacency
matrix.str_matrix, which is the structural adjacency
matrix.str(SBM_net)
## List of 5
##  $ func_edges :'data.frame': 1233 obs. of  3 variables:
##   ..$ func_start_node: int [1:1233] 2 3 4 5 6 7 8 9 10 11 ...
##   ..$ func_end_node  : int [1:1233] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ func_weight    : num [1:1233] 0.715 0.734 0.739 0.719 0.769 ...
##  $ str_edges  :'data.frame': 453 obs. of  3 variables:
##   ..$ str_start_node: int [1:453] 2 3 5 6 7 8 9 10 13 15 ...
##   ..$ str_end_node  : int [1:453] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ str_weight    : num [1:453] 0.611 0.598 0.601 0.617 0.594 ...
##  $ vertexes   :'data.frame': 80 obs. of  5 variables:
##   ..$ node_id    : int [1:80] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ node_label : logi [1:80] NA NA NA NA NA NA ...
##   ..$ func_degree: num [1:80] 11.2 11.3 11 11.5 11.1 ...
##   ..$ str_degree : num [1:80] 5.81 5.65 6.27 4.03 3.37 ...
##   ..$ community  : logi [1:80] NA NA NA NA NA NA ...
##  $ func_matrix: num [1:80, 1:80] 0 0.715 0.734 0.739 0.719 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:80] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:80] "1" "2" "3" "4" ...
##  $ str_matrix : num [1:80, 1:80] 0 0.611 0.598 0 0.601 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:80] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:80] "1" "2" "3" "4" ...
##  - attr(*, "class")= chr "spinglass_net"simasd_covarsThis dataframe is a simulated dataset of demographics based on summary statistics for a random subset of the ABIDE pre-processed database.
str(simasd_covars)
## 'data.frame':    49 obs. of  8 variables:
##  $ id          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dx_group    : num  1 1 1 1 0 0 0 0 0 1 ...
##  $ sex         : int  0 0 1 0 1 1 0 0 0 1 ...
##  $ age         : num  25.2 18.9 16.1 27.6 37.6 20.6 7.7 21.6 19 25.5 ...
##  $ handedness  : num  2 0 0 0 1 0 1 0 0 0 ...
##  $ fullscale_IQ: num  107 123 100 109 114 130 118 120 101 106 ...
##  $ verbal_IQ   : num  105 125 96 107 118 136 122 124 103 103 ...
##  $ nonverbal_IQ: num  112 120 109 113 113 124 116 117 104 112 ...simasd_comm_dfThis is a dataset of partitions of nodes to communities from
simulated group-level networks with community structures. This dataset
is complementary to the simasd_covars dataset and is
derived from the simasd_array dataset.
str(simasd_comm_df, max.level = 0)
## 'data.frame':    80 obs. of  49 variables:simasd_hamil_dfThis is a dataset of Hamiltonian values from simulated group-level
networks with community structure. This dataset is complementary to the
simasd_covars dataset and is derived from the
simasd_array dataset.
str(simasd_hamil_df)
## 'data.frame':    49 obs. of  2 variables:
##  $ id   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ hamil: num  -342 -347 -327 -361 -363 ...simasd_arrayThis is a dataset containing an array of simulated adjacency
matrices. The dimensions of each matrix is 80 x 80, for a total of 49
simulated networks. This simulated array is the basis of the
simasd_hamil_df and simasd_comm_df datasets
and is complementary to the simasd_covars dataframe.
str(simasd_array)
##  num [1:49, 1:80, 1:80] 0 0 0 0 0 0 0 0 0 0 ...The hierarchical multimodal spinglass (HMS) algorithm was created as a way to address the unique characteristics of brain connectivity, specifically focusing on the hierarchical aspect to brain network segregation and integration and the structure-function relationship.
First introduced by Reichardt and Bornholdt in 2006, the spinglass algorithm relies on the analogy between the Potts spinglass model - a popular statistical mechanics model - and community detection, where any community partitioning is likened to a corresponding spin configuration and the problem of community detection falls into finding the ground state of the Potts spinglass model. Building on the idea of a quality function which, when optimized, represents the ideal partitioning of a network, they determined that a quality function should embody four components: it should (a) reward internal edges between nodes of the same group/spin state; (b) penalize missing edges (non-links) between nodes in the same group; (c) penalize existing edges between different groups/spin states; and (d) reward non-links between different groups. Using these requirements, they built the following function: \[\begin{equation} \begin{split} \mathcal{H}\left(\left\{\mathbf{C}\right\}\right) = & -\sum_{i \neq j}a_{ij}\underbrace{W_{ij} \delta \left(C_i,C_j\right)}_{\text{internal links}} + \sum_{i \neq j}b_{ij}\underbrace{\left(1-W_{ij}\right) \delta \left(C_i,C_j\right)}_{\text{internal non-links}} \\ & + \sum_{i \neq j}c_{ij}\underbrace{W_{ij} \left(1-\delta \left(C_i,C_j\right)\right)}_{\text{external links}} - \sum_{i \neq j}d_{ij}\underbrace{\left(1-W_{ij}\right) \left(1-\delta \left(C_i,C_j\right)\right)}_{\text{external non-links}}, \end{split} \end{equation}\] where \(W_{ij}\) denotes the weighted adjacency matrix with \(W_{ij} \geq 0\) if an edge is present and zero otherwise, \(C_{i} \in \left\{1,2,...,q\right\}\) denotes the group index (or spin state) of node \(i\) in the graph, \(\delta\) is the Kronecker product equaling one if \(C_i\) and \(C_j\) belong to the same community and zero otherwise, and \(a_{ij}, b_{ij}, c_{ij}, d_{ij}\) denote the weights of the individual contributions, respectively. One of the appeals of the spinglass algorithm is that the number of groups \(q\) only determines the maximum number of groups allowed to exist and, in principle, can be as large as the total number of nodes in the network, \(N\). All group indices specified by \(q\) have to be used during the optimal assignment of nodes to communities, though some may be unpopulated in the ground state.
Generally, links and non-links are weighted equally, \(a_{ij}=c_{ij}\) and \(b_{ij}=d_{ij}\), which simplifies the algorithm to only need to consider internal links and non-links. A convenient choice for \(a_{ij}\) and \(b_{ij}\) is \(a_{ij} = 1-\gamma p_{ij}\) and \(b_{ij} = \gamma p_{ij}\), where \(p_{ij}\) is the probability that a non-zero edge exists between nodes \(i\) and \(j\), normalized such that \(\sum_{i \neq j}p_{ij} = 2M\), where \(M\) is the total weight (or sum of all edge weights) of the graph. Adopting all of these settings, we can simplify the above equation to
\[\begin{equation} \begin{split} \mathcal{H}\left(\left\{\mathbf{C}\right\}\right) = & -\sum_{i \neq j}\left(W_{ij} - \gamma p_{ij}\right) \delta \left(C_{i},C_{j}\right). \end{split} \end{equation}\]
Expanding on the original algorithm of Reichardt and Bornholdt, in 2012, Eaton and Mansbach introduced the concept of semi-supervised community detection to the spinglass model. When knowledge can be used to inform the community search, either through knowledge of a specific community of interest or partial knowledge of community memberships, its incorporation can be used to help compensate for potential noise in the network. As the Hamiltonian captures the total energy of the Potts model, external knowledge can be incorporated into community detection by penalizing for community structures that violate guidance. Generally, let the disagreement of the communities to a given guidance be specified by a function \(U: C \rightarrow \mathbb{R}\), where \(U(C)\) is smaller when the communities \(C\) agree with the guidance and larger when they disagree, with the following form:
\[\begin{equation} U(C) = \sum_{i \neq j} \left( u_{ij} \left( 1-\delta\left(C_{i},C_{j}\right) \right) \pm \bar{u}_{ij} \delta \left(C_{i},C_{j}\right)\right), \end{equation}\] where \(u_{ij}\) is the penalty for violating guidance that \(v_{i}\) and \(v_{j}\) belong to the same community and \(\bar{u}_{ij}\) is the penalty for violating guidance that \(v_{i}\) and \(v_{j}\) belong to different communities. When incorporating the guidance into the Hamiltonian,
\[\begin{equation} \mathcal{H}'\left( C \right) = \mathcal{H}\left( C \right) + \mu \sum_{i \neq j} \left( u_{ij} - \left(u_{ij}-\bar{u}_{ij}\right) \delta \left(C_{i},C_{j}\right)\right), \end{equation}\] where \(\mu \geq 0\) controls the balance between the inherent community structure and the external guidance. Rewriting \(\mathcal{H}'\) to be more conducive to optimization, we can decompose it as
\[\begin{equation} \mathcal{H}'\left( C \right) = -\sum_{i \neq j}M_{ij}\delta\left(C_{i},C_{j}\right) - \mu \sum_{i \neq j}\Delta U_{ij} \delta\left(C_{i},C_{j}\right)+K, \end{equation}\] where \(M_{ij}=A_{ij}-\gamma P_{ij}\), \(\Delta U_{ij}=u_{ij} - \bar{u}_{ij}\), and all constant terms are contained within \(K = \mu \sum_{i \neq j}u_{ij}\), which can ultimately be dropped from the optimization problem.
Adapting Eaton and Mansbach’s method for multimodal community detection in the human brain, our modified Hamiltonian takes the following form,
\[\begin{equation} \mathcal{H}'\left( C \right) = - \sum_{i \neq j} M_{ij}\delta\left(C_{i},C_{j}\right) - \alpha \sum_{i \neq j} S_{ij} \delta\left(C_{i},C_{j}\right), \end{equation}\] where \(M_{ij}\) is the modularity matrix associated with functional connectivity, \(S_{ij}\) is the structural connectivity matrix, and \(\alpha \geq 0\) controls the balance between the community structure determined solely by functional connectivity and being informed by the structural connectivity via white matter fiber tract information. In contrast to Eaton and Mansbach’s approach, where penalties are applied when violating guidance for nodes belonging both to the same community or different communities, our method only decreases the total energy of the model when white matter fiber tracts exist between regions \(v_i\) and \(v_j\); otherwise, the guidance term is set to zero, with the algorithm deciding the community assignments for \(v_i\) and \(v_j\) based solely on their functional connectivity.
To then study the hierarchical community structure of these networks, we implement an iterative method. Specifically, we create a multilayer network, where communities from layer \(k\) each become subnetworks in layer \(k+1\) wherein the multimodal spinglass algorithm is applied. The number of iterations (or layers) is pre-specified in the algorithm to avoid the need for stopping criteria. The number of communities \(q\) is held constant across each layer of the network. An additional parameter, \(\tau\), is also specified a priori, controlling the acceptance rate of the algorithm.
Before implementing the algorithm within the CommKern package, we start from a common brain atlas (e.g., Automated Anatomical Labeling (AAL), Harvard-Oxford (HO), Craddock 200 (CC200), etc.), where adjacency matrices representing functional and structural connectivity are derived from the correlation between resting state functional MRI (rs-FMRI) blood oxygen level dependent (BOLD) time series or the number of white matter fiber tracks computed from diffusion tensor MRI, respectively.
Once we have structural and functional adjacency matrices, we must
create the network dataframe used within the hms algorithm
function. This is done using the matrix_to_df function
(Note: the SBM_net dataset is already in this form for the
end user). It should be noted that both a functional and structural
matrix need to be specified for the matrix_to_df function
to work. If there is no structural (or guidance) matrix for use, the end
user can simply create a matrix of \(0\)s of the same dimensions for input into
the function. This will not affect the end results of the
hms function.
net <- matrix_to_df(func_mat = SBM_net$func_mat, str_mat = SBM_net$str_mat)
identical(net, SBM_net)
## [1] TRUEFrom here, we can implement the hms function. There are
several input parameters to discuss.
str(hms)
## function (input_net, spins, alpha, coolfact, tol, max_layers)The spins parameter denotes the maximum number
of spins, or communities, the algorithm can utilize during optimization.
The whole number integer value of spins must lie within
\(\left[2,\texttt{ # of network
nodes}\right]\). However, the higher the number of spins, the
more complex the optimization process, leading to longer computation
times. If the number of communities is unknown, we recommend an
exploratory analysis to determine the number of communities the
algorithm tends to decide on using less restrictive parameters.
The alpha parameter is a numeric parameter \(\left(\geq 0\right)\) that balances the use
of the guidance matrix in the algorithm’s optimization. When \(\alpha=0\), the algorithm ignores the
guidance matrix completely and when \(\alpha=1\), the algorithm weights the
functional and structural inputs equally.
The coolfact parameter indicates how quickly (or slowly)
to cool the heatbath algorithm, which is used to find the optimal
partitioning of nodes to communities. This parameter must be in \(\left(0,1\right)\), but is usually set to
be \(0.95-0.99\).
The tol parameter indicates the tolerance level of
accepting the proposed changes within a temperature; at the end of each
sweep, the number of proposed changes to the partition is assessed to
see if it exceeds a threshold determined as a function of tol and spins,
typically set to be 0.01-0.05.
Finally, max_layers is the parameter controlling the
number of layers of communities within the network. At least one layer
must be specified in the algorithm. End users can specify the number of
layers to be any positive whole number but several considerations should
be made. First, as the number of layers increases, so does the
computational time. Second, over-specification of the community layers
can create singleton issues (a single node in a community), leading to
the algorithm throwing an error. Finally, the number of layers specified
should be data driven; therefore, we cannot provide a universally
optimal number of layers to specify.
Using the SBM_net dataset as an example, below we have
run the hms algorithm to find the first layer of
bipartitioning, suppressing the guidance matrix in the optimization.
The output from the hms function call has two components
within a list structure:
comm_layers_tree, a dataframe whose first column is
the node ID (pulled from the input network object) and all other columns
denote the partitioning of nodes to communities across layers.
best_hamiltonian, a vector of the optimized
Hamiltonian values for each run of the HMS algorithm.
hms_object <-
  hms(
      input_net = SBM_net,
      spins = 2,
      alpha = 0,
      coolfact = 0.99,
      tol = 0.01,
      max_layers = 1)
str(hms_object)
## List of 3
##  $ comm_layers_tree:'data.frame':    80 obs. of  2 variables:
##   ..$ node_id: int [1:80] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ layer_1: int [1:80] 2 2 2 2 2 2 2 2 2 2 ...
##  $ net             :List of 5
##   ..$ func_edges :'data.frame':  1233 obs. of  3 variables:
##   .. ..$ func_start_node: int [1:1233] 2 3 4 5 6 7 8 9 10 11 ...
##   .. ..$ func_end_node  : int [1:1233] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ func_weight    : num [1:1233] 0.715 0.734 0.739 0.719 0.769 ...
##   ..$ str_edges  :'data.frame':  453 obs. of  3 variables:
##   .. ..$ str_start_node: int [1:453] 2 3 5 6 7 8 9 10 13 15 ...
##   .. ..$ str_end_node  : int [1:453] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ str_weight    : num [1:453] 0.611 0.598 0.601 0.617 0.594 ...
##   ..$ vertexes   :'data.frame':  80 obs. of  5 variables:
##   .. ..$ node_id    : int [1:80] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ node_label : logi [1:80] NA NA NA NA NA NA ...
##   .. ..$ func_degree: num [1:80] 11.2 11.3 11 11.5 11.1 ...
##   .. ..$ str_degree : num [1:80] 5.81 5.65 6.27 4.03 3.37 ...
##   .. ..$ community  : logi [1:80] NA NA NA NA NA NA ...
##   ..$ func_matrix: num [1:80, 1:80] 0 0.715 0.734 0.739 0.719 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:80] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:80] "1" "2" "3" "4" ...
##   ..$ str_matrix : num [1:80, 1:80] 0 0.611 0.598 0 0.601 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:80] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:80] "1" "2" "3" "4" ...
##   ..- attr(*, "class")= chr "spinglass_net"
##  $ best_hamiltonian: num -286
##  - attr(*, "class")= chr "spinglass_hms"The HMS algorithm is flexible enough that either the multimodal or hierarchical aspects can be suppressed. If the multimodal aspect would like to be suppressed, set \(\alpha=0\) and if the hierarchical aspect would like to be suppressed, set \(\text{max_layers}=1\). If both are suppressed, the algorithm simplifies to the spinglass algorithm originally proposed by Reichardt and Bornholdt.
Once the algorithm has finished running, you can visualize the
results in a plot using community_plot. Using the output
from the hms run of the SBM_net dataset, we
can produce the following plot:
community_plot(hms_object)If the HMS algorithm is used on network data for which ground truth is unknown (like in the case of brain connectivity), the CommKern package has provided several functions to understand the robustness and stability of the partitioning of nodes to communities across multiple runs.
First is the community_allegiance function.For node
\(i\), the stability of its allegiance
to community \(A\) is calculated as the
number of times where node \(i\)
belongs to community \(A\), divided by
the total number of runs. This measure is bounded in \(\left[0,1\right]\), where higher values of
stability indicate that a node belong to a single community across a
greater number of runs, and can only work on a single layer of the
hierarchy at a time. Higher proportions of nodes with high levels of
community allegiance indicate a robust community detection
algorithm.
Below is a simple example of how community allegiance is calculated and visualized.
set.seed(7183)
x <- sample(x = rep(1:3, 4), 12)
y <- sample(x = rep(1:3, 4), 12)
z <- sample(x = rep(1:3, 4), 12)
xyz_comms <- data.frame(id=seq(1:length(x)),x_comm=x,y_comm=y,z_comm=z)
xyz_alleg <- community_allegiance(xyz_comms)
xyz_melt <- reshape2::melt(xyz_alleg)
print(xyz_comms)
##    id x_comm y_comm z_comm
## 1   1      3      3      2
## 2   2      3      3      2
## 3   3      2      2      3
## 4   4      3      2      2
## 5   5      1      2      3
## 6   6      2      1      1
## 7   7      3      1      3
## 8   8      2      3      3
## 9   9      1      3      1
## 10 10      2      1      2
## 11 11      1      2      1
## 12 12      1      1      1
ggplot2::ggplot(data = xyz_melt) +
  ggplot2::theme_minimal() +
  ggplot2::aes(x = as.factor(Var1), y = as.factor(Var2), fill = value) +
  ggplot2::geom_tile() +
  ggplot2::xlab('Node') + ggplot2::ylab('Node') +
  ggplot2::ggtitle('Community Allegiance Example') +
  ggplot2::scale_fill_gradient2(
    low  = 'navy',
    high = 'goldenrod1',
    mid  = 'darkturquoise',
    midpoint = 0.5,
    limit = c(0, 1),
    space = 'Lab',
    name='')The average community allegiance value within this simple example is 0.3333333, meaning that, on average, two nodes appear in the same community 33% of the time.
If multiple runs of the HMS algorithm have been performed on the same
dataset and a single, representative partition is desired, end users can
utilize the consensus_similarity function. This function
can be used to identify a single, representative partition from a set of
partitions that is most similar to all others. Similarity is taken to be
the z-score of the adjusted Rand coefficient. This definition of
consensus similarity was first defined by Doron et al. in a 2012
PNAS publication. Pairwise z-scores of the adjusted Rand
coefficient are calculated for all partitions, then the average pairwise
similarity is calculated for each partition. The partition with the
highest average pairwise similarity is then chosen to be the consensus
partition.
Below is a simple example of how the
consensus_similarity function works.
set.seed(7183)
x <- sample(x = rep(1:3, 4), 12)
y <- sample(x = rep(1:3, 4), 12)
z <- sample(x = rep(1:3, 4), 12)
xyz_comms_mat <- matrix(c(x,y,z),nrow=length(x),ncol=3)
consensus_similarity(xyz_comms_mat)
##  [1] 3 3 2 3 1 2 3 2 1 2 1 1End users can simulate group-level networks with community structure
using two main functions within the CommKern package:
group_network_perturb and
group_adj_perturb.
str(group_network_perturb)
## function (n_nodes, n_comm, n_nets, perturb_prop, wcr, bcr, bfcr = NA, fuzzy_comms = NA)
str(group_adj_perturb)
## function (group_network_list, n_nets, n_nodes)First, group_network_perturb creates a list of simulated
networks, of which each network is in a data.frame format, which
describes the community assignment for each node in the network and
simulates the edge weights based on whether the node dyad is (a) within
the same community; (b) between different communities; or (c) between
different communities but designated as “fuzzy” in their distinction
from one another. This function takes several inputs:
n_nodes, the number of nodes in each
network.
n_comm, the number of communities to simulate in
each network.
n_nets, the number of networks to simulate.
perturb_prop, the proportion of network nodes to
randomly alter their community assignment within each network; this
helps to simulate individual variability.
wcr, the the within community edge weights, sampled
from a Beta distribution.
bcr, the between community edge weights, sample from
a Beta distribution.
bfcr, fuzzy community edge weights, sample from a
Beta distribution
fuzzy_comms, the communities for which their
distinction is “fuzzy” or not as distinct; fuzzy communities tend to
have higher between-community edge weights than non-fuzzy
communities.
Once the group_network_perturb function is run, its
output is used as one of the inputs for the
group-adj_perturb function, which produces an array of
adjacency matrices. These adjacency matrices can then be used as input
to any community detection algorithm (such as the HMS algorithm).
Below is an example of how to simulate a group of networks with similar community structure, with a visualization of one resulting simulated adjacency matrix.
sim_nofuzzy <-
   group_network_perturb(
     n_nodes = 50,
     n_comm = 4,
     n_nets = 3,
     perturb_prop = 0.1,
     wcr = c(8, 8),
     bcr = c(1.5, 8)
   )
nofuzzy_adj <-
  group_adj_perturb(sim_nofuzzy, n_nets = 3, n_nodes = 50)
if (require(pheatmap)) {
 pheatmap::pheatmap(
   nofuzzy_adj[1,,],
   treeheight_row = FALSE,
   treeheight_col = FALSE
 )
}
## Loading required package: pheatmapAs well, the following is an example of group-level networks simulated with fuzzy network structure.
sim_fuzzy <-
   group_network_perturb(
     n_nodes = 50,
     n_comm = 4,
     n_nets = 3,
     perturb_prop = 0.1,
     wcr = c(8, 8),
     bcr = c(1.5, 8),
     bfcr = c(3.5, 8),
     fuzzy_comms = c('comm_b', 'comm_c')
   )
fuzzy_adj <-
  group_adj_perturb(sim_fuzzy, n_nets = 3, n_nodes = 50)
if (require(pheatmap)) {
 pheatmap::pheatmap(
   fuzzy_adj[1,,],
   treeheight_row = FALSE,
   treeheight_col = FALSE
 )
}Within community detection, the methods for evaluating the performance of a particular algorithm can be classified as either (1) extrinsic, requiring ground truth labels or (2) intrinsic not requiring ground truth labels.
All external clustering measures rely on a \(r \times k\) contingency matrix \(\mathbs{N}\) that is induced by a system-clustering \(\mathcal{C}\) and a ground truth partitioning \(\mathcal{T}\), where the rows represent the system-clusters \(\mathcal{C}_i \in \mathcal{C}\) for \(i=1,\dots,r\) and the columns represent the ground truth partitions \(\mathcal{T}_j \in \mathcal{T}\) for \(j=1,\dots,k\). A cell at the intersection of row \(i\) and column \(j\) contains the count \(n_{ij}\) of points that are common to cluster \(\mathcal{C}_i\) and ground truth partition \(\mathcal{T}_j\):
\[\begin{equation} N\left(i,j\right)=n_{ij}=\vert\mathcal{C}_i \cap \mathcal{T}_j \vert \end{equation}\] However, in many cases a ground truth labeling does not exist, such as brain connectivity. In this case, a two-way matching can be performed, where the matching is performed twice, each time with one of the clusterings being the ground truth and the results combined using a harmonic mean to produce a final matching measure.
There are several options for external cluster evaluation metrics. For all, assume \(\Omega=\left\{\omega_1,\omega_2,\dots,\omega_k\right\}\) and \(\mathbb{C}=\left\{c_1,c_2,\dots,c_j\right\}\) are two sets of clusters.
set.seed(7183)
x <- sample(x = rep(1:3, 4), 12)
y <- sample(x = rep(1:3, 4), 12)
purity(x,y)
## [1] 0.5set.seed(7183)
x <- sample(x = rep(1:3, 4), 12)
y <- sample(x = rep(1:3, 4), 12)
NMI(x,y)
## [1] 0.05360537set.seed(7183)
x <- sample(x = rep(1:3, 4), 12)
y <- sample(x = rep(1:3, 4), 12)
adj_RI(x,y)
## [1] -0.1458333To create a matrix of distances, the function
ext_distance can be used. This function uses the community
output values from any community detection algorithm, such as the HMS
algorithm, as an input. Because extrinsic cluster evaluation metrics use
the underlying idea of similarity, distance is calculated as \(\left(1-\text{similarity}\right)\). The use
of distance ensures that the distance matrix will be positive and
semi-definite, a requirement for the semiparametric kernel methods
detailed in the next section.
x <- c(2,2,3,1,3,1,3,3,2,2,1,1)
y <- c(3,3,2,1,1,1,1,2,2,3,2,3)
z <- c(1,1,2,3,2,3,2,1,1,2,3,3)
xyz_comms <- data.frame(x_comm = x, y_comm = y, z_comm = z)
ext_distance(xyz_comms, variant = 'NMI')
##              1         2         3
## [1,] 0.0000000 0.6963946 0.3412397
## [2,] 0.6963946 0.0000000 0.8412397
## [3,] 0.3412397 0.8412397 0.0000000
ext_distance(xyz_comms, variant = 'adj_RI')
##              1         2         3
## [1,] 0.0000000 0.9166667 0.4583333
## [2,] 0.9166667 0.0000000 1.0694444
## [3,] 0.4583333 1.0694444 0.0000000
ext_distance(xyz_comms, variant = 'purity')
##              1         2         3
## [1,] 0.0000000 0.4166667 0.1666667
## [2,] 0.4166667 0.0000000 0.5000000
## [3,] 0.1666667 0.5000000 0.0000000Rather than use extrinsic metrics, which rely either on a ground truth labeling or on a harmonic mean between two clusterings, intrinsic evaluation metrics do not rely on the existence of a ground truth. One of the most commonly used intrinsic metrics is the optimized value of the modularity function; this is based on the fact that networks with very similar community structure should have very similar modularity values. Within the HMS algorithm, the optimized value of the Hamiltonian van be used in the same manner as the modularity function. Because the optimized Hamiltonian values from the HMS algorithm are not bounded in the same manner as the extrinsic cluster evaluation metrics, a p-norm of the differences between the Hamiltonians for two clusterings was used such that
\[\begin{equation} d\left(\mathcal{H}_{\Omega},\mathcal{H}_{\mathbb{C}}\right)=\Vert \mathcal{H}_{\Omega} - \mathcal{H}_{\mathbb{C}} \Vert_{p} \end{equation}\] for \(1 \leq p < \infty.\)
To create a matrix of distances, the function
ham_distance can be used. This function uses the
Hamiltonian output values from a community detection algorithm that
implements a Hamiltonian value, such as the HMS algorithm. To ensure a
positive, semi-definite matrix (as required for the kernel machine
methods), the absolute difference between Hamiltonian values is
calculated.
hamil_df <- data.frame(id  = seq(1:8),
                       ham = c(-160.5375, -167.8426, -121.7128,
                               -155.7245, -113.9834, -112.5262,
                               -117.9724, -171.374))
ham_distance(hamil_df)
##            1       2       3       4       5       6       7       8
## [1,]  0.0000  7.3051 38.8247  4.8130 46.5541 48.0113 42.5651 10.8365
## [2,]  7.3051  0.0000 46.1298 12.1181 53.8592 55.3164 49.8702  3.5314
## [3,] 38.8247 46.1298  0.0000 34.0117  7.7294  9.1866  3.7404 49.6612
## [4,]  4.8130 12.1181 34.0117  0.0000 41.7411 43.1983 37.7521 15.6495
## [5,] 46.5541 53.8592  7.7294 41.7411  0.0000  1.4572  3.9890 57.3906
## [6,] 48.0113 55.3164  9.1866 43.1983  1.4572  0.0000  5.4462 58.8478
## [7,] 42.5651 49.8702  3.7404 37.7521  3.9890  5.4462  0.0000 53.4016
## [8,] 10.8365  3.5314 49.6612 15.6495 57.3906 58.8478 53.4016  0.0000A kernel is a function that takes as its inputs vectors in some original space and returns the dot product of vectors in the feature space. More formally, if we have \(\mathbf{x},\mathbf{z} \in \mathcal{X}\) and a map \(\phi: \mathcal{X} \rightarrow \mathbb{R}^N\), then \(k\left(\mathbf{x},\mathbf{z}\right)=\langle\phi\left(\mathbf{x}\right),\phi\left(\mathbf{z}\right)\rangle\) is a kernel function. An important concept relating to kernel methods is the reproducing kernel Hilbert space (RKHS). Because all kernels are positive definite, there exists one or more feature spaces for which the kernel defines the inner product, without having to explicitly define such feature space. Using the Moore-Aronszajn theorem, it can be shown that for each kernel \(k\) on a set \(\mathcal{X}\), there is a unique space of functions (known as the Hilbert space) on \(\mathcal{X}\) for which \(k\) is a reproducing kernel.
Letting \(Z\) be a multidimensional array of variables and \(i,j\) be two subjects, then \(k\left(Z_i,Z_j\right)\) can be used as a measure of similarity between the pair of subjects \(i,j\) since \(k\left(\cdot,\cdot\right)\) is positive definite. This similarity measure can then be incorporated into a statistical inference framework to test what extend variation in \(Z\) between individuals can explain variation in some outcome of interest \(Y\). A range of kernel functions are used in statistics, where the choice of kernel determines the function space used to approximate the relationship between two variables. A distance-based kernel is denoted as
\[\begin{equation} K_d\left(x_1,x_2\right)=exp\left\{\dfrac{-d^2(x_1,x_2)}{\rho}\right\}, \end{equation}\] where \(d^2(x_1,x_2)\) is a distance function and \(\rho\) an unknown bandwidth or scaling parameter.
While this package provides several options for cluster evaluation, many more exist in the literature. End users can utilize their own custom metrics, so long as the resulting distance matrix is symmetric, positive, and semi-definite, as these are the requirements for incorporation into the kernel machine methods detailed in the next section.
Suppose a dataset consists of \(n\) subjects, where for subject \(i=\left(1,\dots, n\right)\), \(y_i\) is an outcome variable (either binary or continuous), \(\mathbf{x}_i\) is a \(q \times 1\) vector of clinical covariates and \(\mathbf{x}_i\) is a \(p \times 1\) vector of cluster evaluation metrics. The outcome \(y_i\) depends on \(\mathbf{x}_i\) and \(\mathbf{z}_i\) through the following semiparametric linear model
\[\begin{equation} y_i = \mathbf{x}_{i}^{T}\mathbf{\beta} + h\left(\mathbf{z}_i\right) + e_i, \end{equation}\] where \(\mathbf{\beta}\) is a \(q \times 1\) vector of regression coefficients, \(h\left(\mathbf{z}_i\right)\) is an unknown, centered and smooth function, and \(e_i\) is an independent and identically distributed error term following \(N\left(0,\sigma^2\right)\).
This model allows for covariate effects to be modeled parametrically and the brain connectivity metric either parametrically or non-parametrically. There are several special cases of this model: when \(h\left(\cdot\right)=0\), the model reduces too a standard logistic regression model and when \(\mathbf{x}_i=1\), the model reduces to a least squares kernel machine regression. A hypothesis test can be conducted to determine whether the multidimensional variable set \(\mathbf{z}_i\) is associated with \(y_i\), controlling for \(\mathbf{x}_i\), of the form
\[\begin{equation} H_0: h\left( \cdot \right) = 0\\ H_A: h\left( \cdot \right) \neq 0\\ \end{equation}\]
Details of the derivation of the score test will be left out of this vignette. If end users are interested, please refer to Jensen et al. (2019); Liu, Lin, and Ghosh (2007); and Liu, Ghosh, and Lin (2008).
Using the datasets provided in the CommKern package,
we can implement the various kernel machine types. In all cases, a
parameter grid_gran is preset to 5000; this variable
relates to the grid search length for the nuisance \(\rho\) parameter.
If the outcome is binary and no other covariates are of interest,
then use the score_log_nonparam function:
str(score_log_nonparam)
## function (outcome, dist_mat, grid_gran = 5000)Using the simasd_covars demographics dataset, where
dx_group will be used as the outcome of interest, we will
use simasd_hamil_df as the basis for our distance
matrix.
simasd_ham_mat <- ham_distance(simasd_hamil_df)
score_log_nonparam(outcome=simasd_covars$dx_group,
                   dist_mat=simasd_ham_mat)
## [1] 0.5703748If we wish to incorporate additional covariates of interest, like
age, sex, or handedness, this can easily be done via the
score_log_semiparam function.
str(score_log_semiparam)
## function (outcome, covars, dist_mat, grid_gran = 5000)For our purposes, we will still use the Hamiltonian as the basis for our distance matrix. Because handedness is a categorical variable, we will convert it to a factor before adding the covariates dataframe to the function.
simasd_ham_mat <- ham_distance(simasd_hamil_df)
simasd_confound <- simasd_covars[,3:5]
simasd_confound$handedness <- as.factor(simasd_confound$handedness)
score_log_semiparam(outcome=simasd_covars$dx_group,
                    covars=simasd_confound,
                    dist_mat=simasd_ham_mat)
## [1] 0.5368395In a similar manner, if the outcome of interest is continuous and no
other covariates are of interest, then use the
score_cont_nonparam function:
str(score_cont_nonparam)
## function (outcome, dist_mat, grid_gran = 5000)Still using the simasd_covars demographics dataset, we
will now use verbal_IQ as the outcome of interest and we
will now use simasd_comm_df as the basis for our distance
matrix, with the NMI as the extrinsic cluster evaluation metric
simasd_NMI_mat <- ext_distance(comm_df=simasd_comm_df,
                               variant=c("NMI"))
score_cont_nonparam(outcome=simasd_covars$verbal_IQ,
                   dist_mat=simasd_NMI_mat)
## [1] 0.5619891If we wish to incorporate additional covariates of interest, like
age, sex, or handedness, this can easily be done via the
score_cont_semiparam function.
str(score_cont_semiparam)
## function (outcome, covars, dist_mat, grid_gran = 5000)For our purposes, we will now use purity as the basis for our distance matrix. Because handedness is a categorical variable, we will convert it to a factor before adding the covariates dataframe to the function.
simasd_pur_mat <- ext_distance(comm_df=simasd_comm_df,
                               variant=c("purity"))
simasd_confound <- simasd_covars[,3:5]
simasd_confound$handedness <- as.factor(simasd_confound$handedness)
score_cont_semiparam(outcome=simasd_covars$verbal_IQ,
                    covars=simasd_confound,
                    dist_mat=simasd_pur_mat)
## [1] 0.3368465