The package provides a tool to statistically assess presence of trajectory in data. The function, test.trajectory()
, implements the tree dimension test (TDT) (Tenha and Song 2022). It takes as input a matrix with rows as observations and columns as features. The output from the function is a list containing the tree dimension test measure (TDT), tree dimension test effect, \(S\) statistic and the p.value for TDT.
The example below illustrates the application of TDT to test presence of trajectory in simulated single-cell RNA-seq gene expression data that has a trajectory. The dataset is stored in matrix input
(Cannoodt et al. 2018), where rows are cells and columns are genes. TDT is able to recognize the presence of trajectory as depicted by the significant TDT \(p\)-value.
require(TreeDimensionTest)
=system.file('extdata', 'bifurcating_7.rds', package = 'TreeDimensionTest')
data_path
= readRDS(data_path) input
Next we run the test.trajectory function on matrix input
, with the other parameters set to default.
= test.trajectory(input) res
The test returns a list containing tree dimension measure (tdt_measure
), effect (tdt_effect
), test statistic, \(p\)-value, number vertices that are leaves, and tree diameter. Here, the \(p\)-value is significant and tdt_effect
is strong, depicting presence of trajectory.
Test results contained in res |
|
---|---|
tdt_measure | 2.6666667 |
statistic | 74.8848818 |
tdt_effect | 0.7488488 |
leaves | 26.0000000 |
diameter | 41.0000000 |
p.value | 0.0046336 |
original_dimension | 7848.0000000 |
pca_components | 3.0000000 |
We visualize the MST and the scatterplot using principal components.
plot(res, node.size=12, node.col="mediumseagreen",
main = paste0("Trajectory presence\np-value = ",
format.pval(res$p.value, digit=2)))
=prcomp(input)
pcplot(pc$x[,c(1:2)], xlab="PC1", ylab="PC2", cex=1,
sub="Principal components", col="mediumseagreen",
main = paste0("Trajectory presence\np-value = ",
format.pval(res$p.value, digit=2),
"\n(n = ", nrow(input), ")")
)
Here we demonstrate using test.trajectory on the input
matrix with modified parameter values. dim.reduction="pca"
means dimensionality reduction will be performed first using principal component analysis. The number of principal components is selected using the Scree test. One can set dim.reduction="none"
if dimensionality reduction is unnecessary. MST="exact"
; the exact Euclidean MST (EMST) is used. The alternative is the approximate and fast dual-tree Boruvka algorithm by setting MST="boruvka"
to obtain an approximate EMST.
= test.trajectory(input, dim.reduction = "pca", MST = "exact") res
Test results contained in res |
|
---|---|
tdt_measure | 2.6666667 |
statistic | 74.8848818 |
tdt_effect | 0.7488488 |
leaves | 26.0000000 |
diameter | 41.0000000 |
p.value | 0.0046336 |
original_dimension | 7848.0000000 |
pca_components | 3.0000000 |
It can be seen from the results that the exact MST and boruvka MST are equivalent for small sample size.
We simulated spherical bivariate normal data, which are isotropic and contain no trajectory. The \(p\)-value for the test is insignificant.
= 100
n = cbind(rnorm(n), rnorm(n))
mat = test.trajectory(mat, dim.reduction = "none") res
Test statistics and \(p\)-value are contained in res
:
Test results contained in res |
|
---|---|
tdt_measure | 3.4242424 |
statistic | 68.4820747 |
tdt_effect | 0.6848207 |
leaves | 27.0000000 |
diameter | 32.0000000 |
p.value | 0.8482481 |
We visualize the data which show no sign of trajectory presence.
plot(res, node.size=12,
main = paste0("Trajectory presence\np-value = ",
format.pval(res$p.value, digit=2)))
plot(mat, col="orange", pch=2, cex=0.5, xlab="Dim 1", ylab="Dim 2",
main=paste0("Trajectory presence\np-value = ",
format.pval(res$p.value, digit=2),
"\n(n = ", n, ")")
)
TDT statistics can be calculated relatively quickly to learn if there is any effect in the data before launching the more time consuming step of calculating the \(p\)-value.
Function compute.stats()
computes tree dimension measure, tree dimension effect, number of leafs and diameter of EMST for a given input data without calculating the \(p\)-value.
= compute.stats(mat, MST="boruvka", dim.reduction = "none") res
Results contained in res |
|
---|---|
tdt_measure | 3.4242424 |
tdt_effect | 0.6848207 |
leaves | 27.0000000 |
diameter | 32.0000000 |
Function separability()
computes heterogeneity of observations of the same type in a given data (Tenha and Song 2022). Observations of the same type have the same label. The function takes a matrix as input with rows as observations and columns as features. The function also takes a vector of labels for the observations. The function returns separability values for each label type and the overall separability value. The separability values range from 0 to 1, with 1 being the highest separability.
In the examples below, there are three types of observations labeled L1, L2 and L3. An instance of real application is in single-cell data, where the labels could be cell types.
#Random data
= cbind(rnorm(200), rnorm(200))
mat
#Labels for the samples in the data
= c(rep("L1", 93), rep("L2",78), rep("L3",29))
labels
#Color vector of samples, each unique color correspods with unique label
= c(rep("orange", 93), rep("mediumseagreen", 78), rep("purple", 29))
cols
#Compute separability of samples in mat
= separability(mat, labels) res
The result is a list containing separability values for each label and, overall separability on the data. The overall separability is relatively low, implying samples with same labels are mixed.
::kable(res$label_separability, col.names = "Label separability") knitr
Label separability | |
---|---|
L1 | 0.5535714 |
L2 | 0.4698795 |
L3 | 0.2710280 |
#Plots an MST of the data, with samples of the same label highlighted by same color
# plotTree(mat, labels, node.size = 12, node.col = cols,
# main = paste("Low seperability", format(res$overall_separability, digits = 3)),
# legend.cord=c(-2.1,0.9)
# )
plot(res,node.size = 12, node.col = cols,
main = paste("Low seperability", format(res$overall_separability, digits = 3)), legend.cord=c(-2.1,0.9))
An example where samples with the same label are close together.
= rbind(
mat cbind(rnorm(93,mean=20), rnorm(93, mean=20)),
cbind(rnorm(78,mean=5), rnorm(78,mean=5)),
cbind(rnorm(29, mean=50), rnorm(29, mean=50)))
= c(rep("L1", 93), rep("L2",78), rep("L3",29))
labels = separability(mat, labels) res
The overall separability is 1, implying samples of different labels are perfectly separated.
::kable(res$label_separability, col.names = " Label separability") knitr
Label separability | |
---|---|
L1 | 1 |
L2 | 1 |
L3 | 1 |
#Color vector of samples corresponding to labels
= c(rep("orange", 93), rep("mediumseagreen", 78), rep("purple", 29))
cols
# plotTree(
# mat,labels, node.size=12, node.col = cols,
# main = paste("High seperability", format(res$overall_separability, digits = 3)),
# legend.cord=c(-1.9,0.9))
plot(res,node.size = 12, node.col = cols,
main = paste("High seperability", format(res$overall_separability, digits = 3)), legend.cord=c(-2.1,0.9))
We now illustrate the use of separability to compute tissue specificity for calcium signaling and ribosome pathways on developing mouse data (Cardoso-Moreira et al. 2019) with samples of different tissue types.
#Loading calcium signaling pathway data from Mouse
#This is Mouse development RNA-seq data spanned by the geneset fo calcium signaling pathway
#Rows are genes and columns are samples.
=system.file('extdata', 'calcium_pathway_data.rdata', package = 'TreeDimensionTest')
file.rdataload(file=file.rdata)
#loading color vector of samples by label type; mouse_cols
=system.file('extdata', 'mouse_cols.rdata', package = 'TreeDimensionTest')
file.rdataload(file=file.rdata)
#Labels of the samples are the column names of the data, which are names of tissue types.
= colnames(calcium_pathway_data)
labels
= separability(t(calcium_pathway_data), labels)
res
#Separabiltiy for each tissue type as well as the overall separability. High separability depicts high tissue specificity.
# plotTree(
# t(calcium_pathway_data), labels, node.col=mouse_cols,node.size=12,
# main = paste("Calcium signaling pathway\nTissue specificity",
# format(res$overall_separability, digits = 3)),
# legend.cord=c(-1.9,-1.3))
plot(res,node.size = 12, node.col=mouse_cols,
main = paste("Calcium signaling pathway\nTissue specificity",
format(res$overall_separability, digits = 3)), legend.cord=c(-1.9,-1.3))
Calcium signaling pathway has high tissue specificity as depicted by the high separability value. Tissues of the same type are closer together as shown in the plot.
::kable(res$label_separability, col.names = "Calcium signaling pathway tissue specificity") knitr
Calcium signaling pathway tissue specificity | |
---|---|
Brain | 0.7971014 |
Cerebellum | 1.0000000 |
Heart | 1.0000000 |
Kidney | 1.0000000 |
Liver | 0.9047619 |
Ovary | 0.9032258 |
Testis | 0.8709677 |
#Loading ribosome pathway data from Mouse
=system.file('extdata', 'ribosome_pathway_data.rdata', package = 'TreeDimensionTest')
file.rdataload(file=file.rdata)
#loading color vector of samples by label type; mouse_cols
=system.file('extdata', 'mouse_cols.rdata', package = 'TreeDimensionTest')
file.rdataload(file=file.rdata)
# ribsome_pathway_data is RNA-seq data with rows as genes and columns as samples
= colnames(ribosome_pathway_data)
labels = separability(t(ribosome_pathway_data), labels)
res plot(
node.col= mouse_cols,
res, node.size=12,
main = paste("Ribosome pathway\nTissue specificity",
format(res$overall_separability, digits = 3)),
legend.cord=c(-1.9,-1.3))
The ribosome pathway has relatively lower tissue specificity as depicted by the lower separability value. Tissues of the same type are mixed as shown in the plot.
::kable(res$label_separability, col.names = "Ribosome pathway tissue specificity") knitr
Ribosome pathway tissue specificity | |
---|---|
Brain | 0.5140187 |
Cerebellum | 0.7818182 |
Heart | 0.6179775 |
Kidney | 0.6219512 |
Liver | 0.5428571 |
Ovary | 0.4057971 |
Testis | 0.3913043 |