An Introduction to SOHPIE

Seungjun Ahn

October 20th, 2023

We introduce a Statistical Approach via Pseudo-value Information and Estimation (SOHPIE; pronounced as “Sofie”). This is a regression modeling method for differential network (DN) analysis that can include covariate information in analyzing microbiome data. SOHPIE is a software extension of our methodology work [1].

Requirements

Please install these R packages prior to use SOHPIE.

# library(robustbase) # To fit a robust regression.
# library(parallel) # To use mclapply() when reestimating the association matrix.
# library(dplyr)  # For the convenience of tabulating p-values, coefficients, and q-values.
# library(fdrtool) # For false discovery rate control.
# library(gtools) # To estimate an association matrix via SparCC.
library(SOHPIE)

Load the study data from SOHPIE R package:

Two sample datasets are available in this package. One (combinedamgut) is from the American Gut Project [2] and contains 138 taxa and 268 subjects. In this vignette, the first 30 out of 138 taxa will be used for the simple demonstration purpose. The other (combineddietswap) is from the geographical epidemiology study of diet swap intervention [3] that includes 112 taxa with 37 subjects (20 African Americans from Pittsburgh and 17 rural South Africans). The full data of each study are available in the SpiecEasi and microbiome R packages, respectively.

Example I: American Gut Project Data

set.seed(20050505)
data(combinedamgut) # A complete data containing columns with taxa and clinical covariates.

Data processing for the toy example using sample dataset from American Gut Project:

The main grouping variable will be the indicator variable for the status of living with a dog. After the data processing, the indices of subjects will be available for each ‘Not living with a dog (Group A)’ vs. ‘Living with a dog (Group B).’ We need these indices for the estimation of group-specific \(p \times p\) association matrices (and re-estimation of association matrices for pseudo-value calculations later).

# Note: Again, we will use a toy example with the first 30 out of 138 taxa.
OTUtab = combinedamgut[ , 8:37]

# Clinical/demographic covariates (phenotypic data):
# Note: All of these covariates in phenodat below will be included in the regression 
#       when you use SOHPIE_DNA function later. Please make sure 
#       phenodat below include variables that will be analyzed only.
phenodat = combinedamgut[, 1:7] # first column is ID, so not using it.
# Obtain indices of each grouping factor.
# In this example, a variable indicating the status of living with a dog was chosen (i.e. bin_dog).
# Accordingly, Groups A and B imply living without and with a dog, respectively.
newindex_grpA = which(combinedamgut$bin_dog == 0)
newindex_grpB = which(combinedamgut$bin_dog == 1)

Fit a pseudo-value regression via SOHPIE_DNA() function:

Upon our data processing step above is complete, we can then fit a pseudo-value regression using SOHPIE_DNA function. An important note! Please provide the object name of each OTU table and clinical/demographic data (i.e. metadata) separately in the function. In addition, you must indicate the object names of the indices for each group of a binary indicator variable that is used as a main predictor variable (e.g. living with a dog vs. without a dog). Lastly, you must enter a trimming proportion c, which ranges from 0.5 to 1.

SOHPIEres <- SOHPIE_DNA(OTUdat = OTUtab, clindat = phenodat, 
                        groupA = newindex_grpA, groupB = newindex_grpB, c = 0.5)

Additional features available in SOHPIE package:

Now, I would like to show you that SOHPIE has some convenient tools/functions after fitting a pseudo-value regression. There are functions that you can quickly extract names of taxa that are significantly differentially connected (DC; DCtaxa_tab), as well as p-values (pval and pval_specific_var), adjusted p-values (q-values; qval and qval_specific_var), coefficient estimates (coeff and coeff_specific_var), and standard errors (stderrs and stderrs_specific_var) of all variables that are considered in the regression or a specific variable.

# qval() function will get you a table with q-values.
qval(SOHPIEres)
#>        bin_dog          age       sex bin_floss bin_exercise cat_alcohol1
#> 326792       1 0.3335066191 0.3891881 0.6085508 0.7029038265   0.58249028
#> 348374       1 0.4959918786 0.4553286 0.6563176 0.0007978104   0.67793327
#> 181016       1 0.1034221221 0.3437233 0.4613268 0.7906962897   0.91583955
#> 191687       1 0.1061624991 0.4121733 0.7620091 0.6509072393   0.63046991
#> 305760       1 0.0623960296 0.3443704 0.3380110 0.2961143241   0.89650834
#> 326977       1 0.0816294220 0.4511363 0.8534582 0.7191285464   0.85698753
#> 194648       1 0.1738269828 0.2574068 0.3337861 0.6925366825   0.88023775
#> 28186        1 0.0007831936 0.2574068 0.7885174 0.8453179274   0.89192131
#> 541301       1 0.5657343541 0.2574068 0.6908599 0.7506372978   0.90966387
#> 198941       1 0.1013064347 0.2574068 0.7776566 0.7026636239   0.01754703
#> 353985       1 0.3223166965 0.3746422 0.6595031 0.6871464142   0.81861042
#> 187524       1 0.2946551331 0.3593480 0.8264901 0.8042761312   0.85349403
#> 182054       1 0.4039503606 0.3070761 0.8275860 0.1490472935   0.73146229
#> 175537       1 0.4490265917 0.2711891 0.5726527 0.8033774242   0.91236812
#> 9753         1 0.4910740372 0.3911474 0.2141110 0.8025621972   0.90363709
#> 194211       1 0.5115365048 0.2574068 0.7307118 0.8373385238   0.87655047
#> 188518       1 0.0151224058 0.3347717 0.7828179 0.3053262763   0.57247901
#> 189396       1 0.4212166815 0.3494026 0.3350469 0.6958245850   0.62345510
#> 90487        1 0.0155838264 0.2574068 0.5957072 0.6334803310   0.92405775
#> 203708       1 0.4307922980 0.3876823 0.8196181 0.4291603609   0.90042477
#> 173965       1 0.5767127939 0.3928536 0.3326346 0.3775387180   0.87458491
#> 194661       1 0.2655818775 0.4633322 0.2382683 0.8290773188   0.48876124
#> 512309       1 0.1044861801 0.3375792 0.7154385 0.7790596913   0.88219494
#> 170124       1 0.5607160782 0.5135918 0.3142403 0.7070289268   0.85300729
#> 216862       1 0.1033116429 0.2574068 0.8304994 0.6485739027   0.89802021
#> 352304       1 0.4001572982 0.4448885 0.6941110 0.7547647075   0.91968832
#> 191306       1 0.2730470071 0.3820290 0.4569466 0.7873666361   0.88495526
#> 191541       1 0.3346195307 0.3882961 0.8461190 0.7847328718   0.62178911
#> 191547       1 0.0931888307 0.2574068 0.8288977 0.8324249858   0.89687588
#> 195493       1 0.3150635107 0.3739925 0.8218145 0.7061505869   0.91477808
#>        cat_alcohol2 bin_migraine
#> 326792    0.4470969   0.43366083
#> 348374    0.6094692   0.34296707
#> 181016    0.7118308   0.51037157
#> 191687    0.5331404   0.22406386
#> 305760    0.7603810   0.47840625
#> 326977    0.7106559   0.63133361
#> 194648    0.6460261   0.60245176
#> 28186     0.5022682   0.44397328
#> 541301    0.5325693   0.38364557
#> 198941    0.5323451   0.32301604
#> 353985    0.5984587   0.42326881
#> 187524    0.7177693   0.44374322
#> 182054    0.7264787   0.64520003
#> 175537    0.5161692   0.43985159
#> 9753      0.6307723   0.62995026
#> 194211    0.7641725   0.37619117
#> 188518    0.4829608   0.23422186
#> 189396    0.3389945   0.20953056
#> 90487     0.4802509   0.50174676
#> 203708    0.7630714   0.59075030
#> 173965    0.7567017   0.44288581
#> 194661    0.4980916   0.40289035
#> 512309    0.5018482   0.41683800
#> 170124    0.4920228   0.43786678
#> 216862    0.7201564   0.04967498
#> 352304    0.7465225   0.44993928
#> 191306    0.5246934   0.40835019
#> 191541    0.5935170   0.36978953
#> 191547    0.7767038   0.53092184
#> 195493    0.4978749   0.56935890

qval_specific_var function will be useful to retrieve the q-values of a specific variable, bin_dog in this example.

# Create an object to keep the table with q-values.
qvaltab <- qval(SOHPIEres)
# Retrieve a vector of q-values for a single variable of interest.
qval_specific_var(qvaltab = qvaltab, varname = "bin_dog")
#>        bin_dog
#> 326792       1
#> 348374       1
#> 181016       1
#> 191687       1
#> 305760       1
#> 326977       1
#> 194648       1
#> 28186        1
#> 541301       1
#> 198941       1
#> 353985       1
#> 187524       1
#> 182054       1
#> 175537       1
#> 9753         1
#> 194211       1
#> 188518       1
#> 189396       1
#> 90487        1
#> 203708       1
#> 173965       1
#> 194661       1
#> 512309       1
#> 170124       1
#> 216862       1
#> 352304       1
#> 191306       1
#> 191541       1
#> 191547       1
#> 195493       1

DCtaxa_tab will return a list containing of (1) names and q-values of taxa that are significantly DC between two biological conditions and (2) names of DC taxa only.

# Please do NOT forget to provide the name of variable in DCtaxa_tab(groupvar = )
# and the level of significance (0.3 in this example).
DCtaxa_tab <- DCtaxa_tab(qvaltab = qvaltab, groupvar = "bin_dog", alpha = 0.3)
DCtaxa_tab
#> $DCtaxa_complete_tab
#> [1] bin_dog
#> <0 rows> (or 0-length row.names)
#> 
#> $DCtaxa_names_only
#> character(0)

Example II: Diet Exchange Study Data

data(combineddietswap)

OTUtab = combineddietswap[ , 5:ncol(combineddietswap)]
phenodat = combineddietswap[ , 1:4] # first column is ID, so not using it.
# Obtain indices for each groups 
# (i.e. African-Americans from Pittsburgh (AAM) vs. Africans from rural South Africa (AFR))
# at baseline (time = 1) and at 29-days (time = 6)

# Group A1 for AAM at baseline.
newindex_A1 = which(combineddietswap$timepoint == 1 & combineddietswap$nationality == "AAM")
# Group A6 for AAM at 29-days.
newindex_A6 = which(combineddietswap$timepoint == 6 & combineddietswap$nationality == "AAM")
# Group B1 for AFR at baseline.
newindex_B1 = which(combineddietswap$timepoint == 1 & combineddietswap$nationality == "AFR")
# Group A6 for AFR at 29-days.
newindex_B6 = which(combineddietswap$timepoint == 6 & combineddietswap$nationality == "AFR")

We are done with loading the data and obtaining indices for each group for each time point. We then move onto the analysis step!

The first step is to estimate and re-estimate association matrices for each of groups A1, A6, B1, and A6 shown above. Of note, a list output comprises data.frame objects as assomat and reest.assomat.

est_asso_matA1 = asso_mat(OTUdat=OTUtab, group=newindex_A1)
est_asso_matA6 = asso_mat(OTUdat=OTUtab, group=newindex_A6)
est_asso_matB1 = asso_mat(OTUdat=OTUtab, group=newindex_B1)
est_asso_matB6 = asso_mat(OTUdat=OTUtab, group=newindex_B6)

## For each group, we take the difference of estimated 
## association matrices between time points (29-days minus baseline).
asso_mat_diffA61 = est_asso_matA6$assomat - est_asso_matA1$assomat
asso_mat_diffB61 = est_asso_matB6$assomat - est_asso_matB1$assomat

## Similarly, for each group, we take the difference of re-estimated 
## association matrices between time points (29-days minus baseline).
asso_mat_drop_diffA61 = mapply('-', est_asso_matA6$reest.assomat,
                               est_asso_matA1$reest.assomat, SIMPLIFY=FALSE)
asso_mat_drop_diffB61 = mapply('-', est_asso_matB6$reest.assomat,
                               est_asso_matB1$reest.assomat, SIMPLIFY=FALSE)

Then, for each group, we calculate changes in network centrality of each taxa between the association matrices estimated from the whole data and between the association matrices re-estimated from the leave-one-out sample.

## For changes in network centrality between association matrices estimated from the whole data.
thetahat_grpA = thetahats(asso_mat_diffA61)
thetahat_grpB = thetahats(asso_mat_diffB61)
## For changes in network centrality between association matrices 
## re-estimated from the leave-one-out sample.
thetahat_drop_grpA = sapply(asso_mat_drop_diffA61, thetahats)
thetahat_drop_grpB = sapply(asso_mat_drop_diffB61, thetahats)

Next, we calculate jackknife pseudo-values.

# Sample sizes for each group.
n_A <- length(newindex_A1) 
n_B <- length(newindex_B1)

# Jackknife pseudo-values.
thetatilde_grpA = thetatildefun(thetahat_grpA, thetahat_drop_grpA, n_A)
thetatilde_grpB = thetatildefun(thetahat_grpB, thetahat_drop_grpB, n_B)

thetatilde = rbind(thetatilde_grpA, thetatilde_grpB)

Fit a robust regression regressing covariates on pseudo-values. Please note that the metadata/phenotype data should contain a set of predictors to be fitted only, so choose them wisely. Additionally, a trimming proportion c of the least trimmed squares estimator should be entered between 0.5 and 1.

fitmod = pseudoreg(pseudoval=thetatilde, clindat=phenodat, c=0.5)

Lastly, we can extract summary results from the fitted model from fitmod object above. A list of data.frame objects for coefficient estimates, p-values, and q-values will be available.

summary.result = pseudoreg.summary(pseudo.reg.res=fitmod, 
                                   taxanames=colnames(OTUtab))

## In this study, the main grouping variable was nationality. 
# We can use qval_specific_var to see the q-values only.
# of nationality. Alternatively, we further use DCtaxa_tab to see
# the DC taxa with their p-values.
qval_specific_var(summary.result$q_values, varname = "nationalityAFR")
#>                                       nationalityAFR
#> Akkermansia                              0.656021088
#> Alcaligenes faecalis et rel.             0.697952103
#> Allistipes et rel.                       0.757496542
#> Anaerostipes caccae et rel.              0.419192540
#> Anaerotruncus colihominis et rel.        0.668645846
#> Anaerovorax odorimutans et rel.          0.264856929
#> Aquabacterium                            0.195376727
#> Atopobium                                0.073176926
#> Bacillus                                 0.673758100
#> Bacteroides fragilis et rel.             0.035519751
#> Bacteroides intestinalis et rel.         0.621518988
#> Bacteroides ovatus et rel.               0.044822294
#> Bacteroides plebeius et rel.             0.595063296
#> Bacteroides splachnicus et rel.          0.672181067
#> Bacteroides stercoris et rel.            0.650922141
#> Bacteroides uniformis et rel.            0.138742256
#> Bacteroides vulgatus et rel.             0.034013932
#> Bifidobacterium                          0.618351854
#> Bilophila et rel.                        0.083458213
#> Brachyspira                              0.712180734
#> Bryantella formatexigens et rel.         0.509315681
#> Bulleidia moorei et rel.                 0.371973569
#> Burkholderia                             0.145675949
#> Butyrivibrio crossotus et rel.           0.668484659
#> Campylobacter                            0.711069109
#> Catenibacterium mitsuokai et rel.        0.582119077
#> Clostridium (sensu stricto)              0.391640539
#> Clostridium cellulosi et rel.            0.361169562
#> Clostridium colinum et rel.              0.314471156
#> Clostridium difficile et rel.            0.638529824
#> Clostridium leptum et rel.               0.459619387
#> Clostridium nexile et rel.               0.517564673
#> Clostridium orbiscindens et rel.         0.694621918
#> Clostridium ramosum et rel.              0.204245652
#> Clostridium sphenoides et rel.           0.728222226
#> Clostridium stercorarium et rel.         0.761438204
#> Clostridium symbiosum et rel.            0.531366952
#> Collinsella                              0.234087001
#> Coprobacillus catenaformis et rel.       0.001459571
#> Coprococcus eutactus et rel.             0.299058393
#> Corynebacterium                          0.237937836
#> Desulfovibrio et rel.                    0.659353237
#> Dialister                                0.721195606
#> Dorea formicigenerans et rel.            0.759582869
#> Eggerthella lenta et rel.                0.078400654
#> Enterobacter aerogenes et rel.           0.349241669
#> Enterococcus                             0.709904370
#> Escherichia coli et rel.                 0.524142125
#> Eubacterium biforme et rel.              0.750869548
#> Eubacterium cylindroides et rel.         0.633016505
#> Eubacterium hallii et rel.               0.661554759
#> Eubacterium limosum et rel.              0.663801438
#> Eubacterium rectale et rel.              0.001048128
#> Eubacterium siraeum et rel.              0.001048128
#> Eubacterium ventriosum et rel.           0.128788309
#> Faecalibacterium prausnitzii et rel.     0.644096129
#> Fusobacteria                             0.555685439
#> Haemophilus                              0.622367506
#> Helicobacter                             0.314897458
#> Klebisiella pneumoniae et rel.           0.753363113
#> Lachnobacillus bovis et rel.             0.531631713
#> Lachnospira pectinoschiza et rel.        0.413071378
#> Lactobacillus catenaformis et rel.       0.739650350
#> Lactobacillus gasseri et rel.            0.730725491
#> Lactobacillus plantarum et rel.          0.185560537
#> Lactobacillus salivarius et rel.         0.202934379
#> Lactococcus                              0.001048128
#> Leminorella                              0.230251123
#> Megamonas hypermegale et rel.            0.694628028
#> Megasphaera elsdenii et rel.             0.211992016
#> Mitsuokella multiacida et rel.           0.729784685
#> Moraxellaceae                            0.291006041
#> Oceanospirillum                          0.734683936
#> Oscillospira guillermondii et rel.       0.683147265
#> Outgrouping clostridium cluster XIVa     0.465352252
#> Oxalobacter formigenes et rel.           0.295900369
#> Papillibacter cinnamivorans et rel.      0.452280674
#> Parabacteroides distasonis et rel.       0.444258228
#> Peptococcus niger et rel.                0.748523402
#> Peptostreptococcus micros et rel.        0.510876670
#> Phascolarctobacterium faecium et rel.    0.046609160
#> Prevotella melaninogenica et rel.        0.469565314
#> Prevotella oralis et rel.                0.477391811
#> Prevotella ruminicola et rel.            0.586679550
#> Prevotella tannerae et rel.              0.455649683
#> Propionibacterium                        0.440568538
#> Proteus et rel.                          0.267804676
#> Roseburia intestinalis et rel.           0.711504977
#> Ruminococcus bromii et rel.              0.226119652
#> Ruminococcus callidus et rel.            0.733204120
#> Ruminococcus gnavus et rel.              0.016099651
#> Ruminococcus lactaris et rel.            0.485267601
#> Ruminococcus obeum et rel.               0.044620773
#> Serratia                                 0.497287487
#> Sporobacter termitidis et rel.           0.503412730
#> Staphylococcus                           0.050978693
#> Streptococcus bovis et rel.              0.233413109
#> Streptococcus intermedius et rel.        0.506043136
#> Streptococcus mitis et rel.              0.731204594
#> Subdoligranulum variable at rel.         0.239512504
#> Sutterella wadsworthia et rel.           0.628461198
#> Tannerella et rel.                       0.520336541
#> Uncultured Bacteroidetes                 0.415039103
#> Uncultured Clostridiales I               0.753587735
#> Uncultured Clostridiales II              0.202319965
#> Uncultured Mollicutes                    0.525998562
#> Uncultured Selenomonadaceae              0.083965037
#> Veillonella                              0.065112613
#> Vibrio                                   0.196090749
#> Weissella et rel.                        0.684806123
#> Xanthomonadaceae                         0.738193936
#> Yersinia et rel.                         0.106486396
DCtaxa_tab(summary.result$q_values, groupvar = "nationalityAFR", alpha=0.05)
#> $DCtaxa_complete_tab
#>                                       nationalityAFR
#> Bacteroides fragilis et rel.             0.035519751
#> Bacteroides ovatus et rel.               0.044822294
#> Bacteroides vulgatus et rel.             0.034013932
#> Coprobacillus catenaformis et rel.       0.001459571
#> Eubacterium rectale et rel.              0.001048128
#> Eubacterium siraeum et rel.              0.001048128
#> Lactococcus                              0.001048128
#> Phascolarctobacterium faecium et rel.    0.046609160
#> Ruminococcus gnavus et rel.              0.016099651
#> Ruminococcus obeum et rel.               0.044620773
#> 
#> $DCtaxa_names_only
#>  [1] "Bacteroides fragilis et rel."         
#>  [2] "Bacteroides ovatus et rel."           
#>  [3] "Bacteroides vulgatus et rel."         
#>  [4] "Coprobacillus catenaformis et rel."   
#>  [5] "Eubacterium rectale et rel."          
#>  [6] "Eubacterium siraeum et rel."          
#>  [7] "Lactococcus"                          
#>  [8] "Phascolarctobacterium faecium et rel."
#>  [9] "Ruminococcus gnavus et rel."          
#> [10] "Ruminococcus obeum et rel."

References

[1] Ahn S, Datta S. (2023). Differential Co-Abundance Network Analyses for Microbiome Data Adjusted for Clinical Covariates Using Jackknife Pseudo-Values. Under Review at \(\textit{BMC Bioinformatics}\).

[2] McDonald D. et al. (2018). American Gut: an Open Platform for Citizen Science Microbiome Research. \(\textit{mSystems}\). 3(3), e00031–18

[3] O’Keefe SJ. et al. (2015). Fat, fibre and cancer risk in African Americans and rural Africans. \(\textit{Nat Commun}\). 6, 6342