| Type: | Package |
| Title: | VCG Sampling using Energy-Based Covariate Balancing |
| Version: | 0.9.2 |
| Description: | Provides a principled framework for sampling Virtual Control Group (VCG) using energy distance-based covariate balancing. The package offers visualization tools to assess covariate balance and includes a permutation test to evaluate the statistical significance of observed deviations. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Depends: | ggplot2 |
| Imports: | ggforce, osqp, patchwork |
| Suggests: | knitr, rmarkdown |
| VignetteBuilder: | knitr |
| NeedsCompilation: | no |
| Packaged: | 2025-12-16 15:18:45 UTC; I0559285 |
| Author: | Andreas Schulz [aut, cre] (Created in 2025), Sanofi [cph, fnd] |
| Maintainer: | Andreas Schulz <andreas.schulz2@sanofi.com> |
| Repository: | CRAN |
| Date/Publication: | 2025-12-19 20:10:17 UTC |
The function attempts to find the optimal size for VCG.
Description
The function tries out different sizes of VCG and searches for the smallest distance.
Usage
BestVCGsize(formula, data = data, plot = TRUE)
Arguments
formula |
A formula specifying the treated and covariates, e.g., 'treated ~ cov1 + cov2 | stratum'. The treated variable must be binary (0=pool, 1=treated) |
data |
A data frame containing the variables specified in the formula. |
plot |
Logical. If 'TRUE', returns a ggplot2 plot. Default: TRUE |
Details
It is only intended for exploratory purposes, as the VCG size is normally given. But it can be used to see how well the given size fits. The recommendation for VCG size is based solely on distance and does not take into account other aspects such as power or validity.
Value
If 'plot = TRUE', returns a list with:
optimal_n |
The estimated optimal VCG size (integer). |
plot |
A ggplot2 object visualizing the energy distance curve and plateau. |
Examples
set.seed(2342)
dat <- data.frame(
treat = rep(0:1, c(50, 30)),
cov1 = c(rnorm(50, 11, 2), rnorm(30, 10, 1)),
cov2 = c(rnorm(50, 12, 2), rnorm(30, 10, 1)),
cov3 = c(rnorm(50, 9, 2), rnorm(30, 10, 1))
)
BestVCGsize(treat ~ cov1 + cov2 + cov3, data=dat)
VCG Sampler for Energy Distance Balancing
Description
This function performs energy distance based balancing and selects a subset from pool based on energy distance to approximate a randomized control trial. Optionally, it visualizes the balancing results.
Usage
VCG_sampler(formula, data, n, c_w = NULL, random = FALSE, plot = TRUE)
Arguments
formula |
A formula specifying the treated indicator and covariates, e.g., 'treated ~ cov1 + cov2 | stratum'. The treated variable must be binary (0=pool, 1=treated) |
data |
A data frame containing the variables specified in the formula. |
n |
Integer. Number of observations to sample from the pool, or a vector of n for each stratum |
c_w |
Optional: Vector of positive weights for covariates, reflecting the relative importance of the covariates for balancing. |
random |
Logical. If 'TRUE', the distance is used as the probability for selecting the observation; otherwise, the nearest observations are used (deterministic). Default: FALSE |
plot |
Logical. If 'TRUE', returns a visualization of the balancing effect. |
Details
If random is set to FALSE, the function selects the top 'n' units from the pool with the lowest energy distance and assigns them to the VCG group. If random is set to TRUE, the function samples 'n' units from pool with sampling probability inversely proportional to energy distance. The quality of covariate balancing is visualized using differences in medians and median absolute deviations (MADs). Permutation ellipses are generated by randomly permuting the pool and treated groups to estimate usual (random) variability. Only the X and Y axes are computed directly; the ellipse is interpolated between the axes. This method is intended as a visual approximation rather than a precise statistical test.
Value
If 'plot = TRUE', returns a list with:
A data frame with added columns:
'VCG': Indicator for selected pool units. VCG==1 indicates the VCG selected.
'e_weights': Energy weights used for selection
'<treated>_balanced': A factor indicating balanced treated assignment.
A ggplot2 object showing the median and MAD differences before and after balancing, with a 95
If 'plot = FALSE', returns only the modified data frame.
Examples
dat <- data.frame(
cov1 = rnorm(50, 10, 1),
cov2 = rnorm(50, 7, 1),
cov3 = rnorm(50, 5, 1),
treated = rep(c(0, 1), c(35, 15))
)
VCG_sampler(treated ~ cov1 + cov2 + cov3, data=dat, n=5)
Combine data from pool and treated groups
Description
If your data is stored in separate files, you can use this function to merge them.
Usage
combine_data(POOL_data, TG_data, indicator_name = "treated")
Arguments
POOL_data |
Data frame with POOL data, where you want to sample from. |
TG_data |
Data frame with TG (treated groups) data, all treated groups together! |
indicator_name |
Name of the variable that is created for further use in the package, Default: 'treated' |
Value
Data frame with all covariates that were present in both files and with new indicator factor POOL vs TG
Examples
pool_data <- data.frame(
cov1 = rnorm(100, 11, 2),
cov2 = rnorm(100, 11, 2),
cov3 = rnorm(100, 11, 2),
sex = rbinom(100, 1, 0.5))
tg_data <- data.frame(
cov2 = rnorm(20, 12, 1),
cov3 = rnorm(20, 12, 1),
cov4 = rnorm(20, 12, 1),
sex = rbinom(20, 1, 0.5))
dx <- combine_data(pool_data, tg_data)
str(dx)
Compute Energy Distance Between Two Groups
Description
Calculates the energy distance between two groups.
Usage
energy_distance(formula, data, standardized = TRUE)
Arguments
formula |
A formula specifying the treated and covariates, e.g., 'treated ~ cov1 + cov2'. The treated variable must be binary (0=pool, 1=treated) |
data |
A data frame containing the variables specified in the formula. |
standardized |
If TRUE, the standardized energy distance that lies in the range 0 to 1 is returned, the so-called E-coefficient. If FALSE, not scaled energy distance is returned that can be >1. |
Details
Energy distance is a non-parametric measure of distributional difference. It is sensitive to differences in location, scale, and shape between groups. Before calculation, the covariates are scaled to a mean value of 0 and a standard deviation of 1.
Value
A numeric value representing the energy distance between the two groups.
Examples
dat <- data.frame(
treated = rep(0:1, c(50, 30)),
age = c(rnorm(50, 5, 2), rnorm(30, 5, 1)),
weight = c(rnorm(50, 11, 2), rnorm(30, 10, 1)),
class = c(rbinom(50, 3, 0.6), rbinom(30, 3, 0.4))
)
energy_distance(treated ~ age + weight + class, data=dat)
Permutation Energy Test for Covariate Imbalance
Description
Performs a permutation-based energy distance test to assess whether two groups (defined by a binary treated variable) are balanced across a set of covariates. Optionally, it visualizes the distribution of permuted energy distances and highlights the observed test statistic and critical value.
Usage
energy_test(formula, data, alpha = 0.05, R = 2000, plot = TRUE)
Arguments
formula |
A formula specifying the treated and covariates, e.g., 'treated ~ cov1 + cov2 | stratum'. |
data |
A data frame containing the variables specified in the formula. |
alpha |
Significance level for the test (default is 0.05). |
R |
Number of permutations to perform (default is 2000). |
plot |
Logical. If 'TRUE', returns a ggplot2 visualization of the permutation distribution. |
Details
The energy distance is a non-parametric measure of distributional difference. This test evaluates whether the covariate distributions between two groups are statistically distinguishable. A small p-value indicates imbalance between groups. A one-sided test is used because the energy distance is strictly positive; only values greater than the observed statistic in the permutation distribution are relevant.
Value
If 'plot = TRUE', returns a list with:
A list of class '"htest"' containing:
'p.value': The permutation p-value.
'estimate': The observed energy distance.
'critical.value': The critical value at the specified alpha level.
'alternative': The alternative hypothesis ("one.sided").
'method': Description of the test.
'n.permutations': Number of permutations performed.
'permutations': Vector of permuted energy distances.
A ggplot2 object showing the histogram of permuted distances, with vertical lines for the observed statistic and critical value.
If 'plot = FALSE', returns only the '"htest"' result list.
See Also
Examples
dat <- data.frame(
treated = rep(0:1, c(50, 30)),
age = c(rnorm(50, 5, 2), rnorm(30, 5, 1)),
weight = c(rnorm(50, 11, 2), rnorm(30, 10, 1)),
class = c(rbinom(50, 3, 0.6), rbinom(30, 3, 0.4))
)
energy_test(treated ~ age + weight + class, data=dat, R = 500)
Multi-Sample VCG Generator and Overlap Visualization
Description
Repeatedly samples VCGs (via 'VCG_sampler' and 'random=TRUE') from the pool, optionally plots the overlap of VCGs.
Usage
multiSampler(formula, data, n, c_w = NULL, Nsamples = 20, plot = TRUE)
Arguments
formula |
A formula specifying the treated and covariates, e.g., 'treated ~ cov1 + cov2'. The treated variable must be binary (0=pool, 1=treated) |
data |
A data frame containing the variables specified in the formula. |
n |
Integer. Number of observations to sample from the pool. Or a vector of n for each stratum. |
c_w |
Optional: Vector of positive weights for covariates, reflecting the relative importance of the covariates for the balancing. |
Nsamples |
Number of VCGs to generate (default is 20). |
plot |
Logical; if 'TRUE', returns a ggplot2 plot showing the overlap of VCGs (default is 'TRUE'). |
Details
The function repeatedly calls 'VCG_sampler' with 'random' set to TRUE to generate multiple VCG groups. It calculates the frequency of selection for each observation and computes the average percentage of overlapping observations. This function should only be used if you really need multiple VCG, e.g. for PoC studies. It is not intended for selecting one VCG from them afterwards! In this case, the VCG_sampler function should be used directly and only one VCG should be generated.
Value
If 'plot = TRUE', returns a list with:
- data
The original data frame with additional VCG columns ('VCG_1', ..., 'VCG_Nsamples').
- p
A 'ggplot2' object showing the number of times each observation was selected across VCG samples.
If 'plot = FALSE', returns the modified data frame only.
Examples
dat <- data.frame(
treat = rep(0:1, c(50, 30)),
cov_1 = c(rnorm(50, 5, 2), rnorm(30, 5, 1)),
cov_2 = c(rnorm(50, 11, 2), rnorm(30, 10, 1))
)
result <- multiSampler(treat ~ cov_1 + cov_2, data = dat, n = 10, Nsamples = 10)
Visualize Covariate Distribution Across TG, VCG, and POOL
Description
Creates a plot to compare the distribution of a selected variable across three groups: TG (treated groups), VCG (virtual control group), and POOL (data pool).
Usage
plot_var(data, what = NULL, stratum = "in_stratum", group = "VCG", title = "")
Arguments
data |
A balanced data frame (output of the VCG_sampler function) |
what |
A string specifying the name of the variable to be visualized. |
stratum |
A string specifying the name of the stratum variable (default is '"in_stratum"') |
group |
A string specifying the column name used to define group membership (default is '"VCG"'). |
title |
Optional title for the plot. |
Details
The function uses energy distance to quantify distributional differences between groups. For continuous variables, it overlays dashed lines for TG group statistics (mean, min, max) and displays sample sizes. For categorical variables, it uses color-coded bars and cumulative proportion lines to highlight imbalance.
Value
A ggplot2 object showing either:
A boxplot for continuous variables (more than 4 unique values).
A proportional bar chart for categorical variables (2–4 unique values).
Examples
dat <- data.frame(
cov1 = rnorm(50, 10, 1),
cov2 = rnorm(50, 7, 1),
cov3 = rnorm(50, 5, 1),
treated = rep(c(0, 1), c(35, 15))
)
out <- VCG_sampler(treated ~ cov1 + cov2 + cov3, data=dat, n=5, plot=FALSE)
plot_var(out, what='cov1', group='VCG')
plot_var(out, what='cov2', group='VCG')
Robust Scaling of Numeric and Categorical Variables
Description
Applies robust scaling to numeric and categorical variables. For numeric variables, the function centers by the median and scales by the MAD. For categorical variables with 2–4 unique levels, it applies a custom transformation to map them to numeric values.
Usage
robust_scale(x, group)
Arguments
x |
A numeric vector, factor, matrix, or data frame. If a matrix or data frame is provided, scaling is applied column-wise. |
group |
vector indicating which group is the TG to scale to |
Details
This function is designed to make numeric and categorical variables comparable. This is an internal function that should not be used by package users.
Value
A scaled numeric vector or a data frame with scaled columns.
Examples
dat<-data.frame(x=rnorm(100, 10, 3), sex=factor(rbinom(100, 1, 0.5), labels=c("M","F")))
x<- robust_scale(dat$x, dat$sex)
round(median(x), 2)
round(mad(x), 2)