The KSPM package was implemented to fit the single and multiple kernel semi-parametric models for continuous outcome. The package includes the most popular kernel functions, allows kernel interactions and test of variance for each kernel part. The summary(), plot() and predict() commands are available and are similar to those included in standard regression packages. We added a graphical tool for the interpretation of individual effect of variables based on pointwise derivatives. Finally, a variable selection procedure is implemented for the single kernel semi-parametric model.
Some key theoretical details are given in the first section. Then, two examples illustrate how to use the package. The first example is about prediction of movie ratings according to conventional and social media features. The second example is a temporal set of data on energy consumption and meteorology.
Note that the KSPM package depends on CompQuadForm, expm and DEoptim packages that should be installed before the importation of KSPM via library() command.
Let \(Y = (Y_1, ..., Y_n)^T\) be a \(n \times 1\) vector of continuous outcomes where \(Y_i\) is the value for subject \(i = 1, ..., n\), and \(X\) and \(Z\) are \(n \times p\) and \(n \times q\) matrices of predictors, such that \(Y\) depends on \(X\) and \(Z\) as follows:
\[Y = X\beta + h(Z) + e\]
where \(\beta\) is a \(p \times 1\) vector of regression coefficients for the linear parametric part, \(h(.)\) is an unknown smooth function and \(e\) is an \(n \times 1\) vector of independent and homoscedastic errors such as \(e_i \sim \mathcal{N}(0, \sigma^2)\). \(X\beta\) is referred to as the linear part of the model and we assume that \(X\) contains an intercept. \(h(Z)\) is referred to as the kernel part. The function \(h(.)\) lies in \(\mathcal{H}_k\), the function space generated by a kernel function \(k(.,.)\). \(K\) is the \(n \times n\) Gram matrix of \(k(.,.)\) such that \(K_{ij} = k(Z_i, Z_j)\) represents the similarity between subjects \(i\) and \(j\) according to \(Z\). In the estimation process, \(h()\) is expressed as a linear combination of \(k(.,.)\) such that:
\[h = K \alpha\]
Now suppose there are \(L\) matrices \(Z_1, ..., Z_L\) of dimension \(n \times q_1\), …, \(n \times q_L\), respectively. Keeping a similar notation, \(Y\) can be assumed to depend on \(X\) and \(Z_1\), …, \(Z_L\) as follows:
\[Y = X\beta + \sum\limits_{\ell = 1}^L h_{\ell}(Z_{\ell}) + e\]
where \(\forall \ell \in \left\lbrace 1, ..., L \right\rbrace\), each \(h_{\ell}\) is an unknown smooth function from \(\mathcal{H}_{k_{\ell}}\), the function space generated by a kernel function \(k_{\ell}(.,.)\) whose Gram matrix is noted \(K_{\ell}\).
Suppose there exists an interaction between two sets of variables \(Z_1\) and \(Z_2\), \(n \times q_1\) and \(n \times q_2\) matrices. We assume that \(Y\) depends on \(X\), \(Z_1\) and \(Z_2\) as follows:
\[Y = X\beta + h_1(Z_1) + h_2(Z_2) + h_{12}(Z_1, Z_2) + e\]
where \(h_1(.)\), \(h_2(.)\) and \(h_{12}(.,.)\) are unknown smooth functions from \(\mathcal{H}_{k_1}\), \(\mathcal{H}_{k_2}\) and \(\mathcal{H}_{k_{12}}\) respectively. The \(h_{12}(.,.)\) function represents the interaction term. Its associated kernel function \(k_{12}(.,.)\) is defined as the product of kernel functions \(k_1(.,.)\) and \(k_2(.,.)\). Obviously, this 2-way interaction model could be generalized to higher order interaction terms.
Below is the list of all kernel functions suported in this package, where \((i,j)\) represents a pair of subjects with covariates \(Z_i\) and \(Z_j\), respectively:
Of note, users can define their own kernel functions by providing the corresponding kernel Gram matrices.
Kernel functions involve tuning parameters, noted \(\rho\) and \(\gamma\) above. In practice, these tuning parameters are chosen by the user or estimated by the package.
The parameter estimates are obtained by maximizing the penalized log-likelihood associated with the kernel semi-parametric model. Penalisation and tuning parameters are estimated by minimizing the leave-one-out error. Estimated parameters are:
For either a single kernel or a multiple kernel semi-parametric model, a standard test of interest is \(H_0: h_{\ell}(.) = 0\), ie. there is no effect, singly or jointly, of a set of variables on the outcome. If interest lies in a global test for multiple kernel semi-parametric models, the null hypothesis is \(H_0: h_1(.) = ... = h_L(.) = 0\). Both tests are available in the package. They consist in score test for variance component in the equivalent linear mixed effect model.
A kernel represents similarity between subjects through combining a set of variables in a possible complex way, that may include their possible interactions. Hence, the general effect of a kernel on an outcome is hard to see and difficult to interpret. Nevertheless, the marginal effect of each variable in the kernel may be illustrated using a partial derivative function. Indeed, the derivative measures how the change in a variable of interest impacts the outcome in an intuitively understandable way.
The conventional and social media movies (csm) dataset contains data on 187 movies from 2014 and 2015. To predict ratings, movies are described using a set of 6 variables called conventional features (genre, sequel, budget in USD, gross income in USD, number of screens in USA) and another set of 6 variables called social media features (aggregate actor followers on Twitter, number of views, likes, dislikes, comments of movie trailer on Youtube, sentiment score). The dataset may be loaded from the package using the data() command. This data set is a subset of the one available on the UCI machine learning repository (https://archive.ics.uci.edu/ml/index.php) and fully described in Ahmed et al. (2015).
The model is fitted using the kspm() command. Its principal arguments are:
The kernel argument is defined by a formula of Kernel() objects. In our example, we assume a Gaussian kernel function to fit the joint effect of the conventional features on ratings. By leaving rho = NULL, we do not provide any value for the \(\rho\) parameter and it will be estimated at the same time as the penalization parameter by minimizing the leave-one-out error. Alternatively, \(\rho\) can fixed by the user using the argument rho=.
While the model is running, the kspm() command prints some indications about the kernel(s) included in the model. Standard functions such as summary() can also be used to print results.
The summary output gives information about the sample size used in the model, the residual distribution, the coefficient for the linear part (similarly to other regression R packages), and the penalization parameter and score test associated with the kernel part. In such a complex model, the number of free parameters, i.e. the standard definition of the degrees of freedom of a model is undefined, and we use instead the effective degrees of freedom of the model, which is not necessary a whole number. However, our definition for effective degrees of freedom is similar to the one used in linear regression models and depends on the trace of the hat matrix.
The plot() command displays standard diagnostic plots for regression models, such as residuals versus fitted values, residuals versus leverage or residuals versus cook’s distance, using the argument which=. Note that outliers are anotated.
par(mfrow = c(2,2), mar = c(5, 5, 5, 2))
plot(csm.fit1, which = c(1, 3, 5), cex.lab = 1.5, cex = 1.3)
hist(csm$Ratings, cex.lab = 1.5, main = "Histogram of Ratings", xlab = "Ratings")
Of note, the lower tail of residuals distribution is not as expected under normality. The histogram of ratings shows that could be due to the skewed distribution of the response variable.
Like \(\beta\) coefficients in linear regression models, the derivative function may help to interpret the effect of each variable individually on the outcome. The derivatives() command computes partial derivatives of the estimated kernel function according to each variable included in the kernel. Therefore, it illustrates their individual effects on the outcome at each point of the dataset.
The sign of the derivatives corresponds to the direction of variation of the function capturing the effect of variable on the outcome. The plot() command applied to a derivatives object gives a graphical interpretation of these effects.
par(mfrow = c(1,2), mar = c(5,5,5,2))
plot(derivatives(csm.fit1), subset = "Gross", main = "Pointwise derivatives according to Gross Income", cex.main = 0.8)
plot(derivatives(csm.fit1), subset = "Screens", col = csm$Sequel, pch = 16, main = "Pointwise derivatives according to \n Number of Screens and Sequel", cex.main = 0.8, ylim = c(-0.4, 0.8))
legend("topleft", fill = palette()[1:7], legend = 1:7, title = "Sequel", horiz = TRUE, cex = 0.7)
By default, the X-axis label gives the name of the variable and, in brackets, the kernel in which it is included. When only one kernel is included in the kernel, it is named Ker1. Because genre and sequel variables, even if they are numeric, refer to categorical variables, it is possible to easily highlight some pattern of interaction between these variables and other covariates. Derivatives according to Gross income are mostly positive meaning that higher Gross income is associated with higher ratings. However, the slope of this effect decreases as the gross income grows. However it is difficult to interpret the derivatives for Gross income higher than 2.5e+08 since this category includes only a few movies. According to the second graphic, whether a movie has sequels seems to interact with the effect of the number of screens on the ratings. Indeed when the movie is the first or second release (Sequel = 1 or 2), the derivatives are always negative meaning that higher the number of screens on which the movie is projected, lower the ratings, but the effect seems to be stronger for first release than second. No clear pattern can be observed for subsequent sequels. It is difficult to know if this reveals an absence of effect or whether there are simply too few movies past sequel 2 in these data.
We fit a second model assuming a polynomial kernel function with fixed tuning parameters (\(\rho=1\), \(\gamma=1\) and \(d=2\)).
This model can be compared to the previous one using some information criteria such as the AIC or BIC. By default the extractAIC() command gives the AIC. A better model is a model with a lower AIC value.
Of note, in this section kspm() function is ask to estimate tuning parameters at the same time than penalisation parameter such that it could be time-consuming.
We assume a single kernel semi-parametric model with a gaussian kernel function. The kernel part contains the set of social media features. We want to select the relevant variables to be included in the kernel. We performed a stepwise variables selection procedure based on AIC, by letting \(\rho\) to vary at each iteration. To do so, we first fitted the full model including all features:
# NOT RUN
csm.fit4 <- kspm(response = "Ratings", linear = NULL, kernel = ~Kernel(~ Sentiment + Views + Likes + Dislikes + Comments + Aggregate.Followers, kernel.function = "gaussian"), data = csm, level = 0, control = kspmControl(parallel = TRUE))
Then, we apply the stepKSPM() command on the full model as follows:
# NOT RUN
stepKSPM(csm.fit4, kernel.lower = ~1, kernel.upper = ~ Sentiment + Views + Likes + Dislikes + Comments + Aggregate.Followers, direction = "both", k = 2, kernel.param = "change", data = csm)
The stepKSPM() command prints the details of each iteration. At the end, the model excluded the variables Sentiment and Views.
Our second example is a set of data on energy consumed each hour on Ouessant island (France) from September the 13th, 2015 to October the 4th, 2015. The data set also contains corresponding meteorologic data such as temperature (Celsius degrees), pressure (Pa) and humidity rate (g/m\(^3\)). These measures were derived from those that were made publicly available by Electricite de France on https://iles-ponant-edf-sei.opendatasoft.com and Meteo France on https://www.infoclimat.fr. In total the energy data set contains 504 measurements.
These data demonstrate strong periodicity depending on time of day. The following code has been generated to display the data.
par(mfrow = c(1,2), mar = c(5,5,2,2))
# energy among all the measurements
plot(energy$power, type = "l", xlab = "Time", ylab = "Power", cex.lab = 1.5, xaxt = "n")
axis(1, at = 1 + 24 * (0:21), labels = unique(energy$date))
# examples from three days
plot(c(NA,energy[1:26, "power"]), type = "b", xlab = "Time", ylab = "Power", cex.lab = 1.5, xaxt = "n", col = "darkgreen", lwd = 2, cex = 0.8, xlim = c(-1, 30))
lines(energy[24:50, "power"], type = "b", col = "blue", lwd = 2, cex = 0.8, pch = 0)
lines(energy[48:74, "power"], type = "b", col = "red", lwd = 2, cex = 1, pch = 17)
axis(1, at = c(1, 7, 13, 19, 25), labels = c("0h00", "6h00", "12h00", "18h00", "0h00"))
legend("topleft", col = c("darkgreen", "blue", "red"), legend = c("Sep 13, 2015", "Sep 14, 2015", "Sep 15, 2015"), lwd = 2, pch = c(1, 0, 17))
abline(v = 24.9, lty = 2)
text(25.5, 730, "next day", adj = 0)
To predict the energy consumption, we will consider the 408 first measurements (17 days) as the training set. The other 96 days will be used as a test set.
We fit a single kernel semi-parametric model to the training data. We assumed that energy depends linearly on temperature, and therefore this variable is included in the linear part of the model. The other meteorologic data as well as hours in 24-hour numeric format are included in the kernel part of the model. We used a gaussian kernel.
We recomputed the model with different values of the tuning parameter \(\rho\) to explore sensitivity of the results to this key kernel paramater. The actual value is:
Other values were chosen ten fold smaller and larger than the estimated value of \(0.70\).
energy.fit2 <- kspm(response = "power", linear = ~Temperature, kernel = ~Kernel(~hour.num + P + HR, kernel.function = "gaussian", rho = 0.07) , data = energy_train_, level = 0)
energy.fit3 <- kspm(response = "power", linear = ~Temperature, kernel = ~Kernel(~hour.num + P + HR, kernel.function = "gaussian", rho = 7) , data = energy_train_, level = 0)
We display the predictions obtained on both the training and test data sets, as well as the derivatives, as a function of the hours variable,for the three models energy.fit1, energy.fit2 and energy.fit3 that differ only on the value of the tuning parameter \(\rho\).
### parameters for figures panel
par(oma = c(1, 4, 6, 1))
par(mfrow = c(4,3), mar = c(5,5,1,1))
### kspm.fit1 (rho = 0.7)
# predictions with confidence intervals on train_
plot(energy_train_[1:72, "power"], type = "l", xlab = "Time", ylab = "Power", cex.lab = 1.5, xaxt = "n", lwd = 2, ylim = c(300, 750))
axis(1, at = 1 + 24 * (0:2), labels = unique(energy_train_$date)[1:3])
pred_train_ <- predict(energy.fit1, interval = "confidence")
lines(pred_train_$fit, col = "red")
lines(pred_train_$lwr, col = "blue", lty = 2)
lines(pred_train_$upr, col = "blue", lty = 2)
# predictions with prediction intervals on test_
plot(energy_test_[1:72, "power"], type = "l", xlab = "Time", ylab = "Power", cex.lab = 1.5, xaxt = "n", lwd = 2, ylim = c(300, 750))
axis(1, at = 1 + 24 * (0:2), labels = unique(energy_test_$date)[1:3])
pred_train_ <- predict(energy.fit1, newdata.linear = energy_test_, newdata.kernel = list(Ker1 = energy_test_), interval = "prediction")
lines(pred_train_$fit, col = "red")
lines(pred_train_$lwr, col = "blue", lty = 2)
lines(pred_train_$upr, col = "blue", lty = 2)
# derivatives
plot(derivatives(energy.fit1), subset = "hour.num", xaxt = "n", ylab = "Derivatives", cex.lab = 1.5, ylim = c(-1000,1000))
axis(1, at = c(0, 6, 12, 18), labels = c("0h00", "6h00", "12h00", "18h00"))
### kspm.fit3 (rho = 0.07)
# predictions with confidence intervals on train_
plot(energy_train_[1:72, "power"], type = "l", xlab = "Time", ylab = "Power", cex.lab = 1.5, xaxt = "n", lwd = 2, ylim = c(300, 750))
axis(1, at = 1 + 24 * (0:2), labels = unique(energy_train_$date)[1:3])
pred_train_ <- predict(energy.fit2, interval = "confidence")
lines(pred_train_$fit, col = "red")
lines(pred_train_$lwr, col = "blue", lty = 2)
lines(pred_train_$upr, col = "blue", lty = 2)
# predictions with prediction intervals on test_
plot(energy_test_[1:72, "power"], type = "l", xlab = "Time", ylab = "Power", cex.lab = 1.5, xaxt = "n", lwd = 2, ylim = c(300, 750))
axis(1, at = 1 + 24 * (0:2), labels = unique(energy_test_$date)[1:3])
pred_train_ <- predict(energy.fit2, newdata.linear = energy_test_, newdata.kernel = list(Ker1 = energy_test_), interval = "prediction")
lines(pred_train_$fit, col = "red")
lines(pred_train_$lwr, col = "blue", lty = 2)
lines(pred_train_$upr, col = "blue", lty = 2)
# derivatives
plot(derivatives(energy.fit2), subset = "hour.num", xaxt = "n", ylab = "Derivatives", cex.lab = 1.5, ylim = c(-1000,1000))
axis(1, at = c(0, 6, 12, 18), labels = c("0h00", "6h00", "12h00", "18h00"))
### kspm.fit2 (rho = 7)
# predictions with confidence intervals on train_
plot(energy_train_[1:72, "power"], type = "l", xlab = "Time", ylab = "Power", cex.lab = 1.5, xaxt = "n", lwd = 2, ylim = c(300, 750))
axis(1, at = 1 + 24 * (0:2), labels = unique(energy_train_$date)[1:3])
pred_train_ <- predict(energy.fit3, interval = "confidence")
lines(pred_train_$fit, col = "red")
lines(pred_train_$lwr, col = "blue", lty = 2)
lines(pred_train_$upr, col = "blue", lty = 2)
# predictions with prediction intervals on test_
plot(energy_test_[1:72, "power"], type = "l", xlab = "Time", ylab = "Power", cex.lab = 1.5, xaxt = "n", lwd = 2, ylim = c(300, 750))
axis(1, at = 1 + 24 * (0:2), labels = unique(energy_test_$date)[1:3])
pred_train_ <- predict(energy.fit3, newdata.linear = energy_test_, newdata.kernel = list(Ker1 = energy_test_), interval = "prediction")
lines(pred_train_$fit, col = "red")
lines(pred_train_$lwr, col = "blue", lty = 2)
lines(pred_train_$upr, col = "blue", lty = 2)
# derivatives
plot(derivatives(energy.fit3), subset = "hour.num", xaxt = "n", ylab = "Derivatives", cex.lab = 1.5, ylim = c(-1000,1000))
axis(1, at = c(0, 6, 12, 18), labels = c("0h00", "6h00", "12h00", "18h00"))
# Legends
plot.new()
legend("topleft", lty = c(1,1,2), col = c("black", "red", "blue"), legend = c("True data", "Predictions", "Confidence intervals"), cex = 2, bty = "n")
plot.new()
legend("topleft", lty = c(1,1,2), col = c("black", "red", "blue"), legend = c("True data", "Predictions", "Prediction intervals"), cex = 2, bty = "n")
plot.new()
legend("topleft", pch = 1, col = c("black"), legend = c("Pointwise derivatives \n 1 point = 1 measure"), cex = 2, bty = "n")
### legends on the left
par(fig = c(0,0.05,0,0.25), oma = c(0,0,0,0), mar = c(0,0,0,0),new = TRUE)
plot.new()
text(0.1, 0.8, "Legend", srt = 90, cex = 2, adj = 0.5)
par(fig = c(0,0.05,0.26,0.5), oma = c(0,0,0,0), mar = c(0,0,0,0),new = TRUE)
plot.new()
text(0.1, 0.5, expression(paste(rho, " = 7")), srt = 90, cex = 2, adj = 0.5)
par(fig = c(0,0.05,0.5,0.72), oma = c(0,0,0,0), mar = c(0,0,0,0),new = TRUE)
plot.new()
text(0.1, 0.5, expression(paste(rho, " = 0.07")), srt = 90, cex = 2, adj = 0.5)
par(fig = c(0,0.05,0.72,0.97), oma = c(0,0,0,0), mar = c(0,0,0,0),new = TRUE)
plot.new()
text(0.1, 0.5, expression(paste(rho, " = 0.7")), srt = 90, cex = 2, adj = 0.5)
### legends on the top
par(fig = c(0.05,0.36,0.92,1), oma = c(0,0,0,0), mar = c(0,0,0,0),new = TRUE)
plot.new()
text(0.5, 0.5, "Predictions on training data set", cex = 2, adj = 0.5)
par(fig = c(0.37,0.68,0.92,1), oma = c(0,0,0,0), mar = c(0,0,0,0),new = TRUE)
plot.new()
text(0.5, 0.5, "Predictions on test data set", cex = 2, adj = 0.5)
par(fig = c(0.69,1,0.92,1), oma = c(0,0,0,0), mar = c(0,0,0,0),new = TRUE)
plot.new()
text(0.5, 0.5, "Derivatives for hour.num", cex = 2, adj = 0.5)
The first row corresponds to the model with the value of \(\rho\) estimated by the model. Predictions fit well for both the training and the test sets. The derivative graph shows that between 6h00 and 12h00 and between 18h00 and 23h00 the derivatives are positive. It is coherent with the increase of energy consumption during these times of the day. Inversely, between 0h00 and 6h00 and between 12h00 and 18h00, the energy consumption decreases as showed by negative values of derivatives. One might have expected that derivative values should be close between 0h00 and 23h00 but the consumption peak observed at 23h00 everyday may explain a sudden change in derivative values that is difficult to smooth.
Ahmed M, Jahangir M, Afzal H, Majeed A, Siddiqi I (2015). “Using Crowd-Source Based Features from Social Media and Conventional Features to Predict the Movies Popularity.” In IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity), pp 273-278. IEEE.