Vignette 3: Alternatives approaches

In this vignette we consider a simple scenario with a continuous dependent variable and a set of continuous predictors. First, we load the required packages and store the example dataset GSPCRexdata (see the helpfile for details ?GSPCRexdata) in two separate objects:

# Load R packages
library(gspcr) # this package!
library(superpc) # alternative comparison package
library(patchwork) # combining ggplots

# Load data
X <- GSPCRexdata$X$cont
y <- GSPCRexdata$y$cont

# Define names of measures to compare
fit_measure_vec <- c("LRT", "PR2", "MSE", "F", "AIC", "BIC")

1 Compare results with superpc

This package extends the superpc R package by introducing more fit measures and threshold types, which allows for the consideration of a wider range of variable types. Here we want to show that the results obtained by superpc can be replicated by gspcr when using the same fit measure and threshold type.

To use superpc we need to prepare the data in a different format.

# Prepare data for superpc
data.train <- list(
    x = t(as.matrix(scale(X))),
    y = y,
    featurenames = colnames(X)
)

We can then train the model with the superpc::superpc.train() function and observe the solution paths.

# Train the model (computes the scores for each feature)
train.obj <- superpc.train(
    data = data.train,
    type = "regression"
)

# Cross-validate the model
cv.obj <- superpc.cv(
    fit = train.obj,
    data = data.train,
    min.features = 1,
    max.features = nrow(data.train$x),
    n.fold = 10,
    n.threshold = 20,
    n.components = 5
)

# Cross-validation solution paths
cv.obj_plot <- superpc.plotcv(cv.obj)
Figure 1: Solution paths obtained with `superpc`.

Figure 1: Solution paths obtained with superpc.

Finally, we can specify and train gspcr in the same way and compare the cross-validation solution paths.

# Train gspcr with the same specification as superpc
gspcr_superpc <- cv_gspcr(
    dv = y,
    ivs = X,
    fit_measure = "F",
    thrs = "normalized",
    nthrs = 20,
    npcs_range = 1:5,
    K = 10,
    min_features = 1,
    max_features = ncol(X)
)

# Create plot the cross-validation curves
plot(
    gspcr_superpc,
    errorBars = TRUE,
    discretize = FALSE,
)
Figure 2: Solution paths obtained with `gspcr`.

Figure 2: Solution paths obtained with gspcr.

We can also see that the thresholds values computed are exactly the same:

# Report the threshold values
data.frame(
    superpc = round(cv.obj$thresholds, 3),
    gpscr = round(gspcr_superpc$thr, 3),
    diff = round(cv.obj$thresholds - gspcr_superpc$thr, 3)
)
   superpc gpscr diff
1    0.051 0.051    0
2    0.398 0.398    0
3    0.745 0.745    0
4    1.093 1.093    0
5    1.440 1.440    0
6    1.787 1.787    0
7    2.135 2.135    0
8    2.482 2.482    0
9    2.829 2.829    0
10   3.177 3.177    0
11   3.524 3.524    0
12   3.871 3.871    0
13   4.219 4.219    0
14   4.566 4.566    0
15   4.913 4.913    0
16   5.260 5.260    0
17   5.608 5.608    0
18   5.955 5.955    0
19   6.302 6.302    0
20   6.650 6.650    0

2 Is K-fold cross-validation working?

In this section, I want to showcase the effectiveness of using cross-validation to select the number of PCs for GSPCR. We can estimate the same solution paths for GSPCR with and without using K-fold cross-validation by changing the number of folds used. If we set K = 1 in the cv_gspcr() call, the data is assigned a single fold, and the fit measures are evaluated on the same data the model is trained on.

First, let’s estimate the solution paths with K-fold cross-validation. To keep the comparison simple, we specify two options for the number of PCs (5 and 20), but any other set of values could be used. We train the model with the six available fit measures stored in the object fit_measure_vec.

# Train the GSPCR model with two number of PCs options
out_fit_meas_cv <- lapply(fit_measure_vec, function(i) {
    cv_gspcr(
        dv = y,
        ivs = X,
        K = 10,
        npcs_range = c(5, 20),
        fit_measure = i,
        thrs = "normalized"
    )
})

# Plot solution paths
plots <- lapply(seq_along(fit_measure_vec), function(i) {
    # Reverse y?
    rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i])

    # Make plots
    plot(
        x = out_fit_meas_cv[[i]],
        y = fit_measure_vec[[i]],
        labels = TRUE,
        y_reverse = rev,
        errorBars = FALSE,
        discretize = FALSE,
        print = FALSE
    )
})

# Patchwork ggplots
(plots[[1]] + plots[[2]] + plots[[3]]) / (plots[[4]] + plots[[5]] + plots[[6]])
Figure 3: Solution paths for different fit measures using K-fold CV.

Figure 3: Solution paths for different fit measures using K-fold CV.

Then, we repeat the same procedure but we set K = 1, to avoid K-fold cross-validation.

# Train the GSPCR model with many number of components
out_fit_meas_no_CV <- lapply(fit_measure_vec, function(i) {
    cv_gspcr(
        dv = y,
        ivs = X,
        K = 1,
        npcs_range = c(5, 20),
        fit_measure = i,
        thrs = "normalized"
    )
})

# Plot them
plots <- lapply(seq_along(fit_measure_vec), function(i) {
    # Reverse y?
    rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i])

    # Make plots
    plot(
        x = out_fit_meas_no_CV[[i]],
        y = fit_measure_vec[[i]],
        labels = TRUE,
        y_reverse = rev,
        errorBars = TRUE,
        discretize = FALSE,
        print = FALSE
    )
})

# Patchwork ggplots
(plots[[1]] + plots[[2]] + plots[[3]]) / (plots[[4]] + plots[[5]] + plots[[6]])
Figure 4: Solution paths for different fit measures without using K-fold CV.

Figure 4: Solution paths for different fit measures without using K-fold CV.

You can already see from the plots that when using LRT, MSE, and PR2 as fit measures, without cross-validation we would end up selecting the highest number of PCs provided (20 in this case). However, when using K-fold cross-validation, the solution paths would lead us to choose 5 PCs instead.

We can also look at the solutions tables to confirm our read of the plots. The solution we would have found using K-fold CV is:

# Standard solutions
res_CV <- sapply(
    1:length(out_fit_meas_cv),
    function(meth) {
        as.numeric(out_fit_meas_cv[[meth]]$sol_table["standard", ])
    }
)

# Give meaningful names
dimnames(res_CV) <- list(c("thr_value", "thr_number", "Q"), fit_measure_vec)

# Print rounded results
round(t(res_CV), 3)
    thr_value thr_number Q
LRT     1.517          3 5
PR2     1.517          3 5
MSE     1.517          3 5
F       1.517          3 5
AIC     1.517          3 5
BIC     0.784          2 5

The solutions we would have obtained without using K-fold CV are:

# Standard solutions
res_no_CV <- sapply(
    1:length(out_fit_meas_no_CV),
    function(meth) {
        as.numeric(out_fit_meas_no_CV[[meth]]$sol_table["standard", ])
    }
)

# Give meaningful names
dimnames(res_no_CV) <- list(c("thr_value", "thr_number", "Q"), fit_measure_vec)

# Print rounded results
round(t(res_no_CV), 3)
    thr_value thr_number  Q
LRT     1.517          3 20
PR2     1.517          3 20
MSE     1.517          3 20
F       1.517          3  5
AIC     1.517          3  5
BIC     1.517          3  5

As you can see, using CV we find the same solution no matter what the outcome measure, while without using CV, only the AIC, BIC, and F are able to select a low number of PCs.

3 1SE solutions

The results for all fit measures except F struggle with accounting for measure complexity. The use of a simple 1-standard-error rule helps obviate this problem. First, fit the models with many possible number of components:

# Train the GSPCR model with many number of components
out_fit_meas <- lapply(fit_measure_vec, function(i) {
    cv_gspcr(
        dv = y,
        ivs = X,
        fam = "gaussian",
        nthrs = 10,
        npcs_range = 1:10,
        K = 10,
        fit_measure = i,
        thrs = "normalized",
        min_features = 1,
        max_features = ncol(X),
        oneSE = TRUE
    )
})

# Plot them
plots <- lapply(seq_along(fit_measure_vec), function(i) {
    # Reverse y?
    rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i])

    # Make plots
    plot(
        x = out_fit_meas[[i]],
        y = fit_measure_vec[[i]],
        labels = TRUE,
        y_reverse = rev,
        errorBars = TRUE,
        discretize = FALSE,
        print = FALSE
    )
})

# Patchwork ggplots
(plots[[1]] + plots[[2]] + plots[[3]]) / (plots[[4]] + plots[[5]] + plots[[6]])
Figure 5: Solution paths for different fit measures when using the 1-standard-error rule.

Figure 5: Solution paths for different fit measures when using the 1-standard-error rule.

Then, extract the solutions obtained by each:

# Standard solutions
res <- sapply(
    1:length(out_fit_meas),
    function(meth) {
        as.numeric(out_fit_meas[[meth]]$sol_table["standard", ])
    }
)

# Give meaningful names
dimnames(res) <- list(c("thr_value", "thr_number", "Q"), fit_measure_vec)

# Print rounded results
round(t(res), 3)
    thr_value thr_number Q
LRT     0.784          2 4
PR2     0.784          2 4
MSE     1.517          3 5
F       2.250          4 1
AIC     1.517          3 3
BIC     2.250          4 1

Finally, you can check which solutions would be chosen by using the 1-standard-error rule:

# 1se solutions
res_1se <- sapply(
    1:length(out_fit_meas),
    function(meth) {
        as.numeric(out_fit_meas[[meth]]$sol_table["oneSE", ])
    }
)

# Give meaningful names
dimnames(res_1se) <- list(c("thr_value", "thr_number", "Q"), fit_measure_vec)

# Print rounded results
round(t(res_1se), 3)
    thr_value thr_number Q
LRT     1.517          3 3
PR2     1.517          3 3
MSE     1.517          3 3
F       1.517          3 1
AIC     2.250          4 1
BIC     2.250          4 2

4 Alternatives to CV

To speed up the model-fitting process, it can be a good idea to find model-building strategies that are less time-consuming than CV. You can use the BIC fit measure without CV to select the appropriate threshold value and the number of components. To do so, you can specify the number of folds to 1 and the fit measure to BIC or AIC.

# Define vector of measures to be used
fit_measure_vec <- c("LRT", "AIC", "BIC")

# Train the GSPCR model with the different values
out_fit_meas <- lapply(fit_measure_vec, function(i) {
    cv_gspcr(
        dv = y,
        ivs = X,
        fam = "gaussian",
        nthrs = 10,
        npcs_range = c(1, 2, 5, 20),
        K = 1,
        fit_measure = i,
        thrs = "normalized",
        min_features = 1,
        max_features = ncol(X),
        oneSE = TRUE
    )
})

# Plot them
plots <- lapply(seq_along(fit_measure_vec), function(i) {
    # Reverse y?
    rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i])

    # Make plots
    plot(
        x = out_fit_meas[[i]],
        y = fit_measure_vec[[i]],
        labels = TRUE,
        y_reverse = rev,
        errorBars = FALSE,
        discretize = FALSE,
        print = FALSE
    )
})

# Patchwork ggplots
plots[[1]] + plots[[2]] + plots[[3]]
Figure 6: Solution paths obtained using fit measures that do not need K-fold CV.

Figure 6: Solution paths obtained using fit measures that do not need K-fold CV.

You can also look at the solutions:

# Put solutions together
rbind(
    LRT = out_fit_meas[[1]]$sol_table["standard", ],
    AIC = out_fit_meas[[2]]$sol_table["standard", ],
    BIC = out_fit_meas[[3]]$sol_table["standard", ]
)
    thr_value thr_number  Q
LRT  1.517276          3 20
AIC  1.517276          3  5
BIC  1.517276          3  5