CUPED: Controlled Experiments Utilizing Pre-Experiment Data

CUPED is a method for variance reduction being widely implemented by major online experimentation platforms (Deng et al. 2013). The idea was recently extended to a general augmentation framework (Deng et al. 2023), which is a special case of

Chen, Y.-H., and Chen, H. (2000), A Unified Approach to Regression Analysis Under Double Sampling Design, Journal of the Royal Statistical Society, Series B, 62, 449–460.

The extended CUPED is a good application case of multipleOutcomes. Instead of following Deng’s new paper, I use Chen’s approach to elaborate on the details of extended CUPED, as this will better facilitate the explanation of its underlying assumptions and failure conditions.

Consider a two-arm randomized controlled experiment of a targeted metric \(Y\). Let \(A\) be the arm assignment. If randomization is successful, the treatment effect can be assessed through \(\hat{\Delta}\), the estimate of \(\Delta\) in the linear regression model

where error is unnecessary to be normal. Note that a logistic regression model may be more appropriate if \(Y\) is binary.

Obviously, \(\hat{\Delta}\) is consistent and asymptotic unbiased under good randomization. We are of interest in alternative estimates for \(\Delta\) with variance lower than that of \(\hat{\Delta}\). Consider an estimate

\[\tilde{\Delta} = \hat{\Delta} + c \times \hat{\delta} \] where \(\text{E}(\hat{\delta}) = 0\). Thus, \(\tilde{\Delta}\) is also consistent and asymptotic unbiased to the treatment effect \(\Delta\), and the variance of \(\tilde{\Delta}\) is no greater than \(\hat{\Delta}\) (worst case: \(c=0\)). As long as \(\hat\Delta\) and \(\hat\delta\) are correlated, we can find a \(c\) yielding a more efficient estimate of \(\Delta\). This simple idea can be further extended for more than one \(\hat\delta\):

\[\tilde{\Delta} = \hat{\Delta} + c_1 \times \hat\delta_1 + c_2 \times \hat\delta_2 + \cdots + c_K \times \hat\delta_K.\]

To conduct \(\hat\delta\) of zero expectation, one may collect additional pre- or in-experiment metrics \(X_1, X_2, \cdots, X_K\) for each of the samples. If \(X\) are well balanced in the two arms, we have \(\delta_1 = \delta_2 = \cdots = \delta_K = 0\), where \(\delta_k\) is the slope of a working model

\[X_k = \alpha_k + \delta_k A + \text{error}.\] Here, a working model means that the relationship between the variables specified by the model is unnecessarily true; we simply choose to fit it as is. As a result, the error term can be anything and an unreasonable link function may be adopted without affecting the validity of extended CUPED. For example, one can fit a linear regression against a binary \(X\); the slope \(\delta = 0\) as long as \(X\) is balanced. Furthermore, note that for any transformation \(h(\cdot)\), \(h(X)\) is balanced given that \(X\) is balanced, one can fit the following working model as well

\[h_k(X_k) = \alpha_k + \delta_k^h A + \text{error}.\] A more natural model, the logistic regression model also works

\[\text{logit}\ \text{Pr}(X_k = 1\mid A) = \alpha_k + \delta_k^\text{logit}A.\] It is interesting that mutiple \(\hat\delta\) that corresponding to the same \(X_k\) can be integrated into the extended CUPED estimate simultaneously. In the example above, \(\tilde\Delta\) can be defined as

\[\tilde\Delta = \hat\Delta + c_k \hat{\delta}_k + c_k^h \hat{\delta}_k^h + c_k^\text{logit}\hat{\delta}_k^\text{logit} + \hat\delta \text{ based on other } X\] The variance of this estimate is no greater than an extended CUPED estimate with less \(\hat\delta\) from the same metric \(X_k\) (e.g. \(\hat\Delta + c_k \hat\delta_k\)).

Conclusions

In summary, the extended CUPED is valid (consistent and asymptotic unbiased) when

arm assignment is randomized (good for \(\text{E}(\hat\Delta) = \Delta\)) and
pre- and in-experiment metrics are balanced across arms (good for \(\text{E}(\hat\delta)=0\)).

The extended CUPED is more efficient when

at least one of the pre- or in-experiment metrics is correlated with the targeted metric and
the weights \(c\) are carefully picked.

The optimal weights \(c\) depend on the covariance matrix of \((\hat\Delta, \hat\delta)\), not the original observation of \((Y, X)\), thus, tuning on \(c\) will not inflate the type I error of inference. multipleOutcomes can return the covariance matrix for extended CUPED.

The extended CUPED remains valid even if

the working models for pre- and in-experiment metrics (\(X \sim A\)) are mis-specified and/or
a pre- or in-experiment metric that is uncorrelated with \(Y\) is used in the extended CUPED estimate. This is a nice property because we don’t have to worry about introducing too many irrelevant metrics that are unhelpful.

In fact, adding multiple \(\hat\delta\) based on different transformations on the same \(X_k\) into the extended CUPED estimate may lower its variance further in some cases. A variance stabilize transformation may be considered for metrics of certain parametric families of distributions or highly skewed metrics.

Implementation

## generating data
library(multipleOutcomes)
library(mvtnorm)
library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

genData <- function(seed = NULL){
  set.seed(seed)
  n <- 500
  sigma <- matrix(.7, 5, 5)
  diag(sigma) <- 1.
  
  x <- rmvnorm(n, sigma = sigma)
  x[, 1] # pre-experiment data of the targeted metric
  x[, 2] # continuous metric
  x[, 3] <- log(abs(x[, 3])) # continuous and skewed metric
  x[, 4] <- ifelse(x[, 4] > .2, 1, 0) # binary metric
  a <- rbinom(n, 1, .5) # 1:1 randomization
  
  # y is post-experiment data of the targeted metric
  y <- x[, 5]
  y[a == 0] <- y[a == 0] + rnorm(sum(a == 0), sd = .1)
  y[a == 1] <- y[a == 1] + rnorm(sum(a == 1), mean = .1, sd = .1) # Delta = .1
  
  ## x5 and x6 are metrics (random noise) but uncorrelated with the targeted metric
  data.frame(
    y, a = a, 
    x1 = x[, 1], x2 = x[, 2], x3 = x[, 3], x4 = x[, 4], 
    x5 = rnorm(n), x6 = rbinom(n, 1, .5))
  
}

dat <- genData(1)
head(dat) %>% print()
#>             y a         x1         x2         x3 x4         x5 x6
#> 1  0.40263988 0 -0.1619335  0.2817750 -1.2855327  1  0.6414685  0
#> 2  0.07511838 0 -0.2599780  0.4563869 -0.5211990  1  1.6552190  1
#> 3  0.63327183 0  0.8814712  0.2669605 -1.2488553  0 -0.8596355  0
#> 4  1.04650293 1  0.6195346  0.6352780  0.1493731  1  0.6173840  0
#> 5  0.52152316 0  0.6172003  0.5422494 -1.8662895  0  1.3358366  1
#> 6 -0.23932112 0 -0.5193212 -0.5739109  0.2578483  0 -0.5340088  1
xtabs(~x4, dat) %>% print()
#> x4
#>   0   1 
#> 299 201

plot(
  dat %>% 
    select(-x4, -x6, -a) %>% 
    pivot_longer(everything(), names_to = "indicator", values_to = "value") %>% 
    ggplot(aes(value)) + 
    geom_histogram() + facet_wrap(~indicator)
)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


## function to compute CUPED estimate
## ...: formulas for each of the metrics
## family: a vector of family for each of the formulas
## data: a data frame of experiment data
cuped <- function(..., family, data){
  fit <- multipleOutcomes(..., family = family, data = data, data_index = NULL)
  id <- sapply(fit$id_map, function(x){x['a']})
  id1 <- id[1]
  id2 <- id[-1]
  Delta <- coef(fit)[id1]
  delta <- coef(fit)[id2]
  mcov <- vcov(fit)
  opt_c <- as.vector(- solve(mcov[id2, id2, drop = FALSE]) %*% mcov[id2, id1, drop = FALSE])
  
  estimate <- Delta + sum(delta * opt_c)
  stderr <- (
    mcov[id1, id1] + 
    2 * as.vector(t(opt_c) %*% mcov[id2, id1, drop = FALSE]) + 
    as.vector(t(opt_c) %*% mcov[id2, id2, drop = FALSE] %*% opt_c)
  ) %>% sqrt()
  
  list(estimate = estimate, stderr = stderr)
}

An Example

## An example
## y ~ a: treatment effect of post-experiment data for the targeted metric
## x1 ~ a: delta (= 0) using pre-experiment data for the targeted metric
## x2 ~ a: delta (= 0) using a continuous pre-experiment metric
## x3 ~ a: delta (= 0) using a skewed pre-experiment metric
## abs(x3) ~ a: delta (= 0) using a transformation of a pre-experiment metric
## x4 ~ a: delta (= 0) using a binary pre-experiment metric
## x5 ~ a: delta (= 0) using a metric that is a random noise
## family: linear regression model is fitted for all working models (even if x4 is binary)
cuped_fit <- 
  cuped(
    y ~ a, 
    x1 ~ a, 
    x2 ~ a, 
    x3 ~ a, 
    abs(x3) ~ a, 
    x4 ~ a, 
    x5 ~ a, 
    family = 'gaussian', 
    data = dat)

## CUPED estimate and its standard error
print(cuped_fit)
#> $estimate
#> [1] 0.06196927
#> 
#> $stderr
#> [1] 0.05886572

## treatment effect on targeted metric (post-experiment data only)
summary(lm(y ~ a, dat))$coef
#>                Estimate Std. Error    t value  Pr(>|t|)
#> (Intercept)  0.11394931 0.06572757  1.7336608 0.0835972
#> a           -0.03789196 0.09221800 -0.4108955 0.6813259

The example above illustrates a complex case of CUPED using mutipleOutcomes. The standard error of estimated treatment effect is substantially reduced (0.059 vs 0.092).

Simulation

We carry out simulations to verify the key points in Conclusions. Specifically,

adding more metrics with zero expectation can reduce variance;
adding noise does not hurt the efficiency;
if all added metrics are noise, CUPED is equivalent to using post-data only.

n_simu <- 1000
all_estimate <- NULL
all_stderr <- NULL
for(i in 1:n_simu){
  dat <- genData(i)
  
  # use post- data only
  post_only <- summary(lm(y ~ a, dat))$coef
  
  # add pre- data of targeted metric
  pre_target <- cuped(y ~ a, x1 ~ a, family = 'gaussian', data = dat)
  
  # add more pre- data
  pre_more <- cuped(y ~ a, x1 ~ a, x3 ~ a, x4 ~ a, family = 'gaussian', data = dat)
  
  # multiple delta (transformation) for the same metric
  # transform_delta <- cuped(y ~ a, x1 ~ a, x3 ~ a, exp(x3) ~ a, x4 ~ a, x4 ~ a, family = c('gaussian', 'gaussian', 'gaussian', 'gaussian', 'gaussian', 'binomial'), data = dat)
  
  # add all correlated pre- data
  all_corr <- cuped(y ~ a, x1 ~ a, x2 ~ a, x3 ~ a, x4 ~ a, family = 'gaussian', data = dat)
  
  # add noise data
  add_noise <- cuped(y ~ a, x1 ~ a, x2 ~ a, x3 ~ a, x4 ~ a, x5 ~ a, x6 ~ a, family = 'gaussian', data = dat)
  
  # noise only
  noise_only <- cuped(y ~ a, x5 ~ a, x6 ~ a, family = c('gaussian', 'gaussian', 'binomial'), data = dat)
  
  extractEstimate <- function(mod){
    mod$estimate
  }
  
  extractStderr <- function(mod){
    mod$stderr
  }
  
  estimate <- 
    data.frame(
      post_only = post_only['a', 'Estimate'], 
      pre_target = extractEstimate(pre_target), 
      pre_more = extractEstimate(pre_more), 
      # transform_delta = extractEstimate(transform_delta), 
      all_corr = extractEstimate(all_corr), 
      add_noise = extractEstimate(add_noise), 
      noise_only = extractEstimate(noise_only)
    )
  
  stderr <- 
    data.frame(
      post_only = post_only['a', 'Std. Error'], 
      pre_target = extractStderr(pre_target), 
      pre_more = extractStderr(pre_more), 
      # transform_delta = extractStderr(transform_delta), 
      all_corr = extractStderr(all_corr), 
      add_noise = extractStderr(add_noise), 
      noise_only = extractStderr(noise_only)
    )
  
  all_estimate <- rbind(all_estimate, estimate)
  all_stderr <- rbind(all_stderr, stderr)

}

## mean of estimates of simulation (true Delta = .1)
apply(all_estimate, 2, mean) %>% print(digits = 3)
#>  post_only pre_target   pre_more   all_corr  add_noise noise_only 
#>      0.101      0.100      0.101      0.101      0.101      0.102

## empirical standard deviation
apply(all_estimate, 2, sd) %>% print(digits = 3)
#>  post_only pre_target   pre_more   all_corr  add_noise noise_only 
#>     0.0916     0.0654     0.0627     0.0585     0.0585     0.0918

## theoretical standard error
apply(all_stderr, 2, mean) %>% print(digits = 3)
#>  post_only pre_target   pre_more   all_corr  add_noise noise_only 
#>     0.0898     0.0643     0.0617     0.0576     0.0575     0.0896