Problems with P-values

A case study

Norman Matloff

9/28/2023

In 2016, the American Statistical Association released its first-ever position paper, to warn of the problems of significance testing and “p-values.” Though the issues had been well known for many years, it was “significant” that the ASA finally took a stand. Let’s use the lsa data in this package to illustrate.

Law School Admissions Data

According to the Kaggle entry, this is a

…Law School Admissions dataset from the Law School Admissions Council (LSAC). From 1991 through 1997, LSAC tracked some twenty-seven thousand law students through law school, graduation, and sittings for bar exams. …The dataset was originally collected for a study called ‘LSAC National Longitudinal Bar Passage Study’ by Linda Wightman in 1998.

Here is an overview of the variables:

data(lsa)
names(lsa)
#>  [1] "age"      "decile1"  "decile3"  "fam_inc"  "lsat"     "ugpa"    
#>  [7] "gender"   "race1"    "cluster"  "fulltime" "bar"

Most of the names are self-explanatory, but we’ll note that: The two The ‘age’ variable is apparently birth year, with e.g. 67 meaning 1967. decile scores are class standing in the first and third years of law school, and ‘cluster’ refers to the reputed quality of the law school. Two variables of particular interest might be the student’s score on the Law School Admission Test (LSAT) and a logical variable indicating whether the person passed the bar examination.

Wealth Bias in the LSAT?

There is concern that the LSAT and other similar tests may be heavily influenced by family income, thus unfair, especially to underrepresented minorities. To investigate this, let’s consider the estimated coefficients in a linear model for the LSAT

w <- lm(lsat ~ .,lsa)  # predict lsat from all other variables
summary(w)
#> 
#> Call:
#> lm(formula = lsat ~ ., data = lsa)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -19.290  -2.829   0.120   2.888  16.556 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 31.985789   0.448435  71.328  < 2e-16 ***
#> age          0.020825   0.005842   3.565 0.000365 ***
#> decile1      0.127548   0.020947   6.089 1.15e-09 ***
#> decile3      0.214950   0.020919  10.275  < 2e-16 ***
#> fam_inc      0.300858   0.035953   8.368  < 2e-16 ***
#> ugpa        -0.278173   0.080431  -3.459 0.000544 ***
#> gendermale   0.513774   0.060037   8.558  < 2e-16 ***
#> race1black  -4.748263   0.198088 -23.970  < 2e-16 ***
#> race1hisp   -2.001460   0.203504  -9.835  < 2e-16 ***
#> race1other  -0.868031   0.262529  -3.306 0.000947 ***
#> race1white   1.247088   0.154627   8.065 7.71e-16 ***
#> cluster2    -5.106684   0.119798 -42.627  < 2e-16 ***
#> cluster3    -2.436137   0.074744 -32.593  < 2e-16 ***
#> cluster4     1.210946   0.088478  13.686  < 2e-16 ***
#> cluster5     3.794275   0.124477  30.482  < 2e-16 ***
#> cluster6    -5.532161   0.210751 -26.250  < 2e-16 ***
#> fulltime2   -1.388821   0.116213 -11.951  < 2e-16 ***
#> barTRUE      1.749733   0.102819  17.018  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.197 on 20782 degrees of freedom
#> Multiple R-squared:  0.3934, Adjusted R-squared:  0.3929 
#> F-statistic: 792.9 on 17 and 20782 DF,  p-value: < 2.2e-16

There are definitely some salient racial aspects here, but, staying with the income issue, look at the coefficient for family income, 0.3009. The p-value is essentially 0, which in an academic research journal would classically be heralded with much fanfare, termed “very highly significant,” with a 3-star insignia. Indeed, the latter is seen in the output above. But actually, the impact of family income is not significant in practical terms. Here’s why:

Family income in this dataset is measured by quintiles. So Mathematically, testing for a 0 effect is equivalent to checking whether the CI contains 0. But this is missing the point of the CI, which is to (a) give us an idea of the effect size, and (b) to indicate how accurate our estimate is of that size. Aspect (a) is given by the location of the center of the interval, while (b) is seen from the CI’s radius. this estimated coefficient says that, for example, if we compare people who grew up in the bottom 20% of income with those who were raised in the next 20%, the mean LSAT score rises by only about 1/3 of 1 point–on a test where scores are typically in the 20s, 30s and 40s. The 95% confidence interval (CI), (0.2304,0.3714), again indicates that the effect size here is very small.

So family income is not an important factor after all, and the significance test was highly misleading.

But Aren’t There Setting in Which Significance Testing Is of Value?

Some who read the above may object, “Sure, there sometimes may be a difference between statistical significance and practical significance. But I just want to check whether my model fits the data.” Actually, it’s the same problem.

“I just want to check whether my model fits the data”`

For instance, suppose we are considering adding an interaction term between race and undergraduate GPA to our above model. Let’s fit this more elaborate model, then compare.

w1 <- lm(lsat ~ .+race1:ugpa,lsa)  # add interaction 
summary(w1)
#> 
#> Call:
#> lm(formula = lsat ~ . + race1:ugpa, data = lsa)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -19.1783  -2.8065   0.1219   2.8879  16.0633 
#> 
#> Coefficients:
#>                  Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)     26.574993   1.219611  21.790  < 2e-16 ***
#> age              0.020612   0.005837   3.531 0.000415 ***
#> decile1          0.127585   0.020926   6.097 1.10e-09 ***
#> decile3          0.213918   0.020902  10.234  < 2e-16 ***
#> fam_inc          0.295042   0.035939   8.210 2.35e-16 ***
#> ugpa             1.417659   0.363389   3.901 9.60e-05 ***
#> gendermale       0.513686   0.059986   8.563  < 2e-16 ***
#> race1black       4.121631   1.439354   2.864 0.004194 ** 
#> race1hisp        1.378504   1.570833   0.878 0.380191    
#> race1other       2.212299   1.976702   1.119 0.263073    
#> race1white       6.838251   1.201559   5.691 1.28e-08 ***
#> cluster2        -5.105703   0.119879 -42.590  < 2e-16 ***
#> cluster3        -2.427800   0.074862 -32.430  < 2e-16 ***
#> cluster4         1.208794   0.088453  13.666  < 2e-16 ***
#> cluster5         3.777611   0.124422  30.361  < 2e-16 ***
#> cluster6        -5.565130   0.210945 -26.382  < 2e-16 ***
#> fulltime2       -1.406151   0.116132 -12.108  < 2e-16 ***
#> barTRUE          1.743800   0.102855  16.954  < 2e-16 ***
#> ugpa:race1black -2.876555   0.460281  -6.250 4.20e-10 ***
#> ugpa:race1hisp  -1.022786   0.494210  -2.070 0.038508 *  
#> ugpa:race1other -0.941852   0.617940  -1.524 0.127479    
#> ugpa:race1white -1.737553   0.370283  -4.693 2.72e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.193 on 20778 degrees of freedom
#> Multiple R-squared:  0.3948, Adjusted R-squared:  0.3942 
#> F-statistic: 645.4 on 21 and 20778 DF,  p-value: < 2.2e-16

Indeed, the Black and white interaction terms with undergraduate GPA are “very highly significant.” But that does mean we should use the more complex model? Let’s check the actual impact of including the interaction terms, by doing predictions of both models on an example X value:

typx <- lsa[1,-5]  # set up an example case
predict(w,typx)  # no-interaction model
#>       2 
#> 40.2294
predict(w1,typx)  # with-interaction model
#>       2 
#> 40.2056

We see here that adding the interaction term changed the predictions–and the estimated value of the regession function–by only about 0.02 out of a 40.23 baseline. So, while the test has validated our with-interaction model, we may well prefer the simpler, no-interaction model.

“I just want to know whether the effect is positive or negative”

Here we have a different problem, bias. Our linear model is just that, a model, and its imperfection will induce a bias. This could change the estimated effect from positive to negative or vice versa, even with an infinite amount of data. With larger and larger dataset size n, the variance of estimated parameters goes to 0, but the bias won’t go away.

Bottom line:

We must not take small p-values literally.

What Is the Underlying Problem, and Its Implications?

The central issue in the above examples, and essentially in any other testing situation, is that a significance test is not answering the question of interest to us.

We wish to know whether family income plays a substantial role in the LSAT, not whether there is any relation at all, no matter how meaningless. Similarly, we wish to know whether the interaction between race and GPA is substantial enough to include it in our model, not whether there is any interaction at all, no matter how tiny.

The question at hand in research studies is rarely, if ever, whether a quantity is 0.000… to infinitely many decimal places. And ask noted, our measuring instrument is not this accurate in the first place; there will always be systemic bias, in our model, our dataset and so on.

Thus in almost all cases, significance tests don’t address the issue of interest, which is whether some population quantity is substantial enough to be considered important. Analysts should not be misled by words like “significant.” Modern statistical practice places reduced value, or in the view of many, no value at all, on significance testing.