Proportional Odds Ordinal Logistic Regression

By Charles Holbert

April 17, 2022

Introduction

The Kruskal-Wallis test for nonparametric analysis of variance (ANOVA) cannot be directly extended to the case where there are two or more factors as is done with parametric two-way ANOVA. Instead, a nonparametric approximate approach is to transform the data values to their ranks, and then compute a parametric two-way ANOVA on the data ranks. Another approach is to use ordinal logistic regression (OLR) to provide general contrasts on the log odds ratio scale (Harrell 2015). OLR is a semiparametric method that only uses the ranks of the response variable Y. It handles continuous Y by creating k-1 intercepts where k is the number of unique Y values (Liu 2017; Harrell 2015). Proportional odds (PO) OLR is a generalization of the Wilcoxon and Kruskal-Wallis tests that extends to multiple covariates and interactions (Harrell 2022a; Harrell 2022b). The score test from the PO OLR model for the global null hypothesis (k-1 degrees of freedom for comparing k groups) is almost exactly the Kruskal-Wallis test statistic. For the case where k equals 2 (Wilcoxon test), the numerator of the score test is exactly the numerator of the Wilcoxon-Mann-Whitney U test.

In this post, we will use OLR to evaluate whether sample depth and/or site location affect arsenic concentrations measured in soil. We will use Frank Harrell’s rms R package to perform the modelling.

Data

Arsenic concentrations were measured in soil from two depth intervals, shallow (SS) and deep (SB), at five sites (RA1 through RA5). We would like to determine whether arsenic concentrations in soil not only exhibit spatial variability between site locations but also may be affected by sample depth. This constitutes a two-way ANOVA problem, with one factor as location (represented by the five sites) and the second factor as matrix (represented by the two soil sample depths).

The data contain 266 observations and 11 variables.

# Load libraries
library(dplyr)
library(rms)

# Load data
dat <- read.csv('data/Soil_Arsenic.csv', header = T)
dat$LOGDATE <- as.POSIXct(dat$LOGDATE, format = '%m/%d/%Y')

str(dat)

## 'data.frame':	266 obs. of  11 variables:
##  $ LOCATION: chr  "RA3" "RA3" "RA3" "RA3" ...
##  $ SACODE  : chr  "N" "N" "N" "N" ...
##  $ LOGDATE : POSIXct, format: "2019-08-15" "2019-08-15" ...
##  $ MATRIX  : chr  "SS" "SS" "SS" "SS" ...
##  $ MEDIA   : chr  "Soil" "Soil" "Soil" "Soil" ...
##  $ DILUTION: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ PARAM   : chr  "Arsenic" "Arsenic" "Arsenic" "Arsenic" ...
##  $ DETECTED: chr  "Y" "Y" "Y" "Y" ...
##  $ RESULT  : num  2.95 2.83 2.96 2.55 4.02 3.7 3.52 3.03 3.27 3.52 ...
##  $ MDL     : num  1 0.8 0.8 0.8 0.8 1 0.8 0.8 0.8 0.8 ...
##  $ UNITS   : chr  "ppb" "ppb" "ppb" "ppb" ...

Let’s select only the columns needed for the modelling.

# Select only columns needed
d <- dat[, c('MATRIX', 'LOCATION', 'RESULT')]

The datadist() function computes statistical summaries of predictors to automate estimation and plotting of effects. The user generally supplies the final data frame to the datadist() function and set the data distribution using the options function. If the function is called before a model fit and the resulting object pointed to with options(datadist = “name”), the data characteristics will be stored with the fit so later predictions and summaries of the fit will not need to access the original data. It is generally recommended to run datadist() once before any models are fitted, storing the distribution summaries for all potential variables. Adjustment values are 0 for binary variables, the most frequent category (or optionally the first category level) for categorical (factor) variables, the middle level for ordered factor variables, and medians for continuous variables. The default is “mode”, indicating that the most frequent category for categorical (factor) variables is the adjust-to setting.

dd <- datadist(d); options(datadist = 'dd')
dd

##                 MATRIX LOCATION    RESULT
## Low:effect        <NA>     <NA>  4.062500
## Adjust to           SB      RA5  6.090000
## High:effect       <NA>     <NA>  7.150000
## Low:prediction      SB      RA1  1.967744
## High:prediction     SS      RA5 10.107519
## Low                 SB      RA1  0.930000
## High                SS      RA5 11.500000
## 
## Values:
## 
## MATRIX : SB SS 
## LOCATION : RA1 RA2 RA3 RA4 RA5

The orm() function fits ordinal cumulative probability models for continuous or ordinal response variables. The default being the logistic (i.e., the proportional odds model). It is especially made for continuous Y, with fast run times for up to 6,000 intercepts. The ordinal cumulative probability models are stated in terms of exceedance probabilities (Prob[Y = y | X]) so that as with ordinary least squares regression, larger predicted values are associated with larger Y.

To test for differences among the two soil depths (MATRIX) and five site locations (LOCATION), soil depth type SB and site location type RA1 are the reference groups.

f <- orm(RESULT ~ LOCATION * MATRIX, data = d)
f

## Logistic (Proportional Odds) Ordinal Regression Model
##  
##  orm(formula = RESULT ~ LOCATION * MATRIX, data = d)
##  
##                         Model Likelihood               Discrimination    Rank Discrim.    
##                               Ratio Test                      Indexes          Indexes    
##  Obs           266    LR chi2     193.53    R2                  0.517    rho     0.666    
##  Distinct Y    212    d.f.             9    g                   2.090                     
##  Median Y     6.08    Pr(> chi2) <0.0001    gr                  8.082                     
##  max |deriv| 1e-06    Score chi2  211.82    |Pr(Y>=median)-0.5| 0.267                     
##                       Pr(> chi2) <0.0001                                                  
##  
##                           Coef    S.E.   Wald Z Pr(>|Z|)
##  LOCATION=RA2              0.0221 0.9108  0.02  0.9806  
##  LOCATION=RA3             -1.5747 0.6539 -2.41  0.0160  
##  LOCATION=RA4             -0.2248 0.8628 -0.26  0.7945  
##  LOCATION=RA5              2.3085 0.6250  3.69  0.0002  
##  MATRIX=SS                 2.3165 0.6997  3.31  0.0009  
##  LOCATION=RA2 * MATRIX=SS  2.6835 1.0580  2.54  0.0112  
##  LOCATION=RA3 * MATRIX=SS -1.7777 0.8238 -2.16  0.0309  
##  LOCATION=RA4 * MATRIX=SS  1.1788 1.0280  1.15  0.2515  
##  LOCATION=RA5 * MATRIX=SS -1.9581 0.7872 -2.49  0.0129  
##

The p-values for both the likelihood ratio (LR) test \(\chi^2\) and the score test from this model are <0.0001, indicating that the global null hypothesis of no difference in groups can be rejected. Soil depth (MATRIX), site location (LOCATION), and the interaction between depth and location (MATRIX:LOCATION) are statistically significant.

We can inspect a summary of the results using the depigner R library. The tidy_summary() function converts a summary() object from the rms package to a tidy data frame.

library(depigner)
data.frame(tidy_summary(summary(f)))

##    X.nbsp.   Diff.         HR Lower.95..CI Upper.95..CI
## 1 LOCATION RA1:RA5 0.09941039   0.02920485   0.33838310
## 2 LOCATION RA2:RA5 0.10163620   0.02337105   0.44199618
## 3 LOCATION RA3:RA5 0.02058573   0.00829962   0.05105922
## 4 LOCATION RA4:RA5 0.07939697   0.02003986   0.31456702
## 5   MATRIX   SS:SB 1.43104080   0.69886458   2.93029268

Let’s perform ANOVA on the model fit. Because the relationship between all pairs of groups is the same, there is only one set of coefficients.

anova(f)

##                 Wald Statistics          Response: RESULT 
## 
##  Factor                                           Chi-Square d.f. P     
##  LOCATION  (Factor+Higher Order Factors)          139.09     8    <.0001
##   All Interactions                                 36.51     4    <.0001
##  MATRIX  (Factor+Higher Order Factors)             60.09     5    <.0001
##   All Interactions                                 36.51     4    <.0001
##  LOCATION * MATRIX  (Factor+Higher Order Factors)  36.51     4    <.0001
##  TOTAL                                            151.09     9    <.0001

The Wald \(\chi^2\) statistic is 139 for LOCATION, 60.1 for MATRIX, and 36.5 for the interaction between depth and location (MATRIX:LOCATION). Let’s create a dot chart depicting the importance of variables in the model as measured by the Wald statistic and p-value.

plot(anova(f))

Let’s show the intercepts as a function of y to inspect the underlying conditional distribution. Let’s also inspect the distribution of residuals.

alphas <- coef(f)[1 : num.intercepts(f)]
yunique <- f$yunique[-1]
par(mfrow = c(1,2))
plot(yunique, alphas)
plot(ecdf(resid(ols(RESULT ~ LOCATION * MATRIX, data = d))), main = '')

Whereas confidence intervals for means are approximate, the confidence intervals for odds ratios or exceedance probabilities are correct for ordinal models.

M <- Mean(f)
Predict(f, LOCATION, MATRIX, fun = M)

##    LOCATION MATRIX     yhat    lower    upper
## 1       RA1     SB 4.423752 3.498615 5.348890
## 2       RA2     SB 4.442087 3.292227 5.591948
## 3       RA3     SB 3.199625 2.752734 3.646516
## 4       RA4     SB 4.238769 3.205413 5.272125
## 5       RA5     SB 6.311153 6.020567 6.601740
## 6       RA1     SS 6.317480 5.741308 6.893652
## 7       RA2     SS 8.593215 7.988621 9.197809
## 8       RA3     SS 3.597675 3.108999 4.086350
## 9       RA4     SS 7.077969 6.426724 7.729214
## 10      RA5     SS 6.594971 6.093381 7.096561
## 
## Response variable (y):  
## 
## Limits are 0.95 confidence limits

Let’s plot the predicted mean values.

plot(Predict(f, fun = M), ylab = 'Predicted Mean')

Let’s estimate the sample means with t-confidence limits.

with(d, summarize(RESULT, llist(LOCATION, MATRIX), smean.cl.normal))

##    LOCATION MATRIX   RESULT    Lower    Upper
## 1       RA1     SB 4.816087 3.514020 6.118154
## 2       RA1     SS 6.372400 5.724396 7.020404
## 3       RA2     SB 4.580909 2.680446 6.481373
## 4       RA2     SS 8.544000 7.943056 9.144944
## 5       RA3     SB 3.100000 2.849411 3.350589
## 6       RA3     SS 3.594800 3.175911 4.013689
## 7       RA4     SB 4.571429 1.759652 7.383205
## 8       RA4     SS 7.206800 6.364318 8.049282
## 9       RA5     SB 6.334933 6.187271 6.482596
## 10      RA5     SS 6.598000 6.269077 6.926923

Let’s compute the contrasts of the estimated regression coefficients obtained from the OLR model fit, along with standard errors, confidence limits, Z-statistics, and p-values.

contrast(
  f,
  list(
    LOCATION = c('RA1', 'RA2', 'RA3', 'RA4', 'RA5'),
    MATRIX = c('SS', 'SB')
  )
)

##      LOCATION MATRIX    Contrast      S.E.      Lower      Upper     Z Pr(>|z|)
##  1        RA1     SS  2.31648976 0.6997159  0.9450719  3.6879076  3.31   0.0009
##  2        RA2     SS  5.02211664 0.7236544  3.6037800  6.4404533  6.94   0.0000
##  3        RA3     SS -1.03589081 0.6556802 -2.3210004  0.2492188 -1.58   0.1141
##  4        RA4     SS  3.27049995 0.7314778  1.8368298  4.7041701  4.47   0.0000
##  5        RA5     SS  2.66690061 0.6810062  1.3321530  4.0016482  3.92   0.0001
##  6*       RA1     SB  0.00000000 0.0000000  0.0000000  0.0000000   NaN      NaN
##  7        RA2     SB  0.02214307 0.9108468 -1.7630838  1.8073700  0.02   0.9806
##  8        RA3     SB -1.57465875 0.6539419 -2.8563613 -0.2929562 -2.41   0.0160
##  9        RA4     SB -0.22479643 0.8628462 -1.9159440  1.4663511 -0.26   0.7945
## 10        RA5     SB  2.30849860 0.6249717  1.0835766  3.5334206  3.69   0.0002
## 
## Redundant contrasts are denoted by *
## 
## Confidence intervals are 0.95 individual intervals

One of the assumptions underlying OLR is that the relationship between each pair of outcome groups is the same. In other words, OLR assumes that the coefficients that describe the relationship between, say, the lowest versus all higher categories of the response variable are the same as those that describe the relationship between the next lowest category and all higher categories. This is called the proportional odds assumption.

Let’s plot the logit transformation of the group-stratified empirical cumulative distribution functions for the response variable. First, let’s create the plot stratified by MATRIX using the Ecdf() function from the Hmisc R package.

Ecdf( ~ RESULT, group = MATRIX, data = d, fun = qlogis,
     xlab = 'log(Result)', ylab = expression(logit~(F[n](x))))

Now, let’s create the plot of the logit transformation of the empirical cumulative distribution functions stratified by LOCATION.

Ecdf( ~ RESULT, group = LOCATION, data = d, fun = qlogis,
     xlab = 'log(Result)', ylab = expression(logit~(F[n](x))))

Non-parallelism is readily seen, indicating non-proportional odds. If PO doesn’t hold, an OLR model may still be better than the alternatives (note that the Wilcoxon and Kruskal-Wallis tests also assume PO to have optimal power). When PO does not hold, the odds ratio from the proportional odds model represents a kind of average odds ratio, and there is an almost one-to-one relationship between the odds ratio and the concordance probability, which is a simple translation of the Wilcoxon statistic (Harrell 2022c).

References

Harrell, F.E. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, 2nd Edition. Springer, Switzerland.

Harrell, F.E. 2022a. If You Like the Wilcoxon Test You Must Like the Proportional Odds Model. https://www.fharrell.com/post/wpo/. April.

Harrell, F.E. 2022b. Equivalence of Wilcoxon Statistic and Proportional Odds Model. https://www.fharrell.com/post/powilcoxon. April.

Harrell, F.E. 2022c. Violation of Proportional Odds is Not Fatal. https://www.fharrell.com/post/po/. April.

Liu, Q., B.E. Shepherd, C. Li, and F.E. Harrell. 2017. Modeling continuous response variables using ordinal regression. Stat Med. 36, 4316-4335.

Posted on:: April 17, 2022

Length:: 10 minute read, 1948 words

See Also: