How to Calculate Summary Statistics for Left-Censored Data

By Charles Holbert

March 28, 2022

Introduction

Environmental data are frequently “left-censored”, indicating that some values are less than the limit of detection for the analytical methods employed. These data are problematic because censored (nondetect) values are known only to range between zero and the censoring limit. This complicates analysis of the data, including estimating statistical parameters, characterizing data distributions, and conducting inferential statistics.

Replacing nondetects with an arbitrary value, such as one-half the detection or reporting limit, artificially changes the computed variability of the data, resulting in statistics that likely are inaccurate and irreproducible. Calculated values may be far from their true values, and the amount and type of deviation is unknown. Because probability values, confidence intervals, and other statistics are inherently linked to the estimate of variance, these substitutions make it statistically difficult to detect true environmental concerns. The Unites States Environmental Protection Agency (USEPA) (2015) discourages substitution of nondetects with an arbitrary value. Worse than replacing nondetects with an arbitrary value is excluding nondetects. This produces a strong upward bias in all subsequent measures of location (e.g., means and medians) and removes the primary signal used in hypothesis tests when comparing groups of data (Helsel 2012). Fortunately, methods are available for analyzing data containing a mixture of detects and nondetects that make few or no assumptions about the data, or that substitute arbitrary values for the nondetects.

Statistical Methods

For data sets consisting of nondetects with multiple censoring limits, several estimation methods are available, including maximum likelihood (MLE), the Kaplan-Meier (KM) product-limit estimator (Kaplan and Meier 1958), and robust regression on order statistics (ROS).

Maximum Likelihood Estimation (MLE; parametric)
The MLE method assumes that data have a particular shape (or distribution). The method fits a distribution to the data that matches both the values for detected observations, and the proportion of observations falling below each detection limit. The information contained in nondetects is captured by the proportion of data falling below each detection limit. Because the model uses the parameters of a probability distribution, MLE is a fully parametric approach and all inference is based on the assumed distribution. A limitation of MLE is that the appropriate probability distribution to which the data belong must be specified, and the data sample size should be large enough to enable verification that the assumed data distribution is reasonable. MLE methods generally do not work well for small data sets (fewer than 30 to 50 detected values), in which one or two outliers throw off the estimation, or where there is insufficient evidence to know whether the assumed distribution fits the data well (Helsel 2012).

Kaplan-Meier (KM; nonparametric)
The KM method is a standard nonparametric method for computing descriptive statistics of censored data that does not require specification of an assumed distribution. A percentile is assigned to each detected observation, starting at the largest detected value, and working down the data set, based on the number of detects and nondetects above and below each observation. Percentiles are not assigned to nondetects, but nondetects affect the percentiles calculated for detected observations. The survival curve, a step function plot of the cumulative distribution function that jumps up at each uncensored value and is flat between uncensored values, gives the shape of the data. Estimates of the mean and standard deviation are computed from the estimated cumulative distribution function and these estimates then can be used in parametric statistical tests. The KM method has been shown in simulation studies to be one of the best methods for computing the mean and confidence intervals for left-censored environmental concentration data (Singh et al. 2006).

Robust Regression on Order Statistics (ROS; semi-parametric)
The ROS method is semi-parametric in the sense that part of the inference is based on an assumed distribution and part of the inference is based on the observed values, without any distributional assumption. A regression of the plotting positions of the detected concentrations versus normal quantiles is computed, and values for the unknown nondetects are predicted from the regression equation based on their scores (Helsel 2012). The predicted concentrations for the nondetects are combined with the actual detected concentrations to obtain a “full” data set, which is used to compute summary statistics. By performing any necessary back-transformations on individual values rather than the mean, the transformation bias that affects parametric ROS is avoided.

Which procedure to use depends on the sample size and amount of censoring. When the percentage of censored data is less than 50% and the sample size is small, either KM or ROS works reasonably well. With a larger sample size, but a similar censoring percentage, MLE generally works better. The details of the various estimation methods including the KM method can be found in USEPA (2015), Helsel (2012), and Singh et al. (2006).

Example Calculations

For this example, we will calculate summary statistics using the KM method and ROS. These methods are recommended by the USEPA (2009; 2015) for environmental data containing censored values. Calculations will be performed using EnvStats, an R package for environmental statistics (Millard 2013).

The data that will be used is an example data set (EPA.09.Table.9.1.TCE.df) that is contained in the EnvStats package. These data consist of trichloroethene (TCE) concentrations, given in milligrams per liter (mg/L), measured over time at two different groundwater monitoring wells, as presented in Table 9.1 of USEPA (2009). A few observations are annotated with a data qualifier of U, representing a nondetect, or J, denoting an estimated detected concentration. For the purpose of this example, the estimated concentrations will be treated as unqualified detected concentrations.

After loading the data, we see there are 5 variables with 30 observations.

# Load libraries
library(dplyr)
library(EnvStats)

# Load data
datin <- EPA.09.Table.9.1.TCE.df
str(datin)

## 'data.frame':	30 obs. of  5 variables:
##  $ Date.Collected: Factor w/ 15 levels "1/15/2007","1/2/2005",..: 2 12 14 6 3 10 13 5 4 1 ...
##  $ Date          : Date, format: "2005-01-02" "2005-04-07" ...
##  $ Well          : Factor w/ 2 levels "Well.1","Well.2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ TCE.mg.per.L  : num  0.005 0.005 0.004 0.006 0.004 0.009 0.017 0.045 0.05 0.07 ...
##  $ Data.Qualifier: Factor w/ 3 levels "","J","U": 3 3 2 1 3 1 1 1 1 1 ...

Let’s check the data for missing results.

datin[is.na(datin$TCE.mg.per.L),]

##    Date.Collected       Date   Well TCE.mg.per.L Data.Qualifier
## 13      10/5/2007 2007-10-05 Well.1           NA               
## 24     10/17/2006 2006-10-17 Well.2           NA               
## 29     10/29/2007 2007-10-29 Well.2           NA

There are three missing results. Let’s remove the samples with missing results and prep the data for statistical analysis. Let’s change the concentrations from micrograms per liter to milligrams per liter (µg/L) and add a few columns to the data, including Detected, Censored, and Label columns. The Label column is simply the combination of the measured result and data qualifier, if present. This column is used to print the last measured result with the appropriate data qualifier.

# Remove missing values from data set
dat <- datin %>%
  filter(!is.na(TCE.mg.per.L)) %>%
  rename(Qualifier = Data.Qualifier) %>%
  mutate(
    Result = 1000 * TCE.mg.per.L,
    Units = 'ug/L',
    Param = 'TCE',
    Detected = ifelse(Qualifier == 'U', 'N', 'Y'),
    Censored = ifelse(Qualifier == 'U', TRUE, FALSE),
    Label = ifelse(is.na(Qualifier), Result, paste(Result, Qualifier))
  ) %>%
  select(
    Well, Date, Param, Detected, Result, Units, Qualifier, Label, Censored
  ) %>%
  droplevels()
str(dat)

## 'data.frame':	27 obs. of  9 variables:
##  $ Well     : Factor w/ 2 levels "Well.1","Well.2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date     : Date, format: "2005-01-02" "2005-04-07" ...
##  $ Param    : chr  "TCE" "TCE" "TCE" "TCE" ...
##  $ Detected : chr  "N" "N" "Y" "Y" ...
##  $ Result   : num  5 5 4 6 4 9 17 45 50 70 ...
##  $ Units    : chr  "ug/L" "ug/L" "ug/L" "ug/L" ...
##  $ Qualifier: Factor w/ 3 levels "","J","U": 3 3 2 1 3 1 1 1 1 1 ...
##  $ Label    : chr  "5 U" "5 U" "4 J" "6 " ...
##  $ Censored : logi  TRUE TRUE FALSE FALSE TRUE FALSE ...

Now, let’s create a new data set for a fictious well (Well.3) that contains a percentage of nondetect values greater than 50%. After creating the data, we will add it to the previous data set.

# Create new dataset with greater than 50% non-detects
newdat <- dat %>%
  filter(Well == 'Well.1') %>%
  mutate(
    Well = 'Well.3',
    Result = 0.1 * Result,
    Detected = ifelse(Result < 5, 'N', 'Y'),
    Censored = ifelse(Result < 5, TRUE, FALSE),
    Qualifier = ifelse(Result < 5, 'U', ''),
    Label = ifelse(is.na(Qualifier), Result, paste(Result, Qualifier))
  )

# Combine datasets and sort
dat <- dat %>%
  bind_rows(newdat) %>%
  arrange(Param, Well, Date)

Now, let’s calculate summary statistics for TCE in our three groundwater monitoring wells.

# Calculate summary statistics
cdat <- dat %>%
  group_by(Param, Well) %>%
  summarize(
    UNITS   = unique(Units),
    COUNT   = length(Result),
    DET     = length(Result[Detected == 'Y']),
    CEN     = length(Result[Detected == 'N']),
    PER.DET = round(DET*100/COUNT, 2),
    MIN.CEN = tryCatch(min(Result[Detected == 'N'], na.rm = T),
                       warning = function(e) NA),
    MAX.CEN = tryCatch(max(Result[Detected == 'N'], na.rm = T),
                       warning = function(e) NA),
    MIN.DET = tryCatch(min(Result[Detected == 'Y'], na.rm = T),
                       warning = function(e) NA),
    MAX.DET = tryCatch(max(Result[Detected == 'Y'], na.rm = T),
                       warning = function(e) NA),
    LASTVAL  = Label[which.max(Date)],
    LASTDATE = max(Date),
    MEAN.KM = tryCatch(enparCensored(Result, Censored)$parameters['mean'],
                       error = function(e) mean(Result, na.rm = T)),
    MEDIAN.KM = tryCatch(median(ppointsCensored(Result, Censored,
                                prob.method = 'kaplan-meier')$Order.Statistics),
                         error = function(e) median(Result, na.rm = T)),
    SD.KM = tryCatch(enparCensored(Result, Censored)$parameters['sd'],
                     error = function(e) sd(Result, na.rm = T)),
    CV.KM = tryCatch(SD.KM/MEAN.KM, error = function(e) NA),
    MEAN.ROS = tryCatch(enormCensored(Result, Censored, method = 'rROS',
                                      lb.impute = 0)$parameters['mean'],
                        error = function(e) mean(Result, na.rm = T)),
    SD.ROS = tryCatch(enormCensored(Result, Censored, method = 'rROS',
                                    lb.impute = 0)$parameters['sd'],
                      error = function(e) sd(Result, na.rm = T)),
    SKEWNESS = tryCatch(skewness(Result, na.rm = TRUE, method = 'fisher'),
                        error = function(e) NA),
    KURTOSIS = tryCatch(kurtosis(Result, na.rm = TRUE, method = 'fisher', excess = TRUE),
                        error = function(e) NA),
    SW.PVAL = tryCatch(shapiro.test(Result[Detected == 'Y'])$p.value,
                       error = function(e) NA)
  ) %>%
  ungroup() %>%
  data.frame()

# View results
str(cdat)

## 'data.frame':	3 obs. of  22 variables:
##  $ Param    : chr  "TCE" "TCE" "TCE"
##  $ Well     : chr  "Well.1" "Well.2" "Well.3"
##  $ UNITS    : chr  "ug/L" "ug/L" "ug/L"
##  $ COUNT    : int  14 13 14
##  $ DET      : int  11 10 6
##  $ CEN      : int  3 3 8
##  $ PER.DET  : num  78.6 76.9 42.9
##  $ MIN.CEN  : num  4 99 0.4
##  $ MAX.CEN  : num  5 100 4.5
##  $ MIN.DET  : num  4 107 5
##  $ MAX.DET  : num  250 170 25
##  $ LASTVAL  : chr  "250 " "110 " "25 "
##  $ LASTDATE : Date, format: "2007-12-30" "2007-12-30" ...
##  $ MEAN.KM  : num  63.07 117.92 5.87
##  $ MEDIAN.KM: num  31 110 3.1
##  $ SD.KM    : num  76.11 19.23 7.85
##  $ CV.KM    : num  1.207 0.163 1.336
##  $ MEAN.ROS : num  62.21 114.07 5.64
##  $ SD.ROS   : num  79.69 24.91 8.31
##  $ SKEWNESS : num  1.47 1.64 1.47
##  $ KURTOSIS : num  1.38 3.07 1.38
##  $ SW.PVAL  : num  0.0503 0.0211 0.5004

The summary statistics include the total number of samples, the number of detects and nondetects, and the frequency of detection; the minimum and maximum concentrations for detects and nondetects; the mean, median, standard deviation, and coefficient of variation (CV); and distribution characteristics of each data set. The distributional statistics include skewness, kurtosis, and the probabilities that the detected results came from normal distributions as calculated by the Shapiro-Wilk test (1965). The output also include the last result and sample date for each well.

If a data set is a mixture of detects and nondetects, but the nondetect fraction is no more than 50%, a censored estimation method such as the KM method or ROS can be used to compute adjusted estimates of the summary statistics. Because parameter estimation can suffer for data sets with low detection frequencies, the EPA (2009) recommends that these methods should not be used when more than 50% of the data are nondetects. The code snippet shown below can be used to revise the summary statistics to account for those groups of data containing greater than 50% nondetects.

# Change values if less than 50% detections
cdat2 <- cdat %>%
  mutate(
    MEAN.KM = ifelse(PER.DET >= 50, MEAN.KM, NA),
    MEDIAN.KM = ifelse(PER.DET >= 50, MEDIAN.KM, NA),
    SD.KM = ifelse(PER.DET >= 50, SD.KM, NA),
    CV.KM = ifelse(PER.DET >= 50, CV.KM, NA),
    MEAN.ROS = ifelse(PER.DET >= 50, MEAN.ROS, NA),
    SD.ROS = ifelse(PER.DET >= 50, SD.ROS, NA),
    SKEWNESS = ifelse(PER.DET >= 50, SKEWNESS, NA),
    KURTOSIS = ifelse(PER.DET >= 50, KURTOSIS, NA),
    SW.PVAL = ifelse(PER.DET >= 50, SW.PVAL, NA),
  )

Now that we have our summary statistics, we can save the results to a comma-delimited file. Although our example data only contained one analyte and three wells, the calculations can be applied to a much larger data set containing a long time history for any number of analytes and wells.

References

Helsel, D.R. 2012. Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley and Sons, NJ.

Kaplan, E.L. and O. Meier. 1958. Nonparametric Estimation from Incomplete Observations. J. Am. Stat. Assoc 53, 457-481.

Millard, S.P. 2013. EnvStats: An R Package for Environmental Statistics. Springer-Verlag, NY.

Shapiro, S. S. and M. B. Wilk. 1965. An analysis of variance test for normality (complete samples). Biometrika 52, 591-611.

Singh, A., R. Maichle, and S. Lee. 2006. On the Computation of a 95% Upper Confidence Limit of the Unknown Population Mean Based Upon Data Sets with Below Detection Limit Observations. EPA/600/R-06/022, March.

United States Environmental Protection Agency (USEPA). 2009. Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities: Unified Guidance. EPA 530/R-09-007. Office of Resource Conservation and Recovery, Washington, DC, March.

USEPA. 2015. ProUCL Version 5.1 Technical Guide, Statistical Software for Environmental Applications for Data Sets with and without Nondetect Observations. EPA/600/R-07/041. Office of Research and Development, Washington, DC, October.

Posted on:: March 28, 2022

Length:: 11 minute read, 2316 words

See Also: