How to Analyze Data Containing Non-detects
By Charles Holbert, Jacobs
May 6, 2018
Why You Should Care
A wide range of management decisions are affected by left-censored observations because they impact not only the estimation of statistical parameters but also the characterization of data distributions and inferential statistics (e.g., comparing the mean of two or more populations, the estimation of correlation coefficients, and the construction of regression models). When non-detects are assigned a fraction of the detection limits, resulting statistics are inaccurate and irreproducible; calculated values may be far from their true values, and the amount and type of deviation is unknown. Collecting and measuring environmental chemistry data requires a significant investment in time, effort, and money. Fortunately, statistically robust procedures are available for analyzing censored data that make no assumptions or use of arbitrary values.
How Not to Analyze Censored Data
Ignore (or Drop) Them
The worst approach is excluding or deleting non-detects as this provides little insight into the original data. Excluding non-detects produces a strong upward bias in all subsequent measures of location (e.g., means and medians) and removes the primary signal used in hypothesis tests when comparing groups of data.
Substitute an Arbitrary Value
The most common procedure to manage non-detects continues to be substitution of some fraction of the detection limit. Although substitution is an easy method, it has no theoretical basis, has been shown to give poor results in simulation studies, and the resulting statistics are biased and inconsistent (Helsel and Cohen 1988; Singh and Nocerino 2002; Leith et al. 2010; Shoari et al. 2016). Contrary to the general rule of thumb, substitution of one-half the detection limit does not perform well even for low censoring levels. If an arbitrary value is substituted for non-detects, there is no variability in the replaced values and the computed variability of the data is artificially changed. Because probability values, confidence intervals, and other statistics are inherently linked to the variance estimate, these substitutions make it statistically difficult to detect true environmental concerns. This may result in incorrect decisions about whether regulatory enforcement or remediation actions are justified.
How to Correctly Analyze Censored Data
Maximum Likelihood Estimation (MLE: parameter)
The MLE method assumes that data have a particular shape (or distribution). The method fits a distribution to the data that matches both the values for detected observations, and the proportion of observations falling below each detection limit. The information contained in non-detects is captured efficiently by the proportion of data falling below each detection limit. Because the model uses the parameters of a probability distribution, MLE is a fully parametric approach and all inference is based on the assumed distribution. A limitation of MLE is that the appropriate probability distribution to which the data belong must be specified, and the data sample size should be large enough to enable verification that the assumed data distribution is reasonable. MLE methods generally do not work well for small data sets (fewer than 30 to 50 detected values), in which one or two outliers throw off the estimation, or where there is insufficient evidence to know whether the assumed distribution fits the data well (Helsel 2005).
Kaplan-Meier (KM; nonparametric)
The KM product-limit estimator (Kaplan and Meier 1958) is a nonparametric method designed to incorporate data with multiple censoring levels and does not require an assumed distribution. Originally, this method was developed for analysis of right censored survival data where the KM estimator is an estimate of the survival curve, a complement of the cumulative distribution function. The KM method estimates a cumulative distribution function that indicates the probability for an observation to be at or below a reported concentration. The estimated cumulative distribution function is a step function that jumps up at each uncensored value and is flat between uncensored values. It has been shown in simulation studies to be one of the best methods for computing the mean and confidence intervals for left-censored environmental concentration data (Singh et al. 2006).
Robust Regression on Order Statistics (rROS; semi-parametric)
The rROS method is semi-parametric in the sense that part of the inference is based on an assumed distribution and part of the inference is based on the observed values, without any distributional assumption. A regression of the plotting positions of the detected contaminant concentrations versus normal quantiles is computed, and values for the unknown non-detects are predicted from the regression equation based on their scores (Helsel 2012). The predicted concentrations for the non-detects are combined with the actual detected concentrations to obtain a “full” data set, which can then be used to compute summary statistics. By performing any necessary back-transformations on individual values rather than the mean, the transformation bias that affects parametric ROS is avoided, which is one reason this variant of ROS is described as “robust.”
Recommended Guidelines
Which procedure to use depends on the sample size and amount of censoring. When the censoring percentage is less than 50% and the sample size is small, either KM or rROS works reasonably well. With a larger sample size, but a similar censoring percentage, maximum likelihood method generally works better. General recommendations on which method to use are as follows:
- KM - small to moderate censoring (e.g., <50%) with multiple censoring limits
- rROS - moderate to large censoring (e.g., 50% - 80%) and small sample sizes (e.g., <50)
- MLE - moderate to large censoring (e.g., 50% - 80%) and large sample sizes (e.g. >50)
For very large amounts of censoring (e.g., >80%), report the proportions of data below or above the maximum censoring limit, rather than estimating statistics that are unreliable. If all observations are censored, it’s senseless to compute point estimates. The best approach for such data is to report the median, calculated as the median of the censoring levels (Helsel 2012). Inferences also can be made concerning probabilities of exceeding the censoring limit(s), based on binomial probabilities.
The recommendation to use the KM method for multiply censored data with up to 50% censoring follows its predominant use in other disciplines and well-developed theory. However, the KM method should not be used for data with a single censoring limit that is less than all detected observations. In this situation, the KM method is substitution in disguise. The advantages and disadvantages of each method are summarized below.
Method | Advantage | Disadvantage |
---|---|---|
MLE | Supported by consistency, efficiency, and asymptotic normality | Sensitive to distributional assumption; properties are valid for large sample sizes |
KM | Does not require data transformation or distributional assumption about data | Relies exclusively on quality of data; not applicable when non-detects are >50% |
rROS | Robust against distributional mis-specification and variations in skewness | Predicted observations are treated as actual observations |
Computing Summary Statistics
Applying the different methods for analyzing censored data to an example data set yields the summary statistics shown in the table below. These data represent copper concentrations measured in 65 shallow groundwater samples collected from an Alluvial Fan zone in the San Joaquin Valley, California (Millard 2013). There are 17 non-detect results (26.2%) reported at four different censoring limits (1, 5, 10, and 20 mg/L). Results from different substitution methods diverge considerably and values are significantly affected by choice of the substituted value. Results from the KM and rROS methods are similar while those based on MLE are moderately different from the other two methods.
Method | Mean | Median | Std. Dev. |
---|---|---|---|
Substitute with 0 | 3.06 | 2.00 | 3.89 |
Substitute with 1/2 DL | 3.94 | 2.50 | 3.74 |
Substitute with DL | 4.81 | 3.00 | 4.64 |
MLE | 3.34 | 2.57 | 4.06 |
KM | 3.61 | 2.00 | 3.62 |
rROS | 3.58 | 2.00 | 3.74 |
Summing Non-Detects
Properly incorporating non-detects in reported total concentrations or quantities faces the same constraints as when calculating summary statistics. The censored values should not be ignored or assumed equal to some arbitrary value, especially when computing weighting factors such as toxic equivalence factors (Helsel 2009). Summing censored values is the reverse of calculating the mean of a set of numbers. Because the mean is the sum of all values divided by the number of values summed, the sum of data containing censored values is computed by calculating the mean by one of the statistically robust procedures and multiplying that number by the number of observations. No substitutions or other guesses are necessary.
References
Helsel, D.R. 2005. More than obvious: Better methods for interpreting nondetect data. Environ. Sci. Technol. 39:419A-423A.
Helsel, D.R. 2009. Summing nondetects: Incorporating low-level contaminants in risk assessment. Integrated Environmental Assessment and Management 6:361-366.
Helsel, D.R. 2012. Statistics for Censored Environmental Data Using Minitab and R, 2nd Edition. John Wiley and Sons, New Jersey.
Helsel D. and T. Cohen. 1988. Estimation of descriptive statistics for multiply censored water quality data. Water Resour Res 24:1997-2004.
Kaplan, E.L. and O. Meier. 1958. Nonparametric estimation from incomplete observations. Journal of the American Statistical Association 53:457-481.
Leith, K.F., W.W. Bowerman, M.R. Wierda, D.A. Best, T.G. Grubb, and J.G. Sikarske. 2010. A comparison of techniques for assessing central tendency in left-censored data using PCB and p,p’DDE contaminant concentrations from Michigan’s Bald Eagle Biosentinel Program. Chemosphere 80:7-12.
Millard, S.P. 2013. EnvStats: An R Package for Environmental Statistics. Springer, NY.
Shoari N., J.S. Dube, and S. Chenouri. 2016. On the use of the substitution method in left-censored environmental data. Hum Ecol Risk Assess 22:435-446.
Shoari N. and J.S. Dube. 2017. Toward improved analysis of concentration data: Embracing nondetects. Environmental Toxicology and Chemistry 37:643-656.
Singh A. and Nocerino J. 2002. Robust estimation of mean and variance using environmental data sets with below detection limit observations. Chemometr Intell Lab Syst 60:69-86.
Singh A., R. Maichle, and S.E. Lee. 2006. On The Computation of a 95% Upper Confidence Limit of the Unknown Population Mean Based Upon Data Sets With Below Detection Limit Observations. Washington, DC: U.S. Environmental Protection Agency EPA/600/R-06/022.
- Posted on:
- May 6, 2018
- Length:
- 8 minute read, 1600 words
- See Also: