Blog

Data analytics, statistics, and more

Feature Selection Methods for Machine Learning

Feature Selection is a core concept in machine (statistical) learning that can have significant impacts on model performance. This post examines various methods to identify the most important predictor variables in machine learning that explain the variance of the response variable.

February 14, 2020

Creating Static Maps Using R

Use the functionality of R and R packages to create both simple maps and complex maps containing many different layers.

September 18, 2019

Testing Group Differences with Data Containing Non-detects

Often data from more than two groups needs to be evaluated, usually on the basis of a representative value from each group. This post examines the use of survival analysis techniques to test whether surface water samples containing a high frequency of censored (non-detect) values differ in dissolved lead concentration between various watersheds.

September 13, 2019

Outlier Detection Using Machine Learning

There is no precise way to define and identify outliers in general because of the specifics of each dataset. This post evaluates three methods for multivariate outlier detection, including Mahalanobis distance (a multivariate extension to standard univariate tests) and two machine learning (clustering) techniques.

September 9, 2019

Introduction to Statistical Intervals

The issue of uncertainty in estimating population parameters from data samples is often addressed using statistical intervals. The three types of statistical interval differ in their definitions as well as their typical applications. It is important to fully understand the assumptions and limitations underlying the use, interpretation, and calculation of statistical intervals before applying them.

August 6, 2019

Power of the Mann-Kendall Test

An important objective of many environmental monitoring programs is to detect changes or trends in constituent concentrations over time. The Mann-Kendell test is one of the most popular nonparametric tests for determining temporal trend. This post evaluates the power of the Mann-Kendall test to identify a trend for various sample sizes and variability in the data using Monte Carlo simulation.

July 21, 2019

Fitting Distributions with Censored Data

Many statistical analyses depend on the type of data distribution. This post explores methods gooness-of-fit tests for the lognormal distribution, the gamma distribution, and normal distribution when data contain censored (non-detect) values.

June 26, 2019

Censored Regression

Regression performed using censored data can be challenging. Common practices for handling censored data include deletion of the censored observations or substituting nondetects with arbitrary constants, generally based on some fraction of the detection limit. These approaches tend to be biased and cause a loss of information. Censored regression methods produce more accurate and robust estimates than these bias-prone methods.

March 3, 2019

Robust Regression

Ordinary least squares regression is optimal when all regression assumptions are valid. When some of these assumptions are invalid, least squares regression can perform poorly. Robust regression is an alternative to least squares regression when data contain outliers or influential observations.

July 30, 2018

Space Shuttle Challenger Disaster

On January 28, 1986 the space shuttle Challenger disintegrated 73 seconds after liftoff from Kennedy Space Center. The most disturbing part of the space shuttle Challenger disaster was that the O-ring failure had been foreseen by the manufacturer’s engineers, who were unable to convince managers to delay the launch. Providing a better analysis and visualization of the data could have helped improve the decision-making process and potentially built a stronger case for the engineers about the effect of cold weather on O-ring functionality.

June 25, 2018