Quality Control
To perform statistical analysis on the measured or modeled data for the site location, it’s important to make sure the data has has been cleaned and controlled for unrealistic abnormalities.
While the modeled data usually doesn’t need much cleaning due to the hindcast’s extensive validation, the measured data sources from buoys and stations are much more vulnerable to inconsistencies.
All QA procedures are implemented using functions from Pecos, that are exposed through MHKiT.
Pecos reference:
- K.A. Klise and J.S. Stein (2016), Performance Monitoring using Pecos, Technical Report SAND2016-3583, Sandia National Laboratories, Albuquerque, NM.
Quality Control Parameters
The Pecos documentation for the functions used can be found in the pecos.monitoring module.
Corrupt Data
Uses the check_corrupt Pecos function.
Drop values equal to user input from the dataset. Some sources, such as NDBC, have known corrupt values which are removed before presenting the data in the DLC tool.
Range Tests
Uses the check_range Pecos function.
Define the upper and lower bounds of the expected range of data. Helpful if familiar with an area and know that there shouldn’t be Significant Wave Height values > 10 m for example.
Values outside of the range are dropped.
Delta Tests
Uses the check_delta Pecos function.
Checks for stagnant and/or abrupt changes across a rolling window of the time series data. Uses the max and min values to find the delta (max - min) in the window.
The entire window where the calculated delta is outside of the upper or lower bounds is dropped.
The direction argument is None for the Pecos function to catch both if the max occurs before the min or the min before the max in the rolling window.
Outlier
Uses the check_outlier Pecos function.
Remove outliers, calculated across a rolling window, from normalized data. Data is normalized using:
\[\bar{x} = \frac{x-\mu}{\sigma}\]Where:
- \(\mu\) is the mean of the dataset
- \(\sigma\) is the standard deviation of the dataset
Specify outliers via the number of standard deviations away from the mean.
The tool only accepts the upper bound parameter, and passes absolute_value=True to the check_outlier function. This allows for using the same variance +/- from the mean in the rolling window.
Results
Other than the two listed below, the results simply show the number of points dropped and why (above/ below a bound) per test.
Timestamp
Timestamp checks are rooted in the check_timestamp function from Pecos. With the following high-level flow:
- Extract dominant temporal resolution in the time series data (measured data sources often aren’t entirely evenly spaced)
- Use the dominant temporal resolution and
check_timestampto locate the gaps in the data - Use dominant temporal resolution to calculate the percent of the data set that is missing
- Report largest gaps found in the dataset from the results of
check_timestamp
Giving results:
- Period of record - cleaning can potentially remove significant portions of the data. Make sure the cleaning process didn’t reduce the years covered too much.
- Gaps in data - top 5 (if existing) largest gaps in the data. Look at the number of days missing, combined with the start and end dates to see if the data set is potentially missing mostly winter months. The data could be a bad representation of the extreme wave conditions if winter is often missing.
- Temporal Resolution - observed spacing between measurements. The most frequently observed temporal resolution is used in statistic calculations.
Water Depth
When water depth data is available from the data source, its displayed.
Some hindcast sources have 0 m, or negative water depth. The hindcast points need to estimate the coast line, and therefore sometimes the grid of hindcast points end up outside of the water line.