Paper Number
ECIS2026-2024
Paper Type
CRP
Abstract
Data quality (DQ) defects such as missing values substantially impair the performance of machine learning (ML) models by causing predictive uncertainty. This prompts organizations to invest in DQ improvement for better decision quality. However, given the vast and ever-increasing volume of data available, it is rarely feasible to repair all defects, raising the question of which defects should be prioritized. This paper introduces an approach to prioritize repairs of individual missing values, based on predictive uncertainty. By estimating the extent to which each missing value contributes to predictive uncertainty, the approach prioritizes those whose repair improves model performance most effectively. The approach is evaluated through structured experiments, conducted across datasets from multiple domains and established ML models, showing significantly higher model performance compared to existing practices. It thereby enables more effective use of limited repair resources and paves the way for higher-quality decisions in ML-informed decision-making processes.
Recommended Citation
Hägele, Lukas Jakob and Traub, Sebastian, "Repair What Matters: Uncertainty-Aware Prioritization For Data Quality Defects" (2026). ECIS 2026 Proceedings. 12.
https://aisel.aisnet.org/ecis2026/datasc_isresearch/datasc_isresearch/12
Repair What Matters: Uncertainty-Aware Prioritization For Data Quality Defects
Data quality (DQ) defects such as missing values substantially impair the performance of machine learning (ML) models by causing predictive uncertainty. This prompts organizations to invest in DQ improvement for better decision quality. However, given the vast and ever-increasing volume of data available, it is rarely feasible to repair all defects, raising the question of which defects should be prioritized. This paper introduces an approach to prioritize repairs of individual missing values, based on predictive uncertainty. By estimating the extent to which each missing value contributes to predictive uncertainty, the approach prioritizes those whose repair improves model performance most effectively. The approach is evaluated through structured experiments, conducted across datasets from multiple domains and established ML models, showing significantly higher model performance compared to existing practices. It thereby enables more effective use of limited repair resources and paves the way for higher-quality decisions in ML-informed decision-making processes.