Paper Number

ECIS2026-2024

Paper Type

CRP

Abstract

Data quality (DQ) defects such as missing values substantially impair the performance of machine learning (ML) models by causing predictive uncertainty. This prompts organizations to invest in DQ improvement for better decision quality. However, given the vast and ever-increasing volume of data available, it is rarely feasible to repair all defects, raising the question of which defects should be prioritized. This paper introduces an approach to prioritize repairs of individual missing values, based on predictive uncertainty. By estimating the extent to which each missing value contributes to predictive uncertainty, the approach prioritizes those whose repair improves model performance most effectively. The approach is evaluated through structured experiments, conducted across datasets from multiple domains and established ML models, showing significantly higher model performance compared to existing practices. It thereby enables more effective use of limited repair resources and paves the way for higher-quality decisions in ML-informed decision-making processes.

Share

COinS
 
Jun 14th, 12:00 AM

Repair What Matters: Uncertainty-Aware Prioritization For Data Quality Defects

Data quality (DQ) defects such as missing values substantially impair the performance of machine learning (ML) models by causing predictive uncertainty. This prompts organizations to invest in DQ improvement for better decision quality. However, given the vast and ever-increasing volume of data available, it is rarely feasible to repair all defects, raising the question of which defects should be prioritized. This paper introduces an approach to prioritize repairs of individual missing values, based on predictive uncertainty. By estimating the extent to which each missing value contributes to predictive uncertainty, the approach prioritizes those whose repair improves model performance most effectively. The approach is evaluated through structured experiments, conducted across datasets from multiple domains and established ML models, showing significantly higher model performance compared to existing practices. It thereby enables more effective use of limited repair resources and paves the way for higher-quality decisions in ML-informed decision-making processes.