Loading...
Paper Type
Complete
Abstract
The design of data collection interfaces can significantly influence the quality of training data for machine learning (ML) systems. While class-based designs—which structure information into predefined categories—are widely used, their rigid classifications risk omitting critical details. This study compares class-based data collection with an instance-based approach, where contributors freely describe observed phenomena, to assess their impact on ML performance. We conducted a controlled experiment in autonomous driving. Participants (N = 260) described driving scenes using class-based or instance-based interfaces. Machine learning models trained on these datasets were evaluated for predictive accuracy. Results show instance-based data outperformed class-based across all models, with higher accuracy and robustness, improving generalization to real-world complexity. In cases where data is crowdsourced to train machine learning models, this work shows the benefits of instance-based data collection design that integrates structured input with open-ended reporting to enhance ML performance.
Paper Number
2298
Recommended Citation
Khosh Raftar Nouri, Aida and Parsons, Jeffrey, "The Cost of Rigidity: How Class-Based Data Limits Machine Learning Performance" (2025). AMCIS 2025 Proceedings. 5.
https://aisel.aisnet.org/amcis2025/acctinfosys/acctinfosys/5
The Cost of Rigidity: How Class-Based Data Limits Machine Learning Performance
The design of data collection interfaces can significantly influence the quality of training data for machine learning (ML) systems. While class-based designs—which structure information into predefined categories—are widely used, their rigid classifications risk omitting critical details. This study compares class-based data collection with an instance-based approach, where contributors freely describe observed phenomena, to assess their impact on ML performance. We conducted a controlled experiment in autonomous driving. Participants (N = 260) described driving scenes using class-based or instance-based interfaces. Machine learning models trained on these datasets were evaluated for predictive accuracy. Results show instance-based data outperformed class-based across all models, with higher accuracy and robustness, improving generalization to real-world complexity. In cases where data is crowdsourced to train machine learning models, this work shows the benefits of instance-based data collection design that integrates structured input with open-ended reporting to enhance ML performance.
When commenting on articles, please be friendly, welcoming, respectful and abide by the AIS eLibrary Discussion Thread Code of Conduct posted here.


Comments
SIGASYS