Loading...

Media is loading
 

Paper Type

Complete

Abstract

The design of data collection interfaces can significantly influence the quality of training data for machine learning (ML) systems. While class-based designs—which structure information into predefined categories—are widely used, their rigid classifications risk omitting critical details. This study compares class-based data collection with an instance-based approach, where contributors freely describe observed phenomena, to assess their impact on ML performance. We conducted a controlled experiment in autonomous driving. Participants (N = 260) described driving scenes using class-based or instance-based interfaces. Machine learning models trained on these datasets were evaluated for predictive accuracy. Results show instance-based data outperformed class-based across all models, with higher accuracy and robustness, improving generalization to real-world complexity. In cases where data is crowdsourced to train machine learning models, this work shows the benefits of instance-based data collection design that integrates structured input with open-ended reporting to enhance ML performance.

Paper Number

2298

Author Connect URL

https://authorconnect.aisnet.org/conferences/AMCIS2025/papers/2298

Comments

SIGASYS

Author Connect Link

Share

COinS
Best Paper Nominee badge
Top 25 Paper Badge
 
Aug 15th, 12:00 AM

The Cost of Rigidity: How Class-Based Data Limits Machine Learning Performance

The design of data collection interfaces can significantly influence the quality of training data for machine learning (ML) systems. While class-based designs—which structure information into predefined categories—are widely used, their rigid classifications risk omitting critical details. This study compares class-based data collection with an instance-based approach, where contributors freely describe observed phenomena, to assess their impact on ML performance. We conducted a controlled experiment in autonomous driving. Participants (N = 260) described driving scenes using class-based or instance-based interfaces. Machine learning models trained on these datasets were evaluated for predictive accuracy. Results show instance-based data outperformed class-based across all models, with higher accuracy and robustness, improving generalization to real-world complexity. In cases where data is crowdsourced to train machine learning models, this work shows the benefits of instance-based data collection design that integrates structured input with open-ended reporting to enhance ML performance.

When commenting on articles, please be friendly, welcoming, respectful and abide by the AIS eLibrary Discussion Thread Code of Conduct posted here.