Abstract
Data bias is a recognized problem in research and data-driven decision-making. Existing treatments focus on how data was collected: sampling bias from non-representative selection, measurement bias from recording errors, annotation bias from inconsistent labeling, and omitted variable bias from excluded variables (Mehrabi et al., 2021). These accounts share a common assumption: the data collection schema is neutral. The schema, meaning the set of categories that structure what gets recorded, is treated as a given rather than a source of bias. This paper challenges that assumption and introduces classification-induced data bias: systematic distortion arising from the imposition of a class-based schema at the point of data collection. Class-based schemas, which require contributors to assign every observation to a predefined category, are the foundational assumption of structured data modeling (Parsons & Wand, 2000) and the prevailing practice in real-world data collection platforms (Lukyanenko et al., 2019). This assumption of inherent classification means schema designers fix a set of classes before collection begins, embedding their conceptualization into the data architecture. Phenomena that do not fit this structure cannot be recorded naturally: contributors must force-fit observations, record them inaccurately, or abandon the contribution. Lukyanenko et al. (2019) demonstrate this empirically, showing that class-based schemas significantly suppress both the completeness and accuracy of contributed observations. We define classification-induced data bias as systematic distortion arising when a class-based schema requires contributors to assign observations to predefined categories, causing exclusion, misclassification, or suppression of phenomena that do not conform to the designer’s conceptualization. This construct is structurally distinct from existing bias types: unlike sampling bias, it concerns what properties of observed phenomena can be captured, not which units are selected; unlike omitted variable bias, which concerns variables absent from analysis of already-collected data, classification-induced bias operates at the moment of recording itself, before any observation enters the dataset; unlike measurement bias, distortion originates in the schema design rather than in its application. It manifests through three mechanisms: exclusion (phenomena outside predefined classes cannot be recorded), forced misclassification (observations mapped to the nearest available class regardless of fit), and participation suppression (contributors who cannot classify disengage, producing non-random attrition). The result is data that over-represents phenomena anticipated by designers and under-represents novel or atypical ones. We propose that classification-induced bias is more severe in complex, multi-dimensional domains and when contributors have lower familiarity with the class structure. This bias matters because it distorts what phenomena can enter a dataset, affecting the validity of downstream analysis; future research should develop methods to detect it and evaluate design alternatives that reduce its effects.
Recommended Citation
Nouri, Aida and Parsons, Jeffrey, "Classification-Induced Data Bias: How Class-Based Schemas Distort Data at Collection" (2026). AMCIS 2026 TREOs. 172.
https://aisel.aisnet.org/treos_amcis2026/172