Paper Type
Complete
Abstract
This study investigates the LLM-as-a-Judge paradigm as a critical Information System (IS) whose reliability and neutrality determine organizational adoption. Using dictionary definition evaluation as a methodologically neutral testbed, we conduct 8,000 blind pairwise comparisons between definitions from five established English dictionaries and four large language models, judged by the same four LLMs. Our results reveal a pronounced position bias (Definition A preferred 65.7% of the time), a massive self-preference bias (up to +73.3 percentage points when a model judges its own output), and moderate inter-judge agreement (Fleiss’ kappa = 0.357). Lexical diversity analysis shows that LLM-generated definitions are shorter yet lexically comparable to human dictionaries. These structural biases constitute barriers to trust and adoption. Explicit bias measurement, diversified blind evaluation pipelines, and human-calibrated assessment are essential for reliable deployment of LLM-as-a-Judge systems in organizational contexts.
Paper Number
1446
Recommended Citation
Piaget, Jonathan; Rosselet, Ulysse; and Gaspoz, Cédric, "Structural Biases in LLM-as-a-Judge Systems: Implications for reliable IS Adoption" (2026). AMCIS 2026 Proceedings. 1.
https://aisel.aisnet.org/amcis2026/sig_svs/svs/1
Structural Biases in LLM-as-a-Judge Systems: Implications for reliable IS Adoption
This study investigates the LLM-as-a-Judge paradigm as a critical Information System (IS) whose reliability and neutrality determine organizational adoption. Using dictionary definition evaluation as a methodologically neutral testbed, we conduct 8,000 blind pairwise comparisons between definitions from five established English dictionaries and four large language models, judged by the same four LLMs. Our results reveal a pronounced position bias (Definition A preferred 65.7% of the time), a massive self-preference bias (up to +73.3 percentage points when a model judges its own output), and moderate inter-judge agreement (Fleiss’ kappa = 0.357). Lexical diversity analysis shows that LLM-generated definitions are shorter yet lexically comparable to human dictionaries. These structural biases constitute barriers to trust and adoption. Explicit bias measurement, diversified blind evaluation pipelines, and human-calibrated assessment are essential for reliable deployment of LLM-as-a-Judge systems in organizational contexts.
Comments
SIG SVS