Paper Type

Complete

Abstract

This study investigates the LLM-as-a-Judge paradigm as a critical Information System (IS) whose reliability and neutrality determine organizational adoption. Using dictionary definition evaluation as a methodologically neutral testbed, we conduct 8,000 blind pairwise comparisons between definitions from five established English dictionaries and four large language models, judged by the same four LLMs. Our results reveal a pronounced position bias (Definition A preferred 65.7% of the time), a massive self-preference bias (up to +73.3 percentage points when a model judges its own output), and moderate inter-judge agreement (Fleiss’ kappa = 0.357). Lexical diversity analysis shows that LLM-generated definitions are shorter yet lexically comparable to human dictionaries. These structural biases constitute barriers to trust and adoption. Explicit bias measurement, diversified blind evaluation pipelines, and human-calibrated assessment are essential for reliable deployment of LLM-as-a-Judge systems in organizational contexts.

Paper Number

1446

Comments

SIG SVS

Share

COinS
 
Aug 15th, 12:00 AM

Structural Biases in LLM-as-a-Judge Systems: Implications for reliable IS Adoption

This study investigates the LLM-as-a-Judge paradigm as a critical Information System (IS) whose reliability and neutrality determine organizational adoption. Using dictionary definition evaluation as a methodologically neutral testbed, we conduct 8,000 blind pairwise comparisons between definitions from five established English dictionaries and four large language models, judged by the same four LLMs. Our results reveal a pronounced position bias (Definition A preferred 65.7% of the time), a massive self-preference bias (up to +73.3 percentage points when a model judges its own output), and moderate inter-judge agreement (Fleiss’ kappa = 0.357). Lexical diversity analysis shows that LLM-generated definitions are shorter yet lexically comparable to human dictionaries. These structural biases constitute barriers to trust and adoption. Explicit bias measurement, diversified blind evaluation pipelines, and human-calibrated assessment are essential for reliable deployment of LLM-as-a-Judge systems in organizational contexts.