Abstract

This study examines the reliability of automatic evaluation metrics in assessing responses generated by large language models (LLMs) in the context of university recruitment. A total of 113 domain-specific questions were used to prompt five prominent LLMs, each in three configurations: basic, document-context, and internet-context. The generated responses were evaluated using three categories of metrics: lexical, semantic, and LLM-asa-Judge. These metric-based assessments were subsequently compared with expert evaluations conducted using a 5-point Likert scale. The findings indicate that although automatic metrics offer considerable efficiency, their consistency with expert judgments varies substantially. Moreover, the results suggest that both the model configuration and its underlying architecture significantly affect evaluation outcomes. Among the metric categories, LLM-as-a-Judge appears to yield the highest alignment with expert assessments, suggesting greater reliability in this approach.

Recommended Citation

Balsamski, B., Kanclerz, J., Put, D. & Stal, J. (2025). Expert Versus Metric-Based Evaluation: Testing the Reliability of Evaluation Metrics in Large Language Models AssessmentIn I. Luković, S. Bjeladinović, B. Delibašić, D. Barać, N. Iivari, E. Insfran, M. Lang, H. Linger, & C. Schneider (Eds.), Empowering the Interdisciplinary Role of ISD in Addressing Contemporary Issues in Digital Transformation: How Data Science and Generative AI Contributes to ISD (ISD2025 Proceedings). Belgrade, Serbia: University of Gdańsk, Department of Business Informatics & University of Belgrade, Faculty of Organizational Sciences. ISBN: 978-83-972632-1-5. https://doi.org/10.62036/ISD.2025.50

Paper Type

Short Paper

DOI

10.62036/ISD.2025.50

Share

COinS
 

Expert Versus Metric-Based Evaluation: Testing the Reliability of Evaluation Metrics in Large Language Models Assessment

This study examines the reliability of automatic evaluation metrics in assessing responses generated by large language models (LLMs) in the context of university recruitment. A total of 113 domain-specific questions were used to prompt five prominent LLMs, each in three configurations: basic, document-context, and internet-context. The generated responses were evaluated using three categories of metrics: lexical, semantic, and LLM-asa-Judge. These metric-based assessments were subsequently compared with expert evaluations conducted using a 5-point Likert scale. The findings indicate that although automatic metrics offer considerable efficiency, their consistency with expert judgments varies substantially. Moreover, the results suggest that both the model configuration and its underlying architecture significantly affect evaluation outcomes. Among the metric categories, LLM-as-a-Judge appears to yield the highest alignment with expert assessments, suggesting greater reliability in this approach.