Track 4: Data Science and Machine Learning

Expert Versus Metric-Based Evaluation: Testing the Reliability of Evaluation Metrics in Large Language Models Assessment

Bartłomiej Balsamski, Krakow University of Economics, PolandFollow
Jakub Kanclerz, Krakow University of Economics, PolandFollow
Dariusz Put, Krakow University of Economics, PolandFollow
Janusz Stal, Krakow University of Economics, PolandFollow

Abstract

This study examines the reliability of automatic evaluation metrics in assessing responses generated by large language models (LLMs) in the context of university recruitment. A total of 113 domain-specific questions were used to prompt five prominent LLMs, each in three configurations: basic, document-context, and internet-context. The generated responses were evaluated using three categories of metrics: lexical, semantic, and LLM-asa-Judge. These metric-based assessments were subsequently compared with expert evaluations conducted using a 5-point Likert scale. The findings indicate that although automatic metrics offer considerable efficiency, their consistency with expert judgments varies substantially. Moreover, the results suggest that both the model configuration and its underlying architecture significantly affect evaluation outcomes. Among the metric categories, LLM-as-a-Judge appears to yield the highest alignment with expert assessments, suggesting greater reliability in this approach.

Recommended Citation

Balsamski, B., Kanclerz, J., Put, D. & Stal, J. (2025). Expert Versus Metric-Based Evaluation: Testing the Reliability of Evaluation Metrics in Large Language Models AssessmentIn I. Luković, S. Bjeladinović, B. Delibašić, D. Barać, N. Iivari, E. Insfran, M. Lang, H. Linger, & C. Schneider (Eds.), Empowering the Interdisciplinary Role of ISD in Addressing Contemporary Issues in Digital Transformation: How Data Science and Generative AI Contributes to ISD (ISD2025 Proceedings). Belgrade, Serbia: University of Gdańsk, Department of Business Informatics & University of Belgrade, Faculty of Organizational Sciences. ISBN: 978-83-972632-1-5. https://doi.org/10.62036/ISD.2025.50

Paper Type

Short Paper

DOI

10.62036/ISD.2025.50

Download

COinS

Expert Versus Metric-Based Evaluation: Testing the Reliability of Evaluation Metrics in Large Language Models Assessment

Track 4: Data Science and Machine Learning

Expert Versus Metric-Based Evaluation: Testing the Reliability of Evaluation Metrics in Large Language Models Assessment

Abstract

Recommended Citation

Paper Type

DOI

Search

Browse

Author Corner

Links

Track 4: Data Science and Machine Learning

Expert Versus Metric-Based Evaluation: Testing the Reliability of Evaluation Metrics in Large Language Models Assessment

Presenter Information

Abstract

Recommended Citation

Paper Type

DOI

Share

Search

Browse

Author Corner

Links