Paper Number
1724
Paper Type
Complete Research Paper
Abstract
Large language models such as ChatGPT-4 are said to have major potential in digital education. However, current research has mostly conducted empirical studies on the disadvantages, such as cheating in exams. Possible advantages, like essay scoring in exams, are typically only mentioned theoretically. In this study, 100 answers to each of two essay tasks in an exam at a German university were scored by human scorers and ChatGPT-4. Overall, it was shown that ChatGPT-4 awarded significantly more points than the human scorers. This effect was particularly strong for a complex task compared to a less complex task. Although in general the answer length for good answers is often higher than for bad answers, a high correlation between answer length and scoring could be demonstrated – even for wrong answers. For better comprehensibility, the results were further analyzed using a cluster analysis, whereby four clusters were identified.
Recommended Citation
Hartmann, Philipp; Bialas, Sanoa-Amina; Hobert, Sebastian; and Schumann, Matthias, "Does Content Matter? — an Empirical Investigation of ChatGPT-4's Ability to Score Essay Tasks in Exams" (2024). ECIS 2024 Proceedings. 3.
https://aisel.aisnet.org/ecis2024/track13_learning_teach/track13_learning_teach/3
Does Content Matter? — an Empirical Investigation of ChatGPT-4's Ability to Score Essay Tasks in Exams
Large language models such as ChatGPT-4 are said to have major potential in digital education. However, current research has mostly conducted empirical studies on the disadvantages, such as cheating in exams. Possible advantages, like essay scoring in exams, are typically only mentioned theoretically. In this study, 100 answers to each of two essay tasks in an exam at a German university were scored by human scorers and ChatGPT-4. Overall, it was shown that ChatGPT-4 awarded significantly more points than the human scorers. This effect was particularly strong for a complex task compared to a less complex task. Although in general the answer length for good answers is often higher than for bad answers, a high correlation between answer length and scoring could be demonstrated – even for wrong answers. For better comprehensibility, the results were further analyzed using a cluster analysis, whereby four clusters were identified.
When commenting on articles, please be friendly, welcoming, respectful and abide by the AIS eLibrary Discussion Thread Code of Conduct posted here.