Paper Number

1724

Paper Type

Complete Research Paper

Abstract

Large language models such as ChatGPT-4 are said to have major potential in digital education. However, current research has mostly conducted empirical studies on the disadvantages, such as cheating in exams. Possible advantages, like essay scoring in exams, are typically only mentioned theoretically. In this study, 100 answers to each of two essay tasks in an exam at a German university were scored by human scorers and ChatGPT-4. Overall, it was shown that ChatGPT-4 awarded significantly more points than the human scorers. This effect was particularly strong for a complex task compared to a less complex task. Although in general the answer length for good answers is often higher than for bad answers, a high correlation between answer length and scoring could be demonstrated – even for wrong answers. For better comprehensibility, the results were further analyzed using a cluster analysis, whereby four clusters were identified.

Share

COinS
 
Jun 14th, 12:00 AM

Does Content Matter? — an Empirical Investigation of ChatGPT-4's Ability to Score Essay Tasks in Exams

Large language models such as ChatGPT-4 are said to have major potential in digital education. However, current research has mostly conducted empirical studies on the disadvantages, such as cheating in exams. Possible advantages, like essay scoring in exams, are typically only mentioned theoretically. In this study, 100 answers to each of two essay tasks in an exam at a German university were scored by human scorers and ChatGPT-4. Overall, it was shown that ChatGPT-4 awarded significantly more points than the human scorers. This effect was particularly strong for a complex task compared to a less complex task. Although in general the answer length for good answers is often higher than for bad answers, a high correlation between answer length and scoring could be demonstrated – even for wrong answers. For better comprehensibility, the results were further analyzed using a cluster analysis, whereby four clusters were identified.

When commenting on articles, please be friendly, welcoming, respectful and abide by the AIS eLibrary Discussion Thread Code of Conduct posted here.