Paper Number

1452

Paper Type

Short

Description

The expectation of human-like capabilities in communication for conversational agents (also known as “chatbots”) dates back to the proposal of the “Turing Test” (1950s) and this expectation has increased in recent years due to the technological breakthroughs in large language models such as ChatGPT. However, the current evaluation of conversational agents lacks a theoretically top-down framework as most evaluation instruments are created by computer science researchers in a ground-up manner. Thus, these evaluation dimensions may work well in certain contexts but fail to generalize to human-like conversations. In this paper, we design a novel theory-driven evaluation survey instrument for conversational agents based on the results from our mapping mechanism between theoretical measures (TMs) in linguistics and existing empirically developed dimensions (EDs). We also further identify the most representative EDs for each TM through the theory-constrained clustering in an empirical study.

Comments

20-Theory

Share

COinS
 
Dec 15th, 12:00 AM

Back to Principles: Theory-driven Evaluation of AI-based Conversational Agents

The expectation of human-like capabilities in communication for conversational agents (also known as “chatbots”) dates back to the proposal of the “Turing Test” (1950s) and this expectation has increased in recent years due to the technological breakthroughs in large language models such as ChatGPT. However, the current evaluation of conversational agents lacks a theoretically top-down framework as most evaluation instruments are created by computer science researchers in a ground-up manner. Thus, these evaluation dimensions may work well in certain contexts but fail to generalize to human-like conversations. In this paper, we design a novel theory-driven evaluation survey instrument for conversational agents based on the results from our mapping mechanism between theoretical measures (TMs) in linguistics and existing empirically developed dimensions (EDs). We also further identify the most representative EDs for each TM through the theory-constrained clustering in an empirical study.