Paper Number

ECIS2026-1675

Paper Type

SP

Abstract

Artificial intelligence (AI) systems with conversational interfaces are increasingly used to augment humans and even automate complex workflows. Yet systematic functional testing remains difficult because of open-ended multi-turn input. Existing evaluation benchmarks focus on isolated single-turn performance and provide limited insight into how systems behave in realistic interaction scenarios. This paper investigates how synthetic multi-turn conversations can be generated to support scalable functional testing of conversational AI systems. Using a design science research approach, we develop an artifact that employs LLM-based role play to generate synthetic conversations for testing an insurance claim decision system. The generation process combines software testing techniques such as equivalence partitioning and boundary value analysis with persona-based user simulation. We report preliminary results from the first design cycle and derive six design principles for generating diverse, faithful, and diagnostically useful conversational test data.

Share

COinS
 
Jun 14th, 12:00 AM

Generating Synthetic Multi-Turn Conversations For Scalable Functional Testing Of Conversational Ai Systems

Artificial intelligence (AI) systems with conversational interfaces are increasingly used to augment humans and even automate complex workflows. Yet systematic functional testing remains difficult because of open-ended multi-turn input. Existing evaluation benchmarks focus on isolated single-turn performance and provide limited insight into how systems behave in realistic interaction scenarios. This paper investigates how synthetic multi-turn conversations can be generated to support scalable functional testing of conversational AI systems. Using a design science research approach, we develop an artifact that employs LLM-based role play to generate synthetic conversations for testing an insurance claim decision system. The generation process combines software testing techniques such as equivalence partitioning and boundary value analysis with persona-based user simulation. We report preliminary results from the first design cycle and derive six design principles for generating diverse, faithful, and diagnostically useful conversational test data.