Paper Number
ECIS2026-1675
Paper Type
SP
Abstract
Artificial intelligence (AI) systems with conversational interfaces are increasingly used to augment humans and even automate complex workflows. Yet systematic functional testing remains difficult because of open-ended multi-turn input. Existing evaluation benchmarks focus on isolated single-turn performance and provide limited insight into how systems behave in realistic interaction scenarios. This paper investigates how synthetic multi-turn conversations can be generated to support scalable functional testing of conversational AI systems. Using a design science research approach, we develop an artifact that employs LLM-based role play to generate synthetic conversations for testing an insurance claim decision system. The generation process combines software testing techniques such as equivalence partitioning and boundary value analysis with persona-based user simulation. We report preliminary results from the first design cycle and derive six design principles for generating diverse, faithful, and diagnostically useful conversational test data.
Recommended Citation
Weller, Niklas; Cai, Shijing; and Zhou, Syang, "Generating Synthetic Multi-Turn Conversations For Scalable Functional Testing Of Conversational Ai Systems" (2026). ECIS 2026 Proceedings. 10.
https://aisel.aisnet.org/ecis2026/datasc_isresearch/datasc_isresearch/10
Generating Synthetic Multi-Turn Conversations For Scalable Functional Testing Of Conversational Ai Systems
Artificial intelligence (AI) systems with conversational interfaces are increasingly used to augment humans and even automate complex workflows. Yet systematic functional testing remains difficult because of open-ended multi-turn input. Existing evaluation benchmarks focus on isolated single-turn performance and provide limited insight into how systems behave in realistic interaction scenarios. This paper investigates how synthetic multi-turn conversations can be generated to support scalable functional testing of conversational AI systems. Using a design science research approach, we develop an artifact that employs LLM-based role play to generate synthetic conversations for testing an insurance claim decision system. The generation process combines software testing techniques such as equivalence partitioning and boundary value analysis with persona-based user simulation. We report preliminary results from the first design cycle and derive six design principles for generating diverse, faithful, and diagnostically useful conversational test data.