Paper Type
ERF
Abstract
We present a novel multi-stage large language model (LLM) framework for generating high-quality, realistic synthetic medical datasets. Real-world medical data is limited by privacy and regulatory constraints; hence, synthetic alternatives are critical. Our approach utilises multiple LLMs in a staged pipeline to create detailed patient profiles and authentic medical dialogues. The process begins with generating diverse chief complaints, augmented by culturally relevant demographic data, and continues with structured patient profile synthesis, incorporating realistic medical histories, conditions, and treatment plans. Medical interviews follow, simulating natural clinical conversations while maintaining linguistic and clinical accuracy. Rigorous evaluation via n-gram frequency analysis and manual review demonstrates enhanced diversity and reduced biases compared to naive generation methods. This framework effectively bridges the gap between data privacy and medical realism.
Paper Number
1906
Recommended Citation
Holysz, Mikolaj; Bystronski, Mateusz; Piotrowski, Grzegorz Aleksander; Chodak, Grzegorz; and Kajdanowicz, Tomasz, "A Multi-Stage LLM Framework for Generating Realistic Synthetic Medical Datasets" (2025). AMCIS 2025 Proceedings. 10.
https://aisel.aisnet.org/amcis2025/intelfuture/intelfuture/10
A Multi-Stage LLM Framework for Generating Realistic Synthetic Medical Datasets
We present a novel multi-stage large language model (LLM) framework for generating high-quality, realistic synthetic medical datasets. Real-world medical data is limited by privacy and regulatory constraints; hence, synthetic alternatives are critical. Our approach utilises multiple LLMs in a staged pipeline to create detailed patient profiles and authentic medical dialogues. The process begins with generating diverse chief complaints, augmented by culturally relevant demographic data, and continues with structured patient profile synthesis, incorporating realistic medical histories, conditions, and treatment plans. Medical interviews follow, simulating natural clinical conversations while maintaining linguistic and clinical accuracy. Rigorous evaluation via n-gram frequency analysis and manual review demonstrates enhanced diversity and reduced biases compared to naive generation methods. This framework effectively bridges the gap between data privacy and medical realism.
When commenting on articles, please be friendly, welcoming, respectful and abide by the AIS eLibrary Discussion Thread Code of Conduct posted here.
Comments
IntelFuture