Paper Type

ERF

Abstract

We present a novel multi-stage large language model (LLM) framework for generating high-quality, realistic synthetic medical datasets. Real-world medical data is limited by privacy and regulatory constraints; hence, synthetic alternatives are critical. Our approach utilises multiple LLMs in a staged pipeline to create detailed patient profiles and authentic medical dialogues. The process begins with generating diverse chief complaints, augmented by culturally relevant demographic data, and continues with structured patient profile synthesis, incorporating realistic medical histories, conditions, and treatment plans. Medical interviews follow, simulating natural clinical conversations while maintaining linguistic and clinical accuracy. Rigorous evaluation via n-gram frequency analysis and manual review demonstrates enhanced diversity and reduced biases compared to naive generation methods. This framework effectively bridges the gap between data privacy and medical realism.

Paper Number

1906

Author Connect URL

https://authorconnect.aisnet.org/conferences/AMCIS2025/papers/1906

Comments

IntelFuture

Author Connect Link

Share

COinS
 
Aug 15th, 12:00 AM

A Multi-Stage LLM Framework for Generating Realistic Synthetic Medical Datasets

We present a novel multi-stage large language model (LLM) framework for generating high-quality, realistic synthetic medical datasets. Real-world medical data is limited by privacy and regulatory constraints; hence, synthetic alternatives are critical. Our approach utilises multiple LLMs in a staged pipeline to create detailed patient profiles and authentic medical dialogues. The process begins with generating diverse chief complaints, augmented by culturally relevant demographic data, and continues with structured patient profile synthesis, incorporating realistic medical histories, conditions, and treatment plans. Medical interviews follow, simulating natural clinical conversations while maintaining linguistic and clinical accuracy. Rigorous evaluation via n-gram frequency analysis and manual review demonstrates enhanced diversity and reduced biases compared to naive generation methods. This framework effectively bridges the gap between data privacy and medical realism.

When commenting on articles, please be friendly, welcoming, respectful and abide by the AIS eLibrary Discussion Thread Code of Conduct posted here.