Abstract
This research investigates the perceived naturalness of synthesized speech in the context of Polish medical terminology, a critical factor for applications such as voice-enabled medical dialogue systems. We conducted a comparative analysis of three speech synthesis models: SpeechGen, ElevenLabs, and a version of ToucanTTS fine-tuned on a specialized corpus of Polish medical recordings. The evaluation employed objective measures, the NISQA metric, and subjective assessments through Mean Opinion Score (MOS) surveys. Our findings indicate that SpeechGen and ElevenLabs produce synthesized speech that closely rivals the naturalness of human speech, as evidenced by both NISQA scores and MOS ratings. In contrast, despite improvements, the fine-tuned ToucanTTS model did not achieve comparable levels of perceived naturalness. Notably, participants occasionally rated the advanced synthesized speech as more natural than human speech recorded in non-studio environments, underscoring the potential of these technologies in real-world applications. This study emphasizes the significance of naturalness in enhancing user experience, particularly in specialized linguistic domains. It provides insights into speech synthesis's current capabilities and limitations for less-resourced languages like Polish.
Paper Type
Short Paper
DOI
10.62036/ISD.2025.38
Comparing Speech Synthesis Models for Polish Medical Speech Naturalness
This research investigates the perceived naturalness of synthesized speech in the context of Polish medical terminology, a critical factor for applications such as voice-enabled medical dialogue systems. We conducted a comparative analysis of three speech synthesis models: SpeechGen, ElevenLabs, and a version of ToucanTTS fine-tuned on a specialized corpus of Polish medical recordings. The evaluation employed objective measures, the NISQA metric, and subjective assessments through Mean Opinion Score (MOS) surveys. Our findings indicate that SpeechGen and ElevenLabs produce synthesized speech that closely rivals the naturalness of human speech, as evidenced by both NISQA scores and MOS ratings. In contrast, despite improvements, the fine-tuned ToucanTTS model did not achieve comparable levels of perceived naturalness. Notably, participants occasionally rated the advanced synthesized speech as more natural than human speech recorded in non-studio environments, underscoring the potential of these technologies in real-world applications. This study emphasizes the significance of naturalness in enhancing user experience, particularly in specialized linguistic domains. It provides insights into speech synthesis's current capabilities and limitations for less-resourced languages like Polish.
Recommended Citation
Krasiński, W., Rośleń, P., Czyzewski, A. & Zielonka, M. (2025). Comparing Speech Synthesis Models for Polish Medical Speech NaturalnessIn I. Luković, S. Bjeladinović, B. Delibašić, D. Barać, N. Iivari, E. Insfran, M. Lang, H. Linger, & C. Schneider (Eds.), Empowering the Interdisciplinary Role of ISD in Addressing Contemporary Issues in Digital Transformation: How Data Science and Generative AI Contributes to ISD (ISD2025 Proceedings). Belgrade, Serbia: University of Gdańsk, Department of Business Informatics & University of Belgrade, Faculty of Organizational Sciences. ISBN: 978-83-972632-1-5. https://doi.org/10.62036/ISD.2025.38