Abstract

Automatic coding of Polish clinical text is still under-explored. We therefore benchmark six transformers—DISTILBERT, HERBERT, POLBERT, POLKA 1.1B CHAT, XLM-ROBERTA and PAPUGAPT2—on 31 034 de-identified phrases grouped into six clinical categories. A unified fine-tuning and statistically rigorous pipeline (Kruskal–Wallis/ANOVA, Holm-corrected Dunn tests, Cliff’s δ, bootstrap CIs) over 186 210 predictions shows that architecture alone ex- plains∼96% of the variance in top-1 confidence (H = 1.7 ×105, p<10−300). Multilingual XLM-ROBERTA leads; only POLBERT overlaps meaningfully (|δ| = 0.30), whereas all other pairs are near-maximally separated (|δ| > 0.90). The top-quartile confidence peaks at 0.68—be- low clinical automation thresholds—highlighting the need for domain-specific pre-training and macro-F1 evaluation. Open-source code and templates make the benchmark fully reproducible and extensible for Polish biomedical NLP.

Recommended Citation

Cieślak, D. & Czyżewski, A. (2025). Annotator-Aware Evidential Learning for Polish Clinical SentencesIn I. Luković, S. Bjeladinović, B. Delibašić, D. Barać, N. Iivari, E. Insfran, M. Lang, H. Linger, & C. Schneider (Eds.), Empowering the Interdisciplinary Role of ISD in Addressing Contemporary Issues in Digital Transformation: How Data Science and Generative AI Contributes to ISD (ISD2025 Proceedings). Belgrade, Serbia: University of Gdańsk, Department of Business Informatics & University of Belgrade, Faculty of Organizational Sciences. ISBN: 978-83-972632-1-5. https://doi.org/10.62036/ISD.2025.33

Paper Type

Short Paper

DOI

10.62036/ISD.2025.33

Share

COinS
 

Annotator-Aware Evidential Learning for Polish Clinical Sentences

Automatic coding of Polish clinical text is still under-explored. We therefore benchmark six transformers—DISTILBERT, HERBERT, POLBERT, POLKA 1.1B CHAT, XLM-ROBERTA and PAPUGAPT2—on 31 034 de-identified phrases grouped into six clinical categories. A unified fine-tuning and statistically rigorous pipeline (Kruskal–Wallis/ANOVA, Holm-corrected Dunn tests, Cliff’s δ, bootstrap CIs) over 186 210 predictions shows that architecture alone ex- plains∼96% of the variance in top-1 confidence (H = 1.7 ×105, p<10−300). Multilingual XLM-ROBERTA leads; only POLBERT overlaps meaningfully (|δ| = 0.30), whereas all other pairs are near-maximally separated (|δ| > 0.90). The top-quartile confidence peaks at 0.68—be- low clinical automation thresholds—highlighting the need for domain-specific pre-training and macro-F1 evaluation. Open-source code and templates make the benchmark fully reproducible and extensible for Polish biomedical NLP.