Abstract

This paper evaluates the \texttt{gpt-4-turbo} model’s proficiency in recognizing named entities within the clinical trial eligibility criteria. We employ prompt learning to a dataset comprising $49\,903$ criteria from $3\,314$ trials, with $120\,906$ annotated entities in 15 classes. We compare the performance of \texttt{gpt-4-turbo} to state-of-the-art BERT-based Transformer models\footnote{Due to page limits, detailed results and code listings are presented in the supplementary material available at https://github.com/megaduks/isd24}. Contrary to expectations, BERT-based models outperform \texttt{gpt-4-turbo} after moderate fine-tuning, in particular in low-resource settings. The \texttt{CODER} model consistently surpasses others in both low- and high-resource environments, likely due to term normalization and extensive pre-training on the UMLS thesaurus. However, it is important to recognize that traditional NER evaluation metrics, such as precision, recall, and the $F_1$ score, can unfairly penalize generative language models, even if they correctly identify entities.

Recommended Citation

Kantor, K. & Morzy, M. (2024). Fine-Tuned Transformers and Large Language Models for Entity Recognition in Complex Eligibility Criteria for Clinical Trials. In B. Marcinkowski, A. Przybylek, A. Jarzębowicz, N. Iivari, E. Insfran, M. Lang, H. Linger, & C. Schneider (Eds.), Harnessing Opportunities: Reshaping ISD in the post-COVID-19 and Generative AI Era (ISD2024 Proceedings). Gdańsk, Poland: University of Gdańsk. ISBN: 978-83-972632-0-8. https://doi.org/10.62036/ISD.2024.53

Paper Type

Poster

DOI

10.62036/ISD.2024.53

Share

COinS
 

Fine-Tuned Transformers and Large Language Models for Entity Recognition in Complex Eligibility Criteria for Clinical Trials

This paper evaluates the \texttt{gpt-4-turbo} model’s proficiency in recognizing named entities within the clinical trial eligibility criteria. We employ prompt learning to a dataset comprising $49\,903$ criteria from $3\,314$ trials, with $120\,906$ annotated entities in 15 classes. We compare the performance of \texttt{gpt-4-turbo} to state-of-the-art BERT-based Transformer models\footnote{Due to page limits, detailed results and code listings are presented in the supplementary material available at https://github.com/megaduks/isd24}. Contrary to expectations, BERT-based models outperform \texttt{gpt-4-turbo} after moderate fine-tuning, in particular in low-resource settings. The \texttt{CODER} model consistently surpasses others in both low- and high-resource environments, likely due to term normalization and extensive pre-training on the UMLS thesaurus. However, it is important to recognize that traditional NER evaluation metrics, such as precision, recall, and the $F_1$ score, can unfairly penalize generative language models, even if they correctly identify entities.