Abstract
This paper evaluates the \texttt{gpt-4-turbo} model’s proficiency in recognizing named entities within the clinical trial eligibility criteria. We employ prompt learning to a dataset comprising $49\,903$ criteria from $3\,314$ trials, with $120\,906$ annotated entities in 15 classes. We compare the performance of \texttt{gpt-4-turbo} to state-of-the-art BERT-based Transformer models\footnote{Due to page limits, detailed results and code listings are presented in the supplementary material available at https://github.com/megaduks/isd24}. Contrary to expectations, BERT-based models outperform \texttt{gpt-4-turbo} after moderate fine-tuning, in particular in low-resource settings. The \texttt{CODER} model consistently surpasses others in both low- and high-resource environments, likely due to term normalization and extensive pre-training on the UMLS thesaurus. However, it is important to recognize that traditional NER evaluation metrics, such as precision, recall, and the $F_1$ score, can unfairly penalize generative language models, even if they correctly identify entities.
Paper Type
Poster
DOI
10.62036/ISD.2024.53
Fine-Tuned Transformers and Large Language Models for Entity Recognition in Complex Eligibility Criteria for Clinical Trials
This paper evaluates the \texttt{gpt-4-turbo} model’s proficiency in recognizing named entities within the clinical trial eligibility criteria. We employ prompt learning to a dataset comprising $49\,903$ criteria from $3\,314$ trials, with $120\,906$ annotated entities in 15 classes. We compare the performance of \texttt{gpt-4-turbo} to state-of-the-art BERT-based Transformer models\footnote{Due to page limits, detailed results and code listings are presented in the supplementary material available at https://github.com/megaduks/isd24}. Contrary to expectations, BERT-based models outperform \texttt{gpt-4-turbo} after moderate fine-tuning, in particular in low-resource settings. The \texttt{CODER} model consistently surpasses others in both low- and high-resource environments, likely due to term normalization and extensive pre-training on the UMLS thesaurus. However, it is important to recognize that traditional NER evaluation metrics, such as precision, recall, and the $F_1$ score, can unfairly penalize generative language models, even if they correctly identify entities.
Recommended Citation
Kantor, K. & Morzy, M. (2024). Fine-Tuned Transformers and Large Language Models for Entity Recognition in Complex Eligibility Criteria for Clinical Trials. In B. Marcinkowski, A. Przybylek, A. Jarzębowicz, N. Iivari, E. Insfran, M. Lang, H. Linger, & C. Schneider (Eds.), Harnessing Opportunities: Reshaping ISD in the post-COVID-19 and Generative AI Era (ISD2024 Proceedings). Gdańsk, Poland: University of Gdańsk. ISBN: 978-83-972632-0-8. https://doi.org/10.62036/ISD.2024.53