Abstract

The implementation of artificial intelligence (AI) in the public sector offers great potential. Repetitive and labor-intensive tasks can be automated to improve overall efficiency. Generative AI, in particular, opens up new possibilities for structuring and integrating heterogeneous data sources. At the same time, AI introduces challenges such as technical complexity and ethical issues that must be addressed during development and implementation. This paper investigates the potential and challenges of using AI in the extract, transform, load (ETL) process in a public sector study. Our findings demonstrate that open-source large language models (LLMs) can efficiently transfer over 5,000 unstructured documents into the structured format of a relational database, achieving a success rate of approximately 96%. The quality of the results was significantly improved through optimization measures, particularly in terms of prompt engineering and post-processing. While the results are encouraging, challenges remain, including processing extensive documents and adapting the data model to greater complexity.

Recommended Citation

Bongertmann, H., Nast, B., Griesch, L., Rotzoll, H. & Sandkuhl, K. (2025). Large Language Models for Structuring and Integration of Heterogeneous DataIn I. Luković, S. Bjeladinović, B. Delibašić, D. Barać, N. Iivari, E. Insfran, M. Lang, H. Linger, & C. Schneider (Eds.), Empowering the Interdisciplinary Role of ISD in Addressing Contemporary Issues in Digital Transformation: How Data Science and Generative AI Contributes to ISD (ISD2025 Proceedings). Belgrade, Serbia: University of Gdańsk, Department of Business Informatics & University of Belgrade, Faculty of Organizational Sciences. ISBN: 978-83-972632-1-5. https://doi.org/10.62036/ISD.2025.66

Paper Type

Full Paper

DOI

10.62036/ISD.2025.66

Share

COinS
 

Large Language Models for Structuring and Integration of Heterogeneous Data

The implementation of artificial intelligence (AI) in the public sector offers great potential. Repetitive and labor-intensive tasks can be automated to improve overall efficiency. Generative AI, in particular, opens up new possibilities for structuring and integrating heterogeneous data sources. At the same time, AI introduces challenges such as technical complexity and ethical issues that must be addressed during development and implementation. This paper investigates the potential and challenges of using AI in the extract, transform, load (ETL) process in a public sector study. Our findings demonstrate that open-source large language models (LLMs) can efficiently transfer over 5,000 unstructured documents into the structured format of a relational database, achieving a success rate of approximately 96%. The quality of the results was significantly improved through optimization measures, particularly in terms of prompt engineering and post-processing. While the results are encouraging, challenges remain, including processing extensive documents and adapting the data model to greater complexity.