Process-aware evaluation of generative AI in software engineering: a comprehensive platform and comparative study

Abstract

Generative AI, especially large language models (LLMs), is revolutionizing software engineering, but their variable accuracy poses risks. Existing benchmarks fail to capture the nuances of the software development lifecycle. This paper introduces a process-aware evaluation platform to address these challenges. The paper presents: (1) a manually annotated and balanced subset of the CodeAlpaca-20k dataset with software development lifecycle processes; (2) a comprehensive evaluation architecture for reproducible and interpretable LLM comparisons; and (3) a comparative study of three LLMs (Gemma3:1b, Gemma3:4b, and Qwen3:4b) using diverse automatic metrics. The results provide critical insights into LLM performance across software engineering processes, enabling more systematic and reliable evaluation.

Recommended Citation

Czyżewski, Adam; Poniszewska-Marańda, Aneta; and Czyżewski, Piotr, "Process-aware evaluation of generative AI in software engineering: a comprehensive platform and comparative study" (2025). Proceedings of the 2025 Pre-ICIS SIGDSA Symposium. 82.
https://aisel.aisnet.org/sigdsa2025/82

Proceedings of the 2025 Pre-ICIS SIGDSA Symposium

Process-aware evaluation of generative AI in software engineering: a comprehensive platform and comparative study

Abstract

Recommended Citation

Search

Links

Browse

Author Corner

Proceedings of the 2025 Pre-ICIS SIGDSA Symposium

Process-aware evaluation of generative AI in software engineering: a comprehensive platform and comparative study

Authors

Abstract

Recommended Citation

Share

Search

Links

Browse

Author Corner