Abstract
Generative AI, especially large language models (LLMs), is revolutionizing software engineering, but their variable accuracy poses risks. Existing benchmarks fail to capture the nuances of the software development lifecycle. This paper introduces a process-aware evaluation platform to address these challenges. The paper presents: (1) a manually annotated and balanced subset of the CodeAlpaca-20k dataset with software development lifecycle processes; (2) a comprehensive evaluation architecture for reproducible and interpretable LLM comparisons; and (3) a comparative study of three LLMs (Gemma3:1b, Gemma3:4b, and Qwen3:4b) using diverse automatic metrics. The results provide critical insights into LLM performance across software engineering processes, enabling more systematic and reliable evaluation.
Recommended Citation
Czyżewski, Adam; Poniszewska-Marańda, Aneta; and Czyżewski, Piotr, "Process-aware evaluation of generative AI in software engineering: a comprehensive platform and comparative study" (2025). Proceedings of the 2025 Pre-ICIS SIGDSA Symposium. 82.
https://aisel.aisnet.org/sigdsa2025/82