Abstract

Generative AI, especially large language models (LLMs), is revolutionizing software engineering, but their variable accuracy poses risks. Existing benchmarks fail to capture the nuances of the software development lifecycle. This paper introduces a process-aware evaluation platform to address these challenges. The paper presents: (1) a manually annotated and balanced subset of the CodeAlpaca-20k dataset with software development lifecycle processes; (2) a comprehensive evaluation architecture for reproducible and interpretable LLM comparisons; and (3) a comparative study of three LLMs (Gemma3:1b, Gemma3:4b, and Qwen3:4b) using diverse automatic metrics. The results provide critical insights into LLM performance across software engineering processes, enabling more systematic and reliable evaluation.

Share

COinS