Abstract

Large Language Models (LLMs) promise to transform supply chain management (SCM) through improved forecasting and automated decision support (Aggarwal & Davè, 2018). However, generic benchmarks (Maslej et al., 2025) reveal little about domain-specific performance, leaving open whether supply chain managers can trust LLM-derived proposals or whether the field is building on unverified assumptions. We argue the field needs a dedicated, domain-specific benchmarking approach that accounts for the operational realities of supply chain tasks. Our preliminary work, running repeated forecasting trials with agentic LLM orchestration on a self-hosted infrastructure using CrewAI and Retrieval Augmented Generation (RAG), reveals that LLM-generated forecasts do not outperform traditional algorithmic approaches (Lewis et al., 2020). This confirms that the gap between LLM potential and domain-specific performance exists and demands systematic, rigorous investigation. We invite the IS community to shape a research agenda for rigorous LLM evaluation in SCM, focusing on dimensions such as accuracy, consistency, contextual fit, and cost efficiency. Configuration choices, including temperature settings, prompt design, and model architecture, play a significant role in operational outcomes and deserve attention. From a socio-technical perspective (Bostrom & Heinen, 1977), this includes examining how firms, especially small and medium-sized enterprises (SMEs), can build evaluation capabilities needed for responsible AI adoption in line with data sovereignty requirements.

Share

COinS