Paper Type
Complete
Abstract
Corporate communications teams face information overload, refresh pressure, and overreliance risk on LLM-generated briefings. We present the production deployment of an on-premises agentic LLM system for enterprise news intelligence that couples retrieval, clustering, and report generation with ensemble LLM-as-Judge quality governance. Over six months in a conglomerate (Sep 2025–Feb 2026), Corporate Monitor ingested 36,910 unique articles and the Top News pipeline executed 126 runs (52,462 articles). Reports are scored on source faithfulness, factuality, informativeness, and coherence; grade B (≥75) defines the automation–oversight boundary. Sub-threshold outputs trigger up to three retries; remaining cases are flagged for review. Results show 100% success for scheduled pipelines, 99%+ reliability across user-initiated workflows, and an 85.3% pass rate across 2,592 quality-gated outputs. A pilot expert study (40 evaluations) shows directional agreement with expert judgments and flags corporate relevance as a missing dimension. We present design principles for responsible agentic AI governance at scale.
Paper Number
1279
Recommended Citation
lim, kyongmook; Choi, Muryul; Han, Jeong Su; and Lee, ChiHoon, "Deploying Agentic LLM Pipelines at Scale: Quality-Gated Ensemble Governance for Enterprise News Intelligence" (2026). AMCIS 2026 Proceedings. 3.
https://aisel.aisnet.org/amcis2026/ai_systdesign/ai_systdesign/3
Deploying Agentic LLM Pipelines at Scale: Quality-Gated Ensemble Governance for Enterprise News Intelligence
Corporate communications teams face information overload, refresh pressure, and overreliance risk on LLM-generated briefings. We present the production deployment of an on-premises agentic LLM system for enterprise news intelligence that couples retrieval, clustering, and report generation with ensemble LLM-as-Judge quality governance. Over six months in a conglomerate (Sep 2025–Feb 2026), Corporate Monitor ingested 36,910 unique articles and the Top News pipeline executed 126 runs (52,462 articles). Reports are scored on source faithfulness, factuality, informativeness, and coherence; grade B (≥75) defines the automation–oversight boundary. Sub-threshold outputs trigger up to three retries; remaining cases are flagged for review. Results show 100% success for scheduled pipelines, 99%+ reliability across user-initiated workflows, and an 85.3% pass rate across 2,592 quality-gated outputs. A pilot expert study (40 evaluations) shows directional agreement with expert judgments and flags corporate relevance as a missing dimension. We present design principles for responsible agentic AI governance at scale.
Comments
AI SYSTEM