Paper Type

Complete

Abstract

Corporate communications teams face information overload, refresh pressure, and overreliance risk on LLM-generated briefings. We present the production deployment of an on-premises agentic LLM system for enterprise news intelligence that couples retrieval, clustering, and report generation with ensemble LLM-as-Judge quality governance. Over six months in a conglomerate (Sep 2025–Feb 2026), Corporate Monitor ingested 36,910 unique articles and the Top News pipeline executed 126 runs (52,462 articles). Reports are scored on source faithfulness, factuality, informativeness, and coherence; grade B (≥75) defines the automation–oversight boundary. Sub-threshold outputs trigger up to three retries; remaining cases are flagged for review. Results show 100% success for scheduled pipelines, 99%+ reliability across user-initiated workflows, and an 85.3% pass rate across 2,592 quality-gated outputs. A pilot expert study (40 evaluations) shows directional agreement with expert judgments and flags corporate relevance as a missing dimension. We present design principles for responsible agentic AI governance at scale.

Paper Number

1279

Comments

AI SYSTEM

Share

COinS
 
Aug 15th, 12:00 AM

Deploying Agentic LLM Pipelines at Scale: Quality-Gated Ensemble Governance for Enterprise News Intelligence

Corporate communications teams face information overload, refresh pressure, and overreliance risk on LLM-generated briefings. We present the production deployment of an on-premises agentic LLM system for enterprise news intelligence that couples retrieval, clustering, and report generation with ensemble LLM-as-Judge quality governance. Over six months in a conglomerate (Sep 2025–Feb 2026), Corporate Monitor ingested 36,910 unique articles and the Top News pipeline executed 126 runs (52,462 articles). Reports are scored on source faithfulness, factuality, informativeness, and coherence; grade B (≥75) defines the automation–oversight boundary. Sub-threshold outputs trigger up to three retries; remaining cases are flagged for review. Results show 100% success for scheduled pipelines, 99%+ reliability across user-initiated workflows, and an 85.3% pass rate across 2,592 quality-gated outputs. A pilot expert study (40 evaluations) shows directional agreement with expert judgments and flags corporate relevance as a missing dimension. We present design principles for responsible agentic AI governance at scale.