Abstract

When organizations delegate authority to autonomous AI agents, two assumptions hold the arrangement together: that principals can observe what the agent is doing, and that the agent will flag when something is wrong. This paper argues both assumptions fail — together — closing every meaningful error-correction pathway in the delegation chain. The first failure is opacity. Deep neural networks are structurally unobservable, making the monitoring contracts at the heart of Principal-Agent Theory (PAT) unworkable (Liu et al., 2024; Hadfield-Menell & Hadfield, 2019). PAT treats information asymmetry as a gap contracts can narrow. For AI, that gap is a structural property of the model — the principal cannot see in. The second failure follows directly. RLHF-trained agents exhibit what Shapira et al. (2026) formalize as unconditional convergence: a reward structure that systematically favors user agreement over accuracy (Sharma et al., 2023). PAT assumes divergence as the central risk. RLHF produces the inverse: an agent so compliant it can no longer say I think you are wrong. With both safeguards gone, automation bias seals the loop — principals accept AI outputs without genuine evaluation, a pattern awareness alone cannot correct (Parasuraman & Manzey, 2010). Flawed instructions execute unchallenged and return as accepted outputs. The error compounds invisibly. This causal chain — opacity disabling top-down monitoring, sycophancy disabling bottom-up correction, automation bias disabling human detection — is a governance failure PAT was never designed to address (Eisenhardt, 1989). This paper introduces the dual-failure framing as a foundation for an IS governance research agenda, shifting the design burden from model developers to the organizations that structure how these agents operate.

Share

COinS