Abstract
Student instructional evaluations (SIEs) remain the dominant mechanism for assessing teaching performance in higher education and frequently inform high-stakes decisions such as promotion, tenure, and contract renewal. However, extensive research demonstrates that SIEs capture not only instructional quality but also contextual factors such as course difficulty, grading expectations, and instructor characteristics. As universities increasingly experiment with AI-enabled governance tools, multimodal large language models (MLLMs) introduce the possibility of evaluating teaching directly from instructional artifacts, including lecture video, audio, and transcripts. This development raises an important Information Systems question: can AI function as a reliable, valid, and scalable evaluative infrastructure within academic decision-making systems? This study examines the reproducibility and cross-context validity of multimodal AI-based teaching evaluation in higher education. We analyze 118 lecture videos across eight university courses representing two contrasting pedagogical contexts: quantitative management science courses and traditional information systems courses. Using a multimodal model that processes transcripts and sampled video frames, AI evaluations processed each video and provided a score for five different questions and compared them to corresponding end-of-semester student instructional evaluations. Two principal findings emerged. First, AI scoring is highly reproducible yet systematically conservative. Across five evaluation dimensions, AI ratings are consistently lower than student evaluations (mean gap ≈ 0.5 points on a 5-point scale), with good-to-excellent reliability (ICC = 0.63–0.79). The consistency of this downward shift indicates that conservative scoring reflects a stable property of the model rather than stochastic variance. Second, we identify substantial discipline-level algorithmic bias. Evaluation gaps are significantly larger in interactive, lab-based Information Systems courses than in structured Management Science courses (Cohen’s d = 1.08). A within-instructor comparison—where the same instructor taught in both contexts—demonstrates that the magnitude of the evaluation gap shifts with pedagogical format rather than instructor identity. Mixed-effects modeling confirms that the discipline effect persists after accounting for instructor-level clustering. These findings suggest that current multimodal AI systems more effectively detect structured and visibly observable instructional behaviors than relational, iterative, and interaction-intensive pedagogies. Importantly, although AI assigns lower absolute scores, it converges with student evaluations on rank-order extremes, identifying the same highest- and lowest-performing instructors. This pattern indicates divergence in scale without complete misalignment in relative ordering. This research contributes to the Information Systems literature by conceptualizing AI evaluation systems as algorithmic governance mechanisms and empirically demonstrating how bias can emerge from pedagogical structure rather than demographic attributes. The findings raise critical design and policy considerations regarding calibration, construct validity, and the institutional implications of embedding AI-generated metrics into performance evaluation workflows. The TREO session will invite discussion on the appropriate role of AI in academic governance, the boundary conditions under which multimodal evaluation may be valid, and the safeguards necessary to prevent systematic penalization of interaction-intensive pedagogies.
Recommended Citation
Underwood, Alexis; Wisener, LaRue K.; and Pham, Hieu, "AI as Evaluator: Algorithmic Bias in Teaching Assessment" (2026). AMCIS 2026 TREOs. 18.
https://aisel.aisnet.org/treos_amcis2026/18