Abstract

Short-form video platforms are key venues for healthcare support, yet their multimodal form combining visual, auditory, and textual channels complicates scalable analysis. We present a framework grounded in Social Support Theory, implemented with a Masked Ordinal Expectation-Maximization algorithm integrating modality-specific annotations with expert guidance. A 2×2 design varying architecture (orchestrated vs. holistic) and prompt structure (focused vs. combined) shows trade-offs. The orchestrated framework paired with focused prompts yields highest accuracy in classifying support types, whereas the holistic approach with combined prompts better captures relational patterns. This trade-off appears only in high-capability models, as simpler models lack capacity to benefit from orchestration. These findings provide methodological guidance for multimodal analysis and practical insights for building systems that more effectively detect and recommend supportive health content online.

Share

COinS