Paper Type

Complete

Paper Number

PACIS2026-2135

Description

Analyzing media representation requires scalable data extraction, but proprietary LLMs pose data governance and cost risks, limiting adoption in journalism. We address this by developing and evaluating a locally deployable, hybrid LLM pipeline using small, open-source models to extract attributes such as occupations, quotes, and demographics. We test this architecture on 200 human-annotated German news articles from a 100,000-article corpus, spanning diverse genres such as interviews, reports, and reviews. We compare its performance against monolithic, single-prompt LLM baselines such as GPT-4o. Results show the hybrid pipeline significantly outperforms all baselines, solving the critical recall deficits of single-prompt methods. It remains robust to genre variations while baseline performance degrades on narrative texts. This research demonstrates that strategic task decomposition within a local LLM pipeline yields superior extraction performance, establishing a highly accurate and governable alternative to commercial LLMs.

Comments

01-AIML

Share

COinS
 
Jul 5th, 12:00 AM

From News to Data: A Hybrid LLM Pipeline for Robust Person-Centric Information Extraction

Analyzing media representation requires scalable data extraction, but proprietary LLMs pose data governance and cost risks, limiting adoption in journalism. We address this by developing and evaluating a locally deployable, hybrid LLM pipeline using small, open-source models to extract attributes such as occupations, quotes, and demographics. We test this architecture on 200 human-annotated German news articles from a 100,000-article corpus, spanning diverse genres such as interviews, reports, and reviews. We compare its performance against monolithic, single-prompt LLM baselines such as GPT-4o. Results show the hybrid pipeline significantly outperforms all baselines, solving the critical recall deficits of single-prompt methods. It remains robust to genre variations while baseline performance degrades on narrative texts. This research demonstrates that strategic task decomposition within a local LLM pipeline yields superior extraction performance, establishing a highly accurate and governable alternative to commercial LLMs.