Paper Number

ICIS2025-2599

Paper Type

Short

Abstract

The proliferation of misinformation challenges the integrity of digital information ecosystems, a core concern for Information Systems (IS). Automated fact-checking offers a potential solution, yet the reliability of Large Language Models (LLMs) for this task remains variable. This study investigates the interplay between LLM scale and retrieval-augmented generation (RAG) – the practice of providing external evidence – for fact-checking accuracy. Across three tasks tied to various research questions, we benchmark eight LLM variants, including Google's Gemini and OpenAI's GPT-4 families, on the Politifact-based LIAR dataset (n=1,267 claims), comparing performance with and without evidence augmentation. We show that top-tier LLMs (e.g., GPT-4.5) already beat legacy fact-checking systems with no extra engineering, whereas models with smaller scales might only become viable when paired with a simple evidence-retrieval layer. This cost-performance map helps IS practitioners decide whether to pay for premium models or invest in building retrieval pipelines to achieve reliable, scalable misinformation control.

Comments

09-Cybersecurity

Share

COinS
 
Dec 14th, 12:00 AM

Benchmarking Evidence-Augmented Large Language Models for Misinformation Fact-Checking: An Information Systems Perspective on Scale and Augmentation

The proliferation of misinformation challenges the integrity of digital information ecosystems, a core concern for Information Systems (IS). Automated fact-checking offers a potential solution, yet the reliability of Large Language Models (LLMs) for this task remains variable. This study investigates the interplay between LLM scale and retrieval-augmented generation (RAG) – the practice of providing external evidence – for fact-checking accuracy. Across three tasks tied to various research questions, we benchmark eight LLM variants, including Google's Gemini and OpenAI's GPT-4 families, on the Politifact-based LIAR dataset (n=1,267 claims), comparing performance with and without evidence augmentation. We show that top-tier LLMs (e.g., GPT-4.5) already beat legacy fact-checking systems with no extra engineering, whereas models with smaller scales might only become viable when paired with a simple evidence-retrieval layer. This cost-performance map helps IS practitioners decide whether to pay for premium models or invest in building retrieval pipelines to achieve reliable, scalable misinformation control.

When commenting on articles, please be friendly, welcoming, respectful and abide by the AIS eLibrary Discussion Thread Code of Conduct posted here.