Paper Number
ECIS2026-2002
Paper Type
CRP
Abstract
Active Learning (AL) tools are increasingly used to assist the title and abstract screening phase of systematic literature reviews, with the promise of reducing workload while maintaining comprehensiveness. However, empirical evidence regarding their effectiveness relative to validated human screening decisions remains limited. This study conducts an exploratory benchmarking evaluation of AI-assisted screening by comparing two widely used AI-enabled tools, ASReview and Covidence, against a validated human consensus dataset derived from a published meta-synthesis. Performance was assessed using recall, screening duration, the number of full-text review candidates (FTRC), and workload reduction. Manual screening of 5,808 references required 1,018 minutes and resulted in 92 FTRC and a final sample of 28 studies. Using the same dataset, ASReview achieved full recall relative to the manual baseline while screening 746 references in 92 minutes, corresponding to an 87.2% workload reduction. Covidence screened 505 references in 53 minutes but retrieved only 5 of the 28 relevant studies, resulting in a recall rate of 17.8%. The findings highlight substantial variation in recall outcomes across AL implementations and underscore the importance of benchmarking AI-assisted screening against validated human consensus standards. More broadly, this study identifies methodological considerations such as calibration procedures, stopping rules, and transparent reporting, that are critical for responsibly integrating AL tools into systematic review workloads.
Recommended Citation
Ringeval, Mickael; Paré, Guy; Vial, Gregory; and Motulsky, Aude, "Active Learning In Systematic Literature Screening: Benchmarking AI-Assisted Screening Against Human Consensus" (2026). ECIS 2026 Proceedings. 11.
https://aisel.aisnet.org/ecis2026/litrev/litrev/11
Active Learning In Systematic Literature Screening: Benchmarking AI-Assisted Screening Against Human Consensus
Active Learning (AL) tools are increasingly used to assist the title and abstract screening phase of systematic literature reviews, with the promise of reducing workload while maintaining comprehensiveness. However, empirical evidence regarding their effectiveness relative to validated human screening decisions remains limited. This study conducts an exploratory benchmarking evaluation of AI-assisted screening by comparing two widely used AI-enabled tools, ASReview and Covidence, against a validated human consensus dataset derived from a published meta-synthesis. Performance was assessed using recall, screening duration, the number of full-text review candidates (FTRC), and workload reduction. Manual screening of 5,808 references required 1,018 minutes and resulted in 92 FTRC and a final sample of 28 studies. Using the same dataset, ASReview achieved full recall relative to the manual baseline while screening 746 references in 92 minutes, corresponding to an 87.2% workload reduction. Covidence screened 505 references in 53 minutes but retrieved only 5 of the 28 relevant studies, resulting in a recall rate of 17.8%. The findings highlight substantial variation in recall outcomes across AL implementations and underscore the importance of benchmarking AI-assisted screening against validated human consensus standards. More broadly, this study identifies methodological considerations such as calibration procedures, stopping rules, and transparent reporting, that are critical for responsibly integrating AL tools into systematic review workloads.
When commenting on articles, please be friendly, welcoming, respectful and abide by the AIS eLibrary Discussion Thread Code of Conduct posted here.