Paper Type

Complete

Paper Number

PACIS2026-2106

Description

This paper presents an embedding-driven, reproducible pipeline to identify research gaps in systematic literature reviews (SLRs). We combine semantic keyword clustering with quantitative gap-detection criteria to semi-automate a formerly manual process, improving scalability and reproducibility. The pipeline queries Scopus, ranks publications by semantic similarity to a user problem description (nomic-embed-text-v1.5), extracts keywords from top items, clusters them with k-means, labels clusters with GPT-4, and visualizes results via t-SNE. Gaps are flagged when keywords are (a) low-occurrence, (b) semantically peripheral (distance > cluster-specific 95th percentile), (c) present in highly cited works, and (d) not identified as a method. In a gamification-in-marketing case study (14,302 records), we identified eight candidate gaps. Compared to VOSviewer co-occurrence maps, our approach shows superior semantic coherence and is less affected by keyword variation. Validated across five LLM shows method effectiveness - nearly twice as many gap classifications and a 24–30% increase in novelty scores.

Comments

15-Method

Share

COinS
 
Jul 5th, 12:00 AM

Automated Identification of Research Gaps Using Keyword Clustering and an Embedding Model

This paper presents an embedding-driven, reproducible pipeline to identify research gaps in systematic literature reviews (SLRs). We combine semantic keyword clustering with quantitative gap-detection criteria to semi-automate a formerly manual process, improving scalability and reproducibility. The pipeline queries Scopus, ranks publications by semantic similarity to a user problem description (nomic-embed-text-v1.5), extracts keywords from top items, clusters them with k-means, labels clusters with GPT-4, and visualizes results via t-SNE. Gaps are flagged when keywords are (a) low-occurrence, (b) semantically peripheral (distance > cluster-specific 95th percentile), (c) present in highly cited works, and (d) not identified as a method. In a gamification-in-marketing case study (14,302 records), we identified eight candidate gaps. Compared to VOSviewer co-occurrence maps, our approach shows superior semantic coherence and is less affected by keyword variation. Validated across five LLM shows method effectiveness - nearly twice as many gap classifications and a 24–30% increase in novelty scores.