Paper Type
Complete
Paper Number
PACIS2026-2106
Description
This paper presents an embedding-driven, reproducible pipeline to identify research gaps in systematic literature reviews (SLRs). We combine semantic keyword clustering with quantitative gap-detection criteria to semi-automate a formerly manual process, improving scalability and reproducibility. The pipeline queries Scopus, ranks publications by semantic similarity to a user problem description (nomic-embed-text-v1.5), extracts keywords from top items, clusters them with k-means, labels clusters with GPT-4, and visualizes results via t-SNE. Gaps are flagged when keywords are (a) low-occurrence, (b) semantically peripheral (distance > cluster-specific 95th percentile), (c) present in highly cited works, and (d) not identified as a method. In a gamification-in-marketing case study (14,302 records), we identified eight candidate gaps. Compared to VOSviewer co-occurrence maps, our approach shows superior semantic coherence and is less affected by keyword variation. Validated across five LLM shows method effectiveness - nearly twice as many gap classifications and a 24–30% increase in novelty scores.
Recommended Citation
Frankowski, Pawel Karol; Wiśniewska, Joanna; and Matysik, Sebastian, "Automated Identification of Research Gaps Using Keyword Clustering and an Embedding Model" (2026). PACIS 2026 Proceedings. 7.
https://aisel.aisnet.org/pacis2026/adv_theory/adv_theory/7
Automated Identification of Research Gaps Using Keyword Clustering and an Embedding Model
This paper presents an embedding-driven, reproducible pipeline to identify research gaps in systematic literature reviews (SLRs). We combine semantic keyword clustering with quantitative gap-detection criteria to semi-automate a formerly manual process, improving scalability and reproducibility. The pipeline queries Scopus, ranks publications by semantic similarity to a user problem description (nomic-embed-text-v1.5), extracts keywords from top items, clusters them with k-means, labels clusters with GPT-4, and visualizes results via t-SNE. Gaps are flagged when keywords are (a) low-occurrence, (b) semantically peripheral (distance > cluster-specific 95th percentile), (c) present in highly cited works, and (d) not identified as a method. In a gamification-in-marketing case study (14,302 records), we identified eight candidate gaps. Compared to VOSviewer co-occurrence maps, our approach shows superior semantic coherence and is less affected by keyword variation. Validated across five LLM shows method effectiveness - nearly twice as many gap classifications and a 24–30% increase in novelty scores.
Comments
15-Method