Business & Information Systems Engineering

Document Type

Research Paper


Existing methodologies for identifying dataquality problems are typically user-centric, where dataquality requirements are first determined in a top-downmanner following well-established design guidelines, orga-nizational structures and data governance frameworks. In thecurrent data landscape, however, users are often confrontedwith new, unexplored datasets that they may not have anyownership of, but that are perceived to have relevance andpotential to create value for them. Such repurposed datasetscan be found in government open data portals, data marketsand several publicly available data repositories. In suchscenarios, applying top-down data quality checkingapproaches is not feasible, as the consumers of the data haveno control over its creation and governance. Hence, dataconsumers – data scientists and analysts – need to beempowered with data exploration capabilities that allowthem to investigate and understand the quality of suchdatasets to facilitate well-informed decisions on their use.This research aims to develop such an approach fordiscovering data quality problems using generic exploratorymethods that can be effectively applied in settings where datacreation and use is separated. The approach, named LANG,is developed through a Design Science approach on the basisof semiotics theory and data quality dimensions. LANG isempirically validated in terms of soundness of the approach,its repeatability and generalizability.