Abstract
An algebraic formalization of distributed data processing is considered under very general assumptions. The concept of an information space is defined for a given data processing procedure, the existence of the smallest information space is proved, and its structure is investigated. It is shown that in terms of the information space, the binary operation of “information addition” and the ordering, reflecting the concept of “information quality”, are naturally expressed. The use of the smallest information space makes it possible to most effectively parallelize the process of information accumulation within the framework of the MapReduce distributed data analysis model and organize efficient processing without the need to transfer and accumulate original data. In the context of this model, Map “extracts information” from the source datasets, transforming them into elements of the information space, and Reduce combines all these pieces of partial information into one element, forming information that represents all the original data.
Recommended Citation
Golubtsov, Peter, "Towards Theoretical Foundations of Efficient Distributed Big Data Processing: Information Spaces for MapReduce Model" (2022). International Conference on Information Systems 2022 Special Interest Group on Big Data Proceedings. 6.
https://aisel.aisnet.org/sigbd2022/6