Abstract

Data warehouses of large corporations are increasing in size. Many companies have adopted a distributed data warehouse system, which may store data on many machines. Every day, millions of ETL jobs send data to those warehouses, but some jobs fail due to lack of resources and need to be restarted. Predicting ETL resource demands in distributed data warehouse systems is crucial for efficient use of resources and improved ETL pipeline tasks execution performance. The subject of resource-demand predictions for the ETL data pipeline has not yet been discussed in the literature. This paper discusses a method of predicting resource demands based on history. The linear regression function y = k x +b is used to predict memory, as well as disk usage, thus enabling improvement of accuracy of resource usage and the performance of ETL pipeline tasks execution.

Share

COinS