The Challenges of Apache Airflow in a #TrueDataOps World
Data engineering and its associated need to create optimized, orchestrated data pipelines that extract data from multiple, disparate data sources and load it into a centralized data platform, has risen to prominence. Before the development of automated workflow orchestration tools (like Airflow), data pipeline functionality was either manually coded and implemented, or run by batches of multiple lengthy crons and repetitive custom API calls. This has caused an erosion in the quality of the resultant data insights because of the overwhelming manual nature of data pipeline management before the arrival of workflow orchestration tools.
There is now a need to apply some (or all) of DevOps principles, battle hardened in the software development industry, to this world of data, to ensure the implementation of Agile and Lean principles in the data pipeline creation, testing, deployment, monitoring, and maintenance lifecycles. This white paper delves into the #TrueDataOps philosophy and why Apache Airflow, the forerunner, and originator of automated workflow monitoring and management tools, never was an ideal solution for data pipeline orchestration workflows.