Member-only story
ETL — High Quality Data Pipelines
As you are probably aware, data pipeline is a rather broad term. It’s a collection of tasks that transfer, alter, or provide data. A simple example is loading a.txt file into a table/file. It could also be as complex as real-time data aggregations with machine learning scoring and more.
The data tools ecosystem has grown significantly in my experience over the last 16 years. But, I am still in a situation in which the expense of data infrastructure and equipment is prohibitively high.
- Projects are complicated and time-consuming.
- Even resolving fundamental data quality concerns is quite difficult
- It is extremely difficult to maintain confidence in data assets
- In many circumstances, decision makers’ data literacy is inadequate.
- Stakeholders expect miracles and other things to simply work
- Very few individuals can describe the characteristics, computations, measurements, insights, and ramifications
- Everyone is merely making empty promises and postponing the inevitable reality of confronting and resolving difficult problems.
When building a data pipeline to bring data from heterogeneous sources to your data landing areas, some very common thoughts come to mind, such as restart-ability…