Member-only story
Data Engineering — Construct a Standard Data Pipeline Design Pattern
One common challenges encountered by developers and data professionals in the area of data processing and analysis is whether to import data directly into a database or store it as.csv files before loading. This decision is crucial considering it concerns data continuity, error logging, and data retrieval in the event of a system failure.
In contrast to the availability of examples of code design patterns, & data modeling techniques, there are few to none on data flow design patterns. In my experience building pipelines, using the appropriate data flow patterns increases feature delivery speed, decreases toil during pipeline failures, and builds trust with stakeholders.
The standard design pattern is to bring in data into a blob storage called raw, standardised it into parquet or any other columnar format, and then load it into a database system, or if using spark delta lake, build a delta table with location referring to storage.
Blob storage is often preferable since it keeps things flexible and makes debugging easier. Particularly if you can compress the data, which can easily save 90+% on egress, ingress, and storage costs. However, if the data from the API is little enough, it doesn’t really matter.