Data Engineering — Construct a Standard Data Pipeline Design Pattern

Ryan Arjun
6 min readNov 3, 2023

One common challenges encountered by developers and data professionals in the area of data processing and analysis is whether to import data directly into a database or store it as.csv files before loading. This decision is crucial considering it concerns data continuity, error logging, and data retrieval in the event of a system failure.

The standard design pattern is to bring in data into a blob storage called raw, standardised it into parquet or any other columnar format, and then load it into a database system, or if using spark delta lake, build a delta table with location referring to storage.

Blob storage is often preferable since it keeps things flexible and makes debugging easier. Particularly if you can compress the data, which can easily save 90+% on egress, ingress, and storage costs. However, if the data from the API is little enough, it doesn’t really matter.

It doesn’t really matter if you’re constructing a data warehouse in a relational database like Oracle Exadata (engineered platform for Oracle).

Having said that, we typically standardise the incoming data JSON, CSV, XML, mainframe.dat files or SAS b7dat files, then the first step is to standardise into a columnar format like parquet/orc/iceberg as it offers better compression and…

--

--

Ryan Arjun

BI Specialist || Azure || AWS || GCP — SQL|Python|PySpark — Talend, Alteryx, SSIS — PowerBI, Tableau, SSRS