Data Reliability Challenges in Building with Data Lakes
--
A data lake can store all data, regardless of source, regardless of structure, and usually regardless of size also. An Ideal data lake also supports embrace these nontraditional data types which come from nontraditional data sources. These nontraditional data sources include items such as web server logs, sensor data, social network activity, text, and images. New use cases for these data types continue to be identified.
In the data lake, since all data is stored in its raw form, access could be provided to someone who needs to analyze the data quickly. For data science, data lakes provide a convenient storage layer for experimental data, both the input and output of data analysis and learning tasks. The creation and use of data can be done autonomously without coordination with other programs or analysts.
The curated data area contains structured data views, similar to what users are accustomed to when querying the data warehouse, with somewhat less effort to create. The curated data may be physically populated on a schedule, or may be exposed via a semantic layer.
In the Ideal Data Lake, no changes should be allowed to the raw data, as it is considered immutable. It is particularly important to retain, and securely back up, raw data in its native format to ensure:
- Direct access to the raw data should be highly restricted, select users could be empowered to conduct early analysis
- All data which is transformed downstream from the raw data can be regenerated on-demand or when necessary
- Access to the raw data is possible in select circumstances — for instance, data scientists frequently request the raw data because there has been no context applied to it
- Transformations or algorithms which adapt over time can be reprocessed, thus improving accuracy of the historical curated data
- Point-in-time analysis can be accomplished if data has been stored in a way which supports historical reporting
There are many challenges to working with raw “un-curated” data. Only highly adept data analysts and data scientists will want to tackle data wrangling of raw data. A compromise for the non-hard-core data analyst is to introduce a curated data…