PySpark — Read All files from nested Folders/Directories

Ryan Arjun
3 min readJul 11, 2023

PySpark is a Python API for Apache Spark, whereas Apache Spark is an Analytical Processing Engine for large scale sophisticated distributed data processing and machine learning applications.

If you are a PySpark developer, data scientist, or data analyst, you will frequently need to load data from a hierarchical data directory. These nested data directories are generally produced when an ETL procedure continues to place data from different dates in multiple folders. You want to load these CSV files into a spark Dataframe for further analysis. In this essay, I will discuss data loading from nested directories.

Step 1: Import all the necessary libraries in our code as given below —

  • SparkContext is the entry gate of Apache Spark functionality and the most important step of any Spark driver application is to generate SparkContext which represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
  • SparkSession is an entry point to underlying PySpark functionality in order to programmatically create PySpark RDD, DataFrame.
  • SQLContext can be used create DataFrame , register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files whereas SparkContext is…

--

--

Ryan Arjun

BI Specialist || Azure || AWS || GCP — SQL|Python|PySpark — Talend, Alteryx, SSIS — PowerBI, Tableau, SSRS