PySpark — Read CSV file into Dataframe

Ryan Arjun
4 min readJan 15, 2021

As we know that PySpark is a Python API for Apache Spark where as Apache Spark is an Analytical Processing Engine for large scale powerful distributed data processing and machine learning applications.

Read CSV file into spark dataframe, drop some columns, and add new columns

If you want to process a large dataset which is saved as a csv file and would like to read CSV file into spark dataframe, drop some columns, and add new columns. So, we are doing this operation step by steps.

Note : I’m using Jupyter Notebook for this process and assuming that you guys have already setup PySpark on it.

Step 1: Import all the necessary libraries in our code as given below —

  • SparkContext is the entry gate of Apache Spark functionality and the most important step of any Spark driver application is to generate SparkContext which represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
  • SparkSession is an entry point to underlying PySpark functionality in order to programmatically create PySpark RDD, DataFrame.
  • SQLContext can be used create DataFrame , register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files whereas SparkContext is backing this SQLContext. The SparkSession around which this SQLContext wraps.

--

--

Ryan Arjun

BI Specialist || Azure || AWS || GCP — SQL|Python|PySpark — Talend, Alteryx, SSIS — PowerBI, Tableau, SSRS