PySpark — Read CSV file into Dataframe
4 min readJan 15, 2021
As we know that PySpark is a Python API for Apache Spark where as Apache Spark is an Analytical Processing Engine for large scale powerful distributed data processing and machine learning applications.
If you want to process a large dataset which is saved as a csv file and would like to read CSV file into spark dataframe, drop some columns, and add new columns. So, we are doing this operation step by steps.
Note : I’m using Jupyter Notebook for this process and assuming that you guys have already setup PySpark on it.
Step 1: Import all the necessary libraries in our code as given below —
- SparkContext is the entry gate of Apache Spark functionality and the most important step of any Spark driver application is to generate SparkContext which represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
- SparkSession is an entry point to underlying PySpark functionality in order to programmatically create PySpark RDD, DataFrame.
- SQLContext can be used create DataFrame , register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files whereas SparkContext is backing this SQLContext. The SparkSession around which this SQLContext wraps.