Member-only story

PySpark— Update multiple columns in a dataframe

3 min readNov 14, 2023

Working as a PySpark developer, data engineer, data analyst, or data scientist for any organisation requires you to be familiar with dataframes because data manipulation is the act of transforming, cleansing, and organising raw data into a format that can be used for analysis and decision making.

Note: We are using Databricks environment to articulate this example.

We understand, we can add a column to a dataframe and update its values to the values returned from a function or other dataframe column’s values. In PySpark, you can update multiple columns in a DataFrame using the withColumn method along with the col function from the pyspark.sql.functions module as given below -

## importing sparksession from  
## pyspark.sql module 
from pyspark.sql import SparkSession 
from pyspark.sql.functions import col

# Create a Spark session and giving an app name 
spark = SparkSession.builder.appName("UpdateMutliColumns").getOrCreate()

When you see data in a list in PySpark, it signifies you have a collection of data in a PySpark driver.

PySpark— Update multiple columns in a dataframe

Written by Ryan Arjun

No responses yet