Ryan Arjun
5 min readNov 18, 2023

PySpark — Retrieve Unmatched rows from two Dataframes

Data integrity refers to the quality, consistency, and reliability of data throughout its life cycle. Data engineering pipelines are methods and structures that collect, transform, store, and analyse data from many sources.

If you are working as a PySpark developer, data engineer, data analyst, or data scientist for any organisation requires you to be familiar with dataframes because data manipulation is the act of transforming, cleansing, and organising raw data into a format that can be used for analysis and decision making.

For example, you have some user’s data in dataframe-1, and you have to new users’ data in a dataframe-2, then you must find out all the Unmatched records from both dataframe-2 and dataframe-1.

In PySpark, you can retrieve Unmatched rows from two Dataframes using the join, full join and Substract operations.

You have to use some tricks such as —

Matched Dataframe — Use inner join operation combines rows from two Dataframes based on a common column.

Full Dataframe — Use full join operation combines all rows from two Dataframes based on a common column.

Unmatched Dataframe — use substract operation on Full Dataframe by Matched Dataframe

# importing sparksession from  
from pyspark.sql import SparkSession
# pyspark.sql module
from pyspark.sql.functions import col

# Create a Spark session and giving an app name
spark = SparkSession.builder.appName("UnmatchedData").getOrCreate()

Dataset 1: In this dataset, we have three columns such as Name, Age and Occupation and have a pre-defined schema for our PySpark dataframe as given below —

# Sample data for DataFrame1
dataset1 = [("Ryan Arjun", 25, "Engineer"),
("Kimmy Wang", 30, "Data Scientist"),
("Saurabh Yadav", 22, "Analyst")]

# Define the schema for DataFrame1
ds_schema1 = ["Name", "Age", "Occupation"]
Ryan Arjun

BI Specialist || Azure || AWS || GCP — SQL|Python|PySpark — Talend, Alteryx, SSIS — PowerBI, Tableau, SSRS