Data Engineering — Scala vs Python for Spark

Ryan Arjun
5 min readNov 3, 2023

If you are a Data Engineer, you will most likely need to know python anyways. This really depends on what you want to do within data engineering and where you want to work. I agree that SQL and Python are the most important for starting out and give you access to a lot more opportunities than Scala. The Scala market is super niche and dominated by Spark, which is pretty unpleasant to work for.

Spark runs at the same pace in Scala and Python (save for UDFs), thus it is meaningless.

You must keep in mind that both are vastly different in terms of learning. Python is incredibly simple, and instead of learning it, you basically just pick it up. Scala, on the other hand, is a “Scalable Language” and has depths that are worth exploring that will keep you on your heels for years. Then again, if you only learn it to write Spark code, there is not much to learn apart from Spark DSL.

Practically, Python is an interlanguage and one of the fastest-growing programming languages. Whether it’s data manipulation with Pandas, creating visualizations with Seaborn, or deep learning with TensorFlow, Python seems to have a tool for everything. I have never met a data engineer who doesn’t know Python.

Apache Beam — a data processing framework that’s gaining popularity because it can handle both streaming and batch…

--

--

Ryan Arjun

BI Specialist || Azure || AWS || GCP — SQL|Python|PySpark — Talend, Alteryx, SSIS — PowerBI, Tableau, SSRS