Getting Started with Data Engineering: Key Installations of Java, Spark, \& Python PySpark

Getting Started with Data Engineering: Key Installations of Java, Spark, & Python PySpark

In this guide, I will show how to set up a complete data engineering setup including Java, Full Hadoop, Python, and PySpark. Additionally, I'll describe the significance of setting different environment variables, their roles, and the key differences between Pyspark and a complete Spark setup.

Install Java [Oracle JDK]

I’ve opted for the traditional Java, bringing with it the familiar folder system. Feel free to explore variants like OpenJDK.

Download the JDK 11 (or later) installer from Oracle JDK Downloads page
Install JDK in the default directory (typically C:\Program Files\Java\jdk-11).
To verify the installation, enter java -version in your command prompt. You should see output similar to this:

Install Full Apache SPARK

Download spark-3.5.0-bin-hadoop3.tgz from spark.apache.org
Create a folder C:\Spark. Place the unzipped contents of spark-3.5.0-bin-hadoop3.tgz inside it. Your C:\Spark folder should now contain lib, bin etc.
Establish this folder structure: C:\hadoop\bin.
Download winutils.exe from github/cdarlint and place it inside C:\hadoop\bin

Install Python [python.org]

Download python-3.12.0-amd64.exe (or similar) from the Python Downloads page
- Execute the downloaded installer and opt for Customize Installation. Ensure you select Add python.exe to PATH.
- Proceed with all optional features and click Next.
- In Advanced Options, select “Install Python 3.12 for all users”.
- A successful setup should show up in a Setup Success message.
- Verify the installation by typing python --version in your command prompt. The Python version number indicates a successful installation.

Set Env Variables

Entries

Navigate to Run ➤ SYSDM.CPL ➤ Advanced ➤ Environment Variables, and create or set these environment variables at the system (recommended) or user levels:

Variable	Value
`JAVA_HOME`	`C:\Program Files\Java\jdk-11`
`SPARK_HOME`	`C:\Spark`
`HADOOP_HOME`	`C:\hadoop`
`PYSPARK_PYTHON`	`C:\Python39\python.exe`
Path	`%JAVA_HOME%\bin`
	`%SPARK_HOME%\bin`
	`%HADOOP_HOME%\bin`

For a PowerShell command to set these variables with Admin privileges, remember to change 'Machine' (for system-wide level) to 'User' (for user level) as required.

Explanation

Link python.exe with `PYSPARK_PYTHON`

We set the PYSPARK_PYTHON environment variable to C:\Python39\python.exe to specify which Python executable Spark should use. This is vital, particularly if you have multiple Python installations.

`%JAVA_HOME%\bin` to PATH

While C:\Program Files\Common Files\Oracle\Java\javapath might already be in your system’s Path environment variable, it’s generally advisable to add %JAVA_HOME%\bin to your Path. This ensures your system uses the JDK’s executables, rather than those from another Java installation.

Install Pyspark

Background

If your code involves creating a Spark session and dataframes, you’ll need PySpark Libraries. Install it using pip install pyspark. This does two things:

Installs the libraries
Installs a ‘miniature, standalone’ Spark environment for testing

However, in our case, we don’t need the ‘miniature Spark’ that comes with PySpark libraries. We’ll manage potential Spark conflicts with the SPARK_HOME variable set to our full Spark environment.

Install Pyspark System-Wide

Open a command prompt with Admin privilege. Use pip (included with our Python) and execute pip install pyspark for a system-wide installation.

Check the Installation

After installation, confirm if PySpark is in the global site-packages:

pip show pyspark

The Location: field in the output reveals the installation location.

Alt text

See Actual Working

Test your PySpark installation by starting a Spark session in a Python environment:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("TestApp") \
    .getOrCreate()
print(spark.version)
spark.stop()

If Spark starts without errors, your PySpark setup with Python is successful.

Appendix

`PYSPARK_PYTHON` Overview

Selects Python Interpreter: Designates which Python version Spark executors should use for UDFs and transformations. Key in setups with multiple Python versions.
Uniformity in Clusters: Guarantees that all cluster nodes use the same Python environment, maintaining consistency in PySpark.

Pyspark Vs Full Spark Overview

I have put this in another section here. Read More..

Table of Contents