How to set up, install & run Apache Spark on Jupyter Notebook — Windows / Mac (M1/M2)

Anish Mahapatra
6 min readNov 14, 2023

--

In this blog, we will learn how we can set up Apache Spark on Windows / Mac OS Setup.

Photo by Warren on Unsplash

Install Anaconda Navigator

Head on to the Anaconda Website and download the relevant Anaconda navigator for your system.

Launch Jupyter Notebook from the Navigator.

Use the Navigator to go to the location which will be your working directory.

I have selected Documents/DataEngineering as my working directory here.

Install Spark on your Mac M1/M2

If you have Windows, you can use this link as a reference:
Install Spark on Windows 10

cd ~/Documents/DataEngineering
wget https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
tar -xzvf spark-3.2.0-bin-hadoop3.2.tgz
export SPARK_HOME=~/Documents/DataEngineering/spark-3.2.0-bin-hadoop3.2
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3
spark-submit --version

We have successfully installed Spark on our system!

Install Spark on Windows

Download Anaconda for Windows

Head on to Anaconda Navigator for Windows here and download Anaconda to be sure that you have Jupyter Notebook.

Download Java for Windows

Head on to the Java downloads page and download Java as well. Go to your Command Prompt / Terminal.

java -version
Verify Java Installation

Download Spark for Windows

Download the Apache Spark tar file for Windows

cd to your Downloads folder
 certutil -hashfile .\spark-3.5.0-bin-hadoop3.tgz SHA512

Create folders for hadoop and Spark

Create a New Folder in C:/ called hadoop

Create a New Folder in C:/Program Files called Spark.

Place unzipped (untarred) Spark file in Spark Folder

Unzip the .tar file and place it inside the Spark Folder that you created (Should not have .tar in the end). You can use 7Zip in case you are not able to do it.

Download and place winutils.exe in the hadoop folder

Download winutils.exe from here: https://github.com/cdarlint/winutils/tree/master/hadoop-3.3.5/bin

Make a new folder Hadoop in C Drive and another folder bin under it and place the winutils.exe file there.

Add environment variables for SPARK_HOME (user and system variables)

Search for environment variables in Windows Search and select the “Edit the system environment variables” option.

Select Environment variables as shown below and select “Environment Variables”

Under User Variables, Select New and Add SPARK_HOME environment Variable name and the path where we stored Hadoop in our file system. (Make sure there are no double quotes)

Now select System Variables and double-click on Path (as shown below)

Select New

Enter

%SPARK_HOME%\bin

Add environment variables for HADOOP_HOME (user and system variables)

Similarly, Add the path for Hadoop in user variables.

C:\hadoop

And, add the path for Hadoop in System Variables, under Path

%HADOOP_HOME%\bin

Add environment variables for JAVA_HOME (user and system variables)

Also, be sure to add the path for Java under user variables

And, the path for Java under System Variables.

Run Command Prompt as Admin

Head on to Command Prompt and right-click to select “Run as Administrator”

Change the directory to the location where you stored Hadoop and go to the bin. In my case Hadoop it is the following:

cd C:\Program Files\Spark\spark-3.5.0-bin-hadoop3\bin

Run Spark-Shell

Then, run spark-shell

spark-shell
Tada! Spark is now installed.

Run Anaconda Navigator and Launch Jupyter Notebook

Great, Spark is now successfully installed on your system. Please feel free to connect with me on LinkedIn — I am always happy to help!

https://www.linkedin.com/in/anishmahapatra/

Validate Spark on Jupyter Notebook

!pip install findspark
!pip install pyspark
# Import the 'warnings' module and filter out warnings to avoid cluttering the output
import warnings
warnings.filterwarnings("ignore")
# Import 'findspark' and initialize it to set up the necessary environment variables for Spark
import findspark
findspark.init()
# Import the 'SparkSession' class from PySpark
from pyspark.sql import SparkSession

# Create a SparkSession with the application name "SparkDemo"
spark = SparkSession.builder.appName("SparkDemo").getOrCreate()

# Define sample data in the form of a list of tuples
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]

# Define column names for the DataFrame
columns = ["Name", "Age"]

# Create a PySpark DataFrame from the data and columns
df = spark.createDataFrame(data, columns)

# Display the contents of the DataFrame
df.show()

Demo of Finding Word Count on Spark

Make a file input.txt, and place it in the working directory.

from pyspark.sql import SparkSession
from pyspark.sql import Row

# Create a SparkSession
spark = SparkSession.builder \
.appName("Spark SQL Demo") \
.getOrCreate()

# Sample data
data = [Row(name="Alice", age=25),
Row(name="Bob", age=30),
Row(name="Charlie", age=35)]

# Create a DataFrame from the data
df = spark.createDataFrame(data)
df.show()
# Register the DataFrame as a temporary view
df.createOrReplaceTempView("people")

# Perform a query using Spark SQL
query = "SELECT * from people"
result_df = spark.sql(query)
result_df.show()
# Perform a query using Spark SQL
query = "SELECT name, age FROM people WHERE age > 28"
result_df = spark.sql(query)

# Show the results
result_df.show()

# Stop the SparkSession
spark.stop()

--

--

Anish Mahapatra
Anish Mahapatra

Written by Anish Mahapatra

Senior AI & ML Engineer | Fortune 500 | Senior Technical Writer - Google me. Anish Mahapatra | https://www.linkedin.com/in/anishmahapatra/

Responses (1)