How to set up, install & run Apache Spark on Jupyter Notebook — Windows / Mac (M1/M2)

Anish Mahapatra

6 min readNov 14, 2023

In this blog, we will learn how we can set up Apache Spark on Windows / Mac OS Setup.

Install Anaconda Navigator

Head on to the Anaconda Website and download the relevant Anaconda navigator for your system.

Free Download | Anaconda

Anaconda's open-source Distribution is the easiest way to perform Python/R data science and machine learning on a…

www.anaconda.com

Launch Jupyter Notebook from the Navigator.

Use the Navigator to go to the location which will be your working directory.

I have selected Documents/DataEngineering as my working directory here.

Install Spark on your Mac M1/M2

If you have Windows, you can use this link as a reference:
Install Spark on Windows 10

cd ~/Documents/DataEngineering

wget https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz

tar -xzvf spark-3.2.0-bin-hadoop3.2.tgz

export SPARK_HOME=~/Documents/DataEngineering/spark-3.2.0-bin-hadoop3.2
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3

spark-submit --version

We have successfully installed Spark on our system!

Install Spark on Windows

Download Anaconda for Windows

Head on to Anaconda Navigator for Windows here and download Anaconda to be sure that you have Jupyter Notebook.

Free Download | Anaconda

Anaconda's open-source Distribution is the easiest way to perform Python/R data science and machine learning on a…

www.anaconda.com

Download Java for Windows

Head on to the Java downloads page and download Java as well. Go to your Command Prompt / Terminal.

Download Java for Linux

Manual Java download page for Linux. Get the latest version of the Java Runtime Environment (JRE) for Linux.

www.java.com

java -version

Download Spark for Windows

Download the Apache Spark tar file for Windows

Downloads | Apache Spark

Note that Spark 3 is pre-built with Scala 2.12 in general and Spark 3.2+ provides additional pre-built distribution…

spark.apache.org

cd to your Downloads folder

 certutil -hashfile .\spark-3.5.0-bin-hadoop3.tgz SHA512

Create folders for hadoop and Spark

Create a New Folder in C:/ called hadoop

Create a New Folder in C:/Program Files called Spark.

Place unzipped (untarred) Spark file in Spark Folder

Unzip the .tar file and place it inside the Spark Folder that you created (Should not have .tar in the end). You can use 7Zip in case you are not able to do it.

Download and place winutils.exe in the hadoop folder

Download winutils.exe from here: https://github.com/cdarlint/winutils/tree/master/hadoop-3.3.5/bin

Make a new folder Hadoop in C Drive and another folder bin under it and place the winutils.exe file there.

Add environment variables for SPARK_HOME (user and system variables)

Search for environment variables in Windows Search and select the “Edit the system environment variables” option.

Select Environment variables as shown below and select “Environment Variables”

Under User Variables, Select New and Add SPARK_HOME environment Variable name and the path where we stored Hadoop in our file system. (Make sure there are no double quotes)

Now select System Variables and double-click on Path (as shown below)

Select New

Enter

%SPARK_HOME%\bin

Add environment variables for HADOOP_HOME (user and system variables)

Similarly, Add the path for Hadoop in user variables.

C:\hadoop

And, add the path for Hadoop in System Variables, under Path

%HADOOP_HOME%\bin

Add environment variables for JAVA_HOME (user and system variables)

Also, be sure to add the path for Java under user variables

And, the path for Java under System Variables.

Run Command Prompt as Admin

Head on to Command Prompt and right-click to select “Run as Administrator”

Change the directory to the location where you stored Hadoop and go to the bin. In my case Hadoop it is the following:

cd C:\Program Files\Spark\spark-3.5.0-bin-hadoop3\bin

Run Spark-Shell

Then, run spark-shell

spark-shell

Run Anaconda Navigator and Launch Jupyter Notebook

Great, Spark is now successfully installed on your system. Please feel free to connect with me on LinkedIn — I am always happy to help!

https://www.linkedin.com/in/anishmahapatra/

Validate Spark on Jupyter Notebook

!pip install findspark
!pip install pyspark

# Import the 'warnings' module and filter out warnings to avoid cluttering the output
import warnings
warnings.filterwarnings("ignore")

# Import 'findspark' and initialize it to set up the necessary environment variables for Spark
import findspark
findspark.init()

# Import the 'SparkSession' class from PySpark
from pyspark.sql import SparkSession

# Create a SparkSession with the application name "SparkDemo"
spark = SparkSession.builder.appName("SparkDemo").getOrCreate()

# Define sample data in the form of a list of tuples
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]

# Define column names for the DataFrame
columns = ["Name", "Age"]

# Create a PySpark DataFrame from the data and columns
df = spark.createDataFrame(data, columns)

# Display the contents of the DataFrame
df.show()

Demo of Finding Word Count on Spark

Make a file input.txt, and place it in the working directory.

from pyspark.sql import SparkSession
from pyspark.sql import Row

# Create a SparkSession
spark = SparkSession.builder \
    .appName("Spark SQL Demo") \
    .getOrCreate()

# Sample data
data = [Row(name="Alice", age=25),
        Row(name="Bob", age=30),
        Row(name="Charlie", age=35)]

# Create a DataFrame from the data
df = spark.createDataFrame(data)

df.show()

# Register the DataFrame as a temporary view
df.createOrReplaceTempView("people")

# Perform a query using Spark SQL
query = "SELECT * from people"
result_df = spark.sql(query)
result_df.show()

# Perform a query using Spark SQL
query = "SELECT name, age FROM people WHERE age > 28"
result_df = spark.sql(query)

# Show the results
result_df.show()

# Stop the SparkSession
spark.stop()

How to set up, install & run Apache Spark on Jupyter Notebook — Windows / Mac (M1/M2)

Install Anaconda Navigator

Free Download | Anaconda

Anaconda's open-source Distribution is the easiest way to perform Python/R data science and machine learning on a…

Install Spark on your Mac M1/M2

Install Spark on Windows

Download Anaconda for Windows

Free Download | Anaconda

Anaconda's open-source Distribution is the easiest way to perform Python/R data science and machine learning on a…

Download Java for Windows

Download Java for Linux

Manual Java download page for Linux. Get the latest version of the Java Runtime Environment (JRE) for Linux.

Download Spark for Windows

Downloads | Apache Spark

Note that Spark 3 is pre-built with Scala 2.12 in general and Spark 3.2+ provides additional pre-built distribution…

Create folders for hadoop and Spark

Place unzipped (untarred) Spark file in Spark Folder

Download and place winutils.exe in the hadoop folder

Add environment variables for SPARK_HOME (user and system variables)

Add environment variables for HADOOP_HOME (user and system variables)

Add environment variables for JAVA_HOME (user and system variables)

Run Command Prompt as Admin

Run Spark-Shell

Run Anaconda Navigator and Launch Jupyter Notebook

Validate Spark on Jupyter Notebook

Demo of Finding Word Count on Spark

Written by Anish Mahapatra

Responses (1)