How to set up, install & run Apache Spark on Jupyter Notebook — Windows / Mac (M1/M2)
In this blog, we will learn how we can set up Apache Spark on Windows / Mac OS Setup.
Install Anaconda Navigator
Head on to the Anaconda Website and download the relevant Anaconda navigator for your system.
Launch Jupyter Notebook from the Navigator.
Use the Navigator to go to the location which will be your working directory.
I have selected Documents/DataEngineering as my working directory here.
Install Spark on your Mac M1/M2
If you have Windows, you can use this link as a reference:
Install Spark on Windows 10
cd ~/Documents/DataEngineering
wget https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
tar -xzvf spark-3.2.0-bin-hadoop3.2.tgz
export SPARK_HOME=~/Documents/DataEngineering/spark-3.2.0-bin-hadoop3.2
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3
spark-submit --version
We have successfully installed Spark on our system!
Install Spark on Windows
Download Anaconda for Windows
Head on to Anaconda Navigator for Windows here and download Anaconda to be sure that you have Jupyter Notebook.
Download Java for Windows
Head on to the Java downloads page and download Java as well. Go to your Command Prompt / Terminal.
java -version
Download Spark for Windows
Download the Apache Spark tar file for Windows
cd to your Downloads folder
certutil -hashfile .\spark-3.5.0-bin-hadoop3.tgz SHA512
Create folders for hadoop and Spark
Create a New Folder in C:/ called hadoop
Create a New Folder in C:/Program Files called Spark.
Place unzipped (untarred) Spark file in Spark Folder
Unzip the .tar file and place it inside the Spark Folder that you created (Should not have .tar in the end). You can use 7Zip in case you are not able to do it.
Download and place winutils.exe in the hadoop folder
Download winutils.exe from here: https://github.com/cdarlint/winutils/tree/master/hadoop-3.3.5/bin
Make a new folder Hadoop in C Drive and another folder bin under it and place the winutils.exe file there.
Add environment variables for SPARK_HOME (user and system variables)
Search for environment variables in Windows Search and select the “Edit the system environment variables” option.
Select Environment variables as shown below and select “Environment Variables”
Under User Variables, Select New and Add SPARK_HOME environment Variable name and the path where we stored Hadoop in our file system. (Make sure there are no double quotes)
Now select System Variables and double-click on Path (as shown below)
Select New
Enter
%SPARK_HOME%\bin
Add environment variables for HADOOP_HOME (user and system variables)
Similarly, Add the path for Hadoop in user variables.
C:\hadoop
And, add the path for Hadoop in System Variables, under Path
%HADOOP_HOME%\bin
Add environment variables for JAVA_HOME (user and system variables)
Also, be sure to add the path for Java under user variables
And, the path for Java under System Variables.
Run Command Prompt as Admin
Head on to Command Prompt and right-click to select “Run as Administrator”
Change the directory to the location where you stored Hadoop and go to the bin. In my case Hadoop it is the following:
cd C:\Program Files\Spark\spark-3.5.0-bin-hadoop3\bin
Run Spark-Shell
Then, run spark-shell
spark-shell
Run Anaconda Navigator and Launch Jupyter Notebook
Great, Spark is now successfully installed on your system. Please feel free to connect with me on LinkedIn — I am always happy to help!
https://www.linkedin.com/in/anishmahapatra/
Validate Spark on Jupyter Notebook
!pip install findspark
!pip install pyspark
# Import the 'warnings' module and filter out warnings to avoid cluttering the output
import warnings
warnings.filterwarnings("ignore")
# Import 'findspark' and initialize it to set up the necessary environment variables for Spark
import findspark
findspark.init()
# Import the 'SparkSession' class from PySpark
from pyspark.sql import SparkSession
# Create a SparkSession with the application name "SparkDemo"
spark = SparkSession.builder.appName("SparkDemo").getOrCreate()
# Define sample data in the form of a list of tuples
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
# Define column names for the DataFrame
columns = ["Name", "Age"]
# Create a PySpark DataFrame from the data and columns
df = spark.createDataFrame(data, columns)
# Display the contents of the DataFrame
df.show()
Demo of Finding Word Count on Spark
Make a file input.txt, and place it in the working directory.
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Create a SparkSession
spark = SparkSession.builder \
.appName("Spark SQL Demo") \
.getOrCreate()
# Sample data
data = [Row(name="Alice", age=25),
Row(name="Bob", age=30),
Row(name="Charlie", age=35)]
# Create a DataFrame from the data
df = spark.createDataFrame(data)
df.show()
# Register the DataFrame as a temporary view
df.createOrReplaceTempView("people")
# Perform a query using Spark SQL
query = "SELECT * from people"
result_df = spark.sql(query)
result_df.show()
# Perform a query using Spark SQL
query = "SELECT name, age FROM people WHERE age > 28"
result_df = spark.sql(query)
# Show the results
result_df.show()
# Stop the SparkSession
spark.stop()