Setup

Last updated on 2024-07-26 | Edit this page

For this class, we will be using Python notebooks to connect to a Spark server. This Spark server can be hosted locally or deployed on Kaggle, an equivalent of LeetCode for data scientists.

The instructions below will first show the steps to setup Spark environment on local device. Next, the steps to deploy Spark on Kaggle are also demonstrated.

Spark Setup: Locally hosted

Details

Setup for different systems can found in the dropdown menu below.

Windows

Make sure that your Windows is up to date prior to completing the remaining steps.
If you are running on Windows Home, make sure that you switch out of S mode.
The minimal configuration this setup has been tested on is:
- Intel Celeron CPU N3550
- 4GB memory
- Windows 10 Home version 20H2

Windows Terminal

Make sure that your computer is updated.
I will assume that everyone’s computer is running Windows 10, as it is the standard OS on all Windows-based laptop being sold nowadays.
Go to Microsoft Store app, search and install Windows Terminal.

Install Java

It is most likely that you already have Java installed. To confirm this, first, open a terminal.
Run javac -version and java -version.

BASH

javac -version
java -version

You see the version of Java being displayed on screen. Your version might be different from mine.
Spark supports up to Java 11 now.
Similarly, version information will be displayed.

If you do not have both java and javac, you will need to install Java SDK (Software Development Kit). We will be using the Java distribution maintained by OpenJDK.

Go to OpenJDK website.
Choose OpenJDK 8 (LTS) or OpenJDK 11 (LTS) for Version and HotSpot for JVM.
Click on the download link to begin download the installation package.
Run the installer. You can keep all default settings on the installer.
Once the installation process finishes, carry out the tests above again in another terminal to confirm that you have both java and javac.

Install Anaconda

Visit Anaconda’s download page and download the corresponding Anaconda installers.
You should select the 64-Bit variety for your installers.
Run the installation for Anaconda.
Remember the installation directory for Anaconda.
For Windows, this is typically under C:\Users\YOUR_WINDOWS_USERNAME\anaconda3. or C:\ProgramData\anaconda3.
Open a terminal and run:

BASH

conda init powershell

Restart the terminal

Download Spark

Download Spark 3.5.1
Untar and store the final directory somewhere that is easily accessible.
You might need to download and install 7-Zip to decompress .tgz file on Windows.

When decompressing, you might have to do it twice, because the first decompression will return a .tar file, and the second decompression is needed to completely retrieve the Spark directory.
Move the resulting directory under the C: drive.

Install libraries to support Hadoop functionalities

Open Windows Terminal, and create a hadoop directory and a bin subdirectory under the C: drive.

BASH

cd c:\
mkdir -p hadoop\bin

Visit the link to winutils.exe, right click on the Download and choose Save Link As.
Save the file to C:\hadoop\bin.
Visit the link to hadoop.dll, right click on the Download and choose Save Link As.
Save the file to C:\hadoop\bin.

download Windows support files for Hadoop

Setup environment variables

Click on the Windows icon and start typing environment variables in the search box.
Click on Edit the system environment variables.

access and modify System environment variables

Click on Environment Variables.
Under User variables for ..., click New and enter following pairs of input for each of the items below. Click OK when done.
- Java
  - Variable name: JAVA_HOME.
  - Variable value: Typically C:\Program Files\AdoptOpenJDK\jdk-11.0.11.9-hotspot.
- Spark
  - Variable name: SPARK_HOME.
  - Variable value: C:\spark-3.5.1-bin-hadoop3.
- Hadoop
  - Variable name: HADOOP_HOME.
  - Variable value: C:\hadoop.
- Anaconda3
  - Variable name: ANACONDA_HOME.
  - Variable value: C:\Users\YOUR_WINDOWS_USERNAME\anaconda3.

create environment variables for Windows

In User variables for ..., select Path and click Edit. Next, add the executable and enter following pairs of input by pressing New to enter each the items below into the list. Click OK when done.
- Java: %JAVA_HOME%\bin
- Spark: %SPARK_HOME%\bin
- Hadoop: %HADOOP_HOME%\bin
- Anaconda3: %ANACONDA_HOME%\Scripts

Close your terminal and relaunch it. Test that all paths are setup correctly by running the followings:

BASH

where.exe javac
where.exe spark-shell
where.exe winutils

Setup Jupyter and pyspark

Open a terminal and run the followings:

BASH

pip install findspark

Test Jupyter and pyspark

Download the following Shakespeare collection.
Open a terminal
Launch Jupyter notebook

BASH

jupyter notebook

Enter the following Python code into a cell of the new notebook.
Replace PATH_TO_DOWNLOADED_SHAKESPEARE_TEXT_FILE with the actual path (including the file name) to where you downloaded the file earlier.

Adjust the values for number_cores and memory_gb based on your computer’s configuration.
- For example, on a computer running Intel Celeron N3550 with 2 logical cores and 4GB of memory, number_cores is set to 2 and memory_gb is set to 2.
Run the cell.
Once (if) the run completes successfully, you can revisit your Jupyter Server and observe that output-word-count-1 directory is now created.
- _SUCCESS is an empty file that serves the purpose of notifying of a successful execution.
- part-00000 and part-00001 contains the resulting outputs.

You can also visit 127.0.0.1:4040/jobs, you can observe the running Spark cluster spawned by the Jupyter notebook:

Changing the Spark page’s tab to Executorsto observe the configuration of the cluster:
- The cluster has 8 cores
- The amount of available memory is only 8.4GB out of 16GB, this is due to Spark’s memory storage reservation/protection.

MacOS

Install Anaconda

Visit Anaconda’s download page and download the corresponding Anaconda installers.
You should select the 64-Bit variety for your installers.
Run the installation for Anaconda.

Install Java

You can test for the availability of your Java by trying to run:

BASH

java --version
javac --version

You should have a Java version higher than 8.0 returned. If you don’t, you can setup Java 11 as follows:

If you have an M1/M2 Mac machine

BASH

cd
wget https://github.com/adoptium/temurin11-binaries/releases/download/jdk-11.0.24%2B8/OpenJDK11U-jdk_aarch64_mac_hotspot_11.0.24_8.tar.gz
tar xzf OpenJDK11U-jdk_aarch64_mac_hotspot_11.0.24_8.tar.gz
echo "export JAVA_HOME=$HOME/jdk-11.0.24+8" >> ~/.zshrc
echo "export PATH=$JAVA_HOME/bin:$PATH" >> ~/.zshrc
source ~/.zshrc

If you have an Intel-based Mac machine.

BASH

cd
wget https://github.com/adoptium/temurin11-binaries/releases/download/jdk-11.0.24%2B8/OpenJDK11U-jdk_x64_mac_hotspot_11.0.24_8.tar.gz
tar xzf OpenJDK11U-jdk_x64_mac_hotspot_11.0.24_8.tar.gz
echo "export JAVA_HOME=$HOME/jdk-11.0.24+8" >> ~/.zshrc
echo "export PATH=$JAVA_HOME/bin:$PATH" >> ~/.zshrc
source ~/.zshrc

Install Spark

Run the following to download, extract, setup, and test Apache Spark

BASH

wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz\n
tar xzf spark-3.5.1-bin-hadoop3.tgz
echo "export SPARK_HOME=$HOME/spark-3.5.1-bin-hadoop3" >> ~/.zshrc 
echo "export PATH=$SPARK_HOME/bin:$PATH" >> ~/.zshrc
source ~/.zshrc

Test Jupyter and pyspark

Download the following Shakespeare collection.

BASH

wget http://www.gutenberg.org/files/100/100-0.txt

Launch Jupyter notebook

BASH

jupyter notebook

Enter the following Python code into a cell of the new notebook.
Replace PATH_TO_DOWNLOADED_SHAKESPEARE_TEXT_FILE with the actual path (including the file name) to where you downloaded the file earlier.

Adjust the values for number_cores and memory_gb based on your computer’s configuration.
- For example, on a computer running Intel Celeron N3550 with 2 logical cores and 4GB of memory, number_cores is set to 2 and memory_gb is set to 2.
Run the cell.
Once (if) the run completes successfully, you can revisit your Jupyter Server and observe that output-word-count-1 directory is now created.
- _SUCCESS is an empty file that serves the purpose of notifying of a successful execution.
- part-00000 and part-00001 contains the resulting outputs.

Linux

For Linux users, the setup process is similar to that of MacOS users. Depending on the nature of your shell, you might replace .zshrc with .bashrc.

Spark Setup: Online platform on Kaggle

CloudLab is an NSF-funded experimental testbed for future computing research and education. CloudLab allows researchers control to the bare metal of a diverse and distributed set of resources at large scale. As a result, it allows repeatable and scientific design of experiments. Some key characteristics of CloudLab include:

Sliceability: the ability to support virtualization while maintaining some degree of isolation for simultaneous experiments
Deep programmability: the ability to influence the behavior of computing, storage, routing, and forwarding components deep inside the network, not just at or near the network edge.

Access CloudLab

Visit CloudLab’s website
Click Request an Account
Fill in the information as shown below and click Submit Request
Fill in the information as shown below and click Submit Request
- Username: Use your WCUPA username. Make sure everything is lowercase.
- Full Name: Your first and last name
- Email: Use a non-WCUPA email. WCUPA email tends to block the confirmation email.
- Select Country: United States
- Select State: Pennsylvania
- City: Malvern
- Institutional Affiliation: West Chester University of Pennsylvania
- Ignore the SSH Public Key file box for now.
- Enter a password of your choice in Password and Confirm Password boxes.
- Check Join Existing Project for Project Information.
- Project Name: SecureEDU
Wait for a confirmation email to arrive in your wcupa.edu mailbox. You might have to resubmit a new request if you don’t see this email in about half an hour.

After your account is confirmed, the instructor will be able to see your application and can grant you access to CloudLab.
If you already had a CloudLab account, you can select Start/Join Project under your username, then select Join Existing Project and provide the name SecureEDU.

Setup SSH

Launch your terminal (setup in the previous section) and run the following command:

Callout

Hit Enter for all questions.
Do not enter a password or change the default location of the files.

BASH

cd
ssh-keygen -t rsa

Run the following command to display the public key

BASH

cat ~/.ssh/id_rsa.pub

Drag your mouse over to paint/copy the key (just the text, no extra spaces after the last character)
Log into CloudLab, click on your username (top right) and select Manage SSH Keys:

Paste the key into the Key box and click Add Key.

Setup GitHub repository

If you have not had a GitHub account at this point, go to GitHub website and create a new account.
Go to your GitHub account, under Repositories, select New.
- You can select any name for your repo.
- It must be public repository.
- The Add a README file box must be checked.
- Click Create repository when done.
In your new Git repository, click Add file and select Create new file
Type profile.py for the file name and enter the following content into the text editor.

PYTHON

import geni.portal as portal
import geni.rspec.pg as rspec

# Create a Request object to start building the RSpec.
request = portal.context.makeRequestRSpec()
# Create a XenVM
node = request.XenVM("node")
node.disk_image = "urn:publicid:IDN+emulab.net+image+emulab-ops:UBUNTU22-64-STD"
node.routable_control_ip = "true"

node.addService(rspec.Execute(shell="/bin/sh", command="sudo apt update"))
node.addService(rspec.Execute(shell="/bin/sh", command="sudo apt install -y apache2"))
node.addService(rspec.Execute(shell="/bin/sh", command='sudo systemctl status apache2'))

# Print the RSpec to the enclosing page.
portal.context.printRequestRSpec()

Click Commit new file when done.

Setup CloudLab profile and Test SSH connection

Create Experiment Profile

Click on Git Repo
Paste the URL of your previously created Git repo here and click Confirm

Enter the name for your profile, put in some words for the Description.
You will not have a drop-down list of Project.
Click Create when done.

CloudLab Profile

Click Instantiate to launch an experiment from your profile.

CloudLab Profile

Select a Cluster from Wisconsin, Clemson, or Emulab, then click Next.
Do not do anything on the next Start on date/time screen. Click Finish.

Selecting physical cluster before finalizing deployment

Your experiment is now being provision.

Once resources are provisioned, CloudLab will boot up your experiment.

When it is ready, you can use the provided SSH command to log in to your experiment (assuming your key was set up correctly).
The command is in the List View tab.

SSH command to connect to CloudLab experiment

Callout

If you fail to connect to the ready experiment, it is most likely that your public SSH key has not been copied correctly into your CloudLab account.
Check the copied key carefully, and repeat the copy process if necessary (you don’t need to generate a new key)
Click Terminate, then find your profile and instantiate it again.