Setup

Last updated on 2024-07-26 | Edit this page

For this class, we will be using Python notebooks to connect to a Spark server. This Spark server can be hosted locally or deployed on Kaggle, an equivalent of LeetCode for data scientists.

The instructions below will first show the steps to setup Spark environment on local device. Next, the steps to deploy Spark on Kaggle are also demonstrated.

Spark Setup: Locally hosted


Details

Setup for different systems can found in the dropdown menu below.

  • Make sure that your Windows is up to date prior to completing the remaining steps.
  • If you are running on Windows Home, make sure that you switch out of S mode.
  • The minimal configuration this setup has been tested on is:
    • Intel Celeron CPU N3550
    • 4GB memory
    • Windows 10 Home version 20H2

Windows Terminal

  • Make sure that your computer is updated.
  • I will assume that everyone’s computer is running Windows 10, as it is the standard OS on all Windows-based laptop being sold nowadays.
  • Go to Microsoft Store app, search and install Windows Terminal.
Windows terminal

Install Java

  • It is most likely that you already have Java installed. To confirm this, first, open a terminal.
  • Run javac -version and java -version.

BASH

javac -version
java -version
  • You see the version of Java being displayed on screen. Your version might be different from mine.
  • Spark supports up to Java 11 now.
  • Similarly, version information will be displayed.

If you do not have both java and javac, you will need to install Java SDK (Software Development Kit). We will be using the Java distribution maintained by OpenJDK.

  • Go to OpenJDK website.
  • Choose OpenJDK 8 (LTS) or OpenJDK 11 (LTS) for Version and HotSpot for JVM.
  • Click on the download link to begin download the installation package.
  • Run the installer. You can keep all default settings on the installer.
  • Once the installation process finishes, carry out the tests above again in another terminal to confirm that you have both java and javac.

Install Anaconda

  • Visit Anaconda’s download page and download the corresponding Anaconda installers.
  • You should select the 64-Bit variety for your installers.
  • Run the installation for Anaconda.
  • Remember the installation directory for Anaconda.
  • For Windows, this is typically under C:\Users\YOUR_WINDOWS_USERNAME\anaconda3. or C:\ProgramData\anaconda3.
  • Open a terminal and run:

BASH

conda init powershell
  • Restart the terminal

Download Spark

  • Download Spark 3.5.1
  • Untar and store the final directory somewhere that is easily accessible.
  • You might need to download and install 7-Zip to decompress .tgz file on Windows.
Download Spark
  • When decompressing, you might have to do it twice, because the first decompression will return a .tar file, and the second decompression is needed to completely retrieve the Spark directory.
  • Move the resulting directory under the C: drive.

Install libraries to support Hadoop functionalities

  • Open Windows Terminal, and create a hadoop directory and a bin subdirectory under the C: drive.

BASH

cd c:\
mkdir -p hadoop\bin
terminal example
  • Visit the link to winutils.exe, right click on the Download and choose Save Link As.
  • Save the file to C:\hadoop\bin.
  • Visit the link to hadoop.dll, right click on the Download and choose Save Link As.
  • Save the file to C:\hadoop\bin.
download Windows support files for Hadoop

Setup environment variables

  • Click on the Windows icon and start typing environment variables in the search box.
  • Click on Edit the system environment variables.
access and modify System environment variables
  • Click on Environment Variables.
  • Under User variables for ..., click New and enter following pairs of input for each of the items below. Click OK when done.
    • Java
      • Variable name: JAVA_HOME.
      • Variable value: Typically C:\Program Files\AdoptOpenJDK\jdk-11.0.11.9-hotspot.
    • Spark
      • Variable name: SPARK_HOME.
      • Variable value: C:\spark-3.5.1-bin-hadoop3.
    • Hadoop
      • Variable name: HADOOP_HOME.
      • Variable value: C:\hadoop.
    • Anaconda3
      • Variable name: ANACONDA_HOME.
      • Variable value: C:\Users\YOUR_WINDOWS_USERNAME\anaconda3.
create environment variables for Windows
  • In User variables for ..., select Path and click Edit. Next, add the executable and enter following pairs of input by pressing New to enter each the items below into the list. Click OK when done.
    • Java: %JAVA_HOME%\bin
    • Spark: %SPARK_HOME%\bin
    • Hadoop: %HADOOP_HOME%\bin
    • Anaconda3: %ANACONDA_HOME%\Scripts
listing environment variables
  • Close your terminal and relaunch it. Test that all paths are setup correctly by running the followings:

BASH

where.exe javac
where.exe spark-shell
where.exe winutils
showing Windows paths

Setup Jupyter and pyspark

  • Open a terminal and run the followings:

BASH

pip install findspark

Test Jupyter and pyspark

BASH

jupyter notebook
  • Enter the following Python code into a cell of the new notebook.
  • Replace PATH_TO_DOWNLOADED_SHAKESPEARE_TEXT_FILE with the actual path (including the file name) to where you downloaded the file earlier.
  • Adjust the values for number_cores and memory_gb based on your computer’s configuration.
    • For example, on a computer running Intel Celeron N3550 with 2 logical cores and 4GB of memory, number_cores is set to 2 and memory_gb is set to 2.
  • Run the cell.
  • Once (if) the run completes successfully, you can revisit your Jupyter Server and observe that output-word-count-1 directory is now created.
    • _SUCCESS is an empty file that serves the purpose of notifying of a successful execution.
    • part-00000 and part-00001 contains the resulting outputs.
resulting output files of Word Count
contents of resulting files
  • You can also visit 127.0.0.1:4040/jobs, you can observe the running Spark cluster spawned by the Jupyter notebook:
WebUI of local Spark cluster
  • Changing the Spark page’s tab to Executorsto observe the configuration of the cluster:
    • The cluster has 8 cores
    • The amount of available memory is only 8.4GB out of 16GB, this is due to Spark’s memory storage reservation/protection.
Log files of Spark executors

Install Anaconda

  • Visit Anaconda’s download page and download the corresponding Anaconda installers.
  • You should select the 64-Bit variety for your installers.
  • Run the installation for Anaconda.

Install Java

You can test for the availability of your Java by trying to run:

BASH

java --version
javac --version

You should have a Java version higher than 8.0 returned. If you don’t, you can setup Java 11 as follows:

  • If you have an M1/M2 Mac machine

BASH

cd
wget https://github.com/adoptium/temurin11-binaries/releases/download/jdk-11.0.24%2B8/OpenJDK11U-jdk_aarch64_mac_hotspot_11.0.24_8.tar.gz
tar xzf OpenJDK11U-jdk_aarch64_mac_hotspot_11.0.24_8.tar.gz
echo "export JAVA_HOME=$HOME/jdk-11.0.24+8" >> ~/.zshrc
echo "export PATH=$JAVA_HOME/bin:$PATH" >> ~/.zshrc
source ~/.zshrc
  • If you have an Intel-based Mac machine.

BASH

cd
wget https://github.com/adoptium/temurin11-binaries/releases/download/jdk-11.0.24%2B8/OpenJDK11U-jdk_x64_mac_hotspot_11.0.24_8.tar.gz
tar xzf OpenJDK11U-jdk_x64_mac_hotspot_11.0.24_8.tar.gz
echo "export JAVA_HOME=$HOME/jdk-11.0.24+8" >> ~/.zshrc
echo "export PATH=$JAVA_HOME/bin:$PATH" >> ~/.zshrc
source ~/.zshrc

Install Spark

Run the following to download, extract, setup, and test Apache Spark

BASH

wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz\n
tar xzf spark-3.5.1-bin-hadoop3.tgz
echo "export SPARK_HOME=$HOME/spark-3.5.1-bin-hadoop3" >> ~/.zshrc 
echo "export PATH=$SPARK_HOME/bin:$PATH" >> ~/.zshrc
source ~/.zshrc

Test Jupyter and pyspark

BASH

wget http://www.gutenberg.org/files/100/100-0.txt
  • Launch Jupyter notebook

BASH

jupyter notebook
  • Enter the following Python code into a cell of the new notebook.
  • Replace PATH_TO_DOWNLOADED_SHAKESPEARE_TEXT_FILE with the actual path (including the file name) to where you downloaded the file earlier.
  • Adjust the values for number_cores and memory_gb based on your computer’s configuration.
    • For example, on a computer running Intel Celeron N3550 with 2 logical cores and 4GB of memory, number_cores is set to 2 and memory_gb is set to 2.
  • Run the cell.
  • Once (if) the run completes successfully, you can revisit your Jupyter Server and observe that output-word-count-1 directory is now created.
    • _SUCCESS is an empty file that serves the purpose of notifying of a successful execution.
    • part-00000 and part-00001 contains the resulting outputs.

For Linux users, the setup process is similar to that of MacOS users. Depending on the nature of your shell, you might replace .zshrc with .bashrc.

Spark Setup: Online platform on Kaggle


CloudLab is an NSF-funded experimental testbed for future computing research and education. CloudLab allows researchers control to the bare metal of a diverse and distributed set of resources at large scale. As a result, it allows repeatable and scientific design of experiments. Some key characteristics of CloudLab include:

  • Sliceability: the ability to support virtualization while maintaining some degree of isolation for simultaneous experiments
  • Deep programmability: the ability to influence the behavior of computing, storage, routing, and forwarding components deep inside the network, not just at or near the network edge.
  • Visit CloudLab’s website
  • Click Request an Account
  • Fill in the information as shown below and click Submit Request
  • Fill in the information as shown below and click Submit Request
    • Username: Use your WCUPA username. Make sure everything is lowercase.
    • Full Name: Your first and last name
    • Email: Use a non-WCUPA email. WCUPA email tends to block the confirmation email.
    • Select Country: United States
    • Select State: Pennsylvania
    • City: Malvern
    • Institutional Affiliation: West Chester University of Pennsylvania
    • Ignore the SSH Public Key file box for now.
    • Enter a password of your choice in Password and Confirm Password boxes.
    • Check Join Existing Project for Project Information.
    • Project Name: SecureEDU
  • Wait for a confirmation email to arrive in your wcupa.edu mailbox. You might have to resubmit a new request if you don’t see this email in about half an hour.
Joining a CloudLab project
Joining CloudLab Project
  • After your account is confirmed, the instructor will be able to see your application and can grant you access to CloudLab.
  • If you already had a CloudLab account, you can select Start/Join Project under your username, then select Join Existing Project and provide the name SecureEDU.
  • Launch your terminal (setup in the previous section) and run the following command:

Callout

  • Hit Enter for all questions.
  • Do not enter a password or change the default location of the files.

BASH

cd
ssh-keygen -t rsa
generate SSH key
Generate SSH Key
  • Run the following command to display the public key

BASH

cat ~/.ssh/id_rsa.pub
Example content of a public key
Public SSH Key
  • Drag your mouse over to paint/copy the key (just the text, no extra spaces after the last character)
  • Log into CloudLab, click on your username (top right) and select Manage SSH Keys:
adding SSH key to CloudLab account
Manage SSH Keys
  • Paste the key into the Key box and click Add Key.

Setup GitHub repository

  • If you have not had a GitHub account at this point, go to GitHub website and create a new account.
  • Go to your GitHub account, under Repositories, select New.
    • You can select any name for your repo.
    • It must be public repository.
    • The Add a README file box must be checked.
    • Click Create repository when done.
  • In your new Git repository, click Add file and select Create new file
  • Type profile.py for the file name and enter the following content into the text editor.

PYTHON

import geni.portal as portal
import geni.rspec.pg as rspec

# Create a Request object to start building the RSpec.
request = portal.context.makeRequestRSpec()
# Create a XenVM
node = request.XenVM("node")
node.disk_image = "urn:publicid:IDN+emulab.net+image+emulab-ops:UBUNTU22-64-STD"
node.routable_control_ip = "true"

node.addService(rspec.Execute(shell="/bin/sh", command="sudo apt update"))
node.addService(rspec.Execute(shell="/bin/sh", command="sudo apt install -y apache2"))
node.addService(rspec.Execute(shell="/bin/sh", command='sudo systemctl status apache2'))

# Print the RSpec to the enclosing page.
portal.context.printRequestRSpec()
  • Click Commit new file when done.
  • Login to your CloudLab account, click Experiments on top left, select Create Experiment Profile.
Create new CloudLab experiment profile
Create Experiment Profile
  • Click on Git Repo
  • Paste the URL of your previously created Git repo here and click Confirm
Copy URL link of GitHub repository
Copy GitHub repository link
  • Enter the name for your profile, put in some words for the Description.
  • You will not have a drop-down list of Project.
  • Click Create when done.
CloudLab Profile
CloudLab Profile
  • Click Instantiate to launch an experiment from your profile.
Instantiate an experiment from a profile
CloudLab Profile
  • Select a Cluster from Wisconsin, Clemson, or Emulab, then click Next.
  • Do not do anything on the next Start on date/time screen. Click Finish.
Selecting physical cluster before finalizing deployment
CloudLab Profile
  • Your experiment is now being provision.
CloudLab is provisioning resources
Provisioning requested resource
  • Once resources are provisioned, CloudLab will boot up your experiment.
Nodes on experiment are booting
Booting up all nodes of an experiment
  • When it is ready, you can use the provided SSH command to log in to your experiment (assuming your key was set up correctly).
  • The command is in the List View tab.
SSH command to connect to CloudLab experiment
SSH information to connect

Callout

  • If you fail to connect to the ready experiment, it is most likely that your public SSH key has not been copied correctly into your CloudLab account.
  • Check the copied key carefully, and repeat the copy process if necessary (you don’t need to generate a new key)
  • Click Terminate, then find your profile and instantiate it again.