Summary and Schedule

This is the curriculum for CSC 467, Big Data Engineering. The website is built with The Carpentries Workbench.

  • Instructor: Linh B. Ngo
  • Office: UNA 138
  • Office Hours: By appointment
  • Email: lngo AT wcupa DOT edu
  • Phone: 610-436-2595

Course Information


  • The course runs from August 26, 2024 until December 09, 2024. It is an in-person course.

Course Description


This course will investigate engineering approaches in solving challenges in data-intensive and big data computing problems. Course topics include distributed tools and parallel algorithms that help with acquiring, cleaning, and mining very large amount of data.

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.

For this class, we will be using Python notebooks to connect to a Spark server. This Spark server can be hosted locally or deployed on Kaggle, an equivalent of LeetCode for data scientists.

The instructions below will first show the steps to setup Spark environment on local device. Next, the steps to deploy Spark on Kaggle are also demonstrated.

Spark Setup: Locally hosted


Details

Setup for different systems can found in the dropdown menu below.

  • Make sure that your Windows is up to date prior to completing the remaining steps.
  • If you are running on Windows Home, make sure that you switch out of S mode.
  • The minimal configuration this setup has been tested on is:
    • Intel Celeron CPU N3550
    • 4GB memory
    • Windows 10 Home version 20H2

Windows Terminal

  • Make sure that your computer is updated.
  • I will assume that everyone’s computer is running Windows 10, as it is the standard OS on all Windows-based laptop being sold nowadays.
  • Go to Microsoft Store app, search and install Windows Terminal.
Windows terminal

Install Java

  • It is most likely that you already have Java installed. To confirm this, first, open a terminal.
  • Run javac -version and java -version.

BASH

javac -version
java -version
  • You see the version of Java being displayed on screen. Your version might be different from mine.
  • Spark supports up to Java 11 now.
  • Similarly, version information will be displayed.

If you do not have both java and javac, you will need to install Java SDK (Software Development Kit). We will be using the Java distribution maintained by OpenJDK.

  • Go to OpenJDK website.
  • Choose OpenJDK 8 (LTS) or OpenJDK 11 (LTS) for Version and HotSpot for JVM.
  • Click on the download link to begin download the installation package.
  • Run the installer. You can keep all default settings on the installer.
  • Once the installation process finishes, carry out the tests above again in another terminal to confirm that you have both java and javac.

Install Anaconda

  • Visit Anaconda’s download page and download the corresponding Anaconda installers.
  • You should select the 64-Bit variety for your installers.
  • Run the installation for Anaconda.
  • Remember the installation directory for Anaconda.
  • For Windows, this is typically under C:\Users\YOUR_WINDOWS_USERNAME\anaconda3. or C:\ProgramData\anaconda3.
  • Open a terminal and run:

BASH

conda init powershell
  • Restart the terminal

Download Spark

  • Download Spark 3.5.1
  • Untar and store the final directory somewhere that is easily accessible.
  • You might need to download and install 7-Zip to decompress .tgz file on Windows.
Download Spark
  • When decompressing, you might have to do it twice, because the first decompression will return a .tar file, and the second decompression is needed to completely retrieve the Spark directory.
  • Move the resulting directory under the C: drive.

Install libraries to support Hadoop functionalities

  • Open Windows Terminal, and create a hadoop directory and a bin subdirectory under the C: drive.

BASH

cd c:\
mkdir -p hadoop\bin
terminal example
  • Visit the link to winutils.exe, right click on the Download and choose Save Link As.
  • Save the file to C:\hadoop\bin.
  • Visit the link to hadoop.dll, right click on the Download and choose Save Link As.
  • Save the file to C:\hadoop\bin.
download Windows support files for Hadoop

Setup environment variables

  • Click on the Windows icon and start typing environment variables in the search box.
  • Click on Edit the system environment variables.
access and modify System environment variables
  • Click on Environment Variables.
  • Under User variables for ..., click New and enter following pairs of input for each of the items below. Click OK when done.
    • Java
      • Variable name: JAVA_HOME.
      • Variable value: Typically C:\Program Files\AdoptOpenJDK\jdk-11.0.11.9-hotspot.
    • Spark
      • Variable name: SPARK_HOME.
      • Variable value: C:\spark-3.5.1-bin-hadoop3.
    • Hadoop
      • Variable name: HADOOP_HOME.
      • Variable value: C:\hadoop.
    • Anaconda3
      • Variable name: ANACONDA_HOME.
      • Variable value: C:\Users\YOUR_WINDOWS_USERNAME\anaconda3.
create environment variables for Windows
  • In User variables for ..., select Path and click Edit. Next, add the executable and enter following pairs of input by pressing New to enter each the items below into the list. Click OK when done.
    • Java: %JAVA_HOME%\bin
    • Spark: %SPARK_HOME%\bin
    • Hadoop: %HADOOP_HOME%\bin
    • Anaconda3: %ANACONDA_HOME%\Scripts
listing environment variables
  • Close your terminal and relaunch it. Test that all paths are setup correctly by running the followings:

BASH

where.exe javac
where.exe spark-shell
where.exe winutils
showing Windows paths

Setup Jupyter and pyspark

  • Open a terminal and run the followings:

BASH

pip install findspark

Test Jupyter and pyspark

BASH

jupyter notebook
  • Enter the following Python code into a cell of the new notebook.
  • Replace PATH_TO_DOWNLOADED_SHAKESPEARE_TEXT_FILE with the actual path (including the file name) to where you downloaded the file earlier.
  • Adjust the values for number_cores and memory_gb based on your computer’s configuration.
    • For example, on a computer running Intel Celeron N3550 with 2 logical cores and 4GB of memory, number_cores is set to 2 and memory_gb is set to 2.
  • Run the cell.
  • Once (if) the run completes successfully, you can revisit your Jupyter Server and observe that output-word-count-1 directory is now created.
    • _SUCCESS is an empty file that serves the purpose of notifying of a successful execution.
    • part-00000 and part-00001 contains the resulting outputs.
resulting output files of Word Count
contents of resulting files
  • You can also visit 127.0.0.1:4040/jobs, you can observe the running Spark cluster spawned by the Jupyter notebook:
WebUI of local Spark cluster
  • Changing the Spark page’s tab to Executorsto observe the configuration of the cluster:
    • The cluster has 8 cores
    • The amount of available memory is only 8.4GB out of 16GB, this is due to Spark’s memory storage reservation/protection.
Log files of Spark executors

Install Anaconda

  • Visit Anaconda’s download page and download the corresponding Anaconda installers.
  • You should select the 64-Bit variety for your installers.
  • Run the installation for Anaconda.

Install Java

You can test for the availability of your Java by trying to run:

BASH

java --version
javac --version

You should have a Java version higher than 8.0 returned. If you don’t, you can setup Java 11 as follows:

  • If you have an M1/M2 Mac machine

BASH

cd
wget https://github.com/adoptium/temurin11-binaries/releases/download/jdk-11.0.24%2B8/OpenJDK11U-jdk_aarch64_mac_hotspot_11.0.24_8.tar.gz
tar xzf OpenJDK11U-jdk_aarch64_mac_hotspot_11.0.24_8.tar.gz
echo "export JAVA_HOME=$HOME/jdk-11.0.24+8" >> ~/.zshrc
echo "export PATH=$JAVA_HOME/bin:$PATH" >> ~/.zshrc
source ~/.zshrc
  • If you have an Intel-based Mac machine.

BASH

cd
wget https://github.com/adoptium/temurin11-binaries/releases/download/jdk-11.0.24%2B8/OpenJDK11U-jdk_x64_mac_hotspot_11.0.24_8.tar.gz
tar xzf OpenJDK11U-jdk_x64_mac_hotspot_11.0.24_8.tar.gz
echo "export JAVA_HOME=$HOME/jdk-11.0.24+8" >> ~/.zshrc
echo "export PATH=$JAVA_HOME/bin:$PATH" >> ~/.zshrc
source ~/.zshrc

Install Spark

Run the following to download, extract, setup, and test Apache Spark

BASH

wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz\n
tar xzf spark-3.5.1-bin-hadoop3.tgz
echo "export SPARK_HOME=$HOME/spark-3.5.1-bin-hadoop3" >> ~/.zshrc 
echo "export PATH=$SPARK_HOME/bin:$PATH" >> ~/.zshrc
source ~/.zshrc

Test Jupyter and pyspark

BASH

wget http://www.gutenberg.org/files/100/100-0.txt
  • Launch Jupyter notebook

BASH

jupyter notebook
  • Enter the following Python code into a cell of the new notebook.
  • Replace PATH_TO_DOWNLOADED_SHAKESPEARE_TEXT_FILE with the actual path (including the file name) to where you downloaded the file earlier.
  • Adjust the values for number_cores and memory_gb based on your computer’s configuration.
    • For example, on a computer running Intel Celeron N3550 with 2 logical cores and 4GB of memory, number_cores is set to 2 and memory_gb is set to 2.
  • Run the cell.
  • Once (if) the run completes successfully, you can revisit your Jupyter Server and observe that output-word-count-1 directory is now created.
    • _SUCCESS is an empty file that serves the purpose of notifying of a successful execution.
    • part-00000 and part-00001 contains the resulting outputs.

For Linux users, the setup process is similar to that of MacOS users. Depending on the nature of your shell, you might replace .zshrc with .bashrc.

Spark Setup: Online platform on Kaggle


Kaggle is an online platform for learning and practicing data science and engineering. You can think of it as LeetCode for data science.

  • First, you need to create an account on Kaggle and log in.
  • Select Code, then + New Notebook.
Launching new notebook
  • After the notebook is launched, on the right-side of the browser, go to Session options and make sure that Internet is on
Setting Internet option to on
  • Run the following codes in the first cell. This is to setup Java inside the notebook environment

BASH

!wget https://github.com/adoptium/temurin11-binaries/releases/download/jdk-11.0.24%2B8/OpenJDK11U-jdk_x64_linux_hotspot_11.0.24_8.tar.gz
!tar xzf OpenJDK11U-jdk_x64_linux_hotspot_11.0.24_8.tar.gz
!rm OpenJDK11U-jdk_x64_linux_hotspot_11.0.24_8.tar.gz
  • Run the following codes in the second cell. This is to setup Spark

BASH

!wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
!tar xzf spark-3.5.1-bin-hadoop3.tgz
!rm spark-3.5.1-bin-hadoop3.tgz
  • Run the following codes in the third cell. This is to setup findspark.

BASH

!pip install findspark
  • Run the following body of Python codes in the fourth cell of the notebook. This will initiate the Spark stand-alone cluster inside the Kaggle VM.

PYTHON

import os
import sys

os.environ["JAVA_HOME"] = "/kaggle/working/jdk-11.0.24+8/"
os.environ["SPARK_HOME"] = "/kaggle/working/spark-3.5.1-bin-hadoop3/"
spark_path = os.environ['SPARK_HOME']
sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.10.9.7-src.zip")

import findspark
findspark.init()

import pyspark
number_cores = 8
memory_gb = 4

conf = (pyspark.SparkConf().setMaster('local[{}]'.format(number_cores)).set('spark.driver.memory', '{}g'.format(memory_gb)))

sc = pyspark.SparkContext(conf=conf)
  • We will test Spark in the next cell with the following codes

PYTHON

textFile = sc.textFile("100-0.txt")
wordcount = textFile.flatMap(lambda line: line.split(" ")) \
            .map(lambda word: (word, 1)) \
            .reduceByKey(lambda a, b: a + b)
wordcount.saveAsTextFile("output-wordcount-01")
  • Note that /kaggle/working is the current working directory inside Kaggle VM. If you expand the Output tab on the right side, you will see the content of this directory. Open output-wordcount-01, you will see the _SUCCESS file.
Output directory

Callout

  • Kaggle’s environments are temporary. You will need to rerun all the first three setup cells anytime you start your work on Kaggle.