Content from Syllabus

Last updated on 2024-07-26 | Edit this page

This is the curriculum for CSC 467, Big Data Engineering. The website is built with The Carpentries Workbench.

Instructor: Linh B. Ngo
Office: UNA 138
Office Hours: By appointment
Email: lngo AT wcupa DOT edu
Phone: 610-436-2595

Course Information

The course runs from August 26, 2024 until December 09, 2024. It is an in-person course.

Required Materials:

Mining of Massive Datasets. Julre Leskovec, Anand Rajaraman, and Jeffrey David Ullman.
Author’s Textbook Download Page.

Course Description

This course will investigate engineering approaches in solving challenges in data-intensive and big data computing problems. Course topics include distributed tools and parallel algorithms that help with acquiring, cleaning, and mining very large amount of data.

Learning Objectives

Course Student Learning Outcomes (CSLO):

Be able to setup and deploy appropriate data engineering technologies to manage big data sets.
Be able to understand MapReduce, one of the key enabling programming concepts.
Be able to implement key data mining techniques using Spark programming libraries.

BS in CS Program Objectives (CSPO):

Be able to apply theory, techniques, and methodologies to create and/or maintain high quality computing systems that function effectively and reliably in the emerging and future information infrastructure.

ABET Objectives (APO):

Analyze a complex computing problem and to apply principles of computing and other relevant disciplines to identify solutions (ABET 1).
Design, implement, and evaluate a computing-based solution to meet a given set of computing requirements in the context of the program’s discipline (ABET 2).

Assessments and Grading:

Method of Evaluation

Assessment	% of Final Grade	Course Objectives Assessed	Program Objectives Assessed	ABET Objectives
Assignments	40%	1,2,3	1	1
Course Project	40%	1,2,3	1	2
Class Participation	5%	1	1	1
Final	15%	1,2,3	1	1

Grade Scale:

Numeric	Letter
100-93	A
92-90	A-
89-87	B+
86-83	B
82-80	B-
79-77	C+
76-73	C
72-70	C-
69-67	D+
66-63	D
62-60	D-
<= 59	F

Refer to the Undergraduate Catalog for description of NG (No Grade), W, & other grades.

Assessments:

Lateness Policy:

Individual assignments that are late are assessed a 10% per day late penalty. Saturday and Sunday are each days. There is no late acceptance for team-based milestones.

University Policies

Academic & Personal Integrity

It is the responsibility of each student to adhere to the university’s standards for academic integrity. Violations of academic integrity include any act that violates the rights of another student in academic work, that involves misrepresentation of your own work, or that disrupts the instruction of the course. Other violations include (but are not limited to): cheating on assignments or examinations; plagiarizing, which means copying any part of another’s work and/or using ideas of another and presenting them as one’s own without giving proper credit to the source; selling, purchasing, or exchanging of term papers; falsifying of information; and using your own work from one class to fulfill the assignment for another class without significant modification. Proof of academic misconduct can result in the automatic failure and removal from this course. For questions regarding Academic Integrity, the No-Grade Policy, Sexual Harassment, or the Student Code of Conduct, students are encouraged to refer to the Department Undergraduate Handbook, the Undergraduate Course Catalog, the Ram’s Eye View, or the University Website.

Students with Disabilities

If you have a disability that requires accommodations under the Americans with Disabilities Act (ADA), please present your letter of accommodations and meet with me as soon as possible so that I can support your success in an informed manner. Accommodations cannot be granted retroactively. If you would like to know more about West Chester University’s Services for Students with Disabilities (OSSD), please visit them at 223 Lawrence Center. Their phone number is 610-436-2564, their fax number is 610-436-2600, their email address is ossd@wcupa.edu, or visit the OSSD website. In an effort to assist students who either receive or may believe they are entitled to receive accommodations under the Americans with Disabilities Act and Section 504 of the Rehabilitation Act of 1973, the University has appointed a student advocate to be a contact for students who have questions regarding the provision of their accommodations or their right to accommodations. The advocate will assist any student who may have questions regarding these rights. The Director for Equity and Compliance/Title IX Coordinator has been designated in this role. Students who need assistance with their rights to accommodations should contact them at 610-436-2433.

The University’s Americans with Disabilities policy is available on the website. If you encounter an area of this course that is not accessible to you, please contact me.

Excused Absences Policy

Students are advised to carefully read and comply with the excused absences policy, including absences for university-sanctioned events, contained in the WCU Undergraduate Catalog. In particular, please note that the responsibility for meeting academic requirements rests with the student, that this policy does not excuse students from completing required academic work, and that professors can require a fair alternative to attendance on those days that students must be absent from class in order to participate in a University-Sanctioned Event.

Reporting Incidents of Sexual Violence

West Chester University and its faculty are committed to assuring a safe and productive educational environment for all students. In order to comply with the requirements of Title IX of the Education Amendments of 1972 and the University’s commitment to offering supportive measures in accordance with the new regulations issued under Title IX, the University requires faculty members to report incidents of sexual violence shared by students to the University’s Title IX Coordinator. The only exceptions to the faculty member’s reporting obligation are when incidents of sexual violence are communicated by a student during a classroom discussion, in a writing assignment for a class, or as part of a University-approved research project. Faculty members are obligated to report sexual violence or any other abuse of a student who was, or is, a child (a person under 18 years of age) when the abuse allegedly occurred to the person designated in the University Protection of Minors Policy. Information regarding the reporting of sexual violence and the resources that are available to victims of sexual violence is set forth at the webpage for the Office of Diversity, Equity, and Inclusion.

Inclusive Learning Environment and Anti-Racist Statement

Diversity, equity, and inclusion are central to West Chester University’s mission as reflected in our Mission Statement, Values Statement, Vision Statement and Strategic Plan: Pathways to Student Success. We disavow racism and all actions that silence, threaten, or degrade historically marginalized groups in the U.S. We acknowledge that all members of this learning community may experience harm stemming from forms of oppression including but not limited to classism, ableism, heterosexism, sexism, Islamophobia, anti-Semitism, and xenophobia, and recognize that these forms of oppression are compounded by racism.

Our core commitment as an institution of higher education shapes our expectation for behavior within this learning community, which represents diverse individual beliefs, backgrounds, and experiences. Courteous and respectful behavior, interactions, and responses are expected from all members of the University. We must work together to make this a safe and productive learning environment for everyone. Part of this work is recognizing how race and other aspects of who we are shape our beliefs and our experiences as individuals. It is not enough to condemn acts of racism. For real, sustainable change, we must stand together as a diverse coalition against racism and oppression of any form, anywhere, at any time.

Resources for education and action are available through WCU’s Office for Diversity, Equity, and Inclusion (ODEI), DEI committees within departments or colleges, the student ombudsperson, and centers on campus committed to doing this work (e.g., Dowdy Multicultural Center, Center for Women and Gender Equity, and the Center for Trans and Queer Advocacy).

Guidance on how to report incidents of discrimination and harassment is available at the University’s Office of Diversity, Equity and Inclusion.

Emergency Preparedness

All students are encouraged to sign up for the University’s free WCU ALERT service, which delivers official WCU emergency text messages directly to your cell phone. For more information, visit https://www.wcupa.edu/wcualert. To report an emergency, call the Department of Public Safety at 610-436-3311.

Electronic Mail Policy

It is expected that faculty, staff, and students activate and maintain regular access to University provided e-mail accounts. Official university communications, including those from your instructor, will be sent through your university e-mail account. You are responsible for accessing that mail to be sure to obtain official University communications. Failure to access will not exempt individuals from the responsibilities associated with this course.

Course Topics and Schedules

Week	Topic	Assessments
	Introduction to Big Data	-
	MapReduce Programming Paradigm	-
	Data Parallel Computing Environment in Python	-
	Frequent Itemsets	-
	Locality-Sensitive Hashing	-
	Clustering	-
	Dimensionality Reduction	-
	Recommendation Systems	-
	Page Rank	-
	Decision Trees	-
	Data Parallel Computing with Spark	-

Content from Introduction

Last updated on 2024-07-26 | Edit this page

1. Terminologies

Big data problems: not only the processing power, but the size of the data is also the limiting factor in being able to find a timely solution.
Big Data (Industry, Social Sciences)
- Input data carry characteristics of Big Data (the 4V).
- Computational process is simple and straightforward, with minimal intermediate data being generated.
Data Intensive Computing (Natural Sciences)
- Input data may or may not be big data.
- Computational process produces massive and complex intermediate data that needs to be analyzed during the process.

2. Big Data in Science (Data-enabled Science)

Scientific process:
- empirical
- theoretical
- computational
- data-enabled/data-intensive science (the fourth pillar of scientific discovery).
Big data analytics in science and engineering for data sets that are:
- Too big
- Too complex
- Too fast (streaming)
- Too noisy
- Too heterogeneous

3. Big Data in Industry

Old statistics
- In 2008-2009:
  - Google processed 20PB a day.
  - Facebook had 2.5PB of user data + 15TB/day.
  - eBay had 6.5PB of user data + 50TB/day.
- In 2010-2011:
  - Facebook had 400M users / 125PB of user data.
  - eBay had 10PB of user data in 2010, expected to double this number in 2011
- In 2012-2013:
  - Facebook had 900M users
  - Twitter had 400M Tweets/day
We don’t see this kind of statistics collected anymore because it has become the norm.
Newer (and different) statistics:
- 53% of companies are using big data analytics today
  - Data warehouse optimization, customer/social analysis, predictive maintenance, clickstream analytics, fraud detectiton, and Internet of Things.
  - Spark, MapReduce, and Yarn are among the most popular framework.
- Medicine, Retail, Construction, Banking, and Transportation are the new industries that are being defined by big data analytic capabilities
People don’t explicitly talk about Big Data anymore:

## Gartner Hype Cycle 2014

## Gartner Hype Cycle 2019

## Gartner Hype Cycle 2020

4. The Vs of Big Data

Used to be four: Volume, Velocity, Variety, Veracity.
Other Vs are added over time.
Volume: the size of the files used to archive and spread data.
Velocity: the speed with which data is generated and processed.
Variety: formats and purposes of data, which may include objects as different as samples of animal tissue, free-text observations, humidity measurements, GPS coordinates, and the results of blood tests.
Veracity: the extent to which the quality and reliability of big data can be guaranteed. Data with high volume, velocity and variety are at significant risk of containing inaccuracies, errors and unaccounted-for bias.
Validity: the selection of appropriate data with respect to the intended use. The choice of a specific dataset as evidence base requires adequate and explicit justification, including recourse to relevant background knowledge to ground the identification of what counts as data in that context.
Volatility: the extent to which data can be relied upon to remain available, accessible and re-interpretable despite changes in archival technologies. This is significant given the tendency of formats and tools used to generate and analyse data to become obsolete, and the efforts required to update data infrastructures so as to guarantee data access in the long term.
Value: the multifaceted forms of significance attributed to big data by different sections of society, which depend as much on the intended use of the data as on historical, social and geographical circumstances.

5. Programming paradigm for big data

Multi-faceted challenges:
- Require not only parallel computation but also parallel data processing.
New computational tools and strategies.
New data intensive scalable architectures.
Science is moving increasingly from hypothesis-driven to data-driven discoveries.
Industry is at a stage where big data infrastructures are integrated and big data sets are beginning to be analyzed to produce business insights.
Example general paradigm:

It is difficult to write parallel programs
- Difficult in converting algorithms from serial to parallel.
- Difficult in identifying different ways that the program can fail.
- No reliable way to detect failure of a process.
It is even more difficult to write parallel programs at large scale
- Same set of errors, but scale up with size.
It is even more difficult to debug large scale parallel programs
- What if the program doesn’t fail but only produce incorrect results?

6. Data-intensive approach

Scale “out”, not “up”
- It is easier and cheaper to add nodes to an existing cluster than to build a faster cluster.
Move computation to the data
- Reduce data movement.
Sequential processing, avoid random access
- Reduce seek movement on disks.
Seamless scalability

{% include links.md %}

Content from MapReduce Programming Paradigm

Last updated on 2024-07-26 | Edit this page

Overview

Questions

How does Linux come to be?

Objectives

Explain the historical development of Linux

1. The reality of working with big data

Hundreds or thousands of machines to support big data.
- Distribute data for storage (not within the scope of this class).
- Parallelize data computation (MapReduce)
- Handle failure (MapReduce)
The paper: MapReduce: simplified data processing on large clusters

2. The reality of working with big data (as outlined by the paper)

Challenges:
- input data is usually large
- the computations have to be distributed across hundreds or thousands of machines
- finish in a reasonable amount of time.
Addressing these challenges causes the original simple computation to become obscured by large amounts of supporting complex codes.
What MapReduce does:
- is a new abstraction
- expresses the simple core computations
- hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.
Why MapReduce?
- inspired by the _map_ and _reduce_ primitives present in Lisp and many other functional languages.
- Most data computations involve:
  - applying a _map_ operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then
  - applying a _reduce_ operation to all the values that shared the same key to combine the derived data appropriately.

3. MapReduce in a nutshell

What is map? A function/procedure that is applied to every individual elements of a collection/list/array/…

int square(x) { return x*x;}

map square [1,2,3,4] -> [1,4,9,16]

What is reduce? A function/procedure that performs an operation on a list. This operation will fold/reduce this list into a single value (or a smaller subset)

reduce ([1,2,3,4]) using sum -> 10

reduce ([1,2,3,4]) using multiply -> 24

3. Word Count: the “Hello, World” of big data

We have a large amount of text …
- Could be stored in a single massive file.
- Could be stored in multiple files.
We want to count the number of times each distinct word appears in the files
Sample applications:
- Analyze web server logs to find popular URLs/keywords.
- Analyze security logs to find incidents.
Standard parallel programming approach:
- Count number of files or set up seek distances within a file.
- Set number of processes
- Possibly setting up dynamic workload assignment
- A lot of data transfer
- Significant coding effort

4. Word Count: MapReduce workflow

5. Word Count: MapReduce workflow - what do you really do

Input: a set of key-value pairs
Programmer specifies two methods:
- Map(k, v) -> (k', v'):
  - Takes a key-value pair and outputs a set of key-value pairs.
    - E.g., key is the filename, value is a single line in the file
  - There is one Map call for every (k,v) pair
- Reduce(k2, <v'>) -> <k’, v’’>
  - All values v' with same key k' are reduced together and processed in v' order.
  - There is one Reduce function call per unique key k'.
MapReduce environment takes care of:
- Partitioning the input data
- Scheduling the program’s execution across a set of machines
- Performing the group by key step
- Handling machine failures
- Managing required inter-machine communication

6. Word Count: MapReduce workflow at scale

7. Applications that suit well with MapReduce programming paradigm

Text tokenization, indexing, and search
- Web access log stats
- Inverted index construction
- Term-vector per host
- Distributed grep/sort
Graph creation
- Web link-graph reversal (Google’s PageRank)
Data Mining and machine learning
- Document clustering
- Machine learning
- Statistical machine translation

{% include links.md %}

Content from Spark Computing Environment

Last updated on 2024-07-26 | Edit this page

Overview

Questions

How does Linux come to be?

Objectives

Explain the historical development of Linux

Spark computing environment

1. What is Spark?

A unified compute engine and a set of libraries for parallel data processing on computer clusters.

2. Design philosophy

Unified: Spark supports a wide range of data analytic tasks over the same computing engine and a consistent set of APIs.
Computing engine: Spark handles loading data from storage systems and perform computation on the data (in memory) rather than on permanent storage. To adhere to the data locality principle, Spark relies on APIs to provide a transparent common interface with different storage systems for all applications.
Libraries: Via its APIs, Spark supports a wide array of internal and external libraries for complex data analytic tasks.

3. A brief history of Spark

Research project at UC Berkeley AMP Lab in 2009 to address drawbacks of Hadoop MapReduce.
Paper published in 2010: Spark: Cluster Computing with Working Sets
Source code is contributed to Apache in 2013. The project had more than 100 contributors from more than 30 organizations outside UC Berkeley.
Version 1.0 was released in 2014.
Currently, Spark is being used extensively in academic and industry (NASA, CERN, Uber, Netflix …).

4. Spark applications

Typically consists of a driver process and a set of executor processes.
The driver runs the main function and is responsible for:
- maintaining information about the Spark application,
- responding to a user’s program or input, and
- analyzing, distributing, and scheduling work across the executors.
The executors carry out the actual work assigned to them by the driver.

Spark also has a local mode (what we are using for this class), where driver and executors are simply processes on the same machine.
Spark application developed in local mode can be carried over almost as-is to run in cluster mode (one of the attractiveness of Spark).
Spark supports the following language APIs: Scala, Java, Python, SQL (ANSI SQL 2003 standard), and R.

5. Hands-on: Word Count in Spark

Open a terminal.
Activate the pyspark conda environment, then launch Jupyter notebook

$ conda activate pyspark

$ jupyter notebook

Create a new notebook using the pyspark kernel, then change the notebook’s name to spark-1.

Enter the following Python code into the first cell of spark-1, then run the cell.
This code sets up the paths to Spark’s libraries and executables.

import os

import sys

spark_path = os.environ['SPARK_HOME']

sys.path.append(spark_path + "/bin")

sys.path.append(spark_path + "/python")

sys.path.append(spark_path + "/python/pyspark/")

sys.path.append(spark_path + "/python/lib")

sys.path.append(spark_path + "/python/lib/pyspark.zip")

sys.path.append(spark_path + "/python/lib/py4j-0.10.9-src.zip")



import findspark

findspark.init()

import pyspark

Enter the following Python code into the next cell, adjust number_cores and memory_gb based on your computer’s hardware, then run the cell.
- This code sets up the configuration for the supporting Spark cluster (in local mode).
- number_cores should be one less than the total number of cores on your computer.
- memory_gb should be half the amount of memory that you have, or at least 3GB less.

number_cores = 8

memory_gb = 16

conf = (pyspark.SparkConf().setMaster('local[{}]'.format(number_cores)).set('spark.driver.memory', '{}g'.format(memory_gb)))

sc = pyspark.SparkContext(conf=conf)

At this point, if you visit 127.0.0.1:4040 in another browser tab, you should also see the local Spark cluster up and running (with no job).

Enter the following Python code into the third cell.
Pay attention to the strings inside sc.textFile("A_STRING") and wordcount.saveAsTextFile("ANOTHER_STRING").
- These strings represent the path to where the original data is located (sc.textFile) and where the output directory to be saved (wordcount.saveAsTextFile).
- Take your time here to organize the directories properly.

textFile = sc.textFile("../data/shakespeare/shakespeare-complete.txt")

wordcount = textFile.flatMap(lambda line: line.split(" ")) \

            .map(lambda word: (word, 1)) \

            .reduceByKey(lambda a, b: a + b)

wordcount.saveAsTextFile("../data/output/output-wordcount-01")

A successful run will generate the resulting output directory that contain _SUCCESS flag file (size 0).

6. Data distribution in Spark

Data in spark are managed as distributed collection: when running on a cluster, parts of the data are distributed across different machines and are manipulated by different executors.
To allow executor to perform work in parallel breaks up data into chunks called partitions.
Run the following code in a cell on spark-1

textFile.getNumPartitions()

7. Transformation

In Spark, the core data structures are immutable, meaning they cannot be changed after creation.
To change a data collection means to create a new data collection that is a transformation from the old one.
List of common Spark’s transformations
There are two types of transformation:
- Narrow dependencies (1-to-1 transformation).
- Wide dependencies (1-to-N transformation). | | MULTICS | UNIX |
  | ——————————— | ——- | ———- | | process abstraction | yes | yes |
  | virtual memory | yes | not really |
  | dynamic linking | yes | not really |
  | hierarchical file system | yes | yes |
  | programmed in high-level language | PL/1 | Assembly |
  | multilevel security | no | later |
  | online reconfiguration | yes | reboot |
  | machine costs | M$ | K$ |
We can make a copy of our textFile that is distributed across more partitions. This is still a narrow dependencies transformation.
Run the following code in a cell on spark-1

textFile_4 = textFile.repartition(4)

textFile_4.getNumPartitions()

Run the following code in a cell on spark-1

wordcount = textFile_4.flatMap(lambda line: line.split(" ")) \

            .map(lambda word: (word, 1)) \

            .reduceByKey(lambda a, b: a + b)

wordcount.saveAsTextFile("../data/output/output-wordcount-02")

There are now four resulting output files:

8. Lazy evaluation and actions

Transformationsare logical plan only.
Spark will wait until the very last moment to execute the graph of computation instructions (the logical plan).
To trigger the computation, we run an action.
List of common Spark actions

wordcount = textFile_4.flatMap(lambda line: line.split(" ")) \

            .map(lambda word: (word, 1)) \

            .reduceByKey(lambda a, b: a + b)

print(wordcount)

There are three kind of actions:
- Actions to view data in the console (e.g., take).
- Action to collect data to native objects (e.g., collect).
- Action to write to output data sources (e.g., saveAsTextFile).

wordcount.take(10)

local_words = wordcount.collect()

print(local_words[1:10])

9. Hands on: Word Count workflow breakdown

Run each of the following segments of Python code in a separate cell.
RDD of text file has a single item per row of collection.

textFile_4.take(10)

flatMap breaks lines into lists of words, and concatenate these lists into a single new RDD.

wordcount_1 = textFile_4.flatMap(lambda line: line.split(" "))

wordcount_1.take(10)

Each item in this new RDD is turned into a (key/value) pair, with key is a word, and value is `.

wordcount_2 = wordcount_1.map(lambda word: (word, 1))

wordcount_2.take(10)

All pairs with the same key are reduced together using pairwise summation to get the final count of each key.

wordcount_3 = wordcount_2.reduceByKey(lambda a, b: a + b)

wordcount_3.take(10)

Challenge 1

Augment the mapping process of WordCount with a function to filter out punctuations and capitalization from the unique words
Hint: The string module is helpful for removing punctuation.

## Solution

import string

translator = str.maketrans('', '', string.punctuation)

wordcount_enhanced = textFile.flatMap(lambda line: line.split(" ")) \

           .map(lambda word: (word.translate(translator).lower(), 1)) \

           .reduceByKey(lambda a, b: a + b)

wordcount_enhanced.take(100)

{: .solution} {: .challenge}

Challenge 2

Look up the Spark Python API for filter.
Augment the results from Challenge 1 to remove the empty spaces ('').

## Solution

wordcount_filtered = wordcount_enhanced.filter(lambda x: x[0] != '')

wordcount_filtered.take(100)

{: .solution} {: .challenge}

{% include links.md %}

Content from Data Parallel Computing with Spark

Last updated on 2024-07-26 | Edit this page

Overview

Questions

How does Linux come to be?

Objectives

Explain the historical development of Linux

Data parallel computing with Spark

Hands-on: Data analytics in Spark

Download Move Dataset
Unzip the movie data file.
Open a terminal.
Activate the pyspark conda environment, then launch Jupyter notebook

$ conda activate pyspark

$ jupyter notebook

Create a new notebook using the pyspark kernel, then change the notebook’s name to spark-2.
Copy the code from spark-1 to setup and launch a Spark application.

{% include links.md %}

Content from Page Rank

Last updated on 2024-07-26 | Edit this page

Overview

Questions

How does Linux come to be?

Objectives

Explain the historical development of Linux

Page Rank

1. Web page organization in the past

Web pages were manually curated and organized.
Does not scale.

Next, web search engines were developed: information retrieval
Information retrieval focuses on finding documents from a trusted set.

Challenges for web search:
- Who to trust given the massive variety of information sources?
  - Trustworthy pages may point to each other.
- What is the best answer to a keyword search (given a context)?
  - Pages using the keyword in the same contexts tend to point to each other.
- All pages are not equally important.
Web pages/sites on the Internet can be represented as a graph:
- Web pages/sites are nodes on the graph.
- Links from one page to another represents the incoming/outgoing connections between nodes.
- We can compute the importance of nodes in this graph based on the distribution/intensity of these links.

2. PageRank: initial formulation

Link as votes:
- A page (node) is more important if it has more links.
  - Incoming or outgoing?
- Think of incoming links as votes.
Are all incoming links equals?
- Links from important pages count more.

Each link’s vote is proportional to the importance of its source page.
If page j with importance r_j has n outgoing links, each outgoing link has r_j/n votes.
The importance of page j is the sum of the votes on its incoming links.

3. PageRank: the flow model

Summary:
- A vote from an important page is worth more.
- A page is important if it is linked to by other important pages.

Flow equations:
- r_y = r_y/2 + r_a/2
- r_a = r_y/2 + r_m
- r_m = r_a/2
General equation for calculating rank r_j for page j:

Three equations (actually two), three unknown.
- No unique solutions
Additional constraint to force uniqueness:
- r_y + r_a + r_m = 1
Gaussian elimination:
- r_y = 2/5, r_a = 2/5, r_m = 1/5
Does not scale to Internet-size!

4. PageRank: matrix formulation

Setup the flow equations as a stochastic adjacency matrix M
Matrix M of size N: N is the number of nodes in the graph (web pages on the Internet).
- Let page i has *d_i outgoing links.
- If there is an outgoing link from i to j, then
  - M_ij = 1/d_i
  - else M_ij = 0
- Stochastic: sum of all values in a column of M is equal to 1.
- Adjacency: non-zero value indicates the availability of a link.
Rank vector r: A vector with an entry per page.
- The order of pages in the rank vector should be the same as the order of pages in rows and columns of M.
Flow equations:
- r_y = r_y/2 + r_a/2
- r_a = r_y/2 + r_m
- r_m = r_a/2
General equation for calculating rank r_j for page j with regard to all pages:

This can be rewritten in matrix form:

Visualization:
- Suppose page i has importance r_i and has outgoing links to three other pages, including page j*.

Final perspective:
- Also, this is why we need to study advanced math …

5. PageRank: power iteration

We want to find the page rank value r
Power method: an iterative scheme.
- Support there are N web pages on the Internet.
- All web pages start out with same importance: 1/N.
- Rank vector r: r⁽⁰⁾ = [1/N, …,1/N]^T
- Iterate: r^(t+1) = M • r
- Stopping condition: r^(t+1) - r^(t) < some small positive error threshold e.
Example:

6. PageRank: the Google formulation

The three questions
- Does the above equation converge?
- Does it converge to what we want?.
- Are the results reasonable?

## Challenge 6.1: Does it converge: - The spider trap problem - Build the stochastic adjacency matrix for the following: graph and calculate the ranking for a and b.

Solution:

{: .solution} {: .challenge}

## Challenge 6.2: Does it converge to what we want?: - The dead end problem - Build the stochastic adjacency matrix for the following: graph and calculate the ranking for a and b.

Solution:

{: .solution} {: .challenge}

The solution: random teleport.
At each time step, the random surfer has two options:
- A probably of β to follow an out-going link at random.
- A probably of 1 - β to jump to some random page.
- The value of β are between 0.8 to 0.9.
The surfer will teleport out of the spider trap within a few time steps.
The surfer will definitely teleport out of the dead-end.

Final equation for PageRank:

7. Hands-on: Page Rank in Spark

Create a file called small_graph.dat with the following contents:

y y

y a

a y

a m

m a

This data file describes the graph in the easlier slides.
Open a terminal.
Activate the pyspark conda environment, then launch Jupyter notebook

$ conda activate pyspark

$ conda install -y psutil

$ jupyter notebook

Create a new notebook using the pyspark kernel, then change the notebook’s name to spark-3.
Copy the code from spark-1 to setup and launch a Spark application.

8. Hands-on: Page Rank in Spark - Hollins dataset

Download the Hollins dataset
Hollins University webbot crawl in 2004.
Which page is most important (internally).

{% include links.md %}

Content from Locality Sensitive Hashing

Last updated on 2024-07-26 | Edit this page

Overview

Questions

How does Linux come to be?

Objectives

Explain the historical development of Linux

Locality Sensitive Hashing

1. Applications of set-similarity

Many data mining problems can be expressed as finding similar sets:

Pages with similar words, e.g., for classification by topic.
NetFlix users with similar tastes in movies, for recommendation systems.
Dual: movies with similar sets of fans.
Entity resolution.

2. Application: similar documents

Given a body of documents, e.g., the Web, find pairs of documents with a lot of text in common, such as:

Mirror sites, or approximate mirrors.
- Application: Don’t want to show both in a search.
Plagiarism, including large quotations.
Similar news articles at many news sites.
- Application: Cluster articles by same story: topic modeling, trend identification.

3. Three essential techniques for similar documents

Shingling : convert documents, emails, etc., to sets.
Minhashing : convert large sets to short signatures, while preserving similarity.
Locality sensitive hashing : focus on pairs of signatures likely to be similar.

4. Shingles

A k-shingle (or k-gram) for a document is a sequence of k characters that appears in the document.
Example: k = 2; doc = abcab. Set of 2-shingles: {ab, bc, ca}.
Represent a doc by its set of k-shingles.
Documents that are intuitively similar will have many shingles in common.
Changing a word only affects k-shingles within distance k from the word.
Reordering paragraphs only affects the 2k shingles that cross paragraph boundaries.
Example: k=3, The dog which chased the cat versus The dog that chased the cat.
- Only 3-shingles replaced are g_w, _wh, whi, hic, ich, ch_, and h_c.

5. Minhashing: Jaccard Similarity

The Jaccard similiarity of two sets is the size of their intersection divided by the size of their union.

Convert from sets to boolean matrices.
- Rows: elements of the universal set. In other words, all elements in the union.
- Columns: individual sets.
- A cell value of 1 in row e and column S if and only if e is a member of S.
- A cell value of 0 otherwise.
- Column similarity is the Jaccard similarity of the sets of their rows that have the value of 1.
- Typically sparse.
This gives you another way to calculate similarity: column similarity = Jaccard similarity.

Generally speaking, given two columns, rows maybe classified as:
- a: 1 1
- b: 1 0
- c: 0 1
- d: 0 0
Sim(C₁, C₂) = a/(a+b+c)

6. Minhashing

Imagine the rows permuted randomly.
Define minhash function h(C) = the number of the first (in the permuted order) row in which column C has 1.
Use several (e.g., 100) independent hash functions to create a signature for each column.
The signatures can be displayed in another matrix called the signature matrix
The signature matrix has
- its columns represent the sets and
- the rows represent the minhash values, in order for that column.

6. Minhashing: surprising property

The probability (over all permutations of the rows) that h(C₁) = h(C₂) is the same as Sim(C₁).
Both are a/(a+b+c)!
The similarity of signatures is the fraction of the minhash functions in which they agree.
The expected similarity of two signatures equals the Jaccard similarity of the columns.
- The longer the signatures, the smaller the expected error will be.

8. Hands-on Minhashing: implementation

Can’t realistically permute billion of rows:
- Too many permutation entries to store.
- Random access on big data (big no no).
How to calculate hashes in sequential order?
- Pick approximately 100 hash functions (100 permutations).
- Go through the data set row by row.
- For each row r, for each hash function i,
  - Maintain a variable M(i,c) which will maintain the smallest value value of h_i(r) for which column c has 1 in row r.

9. Minhashing:

Create three additional row permutations, update the signature matrix, and recalculate the signature similarity.
Does the signature similarity become closer to the column similarity?

10. Locality-sensitive-hashing (LSH)

Generate from the collection of signatures a list of candidate pairs: pairs of elements where similarity must be evaluated.
For signature matrices: hash columns to many buckets and make elements of the same bucket candidate pairs.
- Pick a similarity threshold t, a fraction < 1.
- We want a pair of columns c and d of the signature matrix M to be a candidate pair if and only if their signatures agree in at least fraction t of the rows.

11. LSH

Big idea: hash columns of signature matrix M several times and arrange that only similar columns are likely to hash to the same bucket.
Reality: we don’t need to study the entire column.

Divide matrix M into b bands of r rows each.
For each band, hash its portion of each column to a hash table with k buckets, with k as large as possible.
Candidate column pairs are those that hash to the same bucket for at least one band.
Fine tune b and r.
We will not go into the math here …

12. Hands on LSH

Download the set of inaugural speeches from https://www.cs.wcupa.edu/lngo/data/inaugural_speeches.zip.
Launch a Spark notebook called spark-6.ipynb and create the two initial setup cells.

{% include links.md %}

Content from Frequent Itemsets

Last updated on 2024-07-26 | Edit this page

Overview

Questions

How does Linux come to be?

Objectives

Explain the historical development of Linux

Frequent Itemsets

1. The market-basket model

Describe a common form of many-to-many relationship between two kinds of objects.
- A large set of items. e.g.: things sold in a supermarket.
- A large set of baskets, each of which is a small set of items. e.g.: things one customer buys on one trip to the supermarket.
  - Number of baskets cannot fit into memory.

2. Definition of a frequent itemsets.

A set of items that appears in many baskets is said to be frequent.
Assume a value s: support threshold.
If I is a set of items.
- The support for I is the number of baskets in which I is a subset.
I is frequent if its support is s or higher.

3. Example: frequent itemsets

items = {milk, coke, pepsi, beer, juice}
- B1: m,c,b
- B2: m,p,j
- B3: m,b
- B4: c,j
- B5: m,p,b
- B6: m,c,b,j
- B7: c,b,j
- B8: b,c
Support value s = 3 (three baskets)
Frequent itemsets:
- {m}, {c}, {b}, {j}, {m,b}, {b,c}, {c,j}

4. Applications

Items: products; Baskets: sets of products.
- Given that many people buy beer and diapers together: run a sale on diapers and raise price of beer.
- Given that many people buy hotdog and mustards together: run a sale of hotdog and raise price of mustards.
Items = documents; baskets = sentences/phrases.
- Items that appear together too often could represent plagiarism.
Items = words, basket = documents.
- Unusual words appearing together in large number of documents indicating interesting relationship.

5. Scale of the problem

Walmart sells hundreds of thousands of items, and has billions of transactions (shopping basket/cart at checkout).
The Web has billions of words and many billions of pages.

6. Association Rules:

If-then rules abou the contents of baskets.
{i_1, i_2, ..., i_k} -> j means: “If a basket contains all of i_1,…,i_k then it is likely to contain j.”
Confidence of this association rule is the probability of j given {i_1,…,i_k}.
- The fraction of the basket with {i_1,…,i_k} that also contain j.
Example:
- B1: m,c,b
- B2: m,p,j
- B3: m,b
- B4: c,j
- B5: m,p,b
- B6: m,c,b,j
- B7: c,b,j
- B8: b,c
An association rule: {m,b} -> c
- Basket contains m and b: B1, B3, B5, B6
- Basket contains m, b, and c: B1, B6
- C = 2 / 4 = 50%

7. Finding association rules

Find all association rules with support >= s and confidence >=c
Hard part: funding the frequent itemsets.

8. Computation model

Data is stored in flat files on disk.
Most likely basket-by-basket.
Expand baskets into pairs, triples, etc as you read the baskets.
- Use k nested loops to generate all sets of size k.
I/O cost: per passes (all baskets read).

9. Main-memory bottleneck

For many frequent-itemset algorithms, main memory is the critical resource.
We need to keep count of things (occurrences of pairs/triples/…) when we read baskets.
The number of different things we can count is limited by main memory.
Swapping counts is going to be horrible.

10. Naive algorithm

Hardest problem is finding frequent pairs because they are the most common.
Read file once, counting in main memory the occurences of each pair.
For each basket of n items, there will be n(n-1)/2 pairs, generated by double-nested loops.
If n^2 exceeds main memory, we fails.

11. Naive algorithm: how do we count.

Count all pairs using a triangular matrix.
- Requires 4 bytes per pair for all possible pairs: 2n^2
Keep a table ot triples [i, j, c] with c is the count of pair {i,j}.
- Requires 12 bytes only for pairs with count > 0: 12p with p is the number of pairs that actually occur.

12. A-Priori algorithm.

Limit the need for main memory.
Key idea: monotonicity
- If a set of items appears at least s times, so does every subset of this set.
Contrapositive: If an item i does not appear in s baskets, then no pair containing i can appear in s baskets.
A-Priori algorithm:
- Pass 1: read baskets and count the item occurrences. Only keep items that appear at least s times - frequent items.
- Pass 2: read baskets again and only count in main memory only those pairs whose both items were found to be frequent from Pass 1.
- Repeat the process with increasing number of items added to only sets found to be frequent.
- C1 = all items
- In general, L_k are members of C_k with support greater than or equal to s.
- C_(k+1) includes (k+1) sets, each k of which is in L_k.

13. A-Priori at scale

Under two passes.
SON: Savasere-Omiecinski-Navathe
Adaptable to a distributed data model (mapreduce).
- Repeatedly read small subsets of the baskets into main memory and perform a-priori on these subsets, using a support that is equal to the main support divided by the total numbers of subsets.
- Aggregate all candidate itemsets and determine which are frequent in the entire set.

14. Hands-on: SON on Spark

Download the small movie dataset

15. Hands-on: Processing XML data

Download the NSF Awards Data between 2011 and 2021.
Review the XML Schema
What is an interesting question?

{% include links.md %}

Content from Clustering

Last updated on 2024-07-26 | Edit this page

Overview

Questions

How does Linux come to be?

Objectives

Explain the historical development of Linux

Clustering

1. General problem statement:

Given a set of data points, with a notion of distance between points, group the points into some number of clusters so that:
- Members of a cluster are close/similar to each other.
- Members of different clusters are dissimilar.
Usually
- Points are in high-dimensional space (observations have many attributes).
- Similarity is defined using a distance measure: Euclidean, Cosine, Jaccard, edit distance …

2. Clustering is a hard problem

Clustering in two dimensions looks easy.
Clustering small amounts of data looks easy.
In most cases, looks are not deceiving.
But:
- Many applications involve not 2, but 10 or 10,000 dimensions.
- High-dimensional spaces look different.

3. Example:

Clustering Sky Objects:
- A catalog of 2 billion sky objects represents objects by their radiation in 7 dimensions (frequency bands)
- Problem: cluster into similar objects, e.g., galaxies, stars, quasars, etc.
Clustering music albums
- Music divides into categories, and customer prefer a few categories
- Are categories simply genres?
- Similar Albums have similar sets of customers, and vice-versa
Clustering documents
- Group together documents on the same topic.
- Documents with similar sets of words maybe about the same topic.
- Dual formulation: a topic is a group of words that co-occur in many documents.

4. Cosine, Jaccard, Euclidean

Different ways of representing documents or music albums lead to different distance measures.
Document as set of words
- Jaccard distance
Document as point in space of words.
- x_i = 1 if i appears in doc.
- Euclidean distance
Document as vector in space of words.
- Vector from origin to …
- Cosine distance.

5. Overview: methods of clustering

Hierarchical:
- Agglomerative (bottom up): each point is a cluster, repeatedly combining two nearest cluster.
- Divisive (top down): start with one cluster and recursively split it.
Point assignment:
- Maintain a set of clusters
- Points belong to nearest cluster

6. Point assignment: K-means clustering

Assumes Euclidean space/distance
Pick k, the number of clusters.
Initialize clsuters by picking on point per cluster.
Until converge
- For each point, place it in the cluster whose current centroid it is nearest.
  - A cluster centroid has its coordinates calculated as the averages of all its points’ coordinates.
- After all points are assigned, update the locations of centroids of the k clusters.
- Reassign all points to their closest centroid.

7. The big question

How to select k?
Try different k, looking at the change in the average distance to centroid, as k increases.
Approach 1: sampling
- Cluster a sample of the data using hierarchical clustering, to obtain k clusters.
- Pick a point from each clsuter (e.g. point closest to centroid)
- Sample fits in main memory.
Approach 2: Pick dispersed set of points
- Pick first point at random
- Pick the next point to be the one whose minimum distance from the selected points is as large as possible.
- Repeat until we have k points.

8. Clustering on Spark

Libraries for major clustering techniques
- K-means
- Bisecting-Kmeans

{% include links.md %}

Content from Recommendation Systems

Last updated on 2024-07-26 | Edit this page

Overview

Questions

How does Linux come to be?

Objectives

Explain the historical development of Linux

Recommendation Systems

Content from Distributed Machine Learning with Spark

Last updated on 2024-07-26 | Edit this page

Overview

Questions

How does Linux come to be?

Objectives

Explain the historical development of Linux

Distributed machine learning with Spark

1. Application: Spam Filtering

	viagra	learning	the	nigeria	spam?
X1	1	0	1	0	y₁ = 1
X2	0	1	1	0	Y₂ = -1
X3	0	0	0	1	y₃ = 1

Instance spaces X1, X2, X3 belong to set X (data points)
- Binary or real-valued feature vector X of word occurrences
- d features (words and other things, d is approximately 100,000)
Class Y
- Spam = 1
- Ham = -1

2. Linear models for classification

Vector X_j contains real values
- The Euclidean norm is 1.
- Each vector has a label y_j
The goal is to find a vector W = (w₁, w₂, …, w_d) with w_j is a real number such that:
- The labeled points are clearly separated by a line: .
Dot is spam, minus is ham!

3. Linear classifiers

Each feature i as a weight w_i
Prediction is based on the weighted sum:
If f(x) is:
- Positive: predict +1
- Negative: predict -1

4. Support Vector Machine

Originally developed by Vapnik and collaborators as a linear classifier.
Could be modified to support non-linear classification by mapping into high-dimensional spaces.
Problem statement:
- We want to separate + from - using a line.
- Training examples: .
- Each example i: . .
- Inner product: .
Which is the best linear separate defined by w?

5. Support Vector Machine: largest margin

Distance from the separating line corresponds to the confidence of the prediction.
For example, we are more sure about the class of A and B than of C. .
Margin definition: . .
Maximizing the margin while identifying w is good according to intuition, theory, and practice. .
A math question: how do you narrate this equation?

6. Support Vector Machine: what is the margin?

Slide from the book .
Notation:
- Gamma is the distance from point A to the linear separator L: d(A,L) = |AH|
- If we select a random point M on line L, then d(A,L) is the projection of AM onto vector w.
- Project
- If we assume the normalized Euclidean value of w, |w|, is equal to one, that bring us to the result in the slide.
In other words, maximizing the margin is directly related to how w is chosen.
For the i^th data point:

.

7. Some more math …

. - After some more mathematical manipulations: . - Everything comes back to an optimization problem on w: . .

8. SVM: Non-linearly separable data

. - For each data point: - If margin greater than 1, don’t care. - If margin is less than 1, pay linear penalty. - Introducing slack variables:

9. Hands-on: SVM

Download the set of inaugural speeches from https://www.cs.wcupa.edu/lngo/data/bank.csv
Activate the pyspark conda environment, install pandas, then launch Jupyter notebook

$ conda activate pyspark

$ conda install -y pandas

$ jupyter notebook

Create a new notebook using the pyspark kernel, then change the notebook’s name to spark-7.
Copy the code from spark-6 to setup and launch a Spark application.
Documentation:
Question: Can you predict whether a client will subscribe to a term deposit (feature deposit)?
Problems:
- What data should the bank data be converted to?
- How to handle categorical data?

{% include links.md %}