Content from Syllabus
Last updated on 2024-07-26 | Edit this page
This is the curriculum for CSC 467, Big Data Engineering. The website is built with The Carpentries Workbench.
- Instructor: Linh B. Ngo
- Office: UNA 138
- Office Hours: By appointment
- Email: lngo AT wcupa DOT edu
- Phone: 610-436-2595
Course Information
- The course runs from August 26, 2024 until December 09, 2024. It is an in-person course.
Required Materials:
- Mining of Massive Datasets. Julre Leskovec, Anand Rajaraman, and Jeffrey David Ullman.
- Author’s Textbook Download Page.
Course Description
This course will investigate engineering approaches in solving challenges in data-intensive and big data computing problems. Course topics include distributed tools and parallel algorithms that help with acquiring, cleaning, and mining very large amount of data.
Learning Objectives
Course Student Learning Outcomes (CSLO):
- Be able to setup and deploy appropriate data engineering technologies to manage big data sets.
- Be able to understand MapReduce, one of the key enabling programming concepts.
- Be able to implement key data mining techniques using Spark programming libraries.
BS in CS Program Objectives (CSPO):
- Be able to apply theory, techniques, and methodologies to create and/or maintain high quality computing systems that function effectively and reliably in the emerging and future information infrastructure.
ABET Objectives (APO):
- Analyze a complex computing problem and to apply principles of computing and other relevant disciplines to identify solutions (ABET 1).
- Design, implement, and evaluate a computing-based solution to meet a given set of computing requirements in the context of the program’s discipline (ABET 2).
Assessments and Grading:
Method of Evaluation
Assessment | % of Final Grade | Course Objectives Assessed | Program Objectives Assessed | ABET Objectives |
---|---|---|---|---|
Assignments | 40% | 1,2,3 | 1 | 1 |
Course Project | 40% | 1,2,3 | 1 | 2 |
Class Participation | 5% | 1 | 1 | 1 |
Final | 15% | 1,2,3 | 1 | 1 |
University Policies
Academic & Personal Integrity
It is the responsibility of each student to adhere to the university’s standards for academic integrity. Violations of academic integrity include any act that violates the rights of another student in academic work, that involves misrepresentation of your own work, or that disrupts the instruction of the course. Other violations include (but are not limited to): cheating on assignments or examinations; plagiarizing, which means copying any part of another’s work and/or using ideas of another and presenting them as one’s own without giving proper credit to the source; selling, purchasing, or exchanging of term papers; falsifying of information; and using your own work from one class to fulfill the assignment for another class without significant modification. Proof of academic misconduct can result in the automatic failure and removal from this course. For questions regarding Academic Integrity, the No-Grade Policy, Sexual Harassment, or the Student Code of Conduct, students are encouraged to refer to the Department Undergraduate Handbook, the Undergraduate Course Catalog, the Ram’s Eye View, or the University Website.
Students with Disabilities
If you have a disability that requires accommodations under the Americans with Disabilities Act (ADA), please present your letter of accommodations and meet with me as soon as possible so that I can support your success in an informed manner. Accommodations cannot be granted retroactively. If you would like to know more about West Chester University’s Services for Students with Disabilities (OSSD), please visit them at 223 Lawrence Center. Their phone number is 610-436-2564, their fax number is 610-436-2600, their email address is ossd@wcupa.edu, or visit the OSSD website. In an effort to assist students who either receive or may believe they are entitled to receive accommodations under the Americans with Disabilities Act and Section 504 of the Rehabilitation Act of 1973, the University has appointed a student advocate to be a contact for students who have questions regarding the provision of their accommodations or their right to accommodations. The advocate will assist any student who may have questions regarding these rights. The Director for Equity and Compliance/Title IX Coordinator has been designated in this role. Students who need assistance with their rights to accommodations should contact them at 610-436-2433.
The University’s Americans with Disabilities policy is available on the website. If you encounter an area of this course that is not accessible to you, please contact me.
Excused Absences Policy
Students are advised to carefully read and comply with the excused absences policy, including absences for university-sanctioned events, contained in the WCU Undergraduate Catalog. In particular, please note that the responsibility for meeting academic requirements rests with the student, that this policy does not excuse students from completing required academic work, and that professors can require a fair alternative to attendance on those days that students must be absent from class in order to participate in a University-Sanctioned Event.
Reporting Incidents of Sexual Violence
West Chester University and its faculty are committed to assuring a safe and productive educational environment for all students. In order to comply with the requirements of Title IX of the Education Amendments of 1972 and the University’s commitment to offering supportive measures in accordance with the new regulations issued under Title IX, the University requires faculty members to report incidents of sexual violence shared by students to the University’s Title IX Coordinator. The only exceptions to the faculty member’s reporting obligation are when incidents of sexual violence are communicated by a student during a classroom discussion, in a writing assignment for a class, or as part of a University-approved research project. Faculty members are obligated to report sexual violence or any other abuse of a student who was, or is, a child (a person under 18 years of age) when the abuse allegedly occurred to the person designated in the University Protection of Minors Policy. Information regarding the reporting of sexual violence and the resources that are available to victims of sexual violence is set forth at the webpage for the Office of Diversity, Equity, and Inclusion.
Inclusive Learning Environment and Anti-Racist Statement
Diversity, equity, and inclusion are central to West Chester University’s mission as reflected in our Mission Statement, Values Statement, Vision Statement and Strategic Plan: Pathways to Student Success. We disavow racism and all actions that silence, threaten, or degrade historically marginalized groups in the U.S. We acknowledge that all members of this learning community may experience harm stemming from forms of oppression including but not limited to classism, ableism, heterosexism, sexism, Islamophobia, anti-Semitism, and xenophobia, and recognize that these forms of oppression are compounded by racism.
Our core commitment as an institution of higher education shapes our expectation for behavior within this learning community, which represents diverse individual beliefs, backgrounds, and experiences. Courteous and respectful behavior, interactions, and responses are expected from all members of the University. We must work together to make this a safe and productive learning environment for everyone. Part of this work is recognizing how race and other aspects of who we are shape our beliefs and our experiences as individuals. It is not enough to condemn acts of racism. For real, sustainable change, we must stand together as a diverse coalition against racism and oppression of any form, anywhere, at any time.
Resources for education and action are available through WCU’s Office for Diversity, Equity, and Inclusion (ODEI), DEI committees within departments or colleges, the student ombudsperson, and centers on campus committed to doing this work (e.g., Dowdy Multicultural Center, Center for Women and Gender Equity, and the Center for Trans and Queer Advocacy).
Guidance on how to report incidents of discrimination and harassment is available at the University’s Office of Diversity, Equity and Inclusion.
Emergency Preparedness
All students are encouraged to sign up for the University’s free WCU ALERT service, which delivers official WCU emergency text messages directly to your cell phone. For more information, visit https://www.wcupa.edu/wcualert. To report an emergency, call the Department of Public Safety at 610-436-3311.
Electronic Mail Policy
It is expected that faculty, staff, and students activate and maintain regular access to University provided e-mail accounts. Official university communications, including those from your instructor, will be sent through your university e-mail account. You are responsible for accessing that mail to be sure to obtain official University communications. Failure to access will not exempt individuals from the responsibilities associated with this course.
Course Topics and Schedules
Week | Topic | Assessments |
---|---|---|
Introduction to Big Data | - | |
MapReduce Programming Paradigm | - | |
Data Parallel Computing Environment in Python | - | |
Frequent Itemsets | - | |
Locality-Sensitive Hashing | - | |
Clustering | - | |
Dimensionality Reduction | - | |
Recommendation Systems | - | |
Page Rank | - | |
Decision Trees | - | |
Data Parallel Computing with Spark | - |
Content from Introduction
Last updated on 2024-07-26 | Edit this page
1. Terminologies
- Big data problems: not only the processing power, but the size of the data is also the limiting factor in being able to find a timely solution.
- Big Data (Industry, Social Sciences)
- Input data carry characteristics of Big Data (the 4V).
- Computational process is simple and straightforward, with minimal intermediate data being generated.
- Data Intensive Computing (Natural Sciences)
- Input data may or may not be big data.
- Computational process produces massive and complex intermediate data that needs to be analyzed during the process.
2. Big Data in Science (Data-enabled Science)

- Scientific process:
- empirical
- theoretical
- computational
- data-enabled/data-intensive science (the fourth pillar of scientific discovery).
- Big data analytics in science and engineering for data sets that
are:
- Too big
- Too complex
- Too fast (streaming)
- Too noisy
- Too heterogeneous

3. Big Data in Industry
- Old statistics
- In 2008-2009:
- Google processed 20PB a day.
- Facebook had 2.5PB of user data + 15TB/day.
- eBay had 6.5PB of user data + 50TB/day.
- In 2010-2011:
- Facebook had 400M users / 125PB of user data.
- eBay had 10PB of user data in 2010, expected to double this number in 2011
- In 2012-2013:
- Facebook had 900M users
- Twitter had 400M Tweets/day
- In 2008-2009:
- We don’t see this kind of statistics collected anymore because it has become the norm.
- Newer (and different) statistics:
-
53%
of companies are using big data analytics today
- Data warehouse optimization, customer/social analysis, predictive maintenance, clickstream analytics, fraud detectiton, and Internet of Things.
- Spark, MapReduce, and Yarn are among the most popular framework.
- Medicine, Retail, Construction, Banking, and Transportation are the new industries that are being defined by big data analytic capabilities
-
53%
of companies are using big data analytics today
- People don’t explicitly talk about Big Data anymore:
## Gartner Hype Cycle 2014
## Gartner Hype Cycle 2019
## Gartner Hype Cycle 2020
4. The Vs of Big Data
- Used to be four: Volume, Velocity, Variety, Veracity.
- Other Vs are added over time.
- Volume: the size of the files used to archive and spread data.
- Velocity: the speed with which data is generated and processed.
- Variety: formats and purposes of data, which may include objects as different as samples of animal tissue, free-text observations, humidity measurements, GPS coordinates, and the results of blood tests.
- Veracity: the extent to which the quality and reliability of big data can be guaranteed. Data with high volume, velocity and variety are at significant risk of containing inaccuracies, errors and unaccounted-for bias.
- Validity: the selection of appropriate data with respect to the intended use. The choice of a specific dataset as evidence base requires adequate and explicit justification, including recourse to relevant background knowledge to ground the identification of what counts as data in that context.
- Volatility: the extent to which data can be relied upon to remain available, accessible and re-interpretable despite changes in archival technologies. This is significant given the tendency of formats and tools used to generate and analyse data to become obsolete, and the efforts required to update data infrastructures so as to guarantee data access in the long term.
- Value: the multifaceted forms of significance attributed to big data by different sections of society, which depend as much on the intended use of the data as on historical, social and geographical circumstances.
5. Programming paradigm for big data
- Multi-faceted challenges:
- Require not only parallel computation but also parallel data processing.
- New computational tools and strategies.
- New data intensive scalable architectures.
- Science is moving increasingly from hypothesis-driven to data-driven discoveries.
- Industry is at a stage where big data infrastructures are integrated and big data sets are beginning to be analyzed to produce business insights.
- Example general paradigm:

- It is difficult to write parallel programs
- Difficult in converting algorithms from serial to parallel.
- Difficult in identifying different ways that the program can fail.
- No reliable way to detect failure of a process.
- It is even more difficult to write parallel programs at large scale
- Same set of errors, but scale up with size.
- It is even more difficult to debug large scale parallel programs
- What if the program doesn’t fail but only produce incorrect results?
6. Data-intensive approach
- Scale “out”, not “up”
- It is easier and cheaper to add nodes to an existing cluster than to build a faster cluster.
- Move computation to the data
- Reduce data movement.
- Sequential processing, avoid random access
- Reduce seek movement on disks.
- Seamless scalability
{% include links.md %}
Content from MapReduce Programming Paradigm
Last updated on 2024-07-26 | Edit this page
Overview
Questions
- How does Linux come to be?
Objectives
- Explain the historical development of Linux
1. The reality of working with big data
- Hundreds or thousands of machines to support big data.
- Distribute data for storage (not within the scope of this class).
- Parallelize data computation (MapReduce)
- Handle failure (MapReduce)
- The paper: MapReduce: simplified data processing on large clusters
2. The reality of working with big data (as outlined by the paper)
- Challenges:
- input data is usually large
- the computations have to be distributed across hundreds or thousands of machines
- finish in a reasonable amount of time.
- Addressing these challenges causes the original simple computation to become obscured by large amounts of supporting complex codes.
- What MapReduce does:
- is a new abstraction
- expresses the simple core computations
- hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.
- Why MapReduce?
- inspired by the
_map_
and_reduce_
primitives present in Lisp and many other functional languages. - Most data computations involve:
- applying a
_map_
operation to each logicalrecord
in our input in order to compute a set of intermediate key/value pairs, and then - applying a
_reduce_
operation to all the values that shared the same key to combine the derived data appropriately.
- applying a
- inspired by the
3. MapReduce in a nutshell
- What is
map
? A function/procedure that is applied to every individual elements of a collection/list/array/…
int square(x) { return x*x;}
map square [1,2,3,4] -> [1,4,9,16]
- What is
reduce
? A function/procedure that performs an operation on a list. This operation will fold/reduce this list into a single value (or a smaller subset)
reduce ([1,2,3,4]) using sum -> 10
reduce ([1,2,3,4]) using multiply -> 24
3. Word Count: the “Hello, World” of big data
- We have a large amount of text …
- Could be stored in a single massive file.
- Could be stored in multiple files.
- We want to count the number of times each distinct word appears in the files
- Sample applications:
- Analyze web server logs to find popular URLs/keywords.
- Analyze security logs to find incidents.
- Standard parallel programming approach:
- Count number of files or set up seek distances within a file.
- Set number of processes
- Possibly setting up dynamic workload assignment
- A lot of data transfer
- Significant coding effort
4. Word Count: MapReduce workflow

5. Word Count: MapReduce workflow - what do you really do
- Input: a set of key-value pairs
- Programmer specifies two methods:
-
Map(k, v) -> (k', v')
:- Takes a key-value pair and outputs a set of key-value pairs.
- E.g., key is the filename, value is a single line in the file
- There is one Map call for every (k,v) pair
- Takes a key-value pair and outputs a set of key-value pairs.
-
Reduce(k2, <v'>) -> <k’, v’’>
- All values
v'
with same keyk'
are reduced together and processed inv'
order. - There is one Reduce function call per unique key
k'
.
- All values
-
- MapReduce environment takes care of:
- Partitioning the input data
- Scheduling the program’s execution across a set of machines
- Performing the group by key step
- Handling machine failures
- Managing required inter-machine communication
6. Word Count: MapReduce workflow at scale


7. Applications that suit well with MapReduce programming paradigm
- Text tokenization, indexing, and search
- Web access log stats
- Inverted index construction
- Term-vector per host
- Distributed grep/sort
- Graph creation
- Web link-graph reversal (Google’s PageRank)
- Data Mining and machine learning
- Document clustering
- Machine learning
- Statistical machine translation
{% include links.md %}
Content from Spark Computing Environment
Last updated on 2024-07-26 | Edit this page
Overview
Questions
- How does Linux come to be?
Objectives
- Explain the historical development of Linux
Spark computing environment
1. What is Spark?
- A unified compute engine and a set of libraries for parallel data processing on computer clusters.

2. Design philosophy
-
Unified
: Spark supports a wide range of data analytic tasks over the same computing engine and a consistent set of APIs. -
Computing engine
: Spark handles loading data from storage systems and perform computation on the data (in memory) rather than on permanent storage. To adhere to the data locality principle, Spark relies on APIs to provide a transparent common interface with different storage systems for all applications. -
Libraries
: Via its APIs, Spark supports a wide array of internal and external libraries for complex data analytic tasks.
3. A brief history of Spark
- Research project at UC Berkeley AMP Lab in 2009 to address drawbacks of Hadoop MapReduce.
- Paper published in 2010: Spark: Cluster Computing with Working Sets
- Source code is contributed to Apache in 2013. The project had more than 100 contributors from more than 30 organizations outside UC Berkeley.
- Version 1.0 was released in 2014.
- Currently, Spark is being used extensively in academic and industry (NASA, CERN, Uber, Netflix …).
4. Spark applications
- Typically consists of a
driver
process and a set ofexecutor
processes. - The
driver
runs the main function and is responsible for:- maintaining information about the Spark application,
- responding to a user’s program or input, and
- analyzing, distributing, and scheduling work across the executors.
- The
executors
carry out the actual work assigned to them by thedriver
.

- Spark also has a local mode (what we are using for this class), where driver and executors are simply processes on the same machine.
- Spark application developed in local mode can be carried over almost
as-is
to run in cluster mode (one of the attractiveness of Spark). - Spark supports the following language APIs: Scala, Java, Python, SQL (ANSI SQL 2003 standard), and R.
5. Hands-on: Word Count in Spark
- Open a terminal.
- Activate the
pyspark
conda environment, then launch Jupyter notebook
$ conda activate pyspark
$ jupyter notebook
- Create a new notebook using the
pyspark
kernel, then change the notebook’s name tospark-1
.

- Enter the following Python code into the first cell of
spark-1
, then run the cell. - This code sets up the paths to Spark’s libraries and executables.
import os
import sys
spark_path = os.environ['SPARK_HOME']
sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.10.9-src.zip")
import findspark
findspark.init()
import pyspark

- Enter the following Python code into the next cell, adjust
number_cores
andmemory_gb
based on your computer’s hardware, then run the cell.- This code sets up the configuration for the supporting Spark cluster (in local mode).
-
number_cores
should be one less than the total number of cores on your computer. -
memory_gb
should be half the amount of memory that you have, or at least 3GB less.
number_cores = 8
memory_gb = 16
conf = (pyspark.SparkConf().setMaster('local[{}]'.format(number_cores)).set('spark.driver.memory', '{}g'.format(memory_gb)))
sc = pyspark.SparkContext(conf=conf)

- At this point, if you visit
127.0.0.1:4040
in another browser tab, you should also see the local Spark cluster up and running (with no job).

- Enter the following Python code into the third cell.
- Pay attention to the strings inside
sc.textFile("A_STRING")
andwordcount.saveAsTextFile("ANOTHER_STRING")
.- These strings represent the path to where the original data is
located (
sc.textFile
) and where the output directory to be saved (wordcount.saveAsTextFile
). - Take your time here to organize the directories properly.
- These strings represent the path to where the original data is
located (
textFile = sc.textFile("../data/shakespeare/shakespeare-complete.txt")
wordcount = textFile.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
wordcount.saveAsTextFile("../data/output/output-wordcount-01")
- A successful run will generate the resulting output directory that
contain
_SUCCESS
flag file (size 0).

6. Data distribution in Spark
- Data in spark are managed as
distributed collection
: when running on a cluster, parts of the data are distributed across different machines and are manipulated by different executors. - To allow executor to perform work in parallel breaks up data into
chunks called
partitions
. - Run the following code in a cell on
spark-1
textFile.getNumPartitions()

7. Transformation
In Spark, the core data structures are
immutable
, meaning they cannot be changed after creation.To
change
a data collection means tocreate
a new data collection that is atransformation
from the old one.-
There are two types of transformation:
- Narrow dependencies (1-to-1 transformation).
- Wide dependencies (1-to-N transformation). | | MULTICS | UNIX
|
| ——————————— | ——- | ———- | | process abstraction | yes | yes |
| virtual memory | yes | not really |
| dynamic linking | yes | not really |
| hierarchical file system | yes | yes |
| programmed in high-level language | PL/1 | Assembly |
| multilevel security | no | later |
| online reconfiguration | yes | reboot |
| machine costs | M$ | K$ |
We can make a copy of our
textFile
that is distributed across more partitions. This is still anarrow dependencies
transformation.Run the following code in a cell on
spark-1
textFile_4 = textFile.repartition(4)
textFile_4.getNumPartitions()

- Run the following code in a cell on
spark-1
wordcount = textFile_4.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
wordcount.saveAsTextFile("../data/output/output-wordcount-02")

- There are now four resulting output files:

8. Lazy evaluation and actions
-
Transformations
are logical plan only. - Spark will wait until the very last moment to execute the graph of computation instructions (the logical plan).
- To trigger the computation, we run an
action
. - List of common Spark actions
wordcount = textFile_4.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
print(wordcount)

- There are three kind of actions:
- Actions to view data in the console (e.g.,
take
). - Action to collect data to native objects (e.g.,
collect
). - Action to write to output data sources (e.g.,
saveAsTextFile
).
- Actions to view data in the console (e.g.,
wordcount.take(10)
local_words = wordcount.collect()
print(local_words[1:10])

9. Hands on: Word Count workflow breakdown
- Run each of the following segments of Python code in a separate cell.
- RDD of text file has a single item per row of collection.
textFile_4.take(10)
-
flatMap
breaks lines into lists of words, and concatenate these lists into a single new RDD.
wordcount_1 = textFile_4.flatMap(lambda line: line.split(" "))
wordcount_1.take(10)
- Each item in this new RDD is turned into a (key/value) pair, with key is a word, and value is `.
wordcount_2 = wordcount_1.map(lambda word: (word, 1))
wordcount_2.take(10)
- All pairs with the same key are
reduced
together using pairwise summation to get the final count of each key.
wordcount_3 = wordcount_2.reduceByKey(lambda a, b: a + b)
wordcount_3.take(10)

Challenge 1
- Augment the mapping process of WordCount with a function to filter out punctuations and capitalization from the unique words
- Hint: The string module is helpful for removing punctuation.
## Solution
import string
translator = str.maketrans('', '', string.punctuation)
wordcount_enhanced = textFile.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word.translate(translator).lower(), 1)) \
.reduceByKey(lambda a, b: a + b)
wordcount_enhanced.take(100)
{: .solution} {: .challenge}
Challenge 2
- Look up the Spark Python API for filter.
- Augment the results from Challenge 1 to remove the empty spaces
(
''
).
## Solution
wordcount_filtered = wordcount_enhanced.filter(lambda x: x[0] != '')
wordcount_filtered.take(100)
{: .solution} {: .challenge}
{% include links.md %}
Content from Data Parallel Computing with Spark
Last updated on 2024-07-26 | Edit this page
Overview
Questions
- How does Linux come to be?
Objectives
- Explain the historical development of Linux
Data parallel computing with Spark
Hands-on: Data analytics in Spark
- Download Move Dataset
- Unzip the movie data file.
- Open a terminal.
- Activate the
pyspark
conda environment, then launch Jupyter notebook
$ conda activate pyspark
$ jupyter notebook
- Create a new notebook using the
pyspark
kernel, then change the notebook’s name tospark-2
. - Copy the code from
spark-1
to setup and launch a Spark application.
{% include links.md %}
Content from Page Rank
Last updated on 2024-07-26 | Edit this page
Overview
Questions
- How does Linux come to be?
Objectives
- Explain the historical development of Linux
Page Rank
1. Web page organization in the past
- Web pages were manually curated and organized.
- Does not scale.

- Next, web search engines were developed: information retrieval
- Information retrieval focuses on finding documents from a trusted set.

- Challenges for web search:
- Who to trust given the massive variety of information sources?
- Trustworthy pages may point to each other.
- What is the best answer to a keyword search (given a context)?
- Pages using the keyword in the same contexts tend to point to each other.
- All pages are not equally important.
- Who to trust given the massive variety of information sources?
- Web pages/sites on the Internet can be represented as a graph:
- Web pages/sites are nodes on the graph.
- Links from one page to another represents the incoming/outgoing connections between nodes.
- We can compute the importance of nodes in this graph based on the distribution/intensity of these links.
2. PageRank: initial formulation
- Link as votes:
- A page (node) is more important if it has more links.
- Incoming or outgoing?
- Think of incoming links as votes.
- A page (node) is more important if it has more links.
- Are all incoming links equals?
- Links from important pages count more.

- Each link’s vote is proportional to the importance of its source page.
- If page j with importance rj has n outgoing links, each outgoing link has rj/n votes.
- The importance of page j is the sum of the votes on its incoming links.

3. PageRank: the flow model
- Summary:
- A
vote
from an important page is worth more. - A page is important if it is linked to by other important pages.
- A

- Flow equations:
- ry = ry/2 + ra/2
- ra = ry/2 + rm
- rm = ra/2
- General equation for calculating rank rj for page j:

- Three equations (actually two), three unknown.
- No unique solutions
- Additional constraint to force uniqueness:
- ry + ra + rm = 1
- Gaussian elimination:
- ry = 2/5, ra = 2/5, rm = 1/5
- Does not scale to Internet-size!
4. PageRank: matrix formulation
- Setup the flow equations as a stochastic adjacency matrix M
- Matrix M of size N: N is the number of nodes in the graph (web pages
on the Internet).
- Let page i has *di outgoing links.
- If there is an outgoing link from i to j, then
- Mij = 1/di
- else Mij = 0
- Stochastic: sum of all values in a column of M is equal to 1.
- Adjacency: non-zero value indicates the availability of a link.
- Rank vector r: A vector with an entry per page.
- The order of pages in the rank vector should be the same as the
order of pages in rows and columns of M.
- The order of pages in the rank vector should be the same as the
order of pages in rows and columns of M.
- Flow equations:
- ry = ry/2 + ra/2
- ra = ry/2 + rm
- rm = ra/2
- General equation for calculating rank rj for page j with regard to all pages:

- This can be rewritten in matrix form:

- Visualization:
- Suppose page i has importance ri and has outgoing links to three other pages, including page j*.

- Final perspective:
- Also, this is why we need to study advanced math …

5. PageRank: power iteration
- We want to find the page rank value r
- Power method: an iterative scheme.
- Support there are
N
web pages on the Internet. - All web pages start out with same importance:
1/N
. - Rank vector r: r(0) = [1/N, …,1/N]T
- Iterate: r(t+1) = M • r
- Stopping condition: r(t+1) - r(t) < some small positive error threshold e.
- Support there are
- Example:
6. PageRank: the Google formulation
- The three questions
- Does the above equation converge?
- Does it converge to what we want?.
- Are the results reasonable?
## Challenge 6.1: Does it converge: - The spider trap
problem - Build the stochastic adjacency matrix for the following: graph
and calculate the ranking for a
and b
.

Solution:
{: .solution} {: .challenge}
## Challenge 6.2: Does it converge to what we want?: - The
dead end
problem - Build the stochastic adjacency matrix
for the following: graph and calculate the ranking for a
and b
.

Solution:
{: .solution} {: .challenge}
- The solution: random teleport.
- At each time step, the random surfer has two options:
- A probably of β to follow an out-going link at random.
- A probably of 1 - β to jump to some random page.
- The value of β are between 0.8 to 0.9.
- The surfer will teleport out of the spider trap within a few time steps.
- The surfer will definitely teleport out of the dead-end.

- Final equation for PageRank:

7. Hands-on: Page Rank in Spark
- Create a file called
small_graph.dat
with the following contents:
y y
y a
a y
a m
m a
- This data file describes the graph in the easlier slides.
- Open a terminal.
- Activate the
pyspark
conda environment, then launch Jupyter notebook
$ conda activate pyspark
$ conda install -y psutil
$ jupyter notebook
- Create a new notebook using the
pyspark
kernel, then change the notebook’s name tospark-3
. - Copy the code from
spark-1
to setup and launch a Spark application.
8. Hands-on: Page Rank in Spark - Hollins dataset
- Download the Hollins dataset
- Hollins University webbot crawl in 2004.
- Which page is most important (internally).
{% include links.md %}
Content from Locality Sensitive Hashing
Last updated on 2024-07-26 | Edit this page
Overview
Questions
- How does Linux come to be?
Objectives
- Explain the historical development of Linux
Locality Sensitive Hashing
1. Applications of set-similarity
Many data mining problems can be expressed as finding
similar
sets:
- Pages with similar words, e.g., for classification by topic.
- NetFlix users with similar tastes in movies, for recommendation systems.
- Dual: movies with similar sets of fans.
- Entity resolution.
2. Application: similar documents
Given a body of documents, e.g., the Web, find pairs of documents with a lot of text in common, such as:
- Mirror sites, or approximate mirrors.
- Application: Don’t want to show both in a search.
- Plagiarism, including large quotations.
- Similar news articles at many news sites.
- Application: Cluster articles by
same story
: topic modeling, trend identification.
- Application: Cluster articles by
3. Three essential techniques for similar documents
- Shingling : convert documents, emails, etc., to sets.
- Minhashing : convert large sets to short signatures, while preserving similarity.
- Locality sensitive hashing : focus on pairs of signatures likely to be similar.

4. Shingles
- A k-shingle (or k-gram) for a document is a sequence of k characters that appears in the document.
- Example: k = 2; doc = abcab. Set of 2-shingles: {ab, bc, ca}.
- Represent a doc by its set of k-shingles.
- Documents that are intuitively similar will have many shingles in common.
- Changing a word only affects k-shingles within distance
k
from the word. - Reordering paragraphs only affects the
2k
shingles that cross paragraph boundaries. - Example: k=3,
The dog which chased the cat
versusThe dog that chased the cat
.- Only 3-shingles replaced are
g_w
,_wh
,whi
,hic
,ich
,ch_
, andh_c
.
- Only 3-shingles replaced are
5. Minhashing: Jaccard Similarity
- The
Jaccard similiarity
of two sets is the size of their intersection divided by the size of their union.

- Convert from sets to boolean matrices.
- Rows: elements of the universal set. In other words, all elements in the union.
- Columns: individual sets.
- A cell value of
1
in row e and column S if and only if e is a member of S. - A cell value of
0
otherwise. - Column similarity is the Jaccard similarity of the sets of their
rows that have the value of
1
. - Typically sparse.
- This gives you another way to calculate similarity: column similarity = Jaccard similarity.

- Generally speaking, given two columns, rows maybe classified as:
- a: 1 1
- b: 1 0
- c: 0 1
- d: 0 0
- Sim(C1, C2) = a/(a+b+c)
6. Minhashing
- Imagine the rows permuted randomly.
- Define
minhash
function h(C) = the number of the first (in the permuted order) row in which column C has 1. - Use several (e.g., 100) independent hash functions to create a signature for each column.
- The signatures can be displayed in another matrix called the signature matrix
- The signature matrix has
- its columns represent the sets and
- the rows represent the
minhash
values, in order for that column.

6. Minhashing: surprising property
- The probability (over all permutations of the rows) that h(C1) = h(C2) is the same as Sim(C1).
- Both are a/(a+b+c)!
- The similarity of signatures is the fraction of the minhash functions in which they agree.
- The expected similarity of two signatures equals the Jaccard
similarity of the columns.
- The longer the signatures, the smaller the expected error will be.

8. Hands-on Minhashing: implementation
- Can’t realistically permute billion of rows:
- Too many permutation entries to store.
- Random access on big data (big no no).
- Too many permutation entries to store.
- How to calculate hashes in sequential order?
- Pick approximately 100 hash functions (100 permutations).
- Go through the data set row by row.
- For each row r, for each hash function i,
- Maintain a variable M(i,c) which will maintain the smallest value value of hi(r) for which column c has 1 in row r.

9. Minhashing:
- Create three additional row permutations, update the signature matrix, and recalculate the signature similarity.
- Does the signature similarity become closer to the column similarity?
10. Locality-sensitive-hashing (LSH)
- Generate from the collection of signatures a list of candidate pairs: pairs of elements where similarity must be evaluated.
- For signature matrices: hash columns to many buckets and make
elements of the same bucket candidate pairs.
- Pick a similarity threshold
t
, a fraction < 1. - We want a pair of columns
c
andd
of the signature matrix M to be acandidate pair
if and only if their signatures agree in at least fractiont
of the rows.
- Pick a similarity threshold
11. LSH
- Big idea: hash columns of signature matrix M several times and arrange that only similar columns are likely to hash to the same bucket.
- Reality: we don’t need to study the entire column.

- Divide matrix M into
b
bands ofr
rows each. - For each band, hash its portion of each column to a hash table with
k
buckets, withk
as large as possible. -
Candidate
column pairs are those that hash to the same bucket for at least one band. - Fine tune
b
andr
. - We will not go into the math here …
12. Hands on LSH
- Download the set of inaugural speeches from https://www.cs.wcupa.edu/lngo/data/inaugural_speeches.zip.
- Launch a Spark notebook called
spark-6.ipynb
and create the two initial setup cells.
{% include links.md %}
Content from Frequent Itemsets
Last updated on 2024-07-26 | Edit this page
Overview
Questions
- How does Linux come to be?
Objectives
- Explain the historical development of Linux
Frequent Itemsets
1. The market-basket model
- Describe a common form of many-to-many relationship between two
kinds of objects.
- A large set of
items
. e.g.: things sold in a supermarket. - A large set of
baskets
, each of which is a small set of items. e.g.: things one customer buys on one trip to the supermarket.- Number of
baskets
cannot fit into memory.
- Number of
- A large set of
2. Definition of a frequent itemsets.
- A set of items that appears in many baskets is said to be
frequent
. - Assume a value
s
: support threshold. - If
I
is a set of items.- The
support
forI
is the number of baskets in whichI
is a subset.
- The
-
I
is frequent if its support iss
or higher.
3. Example: frequent itemsets
- items = {milk, coke, pepsi, beer, juice}
- B1: m,c,b
- B2: m,p,j
- B3: m,b
- B4: c,j
- B5: m,p,b
- B6: m,c,b,j
- B7: c,b,j
- B8: b,c
- Support value
s
= 3 (three baskets) - Frequent itemsets:
- {m}, {c}, {b}, {j}, {m,b}, {b,c}, {c,j}
4. Applications
- Items: products; Baskets: sets of products.
- Given that many people buy beer and diapers together: run a sale on diapers and raise price of beer.
- Given that many people buy hotdog and mustards together: run a sale of hotdog and raise price of mustards.
- Items = documents; baskets = sentences/phrases.
- Items that appear together too often could represent plagiarism.
- Items = words, basket = documents.
- Unusual words appearing together in large number of documents indicating interesting relationship.
5. Scale of the problem
- Walmart sells hundreds of thousands of items, and has billions of transactions (shopping basket/cart at checkout).
- The Web has billions of words and many billions of pages.
6. Association Rules:
-
If-then
rules abou the contents of baskets. -
{i_1, i_2, ..., i_k} -> j
means: “If a basket contains all of i_1,…,i_k then it islikely
to contain j.” -
Confidence
of this association rule is the probability of j given {i_1,…,i_k}.- The fraction of the basket with {i_1,…,i_k} that also contain j.
- Example:
- B1: m,c,b
- B2: m,p,j
- B3: m,b
- B4: c,j
- B5: m,p,b
- B6: m,c,b,j
- B7: c,b,j
- B8: b,c
- An association rule: {m,b} -> c
- Basket contains m and b: B1, B3, B5, B6
- Basket contains m, b, and c: B1, B6
- C = 2 / 4 = 50%
7. Finding association rules
- Find all association rules with support >= s and confidence >=c
- Hard part: funding the frequent itemsets.
8. Computation model
- Data is stored in flat files on disk.
- Most likely basket-by-basket.
- Expand baskets into pairs, triples, etc as you read the baskets.
- Use
k
nested loops to generate all sets of sizek
.
- Use
- I/O cost: per passes (all baskets read).
9. Main-memory bottleneck
- For many frequent-itemset algorithms, main memory is the critical resource.
- We need to keep count of things (occurrences of pairs/triples/…) when we read baskets.
- The number of different things we can count is limited by main memory.
- Swapping counts is going to be horrible.
10. Naive algorithm
- Hardest problem is finding
frequent pairs
because they are the most common. - Read file once, counting in main memory the occurences of each pair.
- For each basket of
n
items, there will ben(n-1)/2
pairs, generated by double-nested loops. - If
n^2
exceeds main memory, we fails.
11. Naive algorithm: how do we count.
- Count all pairs using a triangular matrix.
- Requires 4 bytes per pair for all possible pairs: 2n^2
- Keep a table ot triples [i, j, c] with c is the count of pair {i,j}.
- Requires 12 bytes only for pairs with count > 0: 12p with p is the number of pairs that actually occur.
12. A-Priori algorithm.
- Limit the need for main memory.
- Key idea:
monotonicity
- If a set of items appears at least
s
times, so does every subset of this set.
- If a set of items appears at least
- Contrapositive: If an item
i
does not appear ins
baskets, then no pair containingi
can appear in s baskets. - A-Priori algorithm:
Pass 1: read baskets and count the item occurrences. Only keep items that appear at least
s
times -frequent items
.Pass 2: read baskets again and only count in main memory only those pairs whose both items were found to be frequent from Pass 1.
Repeat the process with increasing number of items added to only sets found to be
frequent
.C1 = all items
In general, L_k are members of C_k with support greater than or equal to
s
.C_(k+1) includes (k+1) sets, each k of which is in L_k.
13. A-Priori at scale
- Under two passes.
- SON: Savasere-Omiecinski-Navathe
- Adaptable to a distributed data model (mapreduce).
- Repeatedly read small subsets of the baskets into main memory and
perform
a-priori
on these subsets, using a support that is equal to the main support divided by the total numbers of subsets. - Aggregate all candidate itemsets and determine which are frequent in the entire set.
- Repeatedly read small subsets of the baskets into main memory and
perform
14. Hands-on: SON on Spark
- Download the small movie dataset
15. Hands-on: Processing XML data
- Download the NSF Awards Data between 2011 and 2021.
- Review the XML Schema
- What is an interesting question?
{% include links.md %}
Content from Clustering
Last updated on 2024-07-26 | Edit this page
Overview
Questions
- How does Linux come to be?
Objectives
- Explain the historical development of Linux
Clustering
1. General problem statement:
- Given a set of data points, with a notion of
distance between points, group the
points into some number of clusters so that:
- Members of a cluster are close/similar to each other.
- Members of different clusters are dissimilar.
- Usually
- Points are in high-dimensional space (observations have many attributes).
- Similarity is defined using a distance measure: Euclidean, Cosine, Jaccard, edit distance …
2. Clustering is a hard problem
- Clustering in two dimensions looks easy.
- Clustering small amounts of data looks easy.
- In most cases, looks are not deceiving.
- But:
- Many applications involve not 2, but 10 or 10,000 dimensions.
- High-dimensional spaces look different.
3. Example:
- Clustering Sky Objects:
- A catalog of 2 billion sky objects represents objects by their radiation in 7 dimensions (frequency bands)
- Problem: cluster into similar objects, e.g., galaxies, stars, quasars, etc.
- Clustering music albums
- Music divides into categories, and customer prefer a few categories
- Are categories simply genres?
- Similar Albums have similar sets of customers, and vice-versa
- Clustering documents
- Group together documents on the same topic.
- Documents with similar sets of words maybe about the same topic.
- Dual formulation: a topic is a group of words that co-occur in many documents.
4. Cosine, Jaccard, Euclidean
- Different ways of representing documents or music albums lead to different distance measures.
- Document as
set of words
- Jaccard distance
- Document as
point in space of words
.- x_i = 1 if
i
appears in doc. - Euclidean distance
- x_i = 1 if
- Document as
vector in space of words
.- Vector from origin to …
- Cosine distance.
5. Overview: methods of clustering
- Hierarchical:
- Agglomerative (bottom up): each point is a cluster, repeatedly combining two nearest cluster.
- Divisive (top down): start with one cluster and recursively split it.
- Point assignment:
- Maintain a set of clusters
- Points belong to
nearest
cluster
6. Point assignment: K-means clustering
- Assumes
Euclidean
space/distance - Pick
k
, the number of clusters. - Initialize clsuters by picking on point per cluster.
- Until converge
- For each point, place it in the cluster whose current centroid it is
nearest.
- A cluster centroid has its coordinates calculated as the averages of all its points’ coordinates.
- After all points are assigned, update the locations of centroids of
the
k
clusters. - Reassign all points to their closest centroid.
- For each point, place it in the cluster whose current centroid it is
nearest.
7. The big question
- How to select
k
? - Try different
k
, looking at the change in the average distance to centroid, ask
increases. - Approach 1: sampling
- Cluster a sample of the data using hierarchical clustering, to
obtain
k
clusters. - Pick a point from each clsuter (e.g. point closest to centroid)
- Sample fits in main memory.
- Cluster a sample of the data using hierarchical clustering, to
obtain
- Approach 2: Pick
dispersed
set of points- Pick first point at random
- Pick the next point to be the one whose minimum distance from the selected points is as large as possible.
- Repeat until we have
k
points.
Content from Recommendation Systems
Last updated on 2024-07-26 | Edit this page
Overview
Questions
- How does Linux come to be?
Objectives
- Explain the historical development of Linux
Content from Distributed Machine Learning with Spark
Last updated on 2024-07-26 | Edit this page
Overview
Questions
- How does Linux come to be?
Objectives
- Explain the historical development of Linux
Distributed machine learning with Spark
1. Application: Spam Filtering
viagra | learning | the | dating | nigeria | spam? | |
---|---|---|---|---|---|---|
X1 | 1 | 0 | 1 | 0 | 0 | y1 = 1 |
X2 | 0 | 1 | 1 | 0 | 0 | Y2 = -1 |
X3 | 0 | 0 | 0 | 0 | 1 | y3 = 1 |
- Instance spaces X1, X2, X3 belong to set X (data points)
- Binary or real-valued feature vector X of word occurrences
-
d
features (words and other things, d is approximately 100,000)
- Class Y
- Spam = 1
- Ham = -1
2. Linear models for classification

- Vector Xj contains real values
- The Euclidean
norm is
1
. - Each vector has a label yj
- The Euclidean
norm is
- The goal is to find a vector W = (w1, w2, …,
wd) with wj is a real number such that:
- The labeled points are clearly separated by a line:
.
- The labeled points are clearly separated by a line:
- Dot is spam, minus is ham!
.
3. Linear classifiers
- Each feature
i
as a weight wi - Prediction is based on the weighted sum:
- If f(x) is:
- Positive: predict +1
- Negative: predict -1
.
4. Support Vector Machine
- Originally developed by Vapnik and collaborators as a linear classifier.
- Could be modified to support non-linear classification by mapping into high-dimensional spaces.
- Problem statement:
- We want to separate
+
from-
using a line. - Training examples:
.
- Each example
i
:.
.
- Inner product:
.
- We want to separate
- Which is the best linear separate defined by w?
.
5. Support Vector Machine: largest margin
- Distance from the separating line corresponds to the confidence of the prediction.
- For example, we are more sure about the class of
A
andB
than ofC
..
- Margin definition:
.
.
- Maximizing the margin while identifying
w
is good according to intuition, theory, and practice..
- A math question: how do you narrate this equation?
.
6. Support Vector Machine: what is the margin?
Slide from the book
.
-
Notation:
-
Gamma
is the distance from point A to the linear separator L:d(A,L) = |AH|
- If we select a random point M on line L, then d(A,L) is the
projection of AM onto vector
w
. - Project
- If we assume the normalized Euclidean value of
w
,|w|
, is equal to one, that bring us to the result in the slide.
-
In other words, maximizing the margin is directly related to how
w
is chosen.-
For the ith data point:
.
7. Some more math …
. - After some
more mathematical manipulations:
. - Everything
comes back to an optimization problem on
w
:
.
.
8. SVM: Non-linearly separable data
. - For each
data point: - If margin greater than 1, don’t care. - If margin is less
than 1, pay linear penalty. - Introducing slack variables:
.
.
9. Hands-on: SVM
- Download the set of inaugural speeches from https://www.cs.wcupa.edu/lngo/data/bank.csv
- Activate the
pyspark
conda environment, installpandas
, then launch Jupyter notebook
$ conda activate pyspark
$ conda install -y pandas
$ jupyter notebook
- Create a new notebook using the
pyspark
kernel, then change the notebook’s name tospark-7
. - Copy the code from
spark-6
to setup and launch a Spark application. - Documentation:
- Question: Can you predict whether a client will subscribe to a term
deposit (feature
deposit
)? - Problems:
- What data should the bank data be converted to?
- How to handle categorical data?
{% include links.md %}