Page Rank

Last updated on 2024-07-26 | Edit this page

Overview

Questions

  • How does Linux come to be?

Objectives

  • Explain the historical development of Linux

Page Rank

1. Web page organization in the past

  • Web pages were manually curated and organized.
  • Does not scale.
  • Next, web search engines were developed: information retrieval
  • Information retrieval focuses on finding documents from a trusted set.
  • Challenges for web search:
    • Who to trust given the massive variety of information sources?
      • Trustworthy pages may point to each other.
    • What is the best answer to a keyword search (given a context)?
      • Pages using the keyword in the same contexts tend to point to each other.
    • All pages are not equally important.
  • Web pages/sites on the Internet can be represented as a graph:
    • Web pages/sites are nodes on the graph.
    • Links from one page to another represents the incoming/outgoing connections between nodes.
    • We can compute the importance of nodes in this graph based on the distribution/intensity of these links.

2. PageRank: initial formulation

  • Link as votes:
    • A page (node) is more important if it has more links.
      • Incoming or outgoing?
    • Think of incoming links as votes.
  • Are all incoming links equals?
    • Links from important pages count more.
  • Each link’s vote is proportional to the importance of its source page.
  • If page j with importance rj has n outgoing links, each outgoing link has rj/n votes.
  • The importance of page j is the sum of the votes on its incoming links.

3. PageRank: the flow model

  • Summary:
    • A vote from an important page is worth more.
    • A page is important if it is linked to by other important pages.
  • Flow equations:
    • ry = ry/2 + ra/2
    • ra = ry/2 + rm
    • rm = ra/2
  • General equation for calculating rank rj for page j:
  • Three equations (actually two), three unknown.
    • No unique solutions
  • Additional constraint to force uniqueness:
    • ry + ra + rm = 1
  • Gaussian elimination:
    • ry = 2/5, ra = 2/5, rm = 1/5
  • Does not scale to Internet-size!

4. PageRank: matrix formulation

  • Setup the flow equations as a stochastic adjacency matrix M
  • Matrix M of size N: N is the number of nodes in the graph (web pages on the Internet).
    • Let page i has *di outgoing links.
    • If there is an outgoing link from i to j, then
      • Mij = 1/di
      • else Mij = 0
    • Stochastic: sum of all values in a column of M is equal to 1.
    • Adjacency: non-zero value indicates the availability of a link.
  • Rank vector r: A vector with an entry per page.
    • The order of pages in the rank vector should be the same as the order of pages in rows and columns of M.
  • Flow equations:
    • ry = ry/2 + ra/2
    • ra = ry/2 + rm
    • rm = ra/2
  • General equation for calculating rank rj for page j with regard to all pages:
  • This can be rewritten in matrix form:
  • Visualization:
    • Suppose page i has importance ri and has outgoing links to three other pages, including page j*.
  • Final perspective:
    • Also, this is why we need to study advanced math …

5. PageRank: power iteration

  • We want to find the page rank value r
  • Power method: an iterative scheme.
    • Support there are N web pages on the Internet.
    • All web pages start out with same importance: 1/N.
    • Rank vector r: r(0) = [1/N, …,1/N]T
    • Iterate: r(t+1) = M • r
    • Stopping condition: r(t+1) - r(t) < some small positive error threshold e.
  • Example:

6. PageRank: the Google formulation

  • The three questions
    • Does the above equation converge?
    • Does it converge to what we want?.
    • Are the results reasonable?

## Challenge 6.1: Does it converge: - The spider trap problem - Build the stochastic adjacency matrix for the following: graph and calculate the ranking for a and b.

Solution:

{: .solution} {: .challenge}

## Challenge 6.2: Does it converge to what we want?: - The dead end problem - Build the stochastic adjacency matrix for the following: graph and calculate the ranking for a and b.

Solution:

{: .solution} {: .challenge}

  • The solution: random teleport.
  • At each time step, the random surfer has two options:
    • A probably of β to follow an out-going link at random.
    • A probably of 1 - β to jump to some random page.
    • The value of β are between 0.8 to 0.9.
  • The surfer will teleport out of the spider trap within a few time steps.
  • The surfer will definitely teleport out of the dead-end.
  • Final equation for PageRank:

7. Hands-on: Page Rank in Spark

  • Create a file called small_graph.dat with the following contents:
y y
y a
a y
a m
m a
  • This data file describes the graph in the easlier slides.
  • Open a terminal.
  • Activate the pyspark conda environment, then launch Jupyter notebook
$ conda activate pyspark
$ conda install -y psutil
$ jupyter notebook
  • Create a new notebook using the pyspark kernel, then change the notebook’s name to spark-3.
  • Copy the code from spark-1 to setup and launch a Spark application.

8. Hands-on: Page Rank in Spark - Hollins dataset

  • Download the Hollins dataset
  • Hollins University webbot crawl in 2004.
  • Which page is most important (internally).

{% include links.md %}