Distributed and Cluster Computing

The demand for computational speed

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Why do we need faster computing platforms?

Objectives
  • Know standard computational speed measurements

  • Know representative examples of how large-scale computation servers science, industry, business, and society.

How do we measure speed in computation?

  • Floating-Point Operation per Second (FLOPS)
    • Count number of floating-point calculations (arithmetic operations) per second.
  • Not MIPS (millions of instructions per second) as MIPS also count non-arithmetic operations such as data movement or condition.
  • MFLOPS (megaFLOPS) = 1,000,000 FLOPS
  • GGLOPS (gigaFLOPS) = 1,000,000,000 FLOPS

Modern measurement of speed

  • TFLOPS (teraFLOPS) = 1,000,000,000,000 FLOPS
    • Intel’s ASCI Red for Sandia National Laboratory (DOE) was the first supercomputer in the world to achieve 1 TFLOPS in 1997.
    • ASCI Red is used for large-scale simulation in nuclear weapon development and material analysis.
  • PFLOPS (petaFLOPS) = 1,000,000,000,000,000 FLOPS
    • IBM RoadRunner for Los Alamos National Laboratory (DOE) was the first supercomputer to achieve 1 PFLOPS in 2008.
    • Oak Ridge National Laboratory’s Summit (DOE) is the second fastest supercomputer in the world at 148.6 PFLOPS.
    • Fugaku of RIKEN Center for Computational Science (Japan) is the current fastest supercomputer at 415 PFLOPS.
  • EFLOPS (exaFLOPS) = 1,000,000,000,000,000,000 FLOPS
    • Aurora (> 1 EFLOPS, 2021, Argonne National Lab)
    • Frontier (> 1.5 EFLOPS, 2021, Oak Ridge National Lab)
    • El Capitan (> 2 FFLOPS, 2023, Lawrence Livermore National Lab)
    • Tianhe 3 (first China Exascale computer, 2020, National University of Defense Technology)
    • Subsequent update to Fugaku (first Japan Exascale computer, 2021, RIKEN Center for Computational Science)
    • Europe, India, Taiwan on track

The bragging list (The TOP500 Project)

  • List the 500 most powerful computers in the world.
  • Count FLOPS by having supercomputer runs well-known computationally intensive tasks
    • Solve Ax=b, dense random matrix problem.
    • Primarily dense matrix-matrix multiplications.
  • Updated twice a year:
    • International Supercomputing conference (June, Germany)
    • Supercomputing conference (November, US).
  • Website: http://www.top500.org

Why do we need this much speed?

  • The four modern paradigms of science
    • Theory
    • Experiment
    • Simulation
    • Data analysis
  • Simulation: study things that are too big, too small, too fast, too slow, too expensive, or too dangerous.
  • Data anlaysis: study data that are too big, too complex, too fast (streaming data), too noisy.

Faster computer gives more details

Hurricane Sandy (2012)

  • the deadliest and most destructive, as well as the strongest hurricane of the 2012 Atlantic hurricane season,
  • the second-costliest hurricane on record in the United States (nearly $70 billion in damage in 2012),
  • affected 24 states, with particularly severe damage in New Jersey and New York,
  • hit New York City on October 29, flooding streets, tunnels and subway lines and cutting power in and around the city. Forecast of hurricane sandy on October 26, 2012

Various forecasts of Sandy

Various predictions model for Sandy

  • Geophysical Fluid Dynamic Laboratory hurricane model (National Oceanic and Atmostpheric Administration)
  • Hurricane Weather Resaarch and Forecasting model (NOAA/Naval Research Laboratory/Florida State University)
  • European Centre for Medium Range Weather Forecast model (ECMWF)
  • Global Forecast Model (National Weather Service)
  • Which model is closest to reality?

One of the contributing factors

Computational powr for weather forecast in 2013

  • http://blogs.agu.org/wildwildscience/2013/02/17/seriously-behind-the-numerical-weather-prediction-gap/

The US has catched up

  • NOAA Weather Computer upgrade
  • Two new supercomputers, Luna and Surge
  • 2.89 PFLOPS each for a total of 5.78 PFLOPS (previous generation is only 776 TFLOPS)
  • Increase water quantity forecast from 4000 locations to 2.7 million locations (700-fold increase in spatial density)
  • Can track and forecast 8 storms at any given time
  • 44.5 million dollars investment

Covid and HPC

  • US HPC Consortium to contribute to Covid research
  • Many other work (published in Nature) include supercomputing usages for molecular dynamic simulation/data analysis.

Manufacturing and HPC

  • For 767 development, Boeing built and tested 77 physical prototypes for wing design.
  • For 787 development, only 11 prototypes were built.
    • Optimized via more than 800,000 hours of computer simulation.

Oil and Gas Expoloration improvement with HPC

  • Los Alamos National Lab
  • Development of large-scale data analytic techniques to simulate and predict subsurface fluid distribution, temperature, and pressure
  • This reduces the need for observation wells (has demonstrated commercial success)

Fraud Detection at PayPal

  • 10M+ logins, 13M+ transactions, 300 variables per events
  • ~4B inserts, ~8B selects
  • MPI-like applications, Lustre Parallel File Systems, Hadoop Saved over $700M in fraudulent transactions during first year of deployment

Key Points

  • Computational speed is critical in solving humanity’s growing complex problems across all aspects of society.


Introduction to paralel and distributed computing

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is parallel computing?

Objectives
  • Understand basic architectures and computing models for parallel computing.

  • Understand and be able to carry out rudimentary speedup and efficiency study.

Envision the payroll problem

A simple payroll problem

Components of a computation problem

  • Computational task
  • Execution framework.
  • Computing resources.

Parallelizing payroll

Parallelizing payroll

Computational tasks should be able to …

  • Be borken apart into discrete pieces of work that can be solved simultaneously.
  • Be solved in less time with multiple computing resources than with a single computing resource.

Execution framework should be able to …

  • Execute multiple program instructions concurrently at any moment in time

Computing resources might be …

  • A single computer with multiple processors.
  • An arbitrary number of computers connected by a network.
  • A special computational component inside a single computer, separate from the main processors (GPU), or
  • Any combintations of the above.

The progress

How do parallel and distributed computing resources evolve?

Single site, single computer, single core

Single-core computer

Single site, single computer, multiple cores

Multi-core computer

Single site, multiple computers, multiple cores

cluster of computers

Multiple sites, multiple computers, multiple cores, federated domains

grid computing

Multiple site, multiple computers, multiple cores, virtula unified domain

cloud computing

Distributed computing systems

“A collection of individual computing devices that can communicate with each other.” (Attiya and Welch, 2004).

emphasis …

“A collection of individual computing devices that can communicate with each other.” (Attiya and Welch, 2004).

Can we just throw more computers at the problem?

  • Parallel speedup: how much faster the program becomes once some computing resources are added.
  • Parallel efficiency: Ratio of performance improvement per individual unit of computing resource.

Parallel speedup

  • Given p processors,
  • Speedup, S(p), is the ratio of the time it takes to run the program using a single processor over the time it takes to run the program using p processors.
  • The time it takes to run the program using a single processor, : sequential run time
  • The time it takes to the the program using multiple processor, : parallel run time

Example 1

A program takes 30 seconds to run on a single-core machine and 20 seconds to run on a dual-core machine. What is the speedup of this program?

Solution



Theoretical max

  • Let f be the fraction of the program that is not parallelizable.
  • Assume no overhead.
  • Running the program using one processor will take time .
  • The parallel run time, , can be calculated as the time it take to run the fraction that is non-parallelizable () plus the remainning parallelizable fraction ().
  • If , this simplifies to .
  • Assume no overhead, this means that we reduce the speed by half as we double the number of processor.
  • And so on … scaling number of processors

Amdahl’s Law

  • This brings us to Amdahl’s Law, which quantifies speedup in term of number of processors and fraction of non-parallelizable code:

Parallel efficiency

  • The efficiency E is then defined as the ratio of speedup S(p) over the number of processors p.
  • E is often measured as percentage.
  • For example, E = 0.8 means the parallel efficiency is 80%.

Example 2

Suppose that 4% of my application is serial. What is my predicted speedup according to Amdahl’s Law on 5 processors?

Solution



Example 3

Suppose that I get a speedup of 8 when I run my application on 10 processors. According to Amdahl’s Law:

  • What portion of my code is serial?
  • What is the speedup on 20 processors?
  • What is the efficiency on 5 processors? 20 processors?
  • What is the best speedup that I could achieve?

Serial portion






Speedup on 20 processors



Efficiency



Best speedup



  • In other word, the highest number of processors one should add to this problem is 36.

Limiting factors of parallel speedup

  • Non-parallelizable code.
  • Communication overhead.

If there is no limiting factor …

  • 0% non-paralellizable code.
  • No communication overhead.

Superlinear speedup

  • The unicorn of parallel and distributed computing.
  • Poor sequential reference implementation.
  • Memory caching.
  • I/O blocking.

Computer architecture: Flynn’s Taxonomy

Flyn's taxonomy

Types of distributed computing systems

  • Streaming SIMD extensions for x86 architectures.
  • Shared memory.
  • Distributed shared memory.
  • Heterogeneous computing (accelerators).
  • Message passing.

Streaming SIMD

Streaming simd

Shared memory

  • One processor, multiple threads.
  • All threads have read/write access to the same memory.
  • Programming models:
    • Threads (pthread) - programmer manages all parallelism.
    • OpenMP: compiler extensions handle.
    • Vendor libraries: (Intel MKL - math kernel libraries) shared memory

Heterogeneous computing

  • GPU
  • FPGA
  • Co-processors

GPU - graphics processing unit

  • Processor unit on graphic cards designed to support graphic rendering (numerical manipulation).
  • Significant advantage for certain classes of scientific problems.
  • Programming models:
    • CUDA: Library developed by NVIDIA for their GPUs.
    • OpenACC: Standard developed by NVIDIA, Cray, and Portal Compiler (PGI).
    • OpenAMP: Extension to Visual C++ to direct computation to GPU.
    • OpenCL: Public standard by the group the developed OpenGL. GPU

FPGA - field programmable array

  • Dynamically reconfigurable circuit board.
  • Expensive, difficult to program.
  • Power efficient, low heat.

Co-processors

  • Enables offloading of computationally intensive tasks from main CPU.
  • Similar to GPU, but can support a wider range of computational tasks.
  • Intel
    • Xeon Phi processor line.
    • PCIe-based add-on cards, but could also be used as a stand alone CPU.
    • Unlike GPU, Intel Xeon supports all programs targeted to standard x86 CPU (very minor modification if any)

Message passing distributed computing

  • Processes handle their own memory.
  • Data is passed between processes via messages.
    • Scales well.
    • Cluster can be built from commodity parts.
    • Cluster can easily be expanded.
    • Cluster can be heterogeneous.
  • Programming models:
    • MPI: standardized message passing library.
    • MPI + OpenMP: hybrid model.
    • MapReduce programming model for big data processing.
      message passing

Benchmarking

  • LINPACK (Linear Algebra Package): Dense Matrix Solver
  • HPCC: High-Performance Computing Challenge.
    • HPL (LINPACK to solve linear system of equations)
    • DGEMM (Double precision general matrix multiply)
    • STREAM (Memory bandwidth)
    • PTRANS (Parallel matrix transpose to measure processors communication)
    • RandomAccess (random memory updates)
    • FFT (double precision complex discrete fourier transform)
    • Communication bandwidth and latency
  • SHOC: Scalable heterogeneous computing
    • Non-traditional system (GPU)
  • TestDFSIO
    • I/O performance of MapReduce/Hadoop Distributed File System.

Ranking

  • TOP500: Rank the supercomputers based on their LINPACK score.
  • GREEN500: Rank the supercomputers with emphasis on energy usage (LINPACK/power consumption).
  • GRAPH500: Rank systems based on benchmarks designed for data-intensive computing.

Key Points

  • There various parallel computing architecture/model

  • Users need to find the one that give the best speedup/efficiency measures.

  • Speed/efficiency depends on how well programs are parallelized.


Introduction to OpenMP

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is the OpenMP model?

Objectives
  • Know the target hardware architecture for OpenMP

  • Be able to compile and run an OpenMP program.

  • Understand the concept of parallel regions.

Target hardware

  • Single computing node, multiple sockets, multiple cores.
  • Dell PowerEdge M600 Blade Server. multi-socket motherboard
  • Intel Sandy Bridge CPU. Intel Sandy Bridge CPU
  • In summary
    • Node with up to four sockets.
    • Each socker has up to 60 cores.
    • Each core is an independent CPU.
    • Each core has access to all the memory on the node.

Target software

  • Provide wrappers for threads and fork/join model of parallelism.
    • Program originally runs in sequential mode.
    • When parallelism is activated, multiple threads are forked from the original proces/thread (master thread).
    • Once the parallel tasks are done, threads are joined back to the original process and return to sequential execution.
      threads/fork-join models
  • The threads have access to all data in the master thread. This is shared data.
  • The threads also have their own private memory stack.

Basic requirements to write, compile, and run an OpenMP program

  • Source code (C) needs to include #include <omp.h>
  • Compiling task need to have -fopenmp flag.
  • Specify the environment variable OMP_NUM_THREADS.

OMP directives

  • OpenMP must be told when to parallelize.
  • For C/C++, pragma is used to annotate:
    #pragma omp somedirective clause(value, othervalue)
    parallel statement;
    
  • or
    #pragma omp somedirective clause(value, othervalue)
    {
    parallel statement 1;
    parallel statement 2;
    ...
    }
    

Hands-on 1: Setup directory

  • Create a directory named csc466 inside your home directory, then change into that directory.
  • Next, create a directory called openmp, and change into that directory
    $ cd
    $ mkdir csc466
    $ cd csc466
    $ mkdir openmp
    $ cd openmp
    

create directories

Hands-on 2: Create hello_omp.c

  • In the EXPLORER window, right-click on csc466/openmp and select New File.
  • Type hello_omp.c as the file name and hits Enter.
  • Enter the following source code in the editor windows:
  • Save the file when you are done:
    • Ctrl-S for Windows/Linux
    • Command-S for Macs
  • Memorize your key-combos!.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char *argv[]) {
  /* Fork a team of threads giving them their own copies of variables */
  #pragma omp parallel 
  {
    /* Obtain thread number */
    int tid = omp_get_thread_num();
    printf("Hello World from thread = %d\n", tid);

    /* Only master thread does this */
    if (tid == 0) {
      int nthreads = omp_get_num_threads();
      printf("Number of threads = %d\n", nthreads);
    }
  } /* All threads join master thread and disband */
}

create hello_omp.c

  • Line 1: Include omp.h to have libraries that support OpenMP.
  • Line 7: Declare the beginning of the parallel region. Pay attention to how the curly bracket is setup, comparing to the other curly brackets.
  • Line 10: omp_get_thread_num gets the ID assigned to the thread and then assign it to a variable named tid of type int.
  • Line 15: omp_get_num_threads gets the value assigned to OMP_NUM_THREADS and return it to a variable named nthreads of type int.

What’s important?

  • tid and nthreads.
  • They allow us to coordinate the parallel workloads.
  • Specify the environment variable OMP_NUM_THREADS.
$ export OMP_NUM_THREADS=4

Example: trapezoidal

  • Problem: estimate the integral of on using trapezoidal rule. four threads.
  • With 4 threads: nthreads=4.
    • How to decide which thread will handle which segment?
    • How to get all results back together?
      trapezoidal rule

Hands-on 3: Trapezoid implementation

  • In the EXPLORER window, right-click on csc466/openmp and select New File.
  • Type trapezoid.c as the file name and hits Enter.
  • Enter the following source code in the editor windows:
  • Save the file when you are done:
    • Ctrl-S for Windows/Linux
    • Command-S for Macs
  • Memorize your key-combos!.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char *argv[]) {
  //init parameters and evaluators
  double a = atof(argv[1]);
  double b = atof(argv[2]);
  int N = atoi(argv[3]);
  int nthreads = atoi(argv[4]);
  double partial_sum[nthreads];
  double h = ((b - a) / nthreads);    

  omp_set_num_threads(nthreads);
  #pragma omp parallel 
  {
    int tid = omp_get_thread_num();
    /* number of trapezoids per thread */
    int partial_n = N / nthreads;
    double delta = (b - a)/N;
    double local_a = a + h * tid;
    double local_b = local_a + delta;
    for (int i = 0; i < partial_n; i++) {
      partial_sum[tid] += (local_a * local_a + local_b * local_b) * delta / 2;
      local_a = local_b;
      local_b += delta;
    }
  } 
  double sum = 0;
  for (int i = 0; i < nthreads; i++) {
    sum += partial_sum[i];
  }
  printf("The integral is: %.4f\n", sum);
  return 0;
}

create trapezoid.c

Hands-on 4: A bit more detailed

  • Modify the trapezoid.c so that it looks like below.
  • Save the file when you are done:
    • Ctrl-S for Windows/Linux
    • Command-S for Macs
  • Memorize your key-combos!.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char *argv[]) {
  //init parameters and evaluators
  double a = atof(argv[1]);
  double b = atof(argv[2]);
  int N = atoi(argv[3]);
  int nthreads = atoi(argv[4]);
  double partial_sum[nthreads];
  double h = ((b - a) / nthreads);    

  omp_set_num_threads(nthreads);
  #pragma omp parallel 
  {
    int tid = omp_get_thread_num();
    /* number of trapezoids per thread */
    int partial_n = N / nthreads;
    double delta = (b - a)/N;
    double local_a = a + h * tid;
    double local_b = local_a + delta;
    for (int i = 0; i < partial_n; i++) {
      partial_sum[tid] += (local_a * local_a + local_b * local_b) * delta / 2;
      local_a = local_b;
      local_b += delta;
    }
    printf("Thread %d calculate a partial sum of %.4f from %.4f to %.4f\n", tid, partial_sum[tid], a + h*tid, local_a);
  } 
  double sum = 0;
  for (int i = 0; i < nthreads; i++) {
    sum += partial_sum[i];
  }
  printf("The integral is: %.4f\n", sum);
  return 0;
}

modify trapezoid.c

Challenge 1:

Alternate the trapezoid.c code so that the parallel region will invokes a function to calculate the partial sum.

Solution

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

double trap(double a, double b, int N, int nthreads, int tid) {
  double h = ((b - a) / nthreads);  
  int partial_n = N / nthreads;
  double delta = (b - a)/N;
  double local_a = a + h * tid;
  double local_b = local_a + delta;
  double p_sum = 0;
  for (int i = 0; i < partial_n; i++) {
    p_sum += (local_a * local_a + local_b * local_b) * delta / 2;
    local_a = local_b;
    local_b += delta;
  }
  return p_sum;
}

int main (int argc, char *argv[]) {
   //init parameters and evaluators
  double a = atof(argv[1]);
  double b = atof(argv[2]);
  int N = atoi(argv[3]);
  int nthreads = atoi(argv[4]);
  double partial_sum[nthreads];

  omp_set_num_threads(nthreads);
  #pragma omp parallel 
  {
    int tid = omp_get_thread_num();
    partial_sum[tid] = trap(a, b, N, nthreads, tid) ;
  } 
  double sum = 0;
  for (int i = 0; i < nthreads; i++) {
    sum += partial_sum[i];
  }
  printf("The integral is: %.4f\n", sum);
  return 0;
} 

Challenge 2:

  • Write a program called sum_series.c that takes a single integer N as a command line argument and calculate the sum of the first N non-negative integers.
  • Speed up the summation portion by using OpenMP.
  • Assume N is divisible by the number of threads.

Solution

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int sum(int N, int nthreads, int tid) {
  int count = N / nthreads;  
  int start = count * tid + 1;
  int p_sum = 0;
  for (int i = start; i < start + count; i++) {
    p_sum += i;
  }
  return p_sum;
}

int main (int argc, char *argv[]) {
  int N = atoi(argv[1]);
  int nthreads = atoi(argv[2]);
  int partial_sum[nthreads];
   
  omp_set_num_threads(nthreads);
  #pragma omp parallel 
  {
    int tid = omp_get_thread_num();
    partial_sum[tid] = sum(N, nthreads, tid) ;
  } 
  int sum = 0;
  for (int i = 0; i < nthreads; i++) {
    sum += partial_sum[i];
  }
  printf("The sum of series is: %.4f\n", sum);
  return 0;
} 

Challenge 3:

  • Write a program called sum_series_2.c that takes a single integer N as a command line argument and calculate the sum of the first N non-negative integers.
  • Speed up the summation portion by using OpenMP.
  • There is no assumtion that N is divisible by the number of threads.

Solution

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

int sum(int N, int nthreads, int tid) {
  int count = N / nthreads;  
  int start = count * tid;
  int end = start + count;
  int p_sum = 0;

  for (int i = start; i < end; i++) {
    p_sum += i;
  } 
  if (tid < remainder) {
    p_sum += count * remainder + tid + 1;
  }

  return p_sum;
}

int main (int argc, char *argv[]) {
  int N = atoi(argv[1]);
  int nthreads = atoi(argv[2]);
  int partial_sum[nthreads];
   
  omp_set_num_threads(nthreads);
  #pragma omp parallel 
  {
    int tid = omp_get_thread_num();
    partial_sum[tid] = sum(N, nthreads, tid) ;
  } 
  int sum = 0;
  for (int i = 0; i < nthreads; i++) {
    sum += partial_sum[i];
  }
  printf("The sum of series is: %.4f\n", sum);
  return 0;
} 

Hands-on 5: Trapezoid implementation with timing

  • In the EXPLORER window, right-click on csc466/openmp and select New File.
  • Type trapezoid_time.c as the file name and hits Enter.
  • Enter the following source code in the editor windows (You can copy the contents of trapezoid.c with function from Challenge 1 as a starting point):
  • Save the file when you are done:
    • Ctrl-S for Windows/Linux
    • Command-S for Macs
  • Memorize your key-combos!.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main (int argc, char *argv[]) {
  //init parameters and evaluators
  double a = atof(argv[1]);
  double b = atof(argv[2]);
  int N = atoi(argv[3]);
  int nthreads = atoi(argv[4]);
  double partial_sum[nthreads];
  double h = ((b - a) / nthreads);  
  clock_t start, end;  

  omp_set_num_threads(nthreads);
  start = clock();
  #pragma omp parallel 
  {
    int tid = omp_get_thread_num();
    /* number of trapezoids per thread */
    int partial_n = N / nthreads;
    double delta = (b - a)/N;
    double local_a = a + h * tid;
    double local_b = local_a + delta;
    for (int i = 0; i < partial_n; i++) {
      partial_sum[tid] += (local_a * local_a + local_b * local_b) * delta / 2;
      local_a = local_b;
      local_b += delta;
    }
  } 
  end = clock();
  double sum = 0;
  for (int i = 0; i < nthreads; i++) {
    sum += partial_sum[i];
  }
  printf("The integral is: %.4f\n", sum);
  printf("The run time is: %.4f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
  return 0;
}

create trapezoid.c

  • How’s the run time?

Key Points


Introduction to XSEDE

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is XSEDE?

Objectives
  • Know the history, organization, and impact of XSEDE

  • Be able to access one XSEDE resource

XSEDE

  • Extreme Science and Engineering Discovery Enrironment
    • 2011 - now
    • 17 institutions at the beginning.
  • Successor of the TeraGrid program
    • 2001 - 2011.
    • 9 sites (UIUC, SDSC, ANL, CalTech, PSC, ORNL, Purdue, Indiana, TACC)
  • Unit of computation: SU (Service Unit)
    • 1 SU = 1 core-hour of computing time.

What is currently part of XSEDE

  • Stampede2 (TACC): 18 PFLOPS
    • 4,200 Intel Knights Landing (68 cores), 96GB memory
    • 1,736 Intel Xeon Skylake (48 cores), 192GB memory.
    • 31 PB storage.
  • Comet (SDSC): 2.76 PFLOPS
    • 1944 Intel Haswell (24 cores), 320 GB memory.
    • 72 Intel Haswell (24 cores) with GPUs (36 K80, 36 P100), 128GB memory.
    • 4 large memory nodes (1.5TB memory each), Intel CPU (64 cores).
    • 7.6 PB storage.
  • Bridges (PSC) - we use this.
    • 752 Intel Haswell (28 cores), 128GB memory.
    • 42 Intel Broadwell (36-40 cores), 3TB memory.
    • 4 Intel Broadwell (36-40 cores), 12 TB memory.
    • 16 Haswell nodes with dual K80 GPU, 32 Haswell nodes with dual P100 GPU.
    • 9 Intel Skylake nodes with 8 V100 GPUs per node, 128GB memory.
    • 1 DGX-2 (Intel Cascadelake) with 16 V100 GPU, 1.5 TB memory.
  • Jetstream (IU)
    • Cloud provider (baremetal) for academic research and education.
    • 640 nodes, totaling 7680 cores, 40TB memory, and 640 TB local storage.

What will be part of XSEDE in the next five years?

  • Jetstream2 (IU) - $10M
    • Predicted to be 8 PFLOPS.
    • $10M.
  • Delta (NCSA) - $10M
    • TBD
  • Anvil (Purdue) - $10M
    • 1,000 AMD Epyc Milan (128 cores), 256GB memory.
    • 32 AMD Epyc Milan (128 cores), 1TB memory
    • 16 AMD Epyc Milan (128 cores), 256GB memory, 4 NVIDIA A100 GPUs.
  • Neocortex (PSC) - $5M.
    • Special purpose supercomputer built to support AI research.
    • 2 Cerebras CS-1: largest chip ever built with 400,000 cores, 16GB on-chip memory, and 9.6PB/s memory bandwidth.
    • 1 HPE Superdome gateway: 24 TB memory, 204.6 local storage, 24 100GB network cards, 16 Infiniband network cards.
  • Voyager (SDSC) - $5M.
    • TBD.

What do we have access to?

  • 50,000 SUs on Bridges RSM.
  • 500GB total storage on Bridges Pylon.

Hands-on 1: Login to PSC Bridges

  • Open a browser and go to https://ondemand.bridges.psc.edu.
  • Enter the Bridges user name and password from the confirmation email.

Lc

Hands-on 2: Login to PSC Bridges

  • Open a browser and go to https://ondemand.bridges.psc.edu.
  • Enter the Bridges user name and password from the confirmation email.

Login to ondemand

Hands-on 3: OnDemand navigation

  • Click on the Files and then Home Directory.

Go to Files

  • Top button row (Go To, Open in Terminal, New File, New Dir, Upload) operates at directory level.
  • Next row (View, Edit, Rename/Move, Copy, Paste) operates on files.

Files and Directory

Hands-on 4: Trapezoid implementation with timing

  • Click New Dir and enter openmp. Hit OK.

Create the openmp directory

  • Double click on openmp.
  • Click New File and enter trapezoid.c. Hit OK.
  • Select trapezoid.c and click Edit.
  • Type/copy the code from the below into the editor windows.
  • Click Save to save the file.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

double trap(double a, double b, int N, int nthreads, int tid) {
  double h = ((b - a) / nthreads);  
  int partial_n = N / nthreads;
  double delta = (b - a)/N;
  double local_a = a + h * tid;
  double local_b = local_a + delta;
  double p_sum = 0;
  for (int i = 0; i < partial_n; i++) {
    p_sum += (local_a * local_a + local_b * local_b) * delta / 2;
    local_a = local_b;
    local_b += delta;
  }
  return p_sum;
}

int main (int argc, char *argv[]) {
  //init parameters and evaluators
  double a = atof(argv[1]);
  double b = atof(argv[2]);
  int N = atoi(argv[3]);
  int nthreads = atoi(argv[4]);
  double partial_sum[nthreads];
  clock_t start, end;  

  omp_set_num_threads(nthreads);
  start = clock();
  #pragma omp parallel 
  {
    int tid = omp_get_thread_num();
    partial_sum[tid] = trap(a, b, N, nthreads, tid);
    printf("Thread %d has partial sum %f\n", tid, partial_sum[tid]);
  } 
  end = clock();
  double sum = 0;
  for (int i = 0; i < nthreads; i++) {
    sum += partial_sum[i];
  }
  printf("The integral is: %.4f\n", sum);
  printf("The run time is: %.4f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
  return 0;
}

Editingg trapezoid.c

Hands-on 5: Bridges shell access

  • Go back to the Open OnDemand tab and navigate to Bridges Shell Access under Clusters

Bridges Shell Access

  • This enables an in-browser terminal to a virtual login VM on bridges

Bridges Shell

Hands-on 6: Requesting interactive allocation on Bridges

  • Login node = lobby.
  • We need to request resources to run jobs.
  • interact: request an interactive allocation
  • -p: partition requested (always RM or RM-small)
  • -t: duration (HH:MM:SS)
  • -N: number of nodes (chunks)
  • --ntasks-per-node: number of threads/cores per node (per chunk)
  • --mem: amount of memory requested.
$ interact -p RM-small -t 00:30:00 -N 1 --ntasks-per-node=4 --mem=4GB
$ cd openmp
$ gcc -std=c11 -o trapezoid trapezoid.c -openmp
$ ./trapezoid 0 10 100000000 1
$ ./trapezoid 0 10 100000000 2
$ ./trapezoid 0 10 100000000 4

interactive allocation

  • What is the problem?

Hands-on 7: Submitting batch job on Bridges

  • For long running jobs.
  • Switch to File Exporer tab.
  • Inside openmp, Click New File and enter trapezoid.sh. Hit OK.
  • Type/copy the code from the below into the editor windows.
  • Click Save to save the file.
#!/bin/bash
#SBATCH -N 1
#SBATCH -p RM
#SBATCH --ntasks-per-node 28
#SBATCH -t 00:20:00

# echo commands to stdout 
set -x
cd ~/openmp

# build and trapezoid
gcc -std=c11 -o trapezoid trapezoid.c -openmp
./trapezoid 0 10 100000000 1
./trapezoid 0 10 100000000 2
./trapezoid 0 10 100000000 4
./trapezoid 0 10 100000000 8
./trapezoid 0 10 100000000 16
./trapezoid 0 10 100000000 20

Batch submission script

  • To submit the batch script:
  • This must be done on the login node. If you are still on an r node, need to exit back to a login node.
$ cd openmp
$ sbatch trapezoid.sh
  • Be careful with copy/pasting.
  • Use dos2unix to remove DOS line breaks if that happens.

Submit batch job

Job status

  • An output report file will be created. The filename format is slurm-JOBID.out

Job report

  • Uses more to view the content of the output file. You can also use Edit or View to view this file on the File Explorer tab.
  • The professor made a mistake in the submission script. What is it?
  • Why does the program still run even though there was an error in the gcc command?

Job report output

Key Points


OpenMP: parallel regions and loop parallelism

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Do we have to manually do everything?

Objectives
  • Know how to use parallel for pragma

Loop parallelism

  • Very common type of parallelism in scientific code
  • In previous trapezoid example, we calculate the division of iteration manually.
  • An alternative is to use parallel for pragma

Hands-on 1: Sum series implementation

  • In the EXPLORER window, right-click on csc466/openmp and select New File.
  • Type sum_series_for.c as the file name and hits Enter.
  • Enter the following source code in the editor windows:
  • Save the file when you are done:
    • Ctrl-S for Windows/Linux
    • Command-S for Macs
  • Memorize your key-combos!.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main (int argc, char *argv[]) {
  int N = atoi(argv[1]);
  int nthreads = atoi(argv[2]);
  int partial_sum[nthreads];
  clock_t start, end;

  omp_set_num_threads(nthreads);
  start = clock();
  #pragma omp parallel
  {
    int tid = omp_get_thread_num();
    partial_sum[tid] = 0;
    #pragma omp for
    for (int i = 0; i <  N; i++) {
      partial_sum[tid] += i;
    }
  }
  end = clock();

  int sum = 0;
  for (int i = 0; i < nthreads; i++) {
    sum += partial_sum[i];
  }
  printf("The integral is: %d\n", sum);
  printf("The run time is: %.4f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
  return 0;
}

compile and run sum_series_for.c

Hands-on 2: Improving sum series implementation

  • In the EXPLORER window, right-click on csc466/openmp and select New File.
  • Type sum_series_for_2.c as the file name and hits Enter.
  • Enter the following source code in the editor windows:
  • Save the file when you are done:
    • Ctrl-S for Windows/Linux
    • Command-S for Macs
  • Memorize your key-combos!.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main (int argc, char *argv[]) {
  int N = atoi(argv[1]);
  int nthreads = atoi(argv[2]);
  int partial_sum[nthreads];
  clock_t start, end;

  omp_set_num_threads(nthreads);
  start = clock();
  #pragma omp parallel
  {
    int tid = omp_get_thread_num();
    partial_sum[tid] = 0;
    int psum = 0;
    #pragma omp for
    for (int i = 0; i <  N; i++) {
      psum += i;
    }
    partial_sum[tid] = psum;
  }
  end = clock();

  int sum = 0;
  for (int i = 0; i < nthreads; i++) {
    sum += partial_sum[i];
  }
  printf("The integral is: %d\n", sum);
  printf("The run time is: %.4f\n", ((double) (end - start)) / CLOCKS_PER_SEC);
  return 0;
}

compile and run sum_series_for_2.c

Key Points

  • Parallel for allows simplification of code


OpenMP: Work sharing and controlling thread data

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How do OpenMP threads implicitly and explicitly share work among themsleves?

  • How do data are managed?

Objectives
  • Understand the basic work sharing constructs: for, secions, and single.

  • Understand thread-based data sharing in OpenMP

  • Understand the outcomes of OpenMP programs with different work sharing constructs.

1. Work sharing constructs

  • OpenMP utilizes work sharing constructs to facilitate dividing parallelizable work among a number of threads.
  • The work sharing constructs are:
    • for: divide loop iterations among threads.
    • sections: divide sections of codes among themselves.
    • single: the section is executed by a single thread.

2. Work sharing construct: sections

  • Used when parallelize predetermined number of independent work units.
  • Within a primary sections construct, there can be multiple section construct.
  • A section can be executed by any available thread in the current team, including having multiple sections done by the same thread.

3. Hands on: sections

  • In the EXPLORER window, double-click on csc466/openmp and select New Dir to create a new directory in openmp called sections.
  • Inside sections, create a file named hello_sections.c with the following contents:

compile and run hello_sections.c

4. Challenge

Given the following functions: y=x4 + 15x3 + 10x2 + 2x
develop an OpenMP program called poly_openmp.c with sections/section directives. Each section should handle the calculations for one term of the polynomial.

Solution

5. Work sharing construct: single

  • Limits the execution of a block to a single thread.
  • All other threads will skip the execution of this block but wait until the block is finished before moving on.
  • To enable proceed without waiting, a nowait clause can be added.

6. Hands on: single

  • Inside sections, create the following files:

hello_sections_nosingle.c:

hello_sections_single.c:

hello_sections_single_nowait.c:

Compile and run the above files:

compile and run singles

7. Shared and private data

  • Data declared outside of a parallel region will be shared among all threads.
  • Data declared inside of a parallel region will be private to individual thread.

8. Hands on: problem with shared data

  • Inside sections, create a file named counter_openmp.c with the following contents:

compile and run hello_sections.c

Key Points


Introduction to MPI

Overview

Teaching: 0 min
Exercises: 0 min
Questions
Objectives
  • Understand the origin of message passing interface (MPI)

  • Know the basic MPI routines to identify rank and get total number of participating processes.

  • Understand how to leverage process rank and total process count for work assignment.

1. Message passing

  • Processes communicate via messages
  • Messages can be:
    • Raw data used in actual calculations
    • Signals and acknowledgements for the receiving processes regarding the workflow.

2. History of MPI: early 80s

  • Various message passing environments were developed.
  • Many similar fundamental concepts:
    • Cosmic Cube and nCUBE/2 (Caltech),
    • P4 (Argonne),
    • PICL and PVM (Oakridge),
    • LAM (Ohio SC)

3. History of MPI: 1991-1992

  • More than 80 researchers from different institutions in US and Europe agreed to develop and implement a common standard for message passing.
  • Follow-up first meeting of the working group hosted at Supercomputing 1992 in Minnesota.

MPI letters

4. After finalization of working technical draft

  • MPI becomes the de-facto standard for distributed memory parallel programming.
  • Available on every popular operating system and architecture.
  • Interconnect manufacturers commonly provide MPI implementations optimized for their hardware.
  • MPI standard defines interfaces for C, C++, and Fortran.
  • Language bindings available for many popular languages (quality varies)
    • Python (mpi4py)
    • Java (no longer active)

5. 1994: MPI-1

  • Communicators
    • Information about the runtime environments
    • Creation of customized topologies
  • Point-to-point communication
    • Send and receive messages
    • Blocking and non-blocking variations
  • Collectives
    • Broadcast and reduce
    • Gather and scatter

6. 1998: MPI-2

  • One-sided communication (non-blocking)
    • Get & Put (remote memory access)
  • Dynamic process management
    • Spawn
  • Parallel I/O
    • Multiple readers and writers for a single file
    • Requires file-system level support (LustreFS, PVFS)

7. 2012: MPI-3

  • Revised remote-memory access semantic
  • Fault tolerance model
  • Non-blocking collective communication
  • Access to internal variables, states, and counters for performance evaluation purposes

8. Hands-on: taz and submitty

  • Your account (login/password) will work on both taz and submitty.
  • USERNAME represents the login name that you received in email.
  • To access taz from a terminal:
$ ssh USERNAME@taz.cs.wcupa.edu
  • To access submitty from a terminal:
$ ssh USERNAME@submitty.cs.wcupa.edu
  • The environments on taz and submitty are similar to one another. In the remainder of these lectures, example screenshots will be taken from submitty, but all commands will work on taz as well.

9. Hands-on: create and compile MPI codes

  • Create a directory named intro-mpi
  • Change into intro-mpi
$ cd
$ mkdir intro-mpi
$ cd intro-mpi
  • To create a file from terminal, run nano -c file_name.
  • When finish editing, press Ctrl-X to select Quit and Save.
  • Press Y to confirm that you want to save.
  • Press Enter to confirm that you are saving to file_name.
  • Inside intro-mpi, create a file named first.c with the following contents
  • Compile and run first.c:
$ mpicc -o first first.c
$ mpirun -np 1 ./first
$ mpirun -np 2 ./first
$ mpirun -np 4 ./first
  • Both taz and submitty only have four computing cores, therefore we can (should) only run to a maximum of four processes.

compile and run first.c

10. MPI in a nutshell

  • All processes are launched at the beginning of the program execution.
    • The number of processes are user-specified
    • This number could be modified during runtime (MPI-2 standards)
    • Typically, this number is matched to the total number of cores available across the entire cluster
  • All processes have their own memory space and have access to the same source codes.
  • MPI_Init: indicates that all processes are now working in message-passing mode.
  • MPI_Finalize: indicates that all processes are now working in sequential mode (only one process active) and there are no more message-passing activities.

11. MPI in a nutshell

  • MPI_COMM_WORLD: Global communicator
  • MPI_Comm_rank: return the rank of the calling process
  • MPI_Comm_size: return the total number of processes that are part of the specified communicator.
  • MPI_Get_processor_name: return the name of the processor (core) running the process.

12. MPI communicators (first defined in MPI-1)

MPI defines communicator groups for point-to-point and collective communications:

  • Unique IDs (rank) are defined for individual processes within a communicator group.
  • Communications are performed based on these IDs.
  • Default global communication (MPI_COMM_WORLD) contains all processes.
  • For N processes, ranks go from 0 to N−1.

13. Hands-on: hello.c

  • Inside intro-mpi, create a file named hello.c with the following contents
  • Compile and run hello.c:
$ mpicc -o hello hello.c
$ mpirun -np 1 ./hello
$ mpirun -np 2 ./hello
$ mpirun -np 4 ./hello

compile and run hello.c

14. Hands-on: evenodd.c

  • In MPI, processes’ ranks are used to enforce execution/exclusion of code segments within the original source code.
  • Inside intro-mpi, create a file named evenodd.c with the following contents
  • Compile and run evenodd.c:
$ mpicc -o evenodd evenodd.c
$ mpirun -np 1 ./evenodd
$ mpirun -np 2 ./evenodd
$ mpirun -np 4 ./evenodd

compile and run evenodd.c

15. Hands-on: rank_size.c

  • In MPI, the values of ranks and size can be used as means to calculate and distribute workload (data) among the processes.
  • Inside intro-mpi, create a file named rank_size.c with the following contents
  • Compile and run rank_size.c:
$ mpicc -o rank_size rank_size.c
$ mpirun -np 1 ./rank_size
$ mpirun -np 2 ./rank_size
$ mpirun -np 4 ./rank_size

compile and run rank_size.c

Key Points


MPI: point-to-point, data types, and communicators

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can processes pass messages directly to one another (point-to-point)

Objectives
  • Understand meaning of parameters in syntax of MPI_Send and MPI_Recv

  • Understand the proper ordering of p2p communication to avoid deadlock.

  • Know how to invoke MPI_Send and MPI_Recv to transfer data between processes.

1. Addresses in MPI

Data messages (objects) being sent/received (passed around) in MPI are referred to by their addresses:

  • Memory location to read from to send
  • Memory location to write to after receiving.

2. Parallel workflow

Individual processes rely on communication (message passing) to enforce workflow

  • Point-to-point communication: MPI_Send, MPI_Recv
  • Collective communication: MPI_Broadcast, MPI_Scatter, MPI_Gather, MPI_Reduce, MPI_Barrier.

3. Point-to-point: MPI_Send and MPI_Recv

  • int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
    • *buf: pointer to the address containing the data elements to be sent.
    • count: how many data elements will be sent.
    • MPI_Datatype: MPI_BYTE, MPI_PACKED, MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG, MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_UNSIGNED_CHAR, and other user-defined types.
    • dest: rank of the process these data elements are sent to.
    • tag: an integer identify the message. Programmer is responsible for managing tag.
    • comm: communicator (typically just used MPI_COMM_WORLD)
  • int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)
    • *buf: pointer to the address containing the data elements to be written to.
    • count: how many data elements will be received.
    • MPI_Datatype: MPI_BYTE, MPI_PACKED, MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG, MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_UNSIGNED_CHAR, and other user-defined types.
    • dest: rank of the process from which the data elements to be received.
    • tag: an integer identify the message. Programmer is responsible for managing tag.
    • comm: communicator (typically just used MPI_COMM_WORLD)
    • *status: pointer to an address containing a special MPI_Status struct that carries additional information about the send/receive process.

4. Hands-on: send_recv.c

  • We want to write a program called send_recv.c that allows two processes to exchange their ranks:
    • Process 0 receives 1 from process 1.
    • Process 1 receives 0 from process 0.
  • Inside intro-mpi, create a file named send_recv.c with the following contents
  • Compile and run send_recv.c:
$ mpicc -o send_recv send_recv.c
$ mpirun -np 2 ./send_recv
  • Did we get what we want? Why?

compile and run send_recv.c

  • Correction: separate sending and receiving buffers.

5. Hands-on: p2p communication at scale

  • Rely on rank and size and math.
  • We want to shift the data elements with only message passing among adjacent processes.
  • Inside intro-mpi, create a file named multi_send_recv.c with the following contents
  • Compile and run multi_send_recv.c:
$ mpicc -o multi_send_recv multi_send_recv.c
$ mpirun -np 4 ./multi_send_recv
  • Did we get what we want? Why?

compile and run multi_send_recv.c

6. Hands-on: blocking risk

  • MPI_Recv is a blocking call.
  • Meaning the process will stop until it receives the message.
    • What if the message never arrives?
  • Inside intro-mpi, create a file named deadlock_send_recv.c with the following contents
  • Compile and run deadlock_send_recv.c:
$ mpicc -o deadlock_send_recv deadlock_send_recv.c
$ mpirun -np 2 ./deadlock_send_recv
  • What happens?
  • To get out of deadlock, press Ctrl-C.

compile and run deadlock_send_recv.c

  • Correction:
    • Pay attention to send/receive pairs.
    • The numbers of MPI_Send must always equal to the number of MPI_Recv.
    • MPI_Send should be called first (preferably).

Key Points


MPI: Functional parallelism and collectives

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How do multiple processes communicate as a group at the same time?

Objectives
  • Understand how collective communication work relative to process ranking.

  • Know how to properly invoke a collective MPI routine to carry out a data distribution/collection/reduction task.

1. Collective communication

  • Must involve ALL processes within the scope of a communicator.
  • Unexpected behavior, including programming failure, if even one process does not participate.
  • Types of collective communications:
    • Synchronization: barrier
    • Data movement: broadcast, scatter/gather
    • Collective computation (aggregate data to perform computation): Reduce

collective communication patterns

2. Hands-on: MPI_Bcast

  • int MPI_Bcast(void *buf, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
    • *buf:
      • If on root, pointer to the address containing the data elements to be broadcasted
      • If not on root, pointer to the address where broadcasted data to be stored.
    • count: how many data elements will be broadcasted.
    • MPI_Datatype: MPI_BYTE, MPI_PACKED, MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG, MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_UNSIGNED_CHAR, and other user-defined types.
    • root: rank of the process where the original data will be broadcasted.
    • tag: an integer identify the message. Programmer is responsible for managing tag.
    • comm: communicator (typically just used MPI_COMM_WORLD)
  • Don’t need to specify a TAG or DESTINATION
  • Must specify the SENDER (root)
  • Blocking call for all processes
  • Inside intro-mpi, create a file named bcast.c with the following contents
  • Compile and run bcast.c:
$ mpicc -o bcast bcast.c
$ mpirun -np 4 ./bcast

compile and run bcast.c

3. Hands-on: MPI_Scatter

  • int MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf,int recvcount,MPI_Datatype recvtype, int root, MPI_Comm comm)
    • *sendbuf: pointer to the address containing the array of data elements to be scattered.
    • sendcount: how many data elements to be sent to each process of the communicator.
    • *recvbuf: pointer to the address on each process of the communicator, where the scattered portion will be written.
    • recvcount: how many data elements to be received by each process of the communicator.
    • root: rank of the process from where the original data will be scattered.
    • comm: communicator (typically just used MPI_COMM_WORLD)
  • Inside intro-mpi, create a file named scatter.c with the following contents
  • Compile and run scatter.c:
$ mpicc -o scatter scatter.c
$ mpirun -np 4 ./scatter

compile and run scatter.c

4. Hands-on: MPI_Gather

  • int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf,int recvcount,MPI_Datatype recvtype, int root, MPI_Comm comm)
    • *sendbuf: pointer to the address on each process of the communicator, containing the array of data elements to be gathered.
    • sendcount: how many data elements from each process of the communicator to be sent back to the root process.
    • *recvbuf: pointer to the address on the root process where all gathered data will be written.
    • recvcount: how many data elements to be received from each process of the communicator.
    • root: rank of the process from where the original data will be gathered.
    • comm: communicator (typically just used MPI_COMM_WORLD)
  • Inside intro-mpi, create a file named gather.c with the following contents
  • Compile and run gather.c:
$ mpicc -o gather gather.c
$ mpirun -np 4 ./gather

compile and run gather.c

5. Hands-on: MPI_Reduce

  • int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_OP op,int root, MPI_Comm comm)
    • *sendbuf: pointer to the address on each process of the communicator, containing the array of data elements to be reduced.
    • *recvbuf: pointer to the address on the root process where all final reduced data will be written.
    • count: how many data elements to be received from each process of the communicator. If count > 1, then operation is performed element-wise.
    • op: may be MPI_MIN, MPI_MAX, MPI_SUM, MPI_PROD (twelve total). Programmer may add operations, must be commutative and associative.
    • root: rank of the process from where the original data will be gathered.
    • comm: communicator (typically just used MPI_COMM_WORLD).
  • Inside intro-mpi, create a file named gather.c with the following contents
  • Compile and run reduce.c:
$ mpicc -o reduce reduce.c
$ mpirun -np 4 ./reduce

compile and run reduce.c

Key Points


MPI: pleasantly parallel and workload allocation

Overview

Teaching: 0 min
Exercises: 0 min
Questions
Objectives
  • Understand the characteristics that make a computational job pleasantly parallel.

  • Know the various approach to workload division in pleasantly parallel.

1. Definition

  • Embarrassingly/naturally/pleasantly parallel.
  • A computation that can obviously be divided into a number of completely different parts, each of which can be executed by a separate process.
  • Each process can do its tasks without any interaction with the other processes, therefore, …
    • No communication or very little communication among the processes.

2. Example: integral estimation using the trapezoid method

  • A technique to approximate the integral of a function f.
  • Integral is defined as the area of the region bounded by the graph f, the x-axis, and two vertical lines x=a and x=b.
  • We can estimate this area by dividing it into infinitesimal trapezoids.

trapezoids

3. Example: integral estimation using the trapezoid method

  • Divide the area under the curve into 8 trapezoids with equal base h.
    • N = 8
    • a = 0
    • b = 2
  • The base h can be calculated as:
    • h = (b - a) / N = 0.25

trapezoids

4. Example: integral estimation using the trapezoid method

  • Which trapezoid (workload/task) goes to which process?
    • Start with small number of processes.
    • Calculation workload assignment manually for each count of processes.
    • Generalize assignment for process i based on sample calculations.

trapezoids

5. Example: integral estimation using the trapezoid method

  • 4 processes: P0, P1, P2, P3: size = 4
  • N = 8
  • a = 0
  • b = 2
  • The height h can be calculated as:
    • h = (b - a) / N = 0.25
  • The amount of trapezoid per process:
    • local_n = N / size = 2;
  • local_a: variable represent the starting point of the local interval for each process. Variable local_a will change as processes finish calculating one trapezoid and moving to another.
    • local_a for P0= 0 = 0 + 0 * 2 * 0.25
    • local_a for P1= 0.5 = 0 + 1 * 2 * 0.25
    • local_a for P2= 1 = 0 + 2 * 2 * 0.25
    • local_a for P2= 1.5 = 0 + 3 * 2 * 0.25

trapezoids

6. Handson: integral estimation using the trapezoid method

  • Your account (login/password) will work on both taz and submitty.
  • USERNAME represents the login name that you received in email.
  • To access taz from a terminal:
$ ssh USERNAME@taz.cs.wcupa.edu
  • To access submitty from a terminal:
$ ssh USERNAME@submitty.cs.wcupa.edu
  • The environments on taz and submitty are similar to one another. In the remainder of these lectures, example screenshots will be taken from submitty, but all commands will work on taz as well.

  • Change into intro-mpi

$ cd intro-mpi
  • To create a file from terminal, run nano -c file_name.
  • When finish editing, press Ctrl-X to select Quit and Save.
  • Press Y to confirm that you want to save.
  • Press Enter to confirm that you are saving to file_name.
  • Inside intro-mpi, create a file named trapezoid.c with the following contents
  • Compile and run trapezoid.c:
$ mpicc -o trapezoid trapezoid.c
$ mpirun -np 4 ./trapezoid 0 1 10
$ mpirun -np 4 ./trapezoid 0 1 100
$ mpirun -np 4 ./trapezoid 0 1 1000
$ mpirun -np 4 ./trapezoid 0 1 10000

trapezoids

7. Hands-on: static workload assignment

  • Is this fair?
  • Create a copy of trapezoid.c called trapezoid_static.c and modify trapezoid_static.c to have the following contents
  • Compile and run trapezoid_static.c:
$ mpicc -o trapezoid_static trapezoid_static.c
$ mpirun -np 4 ./trapezoid_static 0 1 1000

trapezoid static

  • This is called static workload assignment.

mandelbrot static

8. Hands-on: cyclic workload assignment

  • Create a copy of trapezoid_static.c called trapezoid_cyclic.c and modify trapezoid_cyclic.c to have the following contents
  • Compile and run trapezoid_cyclic.c:
$ mpicc -o trapezoid_cyclic trapezoid_cyclic.c
$ mpirun -np 4 ./trapezoid_cyclic 0 1 1000

trapezoid cyclic

  • This is called cyclic workload assignment.

mandelbrot cyclic

9. Hands-on: dynamic workload assignment

  • Create a file called trapezoid_dynamic_.c with the following contents:
  • Compile and run trapezoid_dynamic.c:
$ mpicc -o trapezoid_dynamic trapezoid_dynamic.c
$ mpirun -np 4 ./trapezoid_dynamic 0 1 1000
$ mpirun -np 4 ./trapezoid_dynamic 0 1 1000
$ mpirun -np 4 ./trapezoid_dynamic 0 1 1000

trapezoid dynamic

  • This is called dynamic workload assignment.

mandelbrot dynamic

Key Points


Introduction to CloudLab

Overview

Teaching: 0 min
Exercises: 0 min
Questions
Objectives

1. What is CloudLab

  • Experimental testbed for future computing research
  • Allow researchers control to the bare metal
  • Diverse, distributed resources at large scale
  • Allow repeatable and scientific design of experiments

CloudLab

2. What is GENI

  • Global Environment for Networking Innovation
  • Combining heterogeneous resource types, each virtualized along one or more suitable dimensions, to produce a single platform for network science researchers”
  • Key components:
    • GENI racks: virtualized computation and storage resources
    • Software-defined networks (SDNs): virtualized, programmable network resources
    • WiMAX: virtualized cellular wireless communication

Berman, M., Chase, J.S., Landweber, L., Nakao, A., Ott, M., Raychaudhuri, D., Ricci, R. , and Seskar, I., 2014. GENI: A federated testbed for innovative network experiments. Computer Networks, 61, pp.5-23.

3. Key experimental concepts

  • Sliceability: the ability to support virtualization while maintaining some degree of isolation for simultaneous experiments
  • Deep programmability: the ability to influence the behavior of computing, storage, routing, and forwarding components deep inside the network, not just at or near the network edge.

4. Hardware

  • Utah/HP: Low-power ARM64 (785 nodes)
    • 315 m400: 1X 8-core ARMv8 at 2.4GHz, 64GB RAM, 120GB flash
    • 270 m510: 1X 8-core Intel Xeon D-1548 at 2.0 GHz, 64GB RAM, 256 GB flash
    • 200 xl170: 1X 10-core Intel E5-2640v4 at 2.4 Ghz, 64 GB RAM, 480 GB SSD
  • Wisconsin/Cisco: 530 nodes
    • 90 c220g1: 2X 8-core Intel Haswell at 2.4GHz, 128GB RAM, 1X 480GB SDD, 2X 1.2TB HDD
    • 10 c240g1: 2X 8-core Intel Haswell at 2.4GHz, 128GB RAM, 1X 480GB SDD, 1X 1TB HDD, 12X 3TB HDD
    • 163 c220g2: 2X 10-core Intel Haswell at 2.6GHz, 160GB RAM, 1X 480GB SDD, 2X 1.2TB HDD
    • 7 c240g2: 2X Intel Haswell 10-core at 2.6GHz, 160GB RAM, 2X 480GB SDD, 12X 3TB HDD
    • 224 c220g5: 2X 10-core Intel Skylake at 2.20GHz, 192GB RAM, 1TB HDD
    • 32 c240g5: 2X 10-core Intel Skylake at 2.20GHz, 192GB RAM, 1TB HDD, 1 NVIDIA P100 GPU
    • 4 c4130: 2X 8-core Intel Broadwell at 3.20GHz, 128GB RAM, 2X 960GB HDD, 4 NVIDIA V100 GPU
  • Clemson/Dell: 256 nodes
    • 96 c8220: 2X 10-core Intel Ivy Bridge at 2.2GHz, 256GB RAM, 2X 1TB HDD
    • 4 c8220x: 2X 10-core Intel Ivy Bridge at 2.2GHz, 256GB RAM, 8X 1TB HDD, 12X 4TB HDD
    • 84 c6420: 2X 14-core Intel Haswell at 2.0GHz, 256GB RAM, 2X 1TB HDD
    • 2 c4130: 2X 12-core Intel Haswell at 2.5GHz, 256GB RAM, 2X 1TB HDD, 2 NVIDIA K40m GPU
    • 2 dss7500: 2X 6-core Intel Haswell at 2.4GHZ, 128GN RAM, 2X 126GB SSD, 45X 6TB HDD
    • 72 c6420: 2X 16-core Intel Skylake at 2.6GHZ, 386GB RAM, 2X 1TB HDD

5. Setup SSH

  • SSH into taz.cs.wcupa.edu or submitty.cs.wcupa.edu and run the following commands:
  • Hit Enter for all questions. Do not enter a password or change the default location of the files.
$ cd
$ ssh-keygen -t rsa

  • Run the following command to display the public key
  • Drag your mouse over to paint/copy the key (just the text, no extra spaces after the last character)
$ cat ~/.ssh/id_rsa.pub

  • Log into CloudLab, click on your username (top right) and select Manage SSH Keys:

  • Paste the key into the Key box and click Add Key:

6. Setup GitHub repository

  • Go to your GitHub account, under Repositories, select New.

  • You can select any name for your repo.
  • It must be public.
  • The Add a README file box must be checked.
  • Click Create repository when done.

  • Click Add file and select Create new file

  • Type profile.py for the file name and enter THIS CONTENT into the text editor.
  • Click Commit new file when done.

7. Setup CloudLab profile

  • Login to your CloudLab account, click Experiments on top left, select Create Experiment Profile.

  • Click on Git Repo

  • Paste the URL of your previously created Git repo here and click Confirm

  • Enter the name for your profile, put in some words for the Description.
  • You will not have a drop-down list of Project.
  • Click Create when done.

  • Click Instantiate to launch an experiment from your profile.

  • Select a Cluster from Wisconsin, Clemson, or Emulab, then click Next.
  • Do not do anything on the next Start on date/time screen. Click Finish.

  • Your experiment is now being provision, and then `booting

  • When it is ready, you can use the provided SSH command to log in to your experiment (assuming your key was set up correctly)

Key Points


Deploying compute nodes for a supercomputer

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How do we deploy compute nodes and relevant software programmatically?

Objectives

1. Setup webhook in CloudLab

  • Webhook is a mechanism to automatically push any update on your GitHub repo to the corresponding CloudLab repository.

  • Login to CloudLab.
  • Top left, under Experiments, select My Profiles.

  • Click on the profile created for CSC466, then select Edit

  • Click on the Push URL bar, then click on the blue Clipboard icon.

  • Go to the corresponding GitJub repository and go to Settings.

  • Go to Webhooks.

  • Click Add webhook. You will need to re-enter the password for your GitHub account next.

  • Pass the value copied from the clipboard in CloudLab into the Payload URL box.
  • Click on Add webhook.

2. Create a new branch on your GitHub repo

  • In your GitHub repository, create a new branch called compute.

  • If you set up your webhook correctly, compute should show up on your CloudLab profile as well

  • In your GitHub repository, edit the profile.py file with THIS CONTENT.
  • Click Commit changes once done to save.

  • Create a new file called install_mpi.sh with THIS CONTENT
  • Click Commit new file once done to save.

  • Go to CloudLab, refresh the page, and confirm that the hash of your compute branch here match with the hash on GitHub repo.
  • Instantiate from the compute branch.

Key Points


Distributed File Systems

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How to arrange read/write accesses with processes running on computers that are part of a computing cluster?

Objectives

0. Hands-on: What does it take to run MPI on a distributed computing system?

  • Log into CloudLab
  • Go to Experiments/My Experiments

Sun

  • Go to the DistributedFS experiment under Experiments in my Projects.

Sun

  • Under the List View tab, SSH command column, use the provided SSH command to login to compute-1.

Sun

  • Run the following bash commands from your home directory:
$ cd
$ cp -R /local/repository/source .
$ cd source
$ mpicc -o hello hello.c 
$ mpirun -np 4 ./hello

Sun

  • What does it take to run on multiple cores?
    • MPI runs on SSH: needs to setup passwordless SSH (hit Enter across all questions!)
    • Needs to specify IP addresses for each core on host.
$ cd 
$ sudo passwd $USER
$ ssh-keygen -t rsa
$ cat .ssh/id_rsa.pub >> .ssh/authorized_keys
$ ssh localhost
$ exit
$ cd source
$ cat /etc/hosts | grep compute
$ mpirun -np 2 -H 192.168.1.2,192.168.1.2 ./hello
$ mpirun -np 4 -H 192.168.1.2,192.168.1.2,192.168.1.2,192.168.1.2 ./hello

Sun Sun

  • How can we get this to run across multiple nodes?
    • Open a new terminal.
    • Log into compute-2 and repeat the above process to create password and passwordless SSH.
$ cd 
$ sudo passwd $USER
$ ssh-keygen -t rsa
$ cat .ssh/id_rsa.pub >> .ssh/authorized_keys
$ ssh localhost
$ exit

Sun

  • Copy the key from compute-1 to compute-2.
    • Switch to the terminal that logged onto compute-1
    • Answer yes if asked about the authenticity of host.
    • Enter the password for your account on compute-2 (that you setup previously!)
    • If passwordless SSH was successfully setup, a subsequence SSH from compute-1 to compute-2 will not require any password.
  • Repeat the process to setup passwordless SSH from compute-2 to compute-1.
$ ssh-copy-id compute-2 
$ ssh compute-2
$ exit

Sun

  • On each node, compute-1 and compute-2, run the followings:
$ echo "export PATH=/opt/openmpi/3.1.2/bin/:$PATH" >> ~/.bashrc 
$ echo "export LD_LIBRARY_PATH=/opt/openmpi/3.1.3/bin/:$LD_LIBRARY_PATH" >> ~/.bashrc
  • On compute-2, repeat the process to copy source to your home directory and to compile hello.c.
  • Switch back to the terminal that logged into compute-1.
  • View the file machine_file inside the source directory and run the MPI job
$ cd
$ cd source
$ cat machine_file
$ mpirun -np 8 --hostfile machine_file ./hello

Sun

1. Type of distributed file systems

  • Networked File System
    • Allows transparent access to files stored on a remote disk.
  • Clustered File System
    • Allows transparent access to files stored on a large remote set of disks, which could be distributed across multiple computers.
  • Parallel File System
    • Enables parallel access to files stored on a large remote set of disks, which could be distributed across multiple computers.

2. Networked File System

Sun

  • Sandberg, R., Goldberg, D., Kleiman, S., Walsh, D., & Lyon, B. (1985, June). “Design and implementation of the Sun network filesystem.” In Proceedings of the Summer USENIX conference (pp. 119-130)
  • Design goals
    • Machine and operating system independence
    • Crash recovery
    • Transparent access
    • UNIX semantics maintained on client
    • Reasonable performance (target 80% as fast as local disk)

3. Networked File System Design

  • NFS Protocol
    • Remote Procedure Call mechanism
    • Stateless protocol
    • Transport independence (UDP/IP)
  • Server side
    • Must commit modifications before return results
    • Generation number in inode and filesystem id in superblock
  • Client side
    • Additional virtual file system interface in the Linux kernel.
    • Attach remote file system via mount

NFS

4. Clustered File Systems

  • Additional middleware layers such as the tasks of a file system server can be distributed among a cluster of computers.
  • Example: The Zettabyte File System (ZFS) by Sun Microsystem
  • Bonwick, Jeff, Matt Ahrens, Val Henson, Mark Maybee, and Mark Shellenbaum. “The zettabyte file system.” In Proc. of the 2nd Usenix Conference on File and Storage Technologies, vol. 215. 2003.
  • “One of the most striking design principles in modern file systems is the one-to-one association between a file system and a particular storage device (or portion thereof). Volume managers do virtualize the underlying storage to some degree, but in the end, a file system is still assigned to some particular range of blocks of the logical storage device. This is counterintuitive because a file system is intended to virtualize physical storage, and yet there remains a fixed binding between a logical namespace and a specific device (logical or physical, they both look the same to the user).”

5. Clustered File Systems: design principles

  • Simple administration: simplify and automate administration of storage to a much greater degree.
  • Pooled storage: decouple file systems from physical storage with allocation being done on the pooled storage side rather than the file system side.
  • Dynamic file system size
  • Always consistent-disk data.
  • Immense capacity (Prediction in 2003: 16 Exabyte datasets to appear in 10.5 years).
  • Error detection and correction
  • Integration of volume manager
  • Excellent performance

ZFS ZFS

6. Parallel File Systems

  • Ross, Robert, Philip Carns, and David Metheny. “Parallel file systems.” In Data Engineering, pp. 143-168. Springer, Boston, MA, 2009.

“… The storage hardware selected must provide enough raw throughput for the expected workloads. Typical storage hardware architectures also often provide some redundancy to help in creating a fault tolerant system. Storage software, specifically file systems, must organize this storage hardware into a single logical space, provide efficient mechanisms for accessing that space, and hide common hardware failures from compute elements.
Parallel file systems (PFSes) are a particular class of file systems that are well suited to this role. This chapter will describe a variety of PFS architectures, but the key feature that classifies all of them as parallel file systems is their ability to support true parallel I/O. Parallel I/O in this context means that many compute elements can read from or write to the same files concurrently without significant performance degradation and without data corruption. This is the critical element that differentiates PFSes from more traditional network file systems …”

PFS

7. Parallel File Systems: fundamental design concepts

  • Single namespace, including files and directories hierarchy.
  • Actual data are distributed over storage servers.
  • Only large files are split up into contiguous data regions.
  • Metadata regarding namespace and data distribution are stored:
    • Dedicated metadata servers (PVFS)
    • Distributed across storage servers (CephFS)

8. Parallel File Systems: access mechanism

  • Shared-file (N-to-1): A single file is created, and all application tasks write to that file (usually to disjoint regions)
    • Increased usability: only one file is needed
    • Can create lock contention and reduce performance
  • File-per-process (N-to-N): Each application task creates a separate file, and writes only to that file.
    • Avoids lock contention
    • Can create massive amount of small files
    • Does not support application restart on different number of tasks

9. Parallel File Systems: data distribution

  • Original File: Sequence of Bytes
  • Sequence of bytes are converted into sequence of offsets (each offset can cover multiple bytes)
  • Offsets are mapped to objects
    • not necessarily ordered mapping
    • reversible to allow clients to contact specific PFS server for specific data content
  • Objects are distributed across PFS servers
  • Information about where the objects are is stored at the metadata server

10. Parallel File Systems: object placement

  • Round robin is reasonable default solution
  • Work consistently on most systems
    • Default solutions for: GPFS, Lustre, PVFS
  • Potential scalability issue with massive scaling of file servers and file size
    • Two dimensional distribution
    • Limit number of servers per file

11. Design challenges

  • Performance
    • How well the file system interfaces with applications
  • Consistency Semantics
  • Interoperability:
    • POSIX/UNIX
    • MPI/IO
  • Fault Tolerance:
    • Amplifies due to PFS’ multiple storage devices and I/O Path
  • Management Tools

12. Hands-on: Setup NFS

  • Log into CloudLab
  • Log into the head node on your cluster
  • Run the following bash commands from your home directory:
$ sudo yum install -y nfs-utils
$ sudo mkdir -p /scratch
$ sudo chown nobody:nobody /scratch
$ sudo chmod 777 /scratch
$ sudo systemctl enable rpcbind
$ sudo systemctl enable nfs-server
$ sudo systemctl enable nfs-lock
$ sudo systemctl enable nfs-idmap
$ sudo systemctl start rpcbind
$ sudo systemctl start nfs-server
$ sudo systemctl start nfs-lock
$ sudo systemctl start nfs-idmap
$ echo "/scratch 192.168.1.2(rw,sync,no_root_squash,no_subtree_check)" | sudo tee -a /etc/exports
$ echo "/scratch 192.168.1.3(rw,sync,no_root_squash,no_subtree_check)" | sudo tee -a /etc/exports
$ sudo systemctl restart nfs-server
  • Log into compute-1 and compute-2 nodes on your cluster
  • Run the following bash commands from your home directory on both nodes:
$ echo "export PATH=/opt/openmpi/3.1.2/bin/:$PATH" >> ~/.bashrc 
$ echo "export LD_LIBRARY_PATH=/opt/openmpi/3.1.2/bin/:$LD_LIBRARY_PATH" >> ~/.bashrc
$ sudo yum install -y nfs-utils
$ sudo mkdir -p /scratch
$ sudo mount -t nfs 192.168.1.1:/scratch /scratch
$ df -h

NFS mount

  • On compute-1, run the followings
$ cd 
$ ssh-keygen -t rsa -f .ssh/id_rsa -N ''
$ cat .ssh/id_rsa.pub >> .ssh/authorized_keys
$ cp -R .ssh /scratch
  • On compute-2, run the following
$ cp -R /scratch/.ssh .

{: .language-bash}>

  • Test passwordless SSH by attempting to SSH to compute-1 from compute-2 and vice versa.
  • Log on to compute-1 and run the followings:
$ cd /scratch
$ cp -R /local/repository/source .
$ cd source
$ mpicc -o hello hello.c 
$ mpirun -np 8 --hostfile machine_file ./hello

Key Points


Schedulers for cluster of computers

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What does it mean to schedule jobs on a supercomputer?

Objectives

1. Introduction

  • The term scheduling impacts several levels of the system, from application/task scheduling which deals with scheduling of threads to meta-scheduling where jobs are scheduled across multiple supercomputers.
  • Here, we focus on job-scheduling, scheduling of jobs on a supercomputer, the enabling software infrastructure and underlying algorithms.
  • This is an optimization problem in packing with space (processor) and time constraints
    • Dynamic environment
    • Tradeoffs in turnaround, utilization, and fairness

2. Introduction: Job scheduling

  • Simplest way to schedule in a parallel system is using a queue
  • Each job is submitted to a queue and each job upon appropriate resource match executes to completion while other jobs wait.
  • Since each application utilizes only a subset of systems processors, the processors not in the subset could potentially remain idle during execution. Reduction of unused processors and maximization of system utilization is the focus of much of scheduling research.
  • Space Sharing: A natural extension of the queue where the system allows for another job in the queue to execute on the idle processors if enough are available.
  • Time Sharing: Another model where a processor’s time is shared among threads of different parallel programs. In such approaches each processor executes a thread for some time, pauses it and then begins executing a new thread. Eg. Gang Scheduling
  • Since context switching in Time Sharing involves overhead,complex job resource requirements and memory management, most supercomputing installations prefer space sharing scheduling systems.

3. CPU scheduling

  • Scheduling of CPUs is fundamental to Operating System design
  • Process execution consists of a cycle of CPU execution and I/O wait. The process execution begins with a CPU burst followed by an I/O burst etc.
  • OS selects one of the processes from the short term scheduler / CPU scheduler.
  • The scheduler selects from among the process in memory that are ready to execute and allocates the CPU to one of them.
  • Scheduling happens under one of 4 conditions:
    • When process switches from running state to the waiting state (Non Preemptive)
    • When a process switches from the running state to the ready state (Preemptive)
    • When a process switches from the waiting state to the ready state (Preemptive)
    • When a process terminates (Non Preemptive)

4. CPU scheduling

  • Scheduling Criteria:
    • MAX CPU Utilization needs to keep the CPU as busy as possible.
    • MAX Throughput Number of processes completed per time unit
    • MIN Turnaround Time the interval from the time of submission of a process to the time of completion
    • MIN Waiting Time Sum of the periods spent by the process waiting in the ready queue
    • MIN Response Time The measure of time from the submission of a request until the first response is produced

5. Workload management systems

  • Supercomputing Centers often support several hundreds of users running thousands of jobs across a shared supercomputing resources consisting of large number of compute nodes and storage centers.
  • It can be extremely difficult for an administrator of such a resource to manually manage users, resource allocations and ensure optimal utilization of the large supercomputing resources.
  • Workload management systems (WMS) help address this problem by providing users with simple resource access and for the administrator of the supercomputing resource, a set of powerful tools and utilities for resource management, monitoring, scheduling, accounting, and policy enforcement.

workload management systems

6. Workload management systems: main activities

  • Queuing
  • Scheduling
  • Monitoring
  • Resource Management
  • Accounting

7. Workload management systems: a layer between users and computing resources

  • Users submit jobs, specifying the work to be performed, to a queue
  • Jobs wait in queue until they are scheduled to start on the cluster.
  • Scheduling is governed by stipulated policies and algorithms that implement the policy. (Policies usually ensure fair sharing of resources and attempt to optimize overall system utilization.)
  • Resource management mechanisms handle launching of jobs and subsequent cleanup.
  • The workload management system is simultaneously monitoring the status of various system resources and accounting resource utilization.

8. Workload management systems: queueing

  • The most visible part of the WMS process where the system collects the jobs to be executed.
  • Submission of jobs is usually performed in a container called a batch job (usually specified in the form of a file).
  • The batch job consists of two primary parts :
    • A set of resource directives (number of CPUs, amount of memory etc.)
    • A description of the task(s) to be executed (executable, arguments etc.)
  • Upon submission the batch job is held in a queue until a matching resource is found. Queue wait time for submitted jobs could vary depending on the demand for the resource among users.
  • Production or real world supercomputing resources often have multiple queues, each of which can be preconfigured to run certain kinds of jobs. ExampleTezpurcluster has a debug queue and workq

available queues on Bridges

9. Workload management systems: scheduling

  • Scheduling selects the best job to run based on the current resource availability and scheduling policy.
  • Scheduling can be generally broken into two primary activities :
    • Policy enforcement : To enforce resource utilization based on policies set by supercomputing sites (controls job priority and schedulability).
    • Resource Optimization : Packs jobs efficiently, and exploit underused resources to boost overall resource utilization.
  • Balancing policy enforcement with resource optimization in order to pick the best job to run is the difficult part of scheduling
  • Common scheduling algorithms include First In First Out, Backfill, Fairshare.

10. Workload management systems: monitoring

  • Resource monitoring by WMS, provides administrators, users and scheduling systems with status information of jobs and resources. Monitoring is often performed for 3 critical states:
    • For idle nodes , to verify their working order and readiness to run another job.
    • For busy nodes , to monitor memory, CPU, network, I/O and utilization of system resources to ensure proper distribution of workload and effective utilization of nodes.
    • For completed jobs , to ensure that no processes remain from the completed job and that the node is still in working order before a new job is started on it.

11. Workload management systems: resource management

  • Resource Management area is responsible for starting, stopping, and cleaning up after jobs.
  • A batch system resource management is setup in such a way so as to run the jobs using the identity of a user in such a way that the user need not be present at that time.
  • Jobs are started only on the nodes that are functioning properly.
  • Resource management also includes removing or adding of resources to the available pool of systems
  • Clusters are dynamic resources, systems go down, or additional resources are added.
  • Registration of new nodes and the marking of nodes as unavailable are additional aspects of resource management

12. Workload management systems: accounting and reporting

  • Workload accounting can be defined as the process of collecting resource usage data for the batch jobs that run on the resource. (example % CPU utilization, memory utilization etc.)
  • Data obtained from accounting is often used to :
    • Produce weekly/monthly per user usage reports
    • Tuning of scheduling policy
    • Calculating future resource allocations
    • Anticipating future computer component requirements
    • Determining areas for improvement within the system.

13. Scheduling algorithm: FCFS/FIFO

FIFO 1 FIFO 2 FIFO 3

  • Definitions
    • Shadow time: time at which the first job in the queue starts execution
    • Extra nodes: number of nodes idle when the first job in the queue starts execution
  • Simplest scheduling option: FCFS
    • First Come First Serve
  • Problem:
    • Fragmentation:

FIFO 4

14. Scheduling algorithm: FCFS with Backfilling

FIFO Backfilling

  • Which job(s) should be picked for promotion through the queue?
  • Many heuristics are possible
  • Two have been studied in detail
    • EASY
    • Conservative Back Filling (CBF)
  • In practice EASY (or variants of it) is used, while CBF is not
    • Extensible Argonne Scheduling System
    • Maintain only one “reservation”, for the first job in the queue
    • Go through the queue in order starting with the 2nd job
    • Backfill a job if
      • it will terminate by the shadow time, or
      • it needs less than the extra nodes

FIFO EASY FIFO EASY FIFO EASY FIFO EASY FIFO EASY

  • Problem:
    • The first job in the queue will never be delayed by backfilled jobs
    • BUT, other jobs may be delayed infinitely!

FIFO EASY unbounded delay FIFO EASY unbounded delay FIFO EASY unbounded delay FIFO EASY unbounded delay

  • Unbounded Delay
    • The first job in the queue will never be delayed by backfilled jobs
    • BUT, other jobs may be delayed infinitely!
  • No starvation
    • Delay of first job is bounded by runtime of current jobs
    • When the first job runs, the second job becomes the first job in the queue
    • Once it is the first job, it cannot be delayed further
  • EASY favors small long jobs
  • EASY harms large short jobs

15. Conservative Backfilling

  • EVERY job has a “reservation”
  • A job may be backfilled only if it does not delay any other job ahead of it in the queue.
  • Fixes the unbounded delay problem that EASY has
  • More complicated to implement: The algorithm must find holes in the schedule

16. How good is the schedule?

  • All of this is great, but how do we know what a good schedule is?
    • FCFS, EASY, CFB, Random?
  • What we need are metrics to quantify how good a schedule is
    • It has to be an aggregate metric over all jobs
  • Metric #1: Turn-around time
    • Also called flow
    • Wait time + Run time
    • But:
      • Job #1 needs 1h of compute time and waits 1s
      • Job #2 needs 1s of compute time and waits 1h
    • Clearly Job #1 is really happy, and Job #2 is not happy at all
  • Metric #2: wait time
    • But
      • Job #1 asks for 1 node and waits 1 h
      • Job #2 asks for 512 nodes and waits 1h
    • Again, Job #1 is unhappy while Job #2 is probably sort of happy

17. SLURM Scheduler

  • Work funded by Department of Energy at Lawrence Livermore National Laboratory.
  • SLURM: Simple Linux Utility for Resource Management.
  • SLURM characteristics:
    • Simple
    • Open source
    • Portable (written in C, requires no kernel modification)
    • Fault-tolerant
    • Secure
    • System administrator friendly
    • Scalable (16K nodes, 128K processors)

18. SLURM Scheduler: entities

  • Nodes: Individual computers
  • Partitions: Job queues
  • Jobs: Resource allocations
  • Job steps: Set of (typically parallel) tasks

SLURM entities

19. SLURM Scheduler: job states

SLURM job states

  • Jobs:
    • Resource allocation: specific processors and memory or entire nodes allocated to a user for some time period.
    • Can be interactive (executed in real-time) or batch (script queued for later execution).
    • Many constraints available for user request
    • Identified by ID number
  • Job steps:
    • A set of tasks launched at the same time and sharing a common communication mechanism (e.g. switch windows configured for the tasks).
    • Allocated resources within the job’s allocation
    • Multiple job steps can executed concurrently or sequentially on unique or overlapping resources
  • Identified by ID number: jobid.stepid

20. SLURM Scheduler: control daemon and compute node daemon

SLURM job states

  • Control daemon: slurmctld
    • Orchestrates SLURM activities across the cluster
    • Primary components
      • Node Manager: Monitors node state
      • Partition Manager: Groups nodes into partitions with various configuration parameters and allocates nodes to jobs
      • Job Manager: Accepts user job requests and places pending jobs into priority-ordered queue. Uses the partition manager to allocate resources to the jobs and then launch them.
  • Compute node daemon: slurmd
    • Monitors state of a single node
    • Manages user jobs and job steps within that node
    • Very light-weight
    • Supports hierarchical communications with configurable fanout (new in version 1.1)

Key Points


Data intensive computing

Overview

Teaching: 0 min
Exercises: 0 min
Questions
Objectives

Key Points


High throughput computing

Overview

Teaching: 0 min
Exercises: 0 min
Questions
Objectives

Key Points


Introduction to IOT

Overview

Teaching: 0 min
Exercises: 0 min
Questions
Objectives

Key Points