1 / 65

CS 267 Applications of Parallel

ProcessorsLecture 9 Computational

Electromagnetics -Large Dense Linear Systems

- 2/19/97
- Horst D. Simon
- http//www.cs.berkeley.edu/cs267

Outline - Lecture 9

- Computational Electromagnetics - Sources of

large dense linear systems - Review of solution

of linear systems with Gaussian elimination -

BLAS and memory hierarchy for linear algebra

kernels

Outline - Lecture 10

- Layout of matrices on distributed memory

machines - Distributed Gaussian elimination -

Speeding up with advanced algorithms - LINPACK

and LAPACK - LINPACK benchmark - Tflops result

Outline - Lecture 11

- Designing portable libraries for parallel

machines - BLACS - ScaLAPACK for dense linear

systems - other linear algebra algorithms in

ScaLAPACK

Computational Electromagnetics

- developed during 1980s, driven by defense

applications - determine the RCS (radar cross

section) of airplane - reduce signature of plane

(stealth technology) - other applications are

antenna design, medical equipment - two

fundamental numerical approaches MOM methods of

moments ( frequency domain), and finite

differences (time domain)

Computational Electromagnetics

- discretize surface into triangular facets using

standard modeling tools - amplitude of currents

on surface are unknowns

- integral equation is discretized into a set of

linear equations

image NW Univ. Comp. Electromagnetics Laboratory

http//nueml.ece.nwu.edu/

Computational Electromagnetics (MOM)

After discretization the integral equation has

the form Z J V where Z is the

impedance matrix, J is the unknown vector of

amplitudes, and V is the excitation vector. Z is

given as a four dimensional integral. (see Cwik,

Patterson, and Scott, Electromagnetic Scattering

on the Intel Touchstone Delta, IEEE

Supercomputing 92, pp 538 - 542)

Computational Electromagnetics (MOM)

The main steps in the solution process are A)

computing the matrix elements B) factoring the

dense matrix C) solving for one or more

excitations D) computing the fields scattered

from the object

Analysis of MOM for Parallel Implementation

Task Work Parallelism

Parallel Speed

Fill O(n2) embarrassing

low Factor O(n3)

moderately diff. very high Solve

O(n2) moderately diff.

high Field Calc. O(n) embarrassing

high

For most scientific applications the biggest gain

in performance can be obtained by focusing on

one tasks.

Results for Parallel Implementation on Delta

Task Time (hours) Performance

(Gflop/s)

Fill 9.20

1.0 Factor 8.25

10.35 Solve 2.17

- Field Calc.

0.12 3.0

The problem solved was for a matrix of size

48,672. (The world record in 1991.)

Current Records for Solving Dense Systems

Year System Size Machine 1950's

O(100) 1991 55,296

CM-2 1992 75,264 Intel

1993 75,264 Intel 1994

76,800 CM-5 1995

128,600 Paragon XP 1996

215,000 ASCI Red

source Alan Edelman http//www-math.mit.edu/edel

man/records.html

Sources for large dense linear systems

- not many outside CEM - even within CEM

community alternatives such FD-TD are heavily

debated

In many instances choices for algorithms or

methods in existing scientific codes or

applications are not the result of careful

planning and design. At best they are reflecting

the start-of-the-art at the time, at worst they

are purely coincidental.

Review of Gaussian Elimination

Gaussian elimination to solve Axb - start with

a dense matrix - add multiples of each row to

subsequent rows in order to create zeros below

the diagonal - ending up with an upper triangular

matrix U. Solve a linear system with U by

substitution, starting with the last variable.

see Demmel http//HTTP.CS.Berkeley.EDU/demmel/cs2

67/lecture12/lecture12.html

Review of Gaussian Elimination (cont.)

... for each column i, ... zero it out below

the diagonal by ... adding multiples of row i to

later rows for i 1 to n-1 ... each row j

below row i for j i1 to n ...

add a multiple of row i to row j for k

i to n A(j,k) A(j,k) -

(A(j,i)/A(i,i)) A(i,k)

Review of Gaussian Elimination (cont.)

Review of Gaussian Elimination (cont.)

... for each column i, ... zero it out below

the diagonal by ... adding multiples of row i to

later rows for i 1 to n-1 ... each row j

below row i for j i1 to n ...

add a multiple of row i to row j for k

i to n A(j,k) A(j,k) -

(A(j,i)/A(i,i)) A(i,k)

m

Review of Gaussian Elimination (cont.)

for i 1 to n-1 for j i1 to n

m A(j,i)/A(i,i) for k i1 to n

A(j,k) A(j,k) - m A(i,k)

avoid computation of known matrix entry

Review of Gaussian Elimination (cont.)

It will be convenient to store the multipliers m

in the implicitly created zeros below the

diagonal, so we can use them later to transform

the right hand side b for i 1 to n-1

for j i1 to n A(j,i) A(j,i)/A(i,i)

for j i1 to n for k i1 to n

A(j,k) A(j,k) - A(j,i) A(i,k)

Review of Gaussian Elimination (cont.)

Now we use Matlab (data parallel) notation to

express the algorithm even more compactly for

i 1 to n-1 A(i1n, i) A(i1n, i) /

A(i,i) A(i1n, i1n) A(i1n, i1n) -

A(i1n, i)A(i, i1n)

The inner loop consists of one vector operation,

and one matrix-vector operation. Note that the

loop looks elegant, but no longer intuitive.

Review of Gaussian Elimination (cont.)

Review of Gaussian Elimination (cont.)

Lemma. (LU Factorization). If the above algorithm

terminates (i.e. it did not try to divide by

zero) then A LU. Now we can state our

complete algorithm for solving Axb 1)

Factorize A LU. 2) Solve Ly b for y

by forward substitution. 3) Solve Ux y

for x by backward substitution. Then x

is the solution we seek because Ax L(Ux)

Ly b.

Review of Gaussian Elimination (cont.)

Here are some obvious problems with this

algorithm, which we need to address - If

A(i,i) is zero, the algorithm cannot proceed.

If A(i,i) is tiny, we will also have numerical

problems. - The majority of the work is done by

a rank-one update, which does not exploit a

memory hierarchy as well as an operation like

matrix-matrix multiplication

Pivoting for Small A(i,i)

Why pivoting is needed?

A 0 1 1 0

Even if A(i,i) is tiny, but not zero difficulties

can arise (see example in Jim Demmels lecture

notes). This problem is resolved by partial

pivoting.

Partial Pivoting

Reordering the rows of A so that A(i,i) is large

at each step of the algorithm. At step i of the

algorithm, row i is swapped with row kgti if

A(k,i) is the largest entry among A(in,i).

for i 1 to n-1 find and record k where

A(k,i) max_iltjltn A(j,i) if

A(k,i)0, exit with a warning that A is

singular, or nearly so if i ! k, swap rows i

and k of A A(i1n, i) A(i1n, i) /

A(i,i) ... each quotient lies in -1,1

A(i1n, i1n) A(i1n, i1n) -

A(i1n, i)A(i, i1n)

Partial Pivoting (cont.)

- for 2-by-2 example, we get a very accurate

answer - several choices as to when to swap rows

i and k - could use indirect addressing and not

swap them at all, but this would be slow - keep

permutation, then solving Axb only requires

the additional step of permuting b

Fast linear algebra kernels BLAS

- Simple linear algebra kernels such as

matrix-matrix multiply (exercise) can be

performed fast on memory hierarchies. - More

complicated algorithms can be built from

some very basic building blocks and kernels. -

The interfaces of these kernels have been

standardized as the Basic Linear Algebra

Subroutines or BLAS. - Early agreement on

standard interface (around 1980) led to portable

libraries for vector and shared memory parallel

machines. - BLAS are classified into three

categories, level 1,2,3 see Demmel

http//HTTP.CS.Berkeley.EDU/demmel/cs267/lecture0

2.html

Level 1 BLAS

Operate mostly on vectors (1D arrays), or pairs

of vectors perform O(n) operations return

either a vector or a scalar. Examples saxpy

y(i) a x(i) y(i), for i1 to n.

Saxpy is an acronym for the operation. S stands

for single precision, daxpy is for double

precision, caxpy for complex, and zaxpy

for double complex, sscal y a x,

srot replaces vectors x and y by cxsy and

-sxcy, where c and s are typically a

cosine and sine. sdot computes s sum_i1n

x(i)y(i)

Level 2 BLAS

operate mostly on a matrix (2D array) and a

vector return a matrix or a vector O(n2)

operations. Examples. sgemv Matrix-vector

multiplication computes y y Ax where A is

m-by-n, x is n-by-1 and y is m-by-1. sger

rank-one update computes A A yx', where A

is m-by-n, y is m-by-1, x is n-by-1, x' is the

transpose of x. This is a short way of saying

A(i,j) A(i,j) y(i)x(j) for all i,j.

strsv triangular solve solves yTx for x,

where T is a triangular matrix.

Level 3 BLAS

operate on pairs or triples of matrices,

returning a matrix complexity is

O(n3). Examples sgemm Matrix-matrix

multiplication computes C C AB,

where C is m-by-n, A is m-by-k, and B is

k-by-n sgtrsm multiple triangular solve solves Y

TX for X, where T is a triangular

matrix, and X is a rectangular matrix.

Performance of BLAS

Level 3

Level 2

Level 1

Performance of BLAS (cont.)

- BLAS are specially optimized by the vendor

(IBM) to take advantage of all features of the RS

6000/590. - Potentially a big speed advantage if

an algorithm can be expressed in terms of the

BLAS3 instead of BLAS2 or BLAS1. - The top speed

of the BLAS3, about 250 Mflops, is very close to

the peak machine speed of 266 Mflops. - We will

reorganize algorithms, like Gaussian

elimination, so that they use BLAS3 rather than

BLAS1 or BLAS2.

Explanation of Performance of BLAS

m number of memory references to slow memory

(read write) f number of floating point

operations q f/m average number of flops per

slow memory reference m

justification for m

f q saxpy 3n

read x(i), y(i) write y(i) 2n

2/3 sgemv n2O(n) read each A(i,j)

once 2n2 2 sgemm 4n2

read A(i,j),B(i,j),C(i,j)

2n3 n/2

write C(i,j) once

CS 267 Applications of Parallel

ProcessorsLecture 10 Large Dense Linear

Systems -Distributed Implementations

- 2/21/97
- Horst D. Simon
- http//www.cs.berkeley.edu/cs267

Review - Lecture 9

- computational electromagnetics and linear

systems - rewritten Gaussian elimination as

vector and matrix-vector operation (level 2

BLAS) - discussed the efficiency of level 3 BLAS

in terms of reducing number of memory accesses

Outline - Lecture 10

- Layout of matrices on distributed memory

machines - Distributed Gaussian elimination -

Speeding up with advanced algorithms - LINPACK

and LAPACK - LINPACK benchmark - Tflops result

Review of Gaussian Elimination

Now we use Matlab (data parallel) notation to

express the algorithm even more compactly for

i 1 to n-1 A(i1n, i) A(i1n, i) /

A(i,i) A(i1n, i1n) A(i1n, i1n) -

A(i1n, i)A(i, i1n)

The inner loop consists of one vector operation,

and one matrix-vector operation. Note that the

loop looks elegant, but no longer intuitive.

Review of Gaussian Elimination (cont.)

Partial Pivoting

Reordering the rows of A so that A(i,i) is large

at each step of the algorithm. At step i of the

algorithm, row i is swapped with row kgti if

A(k,i) is the largest entry among A(in,i).

for i 1 to n-1 find and record k where

A(k,i) max_iltjltn A(j,i) if

A(k,i)0, exit with a warning that A is

singular, or nearly so if i ! k, swap rows i

and k of A A(i1n, i) A(i1n, i) /

A(i,i) ... each quotient lies in -1,1

A(i1n, i1n) A(i1n, i1n) -

A(i1n, i)A(i, i1n)

How to Use Level 3 BLAS ?

The current algorithm only uses level 1 and level

2 BLAS. Want to use level 3 BLAS because of

higher performance. The standard technique is

called blocking or delayed updating. We want to

save up a sequence of level 2 operations and do

them all at once.

How to Use Level 3 BLAS in LU Decomposition

- process the matrix in blocks of b columns at a

time - b is called the block size. - do a

complete LU decomposition just of the b columns

in the current block, essentially using the above

BLAS2 code. - then update the remainder of the

matrix doing b rank-one updates all at once,

which turns out to be a single matrix-matrix

multiplication of size b

Block GE with Level 3 BLAS

Block GE with Level 3 BLAS

Gaussian elimination with Partial Pivoting,

BLAS3 implementation ... process matrix b

columns at a time for ib 1 to n-1 step b

... point to end of block of b columns end

min(ibb-1,n) ... LU factorize

A(ibn,ibend) with BLAS2 for i ib to end

find and record k where A(k,i)

max_iltjltn A(j,i) if A(k,i)0, exit

with a warning that A is singular, or

nearly so if i ! k, swap rows i and k of

A A(i1n, i) A(i1n, i) / A(i,i)

... only update columns i1 to end

A(i1n, i1end) A(i1n, i1end)

- A(i1n, i)A(i, i1end) endfor

Block GE with Level 3 BLAS (cont.)

... Let LL be the b-by-b lower triangular ...

matrix whose subdiagonal entries are ...

stored in A(ibend,ibend), and with ... 1s

on the diagonal. Do delayed update ... of

A(ibend, end1n) by solving ... n-end

triangular systems ... (A(ibend, end1n) is

pink below) A(ibend, end1n)

LL \ A(ibend, end1n) ... do delayed

update of rest of matrix ... using

matrix-matrix multiplication ... (A(end1n,

end1n) is green below) ... (A(end1n,

ibend) is blue below) A(end1n, end1n)

A(end1n, end1n) - A(end1n,ibend)A(

ib(end,end1n) endfor

Block GE with Level 3 BLAS (cont.)

- LU factorization of A(ibn,ibend) uses the

same algorithm as before (level 2 BLAS) -

Solving a system of n-end equations with

triangular coefficient matrix LL is a single

call to a BLAS3 subroutine (strsm) designed for

that purpose. - No work or data motion is

required to refer to LL done with a pointer. -

When ngtgtb, almost all the work is done in the

final line, which multiplies an (n-end)-by-b

matrix times a b-by-(n-end) matrix in a single

BLAS3 call (to sgemm).

How to select b?

b will be chosen in a machine dependent way to

maximize performance. A good value of b will have

the following properties - b is small enough

so that the b columns currently being

LU-factorized fit in the fast memory (cache,

say) of the machine. - b is large enough to

make matrix-matrix multiplication fast.

LINPACK - LAPACK -ScaLAPACK

LINPACK - linear systems, least squares

problems level 1 BLAS - late

70s LAPACK - redesigned LINPACK to include

eigenvalue software, level 3

BLAS for parallel and shared

memory parallel machines - late 80s ScaLAPACK -

scaleable LAPACK based on BLACS for

communication, distributed memory

machine - mid 90s

Efficiency on Cray C90

Comparison of Different Machines

Machine Procs Clock Peak Block

Speed Mflops Size b

(MHz) ---------------

--------------------------------------------------

---- Convex C4640 1 135 810

64 Convex C4640 4 135 3240

64 Cray C90 1 240 952

128 Cray C90 16 240

15238 128 DEC Alpha 3000-500X 1 200

200 32 IBM RS 6000/590 1 66

264 64 SGI Power Challenge 1 75

300 64

Efficiency of LAPACK LU, for n1000

Efficiency of LAPACK LU, for n1000

LU factorization is almost as efficient as

matrix-matrix multiply for most machines, except

on C90 (16 processors). (why?) LAPACK - LU is

almost as good as best vendor effort. Trade-off

between performance and portability. Vendors

place a premium on LU performance - why?

LINPACK Benchmark

- named after the LINPACK package - originally

consisted of timings for 100-by-100 matrices

no vendor optimization(code changes) permitted -

interesting historical record, with literally

every machine for the last 2 decades listed in

decreasing order of speed, from the largest

supercomputers to a hand-held calculator. - as

machines grew faster 1000-by-1000 matrices

were introduced (all code changes allowed). - a

third benchmark was added for large parallel

machines, which measured their speed on the

largest linear system that would fit in memory,

as well as the size of the system required to

get half the Mflop rate of the largest matrix.

LINPACK Benchmark

Computer

Num_Procs Rmax(GFlops) Nmax(order)

N1/2(order) Rpeak(GFlops) -----------------------

---------------------- --------- ------------

------------ ------------ ------------- Intel

ASCI Option Red (200 MHz Pentium Pro)

7264 1068. 215000 53400

1453 CP-PACS (150 MHz PA-RISC based CPU)

2048 368.2 103680

30720 614 Intel Paragon XP/S MP

(50 MHz OSSUNMOS) 6768 281.1

128600 25700 338 Intel

Paragon XP/S MP (50 MHz OSSUNMOS)

6144 256.2 122500 24300

307 Numerical Wind Tunnel (9.5 ns)

167 229.7 66132

18018 281 Intel Paragon XP/S MP

(50 MHz OSSUNMOS) 5376 223.6

114500 22900 269 HITACHI

SR2201/1024(150MHz) 1024

220.4 138240 34560

307 Fujitsu VPP500/153(10nsec)

153 200.6 62730

17000 245 Numerical Wind Tunnel (9.5

ns) 140 195.0

60480 15730 236 Intel Paragon

XP/S MP (50 MHz OSSUNMOS) 4608

191.5 106000 21000

230 Numerical Wind Tunnel (9.5 ns)

128 179.2 56832

14800 216

Efficiency of LAPACK LU, for n100

Data Layouts for Distributed Memory Machines

The two main issues in choosing a data layout for

Gaussian elimination are 1) load balance, or

splitting the work reasonably evenly among the

processors 2)ability to use the BLAS3 during

computations on a single processor, to account

for the memory hierarchy on each

processor. Several layouts will be discussed

here. All these are part of HPF. Solving linear

systems served as a prototype for these designs.

Gaussian Elimination using BLAS 3

Column Blocked

column i is stored on processor floor(i/c) where

cceiling(n/p) is the maximum number of columns

stored per processor. does not permit good

load balancing. after c columns have been

computed processor 0 is idle row blocked has

similar problem

n16 and p4.

Column Cyclic

each processor owns approximately 1/p-th of the

square southeast corner of the matrix good load

balance single columns are stored rather than

blocks means we cannot use the BLAS3 to

update transpose of this layout, the Row Cyclic

Layout, has a similar problem.

Column Block Cyclic

choose a block size b, divide the columns into

groups of size b, distribute these groups

cyclically for b gt1, slightly worse balance

than the Column Cyclic Layout can use the BLAS2

and BLAS3 b lt c, better load balance than the

Columns Blocked Layout, but can only call the

BLAS on smaller subproblems, take less advantage

of the local memory hierarchy disadvantage that

the factorization of A(ibn,ibend) will take

place on perhaps just on one processor possible

serial bottleneck.

n16, p4 and b2 b not necessarily BLAS3 block

size

Row and Column Block Cyclic

processors and matrix blocks are distributed in a

2d array pcol-fold parallelism in any column,

and calls to the BLAS2 and BLAS3 on matrices of

size brow-by-bcol serial bottleneck is

eased need not be symmetric in rows and columns

Skewered Block

each row and each column is shared among all p

processors so p-fold parallelism is available

for any row operation or any column

operation in contrast, the 2D block cyclic

layout can have at most sqrt(p)-fold parallelism

in all the rows and all the columns not useful

for Gaussian elimination, but in a variety of

other matrix operations

Distributed GE with a 2D Block Cyclic Layout

block size b in the algorithm and the block sizes

brow and bcol in the layout satisfy bbrowbcol.

shaded regions indicate busy processors or

communication performed. unnecessary to have a

barrier between each step of the algorithm,

e.g.. step 9, 10, and 11 can be pipelined

Distributed GE with a 2D Block Cyclic Layout

(No Transcript)

ScaLAPACK LU Performance Results

Teraflop/s Performance Result

Sorry for the delay in responding. The system

had about 7000 200Mhz Pentium Pro Processors.

It solved a 64bit real matrix of size 216000.

It did not use Strassen. The algorithm was

basically the same that Robert van de Geijn used

on the Delta years ago. It does a 2D block

cyclic map of the matrix and requires a power of

2 number of nodes in the vertical direction. The

basic block size was 64x64. A custom dual

processor matrix multiply was written for the

DGEMM call. It took a little less than 2 hours

to run.