# CS 267 Applications of Parallel Processors Lecture 9: Computational Electromagnetics Large Dense Lin - PowerPoint PPT Presentation

1 / 65
Title:

## CS 267 Applications of Parallel Processors Lecture 9: Computational Electromagnetics Large Dense Lin

Description:

### Computational Electromagnetics - Sources of large dense linear systems ... Computational Electromagnetics (MOM) ... computational electromagnetics and linear systems ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 66
Provided by: NER3
Category:
Tags:
Transcript and Presenter's Notes

Title: CS 267 Applications of Parallel Processors Lecture 9: Computational Electromagnetics Large Dense Lin

1
CS 267 Applications of Parallel
ProcessorsLecture 9 Computational
Electromagnetics -Large Dense Linear Systems
• 2/19/97
• Horst D. Simon
• http//www.cs.berkeley.edu/cs267

2
Outline - Lecture 9
- Computational Electromagnetics - Sources of
large dense linear systems - Review of solution
of linear systems with Gaussian elimination -
BLAS and memory hierarchy for linear algebra
kernels
3
Outline - Lecture 10
- Layout of matrices on distributed memory
machines - Distributed Gaussian elimination -
Speeding up with advanced algorithms - LINPACK
and LAPACK - LINPACK benchmark - Tflops result
4
Outline - Lecture 11
- Designing portable libraries for parallel
machines - BLACS - ScaLAPACK for dense linear
systems - other linear algebra algorithms in
ScaLAPACK
5
Computational Electromagnetics
- developed during 1980s, driven by defense
applications - determine the RCS (radar cross
section) of airplane - reduce signature of plane
(stealth technology) - other applications are
antenna design, medical equipment - two
fundamental numerical approaches MOM methods of
moments ( frequency domain), and finite
differences (time domain)
6
Computational Electromagnetics
- discretize surface into triangular facets using
standard modeling tools - amplitude of currents
on surface are unknowns
- integral equation is discretized into a set of
linear equations
image NW Univ. Comp. Electromagnetics Laboratory
http//nueml.ece.nwu.edu/
7
Computational Electromagnetics (MOM)
After discretization the integral equation has
the form Z J V where Z is the
impedance matrix, J is the unknown vector of
amplitudes, and V is the excitation vector. Z is
given as a four dimensional integral. (see Cwik,
Patterson, and Scott, Electromagnetic Scattering
on the Intel Touchstone Delta, IEEE
Supercomputing 92, pp 538 - 542)
8
Computational Electromagnetics (MOM)
The main steps in the solution process are A)
computing the matrix elements B) factoring the
dense matrix C) solving for one or more
excitations D) computing the fields scattered
from the object
9
Analysis of MOM for Parallel Implementation
Parallel Speed
Fill O(n2) embarrassing
low Factor O(n3)
moderately diff. very high Solve
O(n2) moderately diff.
high Field Calc. O(n) embarrassing
high
For most scientific applications the biggest gain
in performance can be obtained by focusing on
10
Results for Parallel Implementation on Delta
(Gflop/s)
Fill 9.20
1.0 Factor 8.25
10.35 Solve 2.17
- Field Calc.
0.12 3.0
The problem solved was for a matrix of size
48,672. (The world record in 1991.)
11
Current Records for Solving Dense Systems
Year System Size Machine 1950's
O(100) 1991 55,296
CM-2 1992 75,264 Intel
1993 75,264 Intel 1994
76,800 CM-5 1995
128,600 Paragon XP 1996
215,000 ASCI Red

source Alan Edelman http//www-math.mit.edu/edel
man/records.html
12
Sources for large dense linear systems
- not many outside CEM - even within CEM
community alternatives such FD-TD are heavily
debated
In many instances choices for algorithms or
methods in existing scientific codes or
applications are not the result of careful
planning and design. At best they are reflecting
the start-of-the-art at the time, at worst they
are purely coincidental.
13
Review of Gaussian Elimination
a dense matrix - add multiples of each row to
subsequent rows in order to create zeros below
the diagonal - ending up with an upper triangular
matrix U. Solve a linear system with U by
substitution, starting with the last variable.
see Demmel http//HTTP.CS.Berkeley.EDU/demmel/cs2
67/lecture12/lecture12.html
14
Review of Gaussian Elimination (cont.)
... for each column i, ... zero it out below
the diagonal by ... adding multiples of row i to
later rows for i 1 to n-1 ... each row j
below row i for j i1 to n ...
add a multiple of row i to row j for k
i to n A(j,k) A(j,k) -
(A(j,i)/A(i,i)) A(i,k)
15
Review of Gaussian Elimination (cont.)
16
Review of Gaussian Elimination (cont.)
... for each column i, ... zero it out below
the diagonal by ... adding multiples of row i to
later rows for i 1 to n-1 ... each row j
below row i for j i1 to n ...
add a multiple of row i to row j for k
i to n A(j,k) A(j,k) -
(A(j,i)/A(i,i)) A(i,k)
m
17
Review of Gaussian Elimination (cont.)
for i 1 to n-1 for j i1 to n
m A(j,i)/A(i,i) for k i1 to n
A(j,k) A(j,k) - m A(i,k)
avoid computation of known matrix entry
18
Review of Gaussian Elimination (cont.)
It will be convenient to store the multipliers m
in the implicitly created zeros below the
diagonal, so we can use them later to transform
the right hand side b for i 1 to n-1
for j i1 to n A(j,i) A(j,i)/A(i,i)
for j i1 to n for k i1 to n
A(j,k) A(j,k) - A(j,i) A(i,k)
19
Review of Gaussian Elimination (cont.)
Now we use Matlab (data parallel) notation to
express the algorithm even more compactly for
i 1 to n-1 A(i1n, i) A(i1n, i) /
A(i,i) A(i1n, i1n) A(i1n, i1n) -
A(i1n, i)A(i, i1n)
The inner loop consists of one vector operation,
and one matrix-vector operation. Note that the
loop looks elegant, but no longer intuitive.
20
Review of Gaussian Elimination (cont.)
21
Review of Gaussian Elimination (cont.)
Lemma. (LU Factorization). If the above algorithm
terminates (i.e. it did not try to divide by
zero) then A LU. Now we can state our
complete algorithm for solving Axb 1)
Factorize A LU. 2) Solve Ly b for y
by forward substitution. 3) Solve Ux y
for x by backward substitution. Then x
is the solution we seek because Ax L(Ux)
Ly b.
22
Review of Gaussian Elimination (cont.)
Here are some obvious problems with this
algorithm, which we need to address - If
A(i,i) is zero, the algorithm cannot proceed.
If A(i,i) is tiny, we will also have numerical
problems. - The majority of the work is done by
a rank-one update, which does not exploit a
memory hierarchy as well as an operation like
matrix-matrix multiplication
23
Pivoting for Small A(i,i)
Why pivoting is needed?
A 0 1 1 0
Even if A(i,i) is tiny, but not zero difficulties
can arise (see example in Jim Demmels lecture
notes). This problem is resolved by partial
pivoting.
24
Partial Pivoting
Reordering the rows of A so that A(i,i) is large
at each step of the algorithm. At step i of the
algorithm, row i is swapped with row kgti if
A(k,i) is the largest entry among A(in,i).
for i 1 to n-1 find and record k where
A(k,i) max_iltjltn A(j,i) if
A(k,i)0, exit with a warning that A is
singular, or nearly so if i ! k, swap rows i
and k of A A(i1n, i) A(i1n, i) /
A(i,i) ... each quotient lies in -1,1
A(i1n, i1n) A(i1n, i1n) -
A(i1n, i)A(i, i1n)
25
Partial Pivoting (cont.)
- for 2-by-2 example, we get a very accurate
answer - several choices as to when to swap rows
i and k - could use indirect addressing and not
swap them at all, but this would be slow - keep
permutation, then solving Axb only requires
the additional step of permuting b
26
Fast linear algebra kernels BLAS
- Simple linear algebra kernels such as
matrix-matrix multiply (exercise) can be
performed fast on memory hierarchies. - More
complicated algorithms can be built from
some very basic building blocks and kernels. -
The interfaces of these kernels have been
standardized as the Basic Linear Algebra
Subroutines or BLAS. - Early agreement on
standard interface (around 1980) led to portable
libraries for vector and shared memory parallel
machines. - BLAS are classified into three
categories, level 1,2,3 see Demmel
http//HTTP.CS.Berkeley.EDU/demmel/cs267/lecture0
2.html
27
Level 1 BLAS
Operate mostly on vectors (1D arrays), or pairs
of vectors perform O(n) operations return
either a vector or a scalar. Examples saxpy
y(i) a x(i) y(i), for i1 to n.
Saxpy is an acronym for the operation. S stands
for single precision, daxpy is for double
precision, caxpy for complex, and zaxpy
for double complex, sscal y a x,
srot replaces vectors x and y by cxsy and
-sxcy, where c and s are typically a
cosine and sine. sdot computes s sum_i1n
x(i)y(i)
28
Level 2 BLAS
operate mostly on a matrix (2D array) and a
vector return a matrix or a vector O(n2)
operations. Examples. sgemv Matrix-vector
multiplication computes y y Ax where A is
m-by-n, x is n-by-1 and y is m-by-1. sger
rank-one update computes A A yx', where A
is m-by-n, y is m-by-1, x is n-by-1, x' is the
transpose of x. This is a short way of saying
A(i,j) A(i,j) y(i)x(j) for all i,j.
strsv triangular solve solves yTx for x,
where T is a triangular matrix.
29
Level 3 BLAS
operate on pairs or triples of matrices,
returning a matrix complexity is
O(n3). Examples sgemm Matrix-matrix
multiplication computes C C AB,
where C is m-by-n, A is m-by-k, and B is
k-by-n sgtrsm multiple triangular solve solves Y
TX for X, where T is a triangular
matrix, and X is a rectangular matrix.
30
Performance of BLAS
Level 3
Level 2
Level 1
31
Performance of BLAS (cont.)
- BLAS are specially optimized by the vendor
(IBM) to take advantage of all features of the RS
6000/590. - Potentially a big speed advantage if
an algorithm can be expressed in terms of the
BLAS3 instead of BLAS2 or BLAS1. - The top speed
of the BLAS3, about 250 Mflops, is very close to
the peak machine speed of 266 Mflops. - We will
reorganize algorithms, like Gaussian
elimination, so that they use BLAS3 rather than
BLAS1 or BLAS2.
32
Explanation of Performance of BLAS
m number of memory references to slow memory
(read write) f number of floating point
operations q f/m average number of flops per
slow memory reference m
justification for m
f q saxpy 3n
read x(i), y(i) write y(i) 2n
2/3 sgemv n2O(n) read each A(i,j)
once 2n2 2 sgemm 4n2
2n3 n/2
write C(i,j) once
33
CS 267 Applications of Parallel
ProcessorsLecture 10 Large Dense Linear
Systems -Distributed Implementations
• 2/21/97
• Horst D. Simon
• http//www.cs.berkeley.edu/cs267

34
Review - Lecture 9
- computational electromagnetics and linear
systems - rewritten Gaussian elimination as
vector and matrix-vector operation (level 2
BLAS) - discussed the efficiency of level 3 BLAS
in terms of reducing number of memory accesses
35
Outline - Lecture 10
- Layout of matrices on distributed memory
machines - Distributed Gaussian elimination -
Speeding up with advanced algorithms - LINPACK
and LAPACK - LINPACK benchmark - Tflops result
36
Review of Gaussian Elimination
Now we use Matlab (data parallel) notation to
express the algorithm even more compactly for
i 1 to n-1 A(i1n, i) A(i1n, i) /
A(i,i) A(i1n, i1n) A(i1n, i1n) -
A(i1n, i)A(i, i1n)
The inner loop consists of one vector operation,
and one matrix-vector operation. Note that the
loop looks elegant, but no longer intuitive.
37
Review of Gaussian Elimination (cont.)
38
Partial Pivoting
Reordering the rows of A so that A(i,i) is large
at each step of the algorithm. At step i of the
algorithm, row i is swapped with row kgti if
A(k,i) is the largest entry among A(in,i).
for i 1 to n-1 find and record k where
A(k,i) max_iltjltn A(j,i) if
A(k,i)0, exit with a warning that A is
singular, or nearly so if i ! k, swap rows i
and k of A A(i1n, i) A(i1n, i) /
A(i,i) ... each quotient lies in -1,1
A(i1n, i1n) A(i1n, i1n) -
A(i1n, i)A(i, i1n)
39
How to Use Level 3 BLAS ?
The current algorithm only uses level 1 and level
2 BLAS. Want to use level 3 BLAS because of
higher performance. The standard technique is
called blocking or delayed updating. We want to
save up a sequence of level 2 operations and do
them all at once.
40
How to Use Level 3 BLAS in LU Decomposition
- process the matrix in blocks of b columns at a
time - b is called the block size. - do a
complete LU decomposition just of the b columns
in the current block, essentially using the above
BLAS2 code. - then update the remainder of the
matrix doing b rank-one updates all at once,
which turns out to be a single matrix-matrix
multiplication of size b
41
Block GE with Level 3 BLAS
42
Block GE with Level 3 BLAS
Gaussian elimination with Partial Pivoting,
BLAS3 implementation ... process matrix b
columns at a time for ib 1 to n-1 step b
... point to end of block of b columns end
min(ibb-1,n) ... LU factorize
A(ibn,ibend) with BLAS2 for i ib to end
find and record k where A(k,i)
max_iltjltn A(j,i) if A(k,i)0, exit
with a warning that A is singular, or
nearly so if i ! k, swap rows i and k of
A A(i1n, i) A(i1n, i) / A(i,i)
... only update columns i1 to end
A(i1n, i1end) A(i1n, i1end)
- A(i1n, i)A(i, i1end) endfor
43
Block GE with Level 3 BLAS (cont.)
... Let LL be the b-by-b lower triangular ...
matrix whose subdiagonal entries are ...
stored in A(ibend,ibend), and with ... 1s
on the diagonal. Do delayed update ... of
A(ibend, end1n) by solving ... n-end
triangular systems ... (A(ibend, end1n) is
pink below) A(ibend, end1n)
LL \ A(ibend, end1n) ... do delayed
update of rest of matrix ... using
matrix-matrix multiplication ... (A(end1n,
end1n) is green below) ... (A(end1n,
ibend) is blue below) A(end1n, end1n)
A(end1n, end1n) - A(end1n,ibend)A(
ib(end,end1n) endfor
44
Block GE with Level 3 BLAS (cont.)
- LU factorization of A(ibn,ibend) uses the
same algorithm as before (level 2 BLAS) -
Solving a system of n-end equations with
triangular coefficient matrix LL is a single
call to a BLAS3 subroutine (strsm) designed for
that purpose. - No work or data motion is
required to refer to LL done with a pointer. -
When ngtgtb, almost all the work is done in the
final line, which multiplies an (n-end)-by-b
matrix times a b-by-(n-end) matrix in a single
BLAS3 call (to sgemm).
45
How to select b?
b will be chosen in a machine dependent way to
maximize performance. A good value of b will have
the following properties - b is small enough
so that the b columns currently being
LU-factorized fit in the fast memory (cache,
say) of the machine. - b is large enough to
make matrix-matrix multiplication fast.
46
LINPACK - LAPACK -ScaLAPACK
LINPACK - linear systems, least squares
problems level 1 BLAS - late
70s LAPACK - redesigned LINPACK to include
eigenvalue software, level 3
BLAS for parallel and shared
memory parallel machines - late 80s ScaLAPACK -
scaleable LAPACK based on BLACS for
communication, distributed memory
machine - mid 90s
47
Efficiency on Cray C90
48
Comparison of Different Machines
Machine Procs Clock Peak Block
Speed Mflops Size b
(MHz) ---------------
--------------------------------------------------
---- Convex C4640 1 135 810
64 Convex C4640 4 135 3240
64 Cray C90 1 240 952
128 Cray C90 16 240
15238 128 DEC Alpha 3000-500X 1 200
200 32 IBM RS 6000/590 1 66
264 64 SGI Power Challenge 1 75
300 64
49
Efficiency of LAPACK LU, for n1000
50
Efficiency of LAPACK LU, for n1000
LU factorization is almost as efficient as
matrix-matrix multiply for most machines, except
on C90 (16 processors). (why?) LAPACK - LU is
almost as good as best vendor effort. Trade-off
between performance and portability. Vendors
place a premium on LU performance - why?
51
LINPACK Benchmark
- named after the LINPACK package - originally
consisted of timings for 100-by-100 matrices
no vendor optimization(code changes) permitted -
interesting historical record, with literally
every machine for the last 2 decades listed in
decreasing order of speed, from the largest
supercomputers to a hand-held calculator. - as
machines grew faster 1000-by-1000 matrices
were introduced (all code changes allowed). - a
third benchmark was added for large parallel
machines, which measured their speed on the
largest linear system that would fit in memory,
as well as the size of the system required to
get half the Mflop rate of the largest matrix.
52
LINPACK Benchmark
Computer
Num_Procs Rmax(GFlops) Nmax(order)
N1/2(order) Rpeak(GFlops) -----------------------
---------------------- --------- ------------
------------ ------------ ------------- Intel
ASCI Option Red (200 MHz Pentium Pro)
7264 1068. 215000 53400
1453 CP-PACS (150 MHz PA-RISC based CPU)
2048 368.2 103680
30720 614 Intel Paragon XP/S MP
(50 MHz OSSUNMOS) 6768 281.1
128600 25700 338 Intel
Paragon XP/S MP (50 MHz OSSUNMOS)
6144 256.2 122500 24300
307 Numerical Wind Tunnel (9.5 ns)
167 229.7 66132
18018 281 Intel Paragon XP/S MP
(50 MHz OSSUNMOS) 5376 223.6
114500 22900 269 HITACHI
SR2201/1024(150MHz) 1024
220.4 138240 34560
307 Fujitsu VPP500/153(10nsec)
153 200.6 62730
17000 245 Numerical Wind Tunnel (9.5
ns) 140 195.0
60480 15730 236 Intel Paragon
XP/S MP (50 MHz OSSUNMOS) 4608
191.5 106000 21000
230 Numerical Wind Tunnel (9.5 ns)
128 179.2 56832
14800 216
53
Efficiency of LAPACK LU, for n100
54
Data Layouts for Distributed Memory Machines
The two main issues in choosing a data layout for
Gaussian elimination are 1) load balance, or
splitting the work reasonably evenly among the
processors 2)ability to use the BLAS3 during
computations on a single processor, to account
for the memory hierarchy on each
processor. Several layouts will be discussed
here. All these are part of HPF. Solving linear
systems served as a prototype for these designs.
55
Gaussian Elimination using BLAS 3
56
Column Blocked
column i is stored on processor floor(i/c) where
cceiling(n/p) is the maximum number of columns
stored per processor. does not permit good
load balancing. after c columns have been
computed processor 0 is idle row blocked has
similar problem
n16 and p4.
57
Column Cyclic
each processor owns approximately 1/p-th of the
square southeast corner of the matrix good load
balance single columns are stored rather than
blocks means we cannot use the BLAS3 to
update transpose of this layout, the Row Cyclic
Layout, has a similar problem.
58
Column Block Cyclic
choose a block size b, divide the columns into
groups of size b, distribute these groups
cyclically for b gt1, slightly worse balance
than the Column Cyclic Layout can use the BLAS2
and BLAS3 b lt c, better load balance than the
Columns Blocked Layout, but can only call the
BLAS on smaller subproblems, take less advantage
of the local memory hierarchy disadvantage that
the factorization of A(ibn,ibend) will take
place on perhaps just on one processor possible
serial bottleneck.
n16, p4 and b2 b not necessarily BLAS3 block
size
59
Row and Column Block Cyclic
processors and matrix blocks are distributed in a
2d array pcol-fold parallelism in any column,
and calls to the BLAS2 and BLAS3 on matrices of
size brow-by-bcol serial bottleneck is
eased need not be symmetric in rows and columns
60
Skewered Block
each row and each column is shared among all p
processors so p-fold parallelism is available
for any row operation or any column
operation in contrast, the 2D block cyclic
layout can have at most sqrt(p)-fold parallelism
in all the rows and all the columns not useful
for Gaussian elimination, but in a variety of
other matrix operations
61
Distributed GE with a 2D Block Cyclic Layout
block size b in the algorithm and the block sizes
brow and bcol in the layout satisfy bbrowbcol.
shaded regions indicate busy processors or
communication performed. unnecessary to have a
barrier between each step of the algorithm,
e.g.. step 9, 10, and 11 can be pipelined
62
Distributed GE with a 2D Block Cyclic Layout
63
(No Transcript)
64
ScaLAPACK LU Performance Results
65
Teraflop/s Performance Result
Sorry for the delay in responding. The system