Title: The%20Performance%20of%20Bags-Of-Tasks%20in%20Large-Scale%20Distributed%20Computing%20Systems
1The Performance of Bags-Of-Tasks in Large-Scale
Distributed Computing Systems
Alexandru Iosup, Ozan Sonmez, Shanny Anoep, and
Dick Epema
Parallel and Distributed Systems Group, TU Delft
ACM/IEEE Intl. Symposium on High Performance
Distributed Computing
2The VL-e project
Natural gas price ? for grid computing
- A grid project in the Netherlands (2004-)
- Natural gas money VL-e 45 MEuro / 800 MEuro
total research package - Overall aim
- to design and build a virtual lab for
(digitally) enhanced science (e-science)
experiments (no in-vivo or in-vitro, but
in-silico experiments). - Goals
- create prototypes of application-specific
e-science environments - design and develop re-usable ICT/grid components
- validate with real-life applications in testbeds
3The VL-e project application areas
Philips
Unilever
IBM
Data Intensive Science
Medical Diagnosis Imaging
Bio- Diversity
Bio- Informatics
Food Informatics
Dutch Telescience
Virtual Laboratory (VL) Application Oriented
Services
Management of comm. computing
4The VL-e project application areas
Philips
Unilever
IBM
Data Intensive Science
Bags-of-Tasks
Medical Diagnosis Imaging
Bio- Diversity
Bio- Informatics
Food Informatics
Dutch Telescience
Virtual Laboratory (VL) Application Oriented
Services
Management of comm. computing
5The VL-e project application areas
Philips
Unilever
IBM
Data Intensive Science
Medical Diagnosis Imaging
Bio- Diversity
Bio- Informatics
Food Informatics
Dutch Telescience
Bags-of-Tasks
Virtual Laboratory (VL) Application Oriented
Services
Management of comm. computing
6The Challenge
- Complete scientific work better,
- User-oriented performance metrics(time a
critical performance component) - Bags-of-tasks for ease-of-use
- in real systems
- Workloads (now that real traces are available)
- Information unavailability
- What to do?
- Hint the next 10 improvement wont cut it!
7The Challenge (contd.)
- System modelWhat is a good model for the study
of large-scale distributed computing systems that
run bag-of-tasks? - Input modelWhat is a good model for bag-of-tasks
workloads in large-scale distributed computing
systems? - What is the best setup for such system/input?
- How to find the best?
- If a best is found, can there be another?
8The Performance of Bags-of-Tasks in Large-Scale
Distributed Computing Systems
- Introduction and Motivation
- Context System Model
- Workload Model
- Design Space Exploration
- Conclusion
9Context System Model 1/4Overview
- System Model
- Clustersexecute jobs
- Resource managerscoordinate job execution
- Resource management architecturesroute jobs
among resource managers - Task selection policiescreate the eligible set
- Task scheduling policiesschedule the eligible
set
10Context System Model 2/4Resource Management
Architecturesroute jobs among resource managers
11Context System Model 3/4Task Selection
Policiescreate the eligible set
- Age-based
- S-T Select Tasks in the order of their arrival.
- S-BoT Select BoTs in the order of their arrival.
- User priority based
- S-U-Prio Select the tasks of the User with the
highest Priority. - Based on fairness in resource consumption
- S-U-T Select the Tasks of the User with the
lowest res. cons. - S-U-BoT Select the BoTs of the User with the
lowest res. cons. - S-U-GRR Select the User Round-Robin/all tasks
for this user. - S-U-RR Select the User Round-Robin/one task for
this user.
12Context System Model 4/4Task Scheduling
Policiesschedule the eligible set
- Information availability
- Known
- Unknown
- Historical records
- Sample policies
- Earliest Completion Time (with Prediction of
Runtimes) (ECT(-P)) - Fastest Processor First (FPF)
- (Dynamic) Fastest Processor Largest Task
((D)FPLT) - Shortest Task First w/ Replication (STFR)
- Work Queue w/ Replication (WQR)
13The Performance of Bags-of-Tasks in Large-Scale
Distributed Computing Systems
- Introduction and Motivation
- Context System Model
- Workload Model
- Design Space Exploration
- Conclusion
14Workload Modeling 101 What Matters
- Job arrival process job service time
- Self-similarity (burstiness) vs. Poisson Leland
Ott ToN94 - Job grouping bags-of-tasks dominant application
type in multi-cluster grids and cycle-scavenging
systems (the e-Science infrastructure) IosupJSE
EuroPar07 - Job size almost always 1 CPU IosupDELW Grid06
TimeUnit100s
Longer queues
No.Packets/Time Unit
TimeUnit0.01s
No.Packets/Time Unit
Time Units
Time Units
15A Bag-of-Tasks Workload Model
- Model
- Users, Bags-of-Tasks, Tasks
- Heavy-tailed distributions for inter-arrival
time, job service time? can model self-similar
workloads - More details (e.g., parameter values) see
article - Validation data the Grid Workloads Archive
- 7 long-term grid traces
- gt5 million tasks
- gt2500 users
- gt40k CPUs
- Domains HEP, graphics, AI, math, biomed,
climate, finance, aero
http//gwa.ewi.tudelft.nl/
16The Performance of Bags-of-Tasks in Large-Scale
Distributed Computing Systems
- Introduction and Motivation
- Context System Model
- Workload Model
- Design Space Exploration
- Conclusion
17Design Space Exploration 1/5Overview
- Design space exploration time to understand how
our solutions fit into the complete system. - Study the impact of
- The Task Scheduling Policy (s policies)
- The Workload Characteristics (P characteristics)
- The Dynamic System Information (I levels)
- The Task Selection Policy (S policies)
- The Resource Management Architecture (A policies)
s x 7P x I x S x A x (environment) ? gt2M design
points
18Design Space Exploration 2/5Experimental Setup
- Simulator
- DGSim IosupETFL SC07, IosupSE EuroPar08
- System
- DAS Grid5000 Cappello Bal CCGrid07
- gt3,000 CPUs relative perf. 1-1.75
- Metrics
- Makespan
- Normalized Schedule Length speed-up
- Workloads
- Real DAS Grid5000
- Realistic system load 20-95 (from workload
model)
19Design Space Exploration 3/5 Selected Results
ADesign Guidelines for Scheduling Policies
- Influence of the information type
- (K,K) best balance between MS and NSL
- (,U),(U,) surprisingly good (FPF) to
surprisingly poor (WQR4x) - (,H),(H,) poor. Simple runtime predictors
dont work (see article) - Where to invest time?
- K -gt H, K-gt U adapt for information type with
lowest variation
WQR4x
FPF
20Design Space Exploration 4/5 Selected Results B
Task Selection Only for Busy Systems
- Not much difference until system load over 50.
- For DAS Grid5000 no change of task selection
policy.
S-BoT
Same performance
S-T
21Design Space Exploration 5/5 Selected Results C
Resource Management Architecture
- Centralized, separated, or distributed?
- Centralized is best Note job overhead not
considered. - Distributed good for system load below 50
over 50 it does not finish all
tasks.
22The Performance of Bags-of-Tasks in Large-Scale
Distributed Computing Systems
- Introduction and Motivation
- Context System Model
- Workload Model
- Design Space Exploration
- Conclusion
23Conclusion
System Model Resource Management Architecture
Task Selection
Policy Task
Scheduling Policy Information availability
framework BoT workload model Design space
exploration the performance of bags-of-tasks
?
Future Work
- Better predictors
- (H,H) task scheduling policies
24Thank you! Questions? Remarks? Observations?
- Contact A.Iosup_at_gmail.com google Iosup
- Web sites
- http//www.vl-e.nl VL-e project
- http//www.pds.ewi.tudelft.nl PDS group
articles software
- Help building the Grid Workloads
Archivehttp//gwa.ewi.tudelft.nl
25What About Other Workloads?
- (High Performance vs. High Throughput
Computing)Parallel jobs vs. bags-of-tasksWorkflo
ws - We need your traces!We work blindly without
them. - For parallel jobs, the architecture counts much
more IosETFL SC07 - For workflows, we dont know much about
performance.