권호기사보기
기사명 | 저자명 | 페이지 | 원문 | 기사목차 |
---|
대표형(전거형, Authority) | 생물정보 | 이형(異形, Variant) | 소속 | 직위 | 직업 | 활동분야 | 주기 | 서지 | |
---|---|---|---|---|---|---|---|---|---|
연구/단체명을 입력해주세요. |
|
|
|
|
|
* 주제를 선택하시면 검색 상세로 이동합니다.
Title Page
Abstract
Contents
Ⅰ. Introduction 15
1.1. Contributions 16
1.2. Organization 17
Ⅱ. Background 18
2.1. Heterogeneous Embedded Systems and Inference 18
2.2. Heterogeneous Memory Systems 19
Ⅲ. Related Work 21
3.1. Model Slicing and Execution for Efficient Deep Learning Inference 21
3.2. Data Placement and Migration for High-Performance Deep Learning 22
3.3. Resource Management for QoS-Aware and Efficient Workload Consolidation 23
Ⅳ. Heterogeneity-, Communication-, and Constraint-Aware Model Slicing and Execution for Accurate and Efficient Inference 25
4.1. Introduction 25
4.2. Experimental Methodology 26
4.3. Need for Heterogeneity-, Communication-, and Constraint-Aware Inference 27
4.4. Design and Implementation 30
4.4.1. Inference Workload Profiler 30
4.4.2. Execution and Communication Cost Estimators 31
4.4.3. Model Slicer and Scheduler 33
4.4.4. Inference Workload Executor 36
4.5. Evaluation 36
4.5.1. Overview 36
4.5.2. Inference Latency 37
4.5.3. Inference Energy 38
4.5.4. Impact of the MOSAIC Components 39
4.5.5. Discussion 41
4.6. Summary 43
Ⅴ. Reinforcement Learning-Augmented System for Efficient Real-Time Inference on Heterogeneous Embedded Systems 44
5.1. Introduction 44
5.2. Background: Deep Q-Network 45
5.3. Design and Implementation 47
5.3.1. Profiler 47
5.3.2. Execution and Communication Cost Estimators 47
5.3.3. Model Slicing and Execution Planner 49
5.3.4. Runtime System 53
5.4. Experimental Methodology 53
5.5. Evaluation 54
5.5.1. Overview 54
5.5.2. Inference Latency and Energy Efficiency 55
5.5.3. Sensitivity 58
5.5.4. Generality 60
5.5.5. Energy-Delay Product Efficiency 60
5.5.6. Training Time 61
5.6. Summary 62
Ⅵ. Hotness-and Lifetime-Aware Data Placement and Migration for High-Performance Deep-Learning on Heterogeneous Memory Systems 63
6.1. Introduction 63
6.2. Background 64
6.2.1. Tensor Flow Machine-Learning System 64
6.2.2. NUMA-Aware Memory Policies 65
6.2.3. Heterogeneous Memory Systems 66
6.2.4. Terminology 66
6.3. Experimental Methodology 67
6.3.1. System Configuration 67
6.3.2. Deep-Learning Applications 67
6.4. Characterization of DL Applications 68
6.4.1. Execution Time Characteristics 68
6.4.2. Tensor Characteristics 70
6.5. Design and Implementation 72
6.5.1. Tensor Hotness Analyzer 72
6.5.2. Tensor Lifetime Analyzer 73
6.5.3. Tensor Combiner 74
6.5.4. Tensor Manager 75
6.5.5. Discussion 78
6.6. Evaluation 78
6.6.1. Performance and Energy Results 79
6.6.2. Performance Overheads 83
6.6.3. Performance Sensitivity 83
6.6.4. Impact of the Optimization Techniques 84
6.7. Summary 85
Ⅶ. Coordinated Management of Cores, Memory, and Compressed Memory Swap for QoS-Aware and Efficient Workload Consolidation for Memory-Intensive Applications 86
7.1. Introduction 86
7.2. Background 88
7.2.1. Memory Reclaim and CMS 88
7.2.2. Workload Consolidation 88
7.3. Experimental Methodology 89
7.3.1. System Configuration 89
7.3.2. Benchmarks 90
7.4. Characterization 91
7.5. Design and Implementation 96
7.5.1. Profiler 96
7.5.2. System State Space Explorer 97
7.5.3. Resource Allocator 100
7.6. Evaluation 101
7.6.1. QoS and Throughput 101
7.6.2. Sensitivity 104
7.6.3. Explored System States 106
7.6.4. Dynamic Resource Management 106
7.7. Summary 107
Ⅷ. Conclusion 108
References 110
Figure 1. Hardware and software stacks for deep-learning inference on heterogeneous embedded systems. 18
Figure 2. Evaluated heterogeneous embedded system and power monitor 27
Figure 3. Performance heterogeneity of inference workloads 28
Figure 4. Energy heterogeneity of inference workloads 29
Figure 5. Communication overheads 30
Figure 6. Overall architecture of MOSAIC 30
Figure 7. Communication time with various tensor sizes 31
Figure 8. Inference latency 37
Figure 9. Inference energy 38
Figure 10. Latency impact of the MOSAIC components 40
Figure 11. Energy impact of the MOSAIC components 40
Figure 12. Inference latency with smaller models 41
Figure 13. Inference energy with smaller models 41
Figure 14. Latency estimation accuracy 42
Figure 15. Energy estimation accuracy 42
Figure 16. Overheads for performance optimization 43
Figure 17. Overheads for energy optimization 43
Figure 18. Overall architecture of HERTI 46
Figure 19. DQN architecture of MSEP 50
Figure 20. Inference latency 56
Figure 21. Inference energy 57
Figure 22. Sensitivity to the inference deadline 58
Figure 23. Sensitivity to the system heterogeneity 59
Figure 24. Generality of HERTI 59
Figure 25. Energy-delay product 60
Figure 26. Training time comparison 61
Figure 27. Networks with linear and non-linear connections 65
Figure 28. Per-operation execution time of VGG 69
Figure 29. Execution time breakdowns 70
Figure 30. Per-operation execution time of GN 71
Figure 31. Tensor characteristics of VGG 72
Figure 32. Tensor characteristics of GN 73
Figure 33. Overall architecture of HALO 74
Figure 34. Overall performance results 79
Figure 35. Execution breakdowns with HALO and various memory management policies 80
Figure 36. Memory traffic 81
Figure 37. Energy consumption breakdowns 82
Figure 38. Performance overheads of HALO 83
Figure 39. Sensitivity to the application working-set size 84
Figure 40. Impact of the optimization techniques 84
Figure 41. Impact of cores, memory, and CMS allocated to the LC container with low load and low MOR 92
Figure 42. Impact of cores, memory, and CMS allocated to the LC container with low load and high MOR 93
Figure 43. Impact of cores, memory, and CMS allocated to the LC container with high load and low MOR 94
Figure 44. Impact of cores, memory, and CMS allocated to the LC container with high load and high MOR 95
Figure 45. Overall architecture of COSMOS 96
Figure 46. Execution flow of the system state space explorer 97
Figure 47. Quality of service 102
Figure 48. Effective machine utilization 103
Figure 49. Sensitivity to the memory overcommit ratio 104
Figure 50. Sensitivity to the load for the LC container 105
Figure 51. Sensitivity to the load and memory overcommit ratio 105
Figure 52. Number of the explored system states 106
Figure 53. Effectiveness of dynamic resource management 107
A wide range of applications have become data-intensive as they operate on the massive amounts of data generated by social network services, multimedia devices, and Internet of Things sensors. These data-intensive applications typically require enormous computational and memory resources to extract useful information from the massive amounts of data they encounter. To accommodate the enormous computing and memory demands of data-intensive applications, hardware resources in computing systems are becoming highly heterogeneous. Specifically, numerous hardware accelerators, such as tensor processing units (TPUs) and neural processing units (NPUs), have been developed to address the ever-increasing computing demands of deep-learning applications. In addition, new memory devices, such as high-bandwidth memory (HBM) and non-volatile memory (NVM), have been developed to tackle the growing demand for increased memory performance, capacity, and cost-efficiency.
Heterogeneous computing and memory have great potential to significantly improve the performance and efficiency of data-intensive applications. However, taking full advantage of the capabilities of heterogeneous computing and memory poses significant challenges to system software in that it is the responsibility of the underlying system software to manage the heterogeneous computing and memory resources effectively so as to maximize the metric of interest, such as the performance or energy efficiency. This dissertation presents heterogeneity-aware resource management techniques that significantly improve the performance and efficiency of data-intensive applications by effectively exploiting heterogeneous computing and memory resources.
First, we investigate system software techniques that effectively schedule computations on heterogeneous computing devices for efficient deep-learning inference. To this end, we propose MOSAIC, a software-based system for heterogeneity-, communication-, and constraint-aware model slicing and execution for accurate and efficient inference on heterogeneous embedded systems. MOSAIC employs accurate models for estimating the execution and communication costs of the target inference workload. MOSAIC generates an efficient model slicing and execution plan for the target workload using an algorithm based on dynamic programming.
Second, we propose HERTI, a reinforcement learning-augmented system for efficient real-time inference on heterogeneous embedded systems. HERTI efficiently explores the state space and robustly finds an efficient state that significantly improves the efficiency of the target inference workload while satisfying the corresponding deadline constraint through reinforcement learning. In addition, HERTI significantly accelerates the training process based on the accurate and lightweight cost estimators.
Third, we investigate a system software technique that effectively manages heterogeneous memory for high-performance deep-learning. We analyze the characteristics of representative deep-learning workloads on a real heterogeneous memory system. Guided by the characterization results, we propose HALO, hotness- and lifetime-aware data placement and migration for high-performance deep-learning on heterogeneous memory systems. HALO extracts the hotness and lifetime information on the tensors of the target deep-learning application based on its dataflow graph. HALO then dynamically places and migrates the tensors on heterogeneous memory nodes based on their hotness and lifetime characteristics.
Finally, we investigate a system software technique for QoS-aware and efficient workload consolidation on heterogeneous memory systems based on software-defined far memory. We conduct an in-depth characterization of the impact of cores, memory, and compressed memory swap (CMS) on the QoS and throughput of consolidated latency-critical (LC) and batch applications. Guided by the characterization results, we propose COSMOS, a software-based runtime system for the coordinated management of cores, memory, and CMS for QoS-aware and efficient workload consolidation for memory-intensive applications. COSMOS dynamically collects runtime data from consolidated applications and the underlying system and allocates the resources to the consolidated applications in a way that achieves high throughput with strong QoS guarantees.*표시는 필수 입력사항입니다.
*전화번호 | ※ '-' 없이 휴대폰번호를 입력하세요 |
---|
기사명 | 저자명 | 페이지 | 원문 | 기사목차 |
---|
번호 | 발행일자 | 권호명 | 제본정보 | 자료실 | 원문 | 신청 페이지 |
---|
도서위치안내: / 서가번호:
우편복사 목록담기를 완료하였습니다.
*표시는 필수 입력사항입니다.
저장 되었습니다.