Title Page
Contents
Abstract 10
Chapter 1. INTRODUCTION 12
1.1. GPU Computing System 12
1.2. Preemptive Scheduling for GPU 15
1.3. GPU Resource Sharing for Multi-Tasking 18
1.4. Organization 21
Chapter 2. BACKGROUND AND RELATED WORK 22
2.1. OpenCL Runtime Model on HSA 22
2.2. GPU Kernel Idempotent 24
2.3. GPU Preemptive Scheduling 27
2.4. GPU Resource Sharing for Deep Learning 29
Chapter 3. A GPU Kernel Transactionization Scheme 36
3.1. Overview 36
3.2. Kernel Snapshotting for Transactionization 39
3.3. Transactionization Process 42
3.4. Evaluation 45
3.4.1. Evaluation Environment 45
3.4.2. Turnaround Time 47
3.4.3. Scheduling Delay 53
3.4.4. Overhead 56
3.5. Summary 59
Chapter 4. Idempotent-Based Preemptive GPU Kernel Scheduling 60
4.1. Overview 60
4.2. Idempotent Kernel Classification 62
4.3. A GPU Kernel Transactionization with Idempotent 68
4.4. Priority-based GPU Preemptive Scheduling 69
4.5. Evaluation 74
4.5.1. Evaluation Environment 74
4.5.2. Turnaround Time 77
4.5.3. Scheduling Delay 84
4.5.4. Overhead 87
4.6. Summary 91
Chapter 5. Distributed Training with MIG 92
5.1. Overview 92
5.2. MIG Context Manager 94
5.3. Communication for Distributed Training with MIG 97
5.4. Evaluation 101
5.4.1. Evaluation Setup 101
5.4.2. Performance of Single GPU and Single Training Workload 105
5.4.3. MIG DDP Performance 109
5.5. Summary 116
Chapter 6. Conclusion 118
References 120
논문요약 130
Table 1. System configuration for evaluation. 46
Table 2. Workload characteristics of Rodinia benchmark suite. 47
Table 3. Workload description and idempotent kernel information. 75
Table 4. Workload performance characteristics. 76
Table 5. System configuration for evaluating distributed training using MIG 101
Table 6. CNN-based models with diverse features 102
Fig. 1. OpenCL runtime model and memory management in HSA systems. 23
Fig. 2. NVIDIA Multi-Instance GPU(MIG) architecture 33
Fig. 3. An example of preemptive scheduling through snapshot-based GPU kernel transactionization. 37
Fig. 4. Snapshotting steps for each kernel execution stage. 38
Fig. 5. Snapshot target constitution and a data structure for recording snapshot targets. 41
Fig. 6. An independent transaction thread handles snapshotting and launching of a kernel instance. 43
Fig. 7. Normalized turnaround time of benchmark programs while varying the number of disturbing background tasks. [N] denotes the results obtained from... 49
Fig. 8. Normalized turnaround time of benchmark programs while varying the execution time of a background disturbing kernel instance. 51
Fig. 9. Normalized turnaround time of benchmarks when three instances of the same benchmark with different priorities run at the same time. Each instance... 52
Fig. 10. Distribution of eviction delay caused by abort command for preemption. 54
Fig. 11. Minimum, maximum and average pre-preemption delay and kernel launch delay observed during execution of benchmark programs. 55
Fig. 12. Distribution of scheduling delay of benchmark programs executed on unmodified Linux kernel. 56
Fig. 13. Performance degradation of benchmark applications when each was independently run under proposed schemes. 57
Fig. 14. Snapshot creation time per MB of GPU buffer data for different workloads. 58
Fig. 15. An example of the idempotent-based preemptive scheduling method. 61
Fig. 16. A clobber anti-dependency pattern is a necessary and sufficient condition for a non-idempotent code block. 63
Fig. 17. Four representative cases in which complex anti-dependency patterns appear. 64
Fig. 18. Example read/write access patterns represented in ASTs produced by ANTLR 4. 67
Fig. 19. Idempotent kernel classification and snapshotting steps for each kernel execution stage. 68
Fig. 20. GPU kernel scheduling structure for target device. 70
Fig. 21. Flowchart of pre-preemption routine. 72
Fig. 22. Kernel scheduling states and pre-/post-preemption procedures. 73
Fig. 23. Normalized turnaround times of benchmark programs while varying the number of disturbing background tasks. [N] denotes the results obtained using... 77
Fig. 24. Normalized turnaround times of benchmark programs while varying the execution time of a background disturbing kernel instance. 80
Fig. 25. Normalized turnaround times of benchmarks when three instances of the same benchmark with different priorities were run at the same time. Each... 81
Fig. 26. Normalized turnaround time of a low-priority kernel while varying its kernel length and the disturbance interval of the high-priority kernel. 82
Fig. 27. Distribution of preemption delay depending on preemption scheme. 84
Fig. 28. Distribution of the scheduling delay of benchmark programs executed on the unmodified Linux kernel. 85
Fig. 29. Minimum, maximum, and average pre-preemption and kernel launch delays observed during the execution of benchmark programs. 86
Fig. 30. Performance degradation of benchmark applications when independently run under the proposed schemes. 88
Fig. 31. Snapshot creation time per MB of GPU buffer data for different workloads. 89
Fig. 32. Communication mechanism for distributed data parallel on MIG 98
Fig. 33. Chunk division and reduction for shared gradients 99
Fig. 34. Training scenario utilizing MIG for evaluation 104
Fig. 35. Performance of concurrent training through multiple MIG instances in a single GPU environment 106
Fig. 36. Performance of concurrent training with batch size splitting for a single training model across MIG instances without communication 108
Fig. 37. Performance of distributed training with batch size splitting for single training on single GPU environment using MIG DPP 110
Fig. 38. The SM usage pattern of kernels constituting Mobilenet small 111
Fig. 39. Performance of distributed training with batch size splitting for single training on multiple GPU environment using MIG DPP 113
Fig. 40. Performance comparison of distributed training with two GPUs versus distributed training using MIG DPP on a single GPU 114
Fig. 41. Concurrent distributed training performance using MIG DPP compared to sequential distributed training with two users and two GPUs 115