대한민국 국회도서관

인명/단체명 검색결과
전체 선택	대표형(전거형, Authority)	생물정보	이형(異形, Variant)	소속	직위	직업	활동분야	주기	서지
연구/단체명을 입력해주세요.

소장자료
공공정책정보
외부기관 자료

학위논문 Research on GPU scheduling and resource sharing for GPU-centric computing model = GPU 중심 컴퓨팅 모델을 위한 GPU 스케줄링 및 자원 공유 기법 연구

저자명
Hyeonsu Lee
발행사항
서울 : 성균관대학교 대학원, 2024.2
청구기호
TD 621.3 -24-270
형태사항
vii, 120 p. ; 30 cm
자료실 전자자료
제어번호
KDMT12024000031482
주기사항
학위논문(박사) -- 성균관대학교 대학원, Dept. of Electrical and Computer Engineering, 2024.2. 지도교수: Euiseong Seo
원문
연계정보
외부기관 원문
학술연구정보서비스(KERIS)
외부기관 원문

목차보기

Title Page

Contents

Abstract 10

Chapter 1. INTRODUCTION 12

1.1. GPU Computing System 12

1.2. Preemptive Scheduling for GPU 15

1.3. GPU Resource Sharing for Multi-Tasking 18

1.4. Organization 21

Chapter 2. BACKGROUND AND RELATED WORK 22

2.1. OpenCL Runtime Model on HSA 22

2.2. GPU Kernel Idempotent 24

2.3. GPU Preemptive Scheduling 27

2.4. GPU Resource Sharing for Deep Learning 29

Chapter 3. A GPU Kernel Transactionization Scheme 36

3.1. Overview 36

3.2. Kernel Snapshotting for Transactionization 39

3.3. Transactionization Process 42

3.4. Evaluation 45

3.4.1. Evaluation Environment 45

3.4.2. Turnaround Time 47

3.4.3. Scheduling Delay 53

3.4.4. Overhead 56

3.5. Summary 59

Chapter 4. Idempotent-Based Preemptive GPU Kernel Scheduling 60

4.1. Overview 60

4.2. Idempotent Kernel Classification 62

4.3. A GPU Kernel Transactionization with Idempotent 68

4.4. Priority-based GPU Preemptive Scheduling 69

4.5. Evaluation 74

4.5.1. Evaluation Environment 74

4.5.2. Turnaround Time 77

4.5.3. Scheduling Delay 84

4.5.4. Overhead 87

4.6. Summary 91

Chapter 5. Distributed Training with MIG 92

5.1. Overview 92

5.2. MIG Context Manager 94

5.3. Communication for Distributed Training with MIG 97

5.4. Evaluation 101

5.4.1. Evaluation Setup 101

5.4.2. Performance of Single GPU and Single Training Workload 105

5.4.3. MIG DDP Performance 109

5.5. Summary 116

Chapter 6. Conclusion 118

References 120

논문요약 130

List of Tables

Table 1. System configuration for evaluation. 46

Table 2. Workload characteristics of Rodinia benchmark suite. 47

Table 3. Workload description and idempotent kernel information. 75

Table 4. Workload performance characteristics. 76

Table 5. System configuration for evaluating distributed training using MIG 101

Table 6. CNN-based models with diverse features 102

List of Figures

Fig. 1. OpenCL runtime model and memory management in HSA systems. 23

Fig. 2. NVIDIA Multi-Instance GPU(MIG) architecture 33

Fig. 3. An example of preemptive scheduling through snapshot-based GPU kernel transactionization. 37

Fig. 4. Snapshotting steps for each kernel execution stage. 38

Fig. 5. Snapshot target constitution and a data structure for recording snapshot targets. 41

Fig. 6. An independent transaction thread handles snapshotting and launching of a kernel instance. 43

Fig. 7. Normalized turnaround time of benchmark programs while varying the number of disturbing background tasks. [N] denotes the results obtained from... 49

Fig. 8. Normalized turnaround time of benchmark programs while varying the execution time of a background disturbing kernel instance. 51

Fig. 9. Normalized turnaround time of benchmarks when three instances of the same benchmark with different priorities run at the same time. Each instance... 52

Fig. 10. Distribution of eviction delay caused by abort command for preemption. 54

Fig. 11. Minimum, maximum and average pre-preemption delay and kernel launch delay observed during execution of benchmark programs. 55

Fig. 12. Distribution of scheduling delay of benchmark programs executed on unmodified Linux kernel. 56

Fig. 13. Performance degradation of benchmark applications when each was independently run under proposed schemes. 57

Fig. 14. Snapshot creation time per MB of GPU buffer data for different workloads. 58

Fig. 15. An example of the idempotent-based preemptive scheduling method. 61

Fig. 16. A clobber anti-dependency pattern is a necessary and sufficient condition for a non-idempotent code block. 63

Fig. 17. Four representative cases in which complex anti-dependency patterns appear. 64

Fig. 18. Example read/write access patterns represented in ASTs produced by ANTLR 4. 67

Fig. 19. Idempotent kernel classification and snapshotting steps for each kernel execution stage. 68

Fig. 20. GPU kernel scheduling structure for target device. 70

Fig. 21. Flowchart of pre-preemption routine. 72

Fig. 22. Kernel scheduling states and pre-/post-preemption procedures. 73

Fig. 23. Normalized turnaround times of benchmark programs while varying the number of disturbing background tasks. [N] denotes the results obtained using... 77

Fig. 24. Normalized turnaround times of benchmark programs while varying the execution time of a background disturbing kernel instance. 80

Fig. 25. Normalized turnaround times of benchmarks when three instances of the same benchmark with different priorities were run at the same time. Each... 81

Fig. 26. Normalized turnaround time of a low-priority kernel while varying its kernel length and the disturbance interval of the high-priority kernel. 82

Fig. 27. Distribution of preemption delay depending on preemption scheme. 84

Fig. 28. Distribution of the scheduling delay of benchmark programs executed on the unmodified Linux kernel. 85

Fig. 29. Minimum, maximum, and average pre-preemption and kernel launch delays observed during the execution of benchmark programs. 86

Fig. 30. Performance degradation of benchmark applications when independently run under the proposed schemes. 88

Fig. 31. Snapshot creation time per MB of GPU buffer data for different workloads. 89

Fig. 32. Communication mechanism for distributed data parallel on MIG 98

Fig. 33. Chunk division and reduction for shared gradients 99

Fig. 34. Training scenario utilizing MIG for evaluation 104

Fig. 35. Performance of concurrent training through multiple MIG instances in a single GPU environment 106

Fig. 36. Performance of concurrent training with batch size splitting for a single training model across MIG instances without communication 108

Fig. 37. Performance of distributed training with batch size splitting for single training on single GPU environment using MIG DPP 110

Fig. 38. The SM usage pattern of kernels constituting Mobilenet small 111

Fig. 39. Performance of distributed training with batch size splitting for single training on multiple GPU environment using MIG DPP 113

Fig. 40. Performance comparison of distributed training with two GPUs versus distributed training using MIG DPP on a single GPU 114

Fig. 41. Concurrent distributed training performance using MIG DPP compared to sequential distributed training with two users and two GPUs 115

초록보기

최근 병렬 컴퓨팅의 발전으로, 그래픽 처리 장치(GPU)가 멀티-테넌트 및 멀티-애플리케이션 환경에서 주요 컴퓨팅 단위로 자리 잡고 있다. GPU 중심의 컴퓨팅으로의 전환에도 불구하고, 현재의 GPU 자원 공유 기술은 분할된 자원 간 통신이 불가능하여, 효율적인 자원 공유와 상호 작용에 제한을 두고 있다. 또한, 운영체제가 GPU 작업을 제어하지 못하는 문제는 우선순위 역전 문제를 악화시키며, 높은 우선순위의 커널이 낮은 우선순위의 작업을 선점하지 못한다.

이러한 문제를 해결하기 위해, 본 논문에서는 운영체제가 제어하는 새로운 GPU 스케줄링 및 자원 공유 모델을 제안하여 GPU 중심의 컴퓨팅을 강화한다. 우리의 모델은 GPU 커널 트랜잭션화, GPU 커널 멱등성 분류, 그리고 MIG 기반 분산 학습이라는 세 가지 주요 기법을 도입한다.

먼저 GPU Kernel Transactionization 기법은 GPU 선점형 스케줄링을 위한 커널의 실행 트랜잭션을 정의한다. 이 기법은 GPU 메모리 스냅샷/롤백 기술과 하나의 GPU 커널 실행 트랜잭션 정의를 통해 커널 실행 시 변경되는 커널 버퍼의 데이터 일관성을 보장한다. 본 기법을 통해 99.9th percentail 내에서 18μs의 선점 지연을 보장하며, 대부분의 경우 과부하 상태에서도 우선순위가 높은 작업의 실행 시간 평균 지연은 10% 미만이다.

GPU Kernel Idempotent classify 기법은 정적 분석을 통해 실행 중 입력 데이터를 손상시키지 않는 명등성 GPU 커널을 검출하여 선점형 스케줄링에 활용한다. 검출된 멱등성 커널 정보는 동적으로 GPU 스케줄러에 전달하여 메모리 컨텍스트 스냅샷/롤백 비용 없이 선점형 스케줄링을 수행한다. 본 기법을 통해 rodinia benchmark suite에서 22의 커널 중 12개의 커널이 멱등성 커널로 검출되었으며, 63.6% 비중의 애플리케이션에 대해 메모리 컨텍스트 스위칭 비용 없이 선점형 스케줄링을 적용할 수 있었다.

마지막으로 Distributed Training with MIG 기법은 여러 GPU 뿐만 아니라 단일 GPU에서도 MIG 인스턴스 간의 통신을 지원한다. 이 방법은 여러 작업이 동시에 GPU 자원을 사용할 수 있게 하여 GPU 자원의 효율성을 향상시킨다. 또한 단일 GPU 내의 인스턴스를 사용하여 분산 학습을 함으로써 계산 병렬성을 증가시켜 학습 성능을 개선한다. 이 방식을 통해 여러 모델을 동시에 분산 학습할 때 처리량이 평균 14% 증가하고, 단일 학습에서 분산 학습으로 전환할 때 최대 21.9%의 성능 향상을 달성할 수 있었다.

우리는 세 가지의 기법을 통해서 운영체제에서 GPU 작업에 대한 선점형 스케줄링 지원과 DNN 프레임워크에서 효율적인 GPU 자원 공유 방법을 제공한다. 우리는 이것을 통해 GPU를 primary computing unit으로 사용하는 GPU 자원 공유 컴퓨팅 시스템을 구축한다.

자료명
저자사항
제어번호
*요청자 이름	회신요청
*전화번호	휴대폰 번호를 입력하세요.
*이메일	@
*요청내용
*오류항목

* 서재명
설명
* 공개수준	비공개 완전공개 * 주의: 국회도서관 이용자 모두에게 공유서재로 서비스 됩니다.

알림톡 발송로 자료명, 기사명/저자명, 수록지명, 자료실, 서가번호, 전화번호로 구성되어 있습니다.




전화번호

연속간행물 상세정보 입니다.
청구기호
자료명/저자사항
발행사항
형태사항
ISSN

다국어입력

상세검색

다국어입력

저자 검색

관련 키워드 검색

주제별 검색

학위논문 Research on GPU scheduling and resource sharing for GPU-centric computing model = GPU 중심 컴퓨팅 모델을 위한 GPU 스케줄링 및 자원 공유 기법 연구

목차보기

초록보기

추천서가 (다양한 추천 자료를 만나보세요)

MARC 보기

오류 데이터 정정요청

알림톡 발송

권호기사보기

연속간행물 권호 선택

연속간행물 권호 선택

우편복사 안내

도서위치안내(서울관)

저자프로필

목차보기

우편복사 안내

우편복사 목록담기

확인

내서재에 담기

새로운 서재

저장

로그인