Title Page
Contents
ABSTRACT 10
Ⅰ. Introduction 11
Ⅱ. Background and Related Work 14
2.1. Convolutional Neural Network 14
2.2. Compression of Neural Networks 16
2.3. Object Detection with YOLO 17
2.4. Data Quantization 21
2.5. CNN ACCELERATION WITH FPGA 23
2.5.1. Research on Accelerator Design Method 23
2.5.2. Parallel computing method 24
Ⅲ. FPGA Accelerator Design 27
3.1. Model Analysis and Task Division 27
3.2. Design and Optimization of Hardware Accelerator 28
3.2.1. Accelerator High-Level Architecture 28
3.2.2. Dynamically configurable line buffering 30
3.2.3. Tiling strategy for large-size feature maps 32
3.2.4. Design of the convolution module 35
3.2.5. Design of the sampling module 37
3.2.6. Ping-pong buffer 38
3.3. Design in CPU 41
3.3.1. Construction of YOLO network 41
3.3.2. Design of the decoding part 42
Ⅳ. Analysis of results 45
4.1. Development environment 45
4.2. Resource consumption of the accelerator 47
4.3. Performance analysis of the accelerator 48
Ⅴ. Conclusion 54
References 55
국문초록 58
〈Table 2-1〉 YOLOv7-Tiny layers and their input and output sizes 20
〈Table 4-1〉 Development environment 45
〈Table 4-2〉 Resource consumption of the SoC-YOLO accelerator 47
〈Table 4-3〉 Comparison of two line buffering strategies 48
〈Table 4-4〉 Comparison of different literature methods 49
〈Table 4-5〉 Performance comparison in different platforms 51
〈Figure 2-1〉 Architecture of a typica CNN 14
〈Figure 2-2〉 Convolutional layer calculation process 15
〈Figure 2-3〉 Architecture of YOLOv7-Tiny network 18
〈Figure 2-4〉 Architecture of Multi-Branch Concat Module 19
〈Figure 2-5〉 Linear Post Taring Static Quantization 22
〈Figure 2-6〉 Principle of Unroll Loop 25
〈Figure 2-7〉 C code description of the convolution 25
〈Figure 3-1〉 Flow chart from CNN model to FPGA acceleration 27
〈Figure 3-2〉 Execution schedule in the accelerator 29
〈Figure 3-3〉 Architecture of the accelerator 30
〈Figure 3-4〉 Basic line buffer structure diagram 31
〈Figure 3-5〉 Dynamically configurable line buffering mechanism 32
〈Figure 3-6〉 Two tiling strategies for large-size feature maps 33
〈Figure 3-7〉 Calculation ways for two different strategies 34
〈Figure 3-8〉 An example of the tiling strategy 34
〈Figure 3-9〉 C++ code for convolution operation 35
〈Figure 3-10〉 unrolling of KERNEL_SIZE in two dimensions 36
〈Figure 3-11〉 Hardware structure of maxpooling 38
〈Figure 3-12〉 Structure of Ping-pong buffer 39
〈Figure 3-13〉 Data processing flow of Ping-pong buffer 39
〈Figure 3-14〉 C++ code of Ping-pong buffer module 41
〈Figure 3-15〉 The convolution module design in CPU-side 42
〈Figure 3-16〉 Decoding operation for anchor 43
〈Figure 3-17〉 Effect of NMS algorithm 44
〈Figure 4-1〉 Block design of the accelerator 46
〈Figure 4-2〉 RTL schematic of the accelerator 46
〈Figure 4-3〉 Ouput information in SDK terminal 52
〈Figure 4-4〉 Sample output image of our hardware accelerator 52
〈Figure 4-5〉 The performance comparison in complex scenes 53