목차

Title Page

Contents

List of Abbreviations 13

ABSTRACT 14

CHAPTER Ⅰ. Introduction 17

1.1. Motivation 17

1.2. Background 24

1.2.1. Transformer architecture in object detection 24

1.2.2. Camera-based 3D object detection 27

1.3. Contributions 29

1.4. Outline 31

CHAPTER Ⅱ. Related Works 32

2.1. Sparse Query-based Methods in Multi-view 3D Object Detection 32

2.2. Query Denoising Strategy 34

2.3. Cross Attention in Object Detection 37

CHAPTER Ⅲ. Proposed Method 40

3.1. Overall Architecture 40

3.2. 4D Query Denoising 47

3.3. Distance-wise Feature Sampling 50

3.4. Query Initialization with 2D Guidance 53

CHAPTER Ⅳ. Experimental Results 57

4.1. Dataset and Metrics 57

4.2. Implementation Details 59

4.3. Performance Analysis 60

4.4. Ablation Studies 63

4.4.1. Ablation study on different denoising strategy 63

4.4.2. Ablation study on different settings of noise scale 65

4.4.3. Ablation study on different weight design of feature sampling strategy 66

4.4.4. Ablation study on the different number of depth points 67

4.5. Visualization 69

CHAPTER Ⅴ. Conclusion 73

References 75

ABSTRACT IN KOREAN 85

TABLE 4.1. The performance comparison on the nuScenes validation set. 61

TABLE 4.2. The performance comparison on the nuScenes test set. 62

TABLE 4.3. Ablation study on the effectiveness of the negative samples. 63

TABLE 4.4. Ablation study on the different number of noised query. 63

TABLE 4.5. Ablation study in different settings on noise scale αθ.[이미지참조] 65

TABLE 4.6. Ablation study in different settings on noise scale αvel.[이미지참조] 66

TABLE 4.7. The performance comparison based on the feature sampling weight design. 67

TABLE 4.8. The performance comparison based on the different number of depth points. 68

Fig. 1.1. An example of nuScenes datasets. 19

Fig. 1.2. The prediction parameters in 2D and 3D object detection. 20

Fig. 1.3. The typical pipeline of DETR like detector. 25

Fig. 1.4. Lift-Splat-Shoot Outline. 28

Fig. 2.1. An illustration of denoising training strategy in decoder. 35

Fig. 2.2. An illustration of CDN from DINO. 36

Fig. 2.3. An example of results from DINO. 37

Fig. 3.1. Overall architecture of the proposed scheme. 41

Fig. 3.2. Illustration of the deformable attention module in the proposed network. 44

Fig. 3.3. An Illustration of generating the noised reference points. 48

Fig. 3.4. The dimensions of noised parameters: (a) center and scale, (b) yaw angle, and (c) velocity. λ₁ and λ₂ represent the boundaries for posi-... 49

Fig. 3.5. An Illustration that the same size boxes appear differently in the 2D image view depending on the distance. 50

Fig. 3.6. An example of nuScenes dataset, where most objects are on the ground plane. 53

Fig. 3.7. The architecture of the 3D reference point generator. 54

Fig. 3.8. An illustration of depth located on the ray. 55

Fig. 4.1. Example of nuScenes dataset. 58

Fig. 4.2. Convergence curves of the proposed scheme and StreamPETR [3] with the ResNet-50 backbone. 64

Fig. 4.3. Visualization of the prediction results of a crowded scene by the proposed method. 70

Fig. 4.4. Visualization of the prediction results of a night scene by the proposed method. 71

Fig. 4.5. Visualization of the prediction results of a cloudy scene by the proposed method. 72