Title Page
Contents
Abstract 12
Chapter 1. INTRODUCTION 14
1. Overview 14
2. Contributions and Thesis Organization 15
Chapter 2. RELATED WORKS 19
1. 3D Point Cloud Processing 19
2. 3D Hand Pose Estimation 21
3. 3D Scene Flow Estimation 24
Chapter 3. Accurate and Efficient 3D Hand Pose Estimation - HandFoldingNet 27
1. Introduction 27
2. HandFoldingNet 32
2.1. Point Cloud Preprocessing 33
2.2. Hierarchical PointNet Encoder 34
2.3. Global-Feature Guided Folding Decoder 35
2.4. Joint-Wise Local-Feature Guided Folding Block 36
2.5. Loss Function 40
3. Experiments 41
3.1. Experiment Settings 41
3.2. Datasets and Evaluation Metrics 42
3.3. Comparison with State-of-the-arts 44
3.4. Ablation Study 46
3.5. Runtime and Model Size 49
4. Conclusion 50
Chapter 4. Accurate and Efficient 3D Scene Flow Estimation - HandR2N2 52
1. Introduction 52
2. Iterative 3D Hand Pose Estimation 56
2.1. Residual Recurrent Unit 58
2.2. Initialization of the Hidden States and Joint Coordinates 62
2.3. Training Loss Function 62
3. Experiments 63
3.1. Experiment Settings 63
3.2. Datasets and Evaluation Metrics 64
3.3. Comparison with State-of-the-Art Methods 65
3.4. Ablation Study 69
4. Conclusion 73
Chapter 5. Accurate and Efficient 3D Scene Flow Estimation - Bi-PointFlowNet 74
1. Introduction 74
2. Problem Definition 77
3. Bi-PointFlowNet 78
3.1. Hierarchical Feature Extraction 79
3.2. Bidirectional Flow Embedding 80
3.3. Decomposed Form of Bidirectional Flow Embedding 81
3.4. Upsampling and Warping 83
3.5. Scene Flow Prediction 84
3.6. Loss Function 85
4. Experiments 85
4.1. Experimental Settings 85
4.2. Evaluation Metrics 86
4.3. Training and Evaluation on FlyingThings3D 87
4.4. Generalization on KITTI 89
4.5. Ablation Study 91
4.6. Runtime 94
5. Conclusion 95
Chapter 6. Accurate and Efficient 3D Scene Flow Estimation - MSBRN 96
1. Introduction 96
2. MSBRN Architecture 100
2.1. Bidirectional Gated Recurrent Unit 101
2.2. Hybrid Correlation 104
2.3. Coarse-to-Fine Architecture 105
2.4. Loss Function 107
3. Experiments 107
3.1. Experimental Settings 107
3.2. Evaluation Metrics 108
3.3. Training and Evaluation on FlyingThings3D 109
3.4. Generalization on KITTI 111
3.5. Ablation Study 113
3.6. Runtime 116
4. Conclusion 117
Chapter 7. Joint Segmentation and Prediction based on Scene Flow 118
1. Introduction 118
2. Problem Definition 121
3. Joint Segmentation and Prediction Network 122
3.1. Architecture 122
3.2. RNN Unit 123
3.3. Down-sampling And Up-sampling 126
3.4. Training 127
4. Experimental Results 130
4.1. EXPERIMENT SETTINGS 130
4.2. Evaluation Metrics 131
4.3. Evaluation on SemanticKITTI 133
4.4. Evaluation on Argoverse 138
4.5. Ablation study 139
5. Conclusion 142
Chapter 8. Conclusion 144
References 146
Table 1. Implementation specifications. Each block contains four types of hyperparameters: search radius (r), the number of grouping neighbors (S),... 42
Table 2. Comparison of the proposed method with previous state-of-the-art methods on the ICVL, MSRA and NYU datasets. Mean error indicates the mean... 44
Table 3. Comparison of different numbers of local folding blocks used in the model. # Local fold indicates the number of local folding blocks attached after... 47
Table 4. Comparison of different settings between the local feature and spatial dependency. 48
Table 5. Comparison of different set abstraction levels for local features 49
Table 6. Comparison of the model size and inference time for the methods using the 3D input. Speed stands for the frame rate (fps) on a single GPU. Time stands... 50
Table 7. Comparison of the proposed method with previous state-of-the-art methods on the ICVL, MSRA and NYU datasets. Input indicates the input type... 67
Table 8. Affect of sampling K. For each line, the inference Iins are same, but with increased K of resampled points for each iteration. All models are trained... 72
Table 9. Comparison with conventional components. All models are trained and tested with both Itr=Iin=3. 73
Table 10. Comparison of the proposed method with previous state-of-the-art methods on the non-occluded FT3Ds and KITTIs datasets. All methods are... 87
Table 11. Comparison of the proposed method with previous state-of-the-art methods on the occluded FT3Do and KITTIo datasets. All methods are... 89
Table 12. Ablation of the bidirectional flow embedding layer. BFP indicates whether the BFP module is used. All methods are trained only on the... 92
Table 13. Ablation of the decomposed form of the bidirectional flow embedding layer. Decomp. indicates whether using decomposed form of BFE. GFLOPs... 93
Table 14. Comparison of the bidirectional feature propagation on PointPWC and FlowNet3D. Although the selected baselines showed strong performance, our... 93
Table 15. Runtime comparison on a single TITAN RTX GPU. 94
Table 16. Comparison of the proposed method with previous state-of-the-art methods on the non-occluded FT3Ds and KITTIs datasets. All methods are... 110
Table 17. Comparison of the proposed method with previous state-of-the-art methods on the occluded FT3Do and KITTIo datasets. All methods are trained... 112
Table 18. Comparison on the EPE2D metric between different training/inference iteration configurations. The model is first trained under a specific Ktr on the... 113
Table 19. Comparison between different numbers of the neighbor points for the correlation extraction. 'Dist' and 'feat' indicate the Euclidean distance-based... 114
Table 20. Ablations of different components. IBFA, HC and GRU represent whether the iterative bidirectional feature augmentation, hybrid correlation and... 116
Table 21. Runtime comparison of iterative methods. The results are evaluated on a single TITAN RTX GPU. PV-RAFT, FlowStep3D and RCP are evaluated... 116
Table 22. Implementation specifications. Each layer contained four types of hyperparameters: search radius (r), the number of sampling neighbors (k),... 130
Table 23. Segmentation performance on SemanticKITTI. The P(Phase) indicates the i-th frame output from proposed network. 134
Table 24. Prediction performance on SemanticKITTI. The Phase indicates the i-th frame output from the proposed network. 134
Table 25. Class-wise segmentation performance on SemanticKITTI. Note that, the ratio is the overall average ratio of specific class points over the entire testing set. 'P+R' indicates the PointRNN+RandLA-Net 136
Table 26. Segmentation and prediction performance on Argoverse. 139
Table 27. Prediction performance on SemanticKITTI. Our model with a flow feature constraint and local geometric constraint performs better than the... 140
Table 28. Runtime and model size. The input sequence length is 5 and the number of input points is 4096. The speed is measured by the average frame... 141
Figure 1. Point cloud representation(Left), voxel representation(Center) and multi-view representation (Right) of 3D point cloud 19
Figure 2. Illustration of the folding concept. The network can be interpreted as emulating the "force" through multi-scale features extracted from the point... 29
Figure 3. The HandFoldingNet architecture. It takes the preprocessed normalized point cloud with surface normal vectors from a 2D depth image as... 32
Figure 4. An example of a 2D hand skeleton based on the ICVL dataset. The skeleton contains J=16 points, each of which is represented as a 2D coordinate. 35
Figure 5. Joint-wise local feature guided folding block. The local folding block accepts three inputs, which are the previously estimated joint coordinates,... 37
Figure 6. The spatial dependency mapping between hand joints of the ICVL dataset (left). Each joint permutes its embedding ej to map with its two adjacent...[이미지참조] 38
Figure 7. Comparison with the state-of-the-art methods using the ICVL (left), MSRA (middle) and NYU (right) dataset. The success rate is shown in this figure. 44
Figure 8. Qualitative results of HandFoldingNet on the ICVL (left), MSRA(middle) and NYU (right) dataset. Hand depth images are transformed into 3D... 46
Figure 9. Illustration of the recurrent concept. For each iteration, our proposed recurrent architecture samples local points around joints from previous iteration... 53
Figure 10. Comparison between operation count and estimation error. I is the iteration number of HandR2N2 during inference. 53
Figure 11. The HandR2N2 pipeline. HandR2N2 takes the normalized point cloud transformed from a 2D depth image as the input. The PointNet-based local... 56
Figure 12. The architecture of the residual graph GRU. It first exploits a graph reasoning module to evolve hidden states. The evolved hidden states are... 58
Figure 13. Visualization of the output from the reset gate in the proposed GRU. The small colored points in the circle are the grouped neighbors of the red input... 61
Figure 14. Comparison with the state-of-the-art methods using the ICVL (left), MSRA (middle) and NYU (right) dataset. The success rate is shown in... 65
Figure 15. Qualitative results of HandR2N2 on the ICVL (left), MSRA (middle) and NYU (right) dataset. Hand depth images are transformed into 3D points as... 65
Figure 16. Qualitative visualization of the models with training Itr=3 evaluated at different inference Iin based on the NYU dataset. Mean error indicates the... 66
Figure 17. Visualization of the failure cases whose mean joint error>15mm on the NYU dataset. 66
Figure 18. Training Itr vs. inference Iin. For each column, the model are trained based on a fixed iteration number and evaluated at different inference Iin. 69
Figure 19. Affect of sampling K. For each line, the inference Iins are same, but with increased K of resampled points for each iteration. All models are trained... 71
Figure 20. Illustration of the bidirectional learning for scene flow estimation. The features extracted from each input frame are propagated bidirectionally for... 74
Figure 21. Architecture of Bi-PointFlowNet for scene flow estimation. (UP stands for an upsampling layer. BFE&FP stand for a bidirectional flow... 78
Figure 22. Bidirectional feature propagation in the novel Bidirectional Flow Embedding layer. Each point first groups the nearest neighbors from the other... 80
Figure 23. Qualitative results of Bi-PointFlowNet on the non-occluded KITTIs dataset. Points are colored to indicate points as from source frame, target frame,... 90
Figure 24. Illustration of the multi-scale bidirectional recurrent network for scene flow estimation. The multi-scale features extracted from each input... 96
Figure 25. Examples of two variants of correlation extraction. The bold red and green points are the grouped correlated points of the bold blue source point... 98
Figure 26. Architecture of MSBRN for scene flow estimation. First, the two consecutive input point frames are fed into the shared hierarchical feature... 100
Figure 27. Structure of the bidirectional gated recurrent unit (BGRU). The proposed BGRU accepts source/target point features, flow estimations and... 102
Figure 28. Qualitative results of MSBRN on the non-occluded KITTI dataset. Points are colored to indicate points as from source frame, target frame or as... 111
Figure 29. Joint segmentation and prediction of point cloud. Our method directly outputs the point-wise segmentation labels (C1, C2, C3,...) as well as predicted... 119
Figure 30. Overall architecture of joint segmentation and prediction. It is based on a hierarchical seq2seq architecture that contains a stack of set abstraction... 122
Figure 31. Temporal Seq2Seq architecture. The bottom RNN units require point sets and their corresponding labels. The upper RNN units receive the output... 123
Figure 32. Architecture of the RNN Unit. The RNN unit requires current coordinates and corresponding labels as the input and adopt flow embedding... 124
Figure 33. Visualization of the segmentation and prediction output on the SemanticKITTI benchmark. The colors indicate the different semantic labels of... 132
Figure 34. The segmentation and prediction performance comparison in terms of various input and output sequence lengths on the SemanticKITTI dataset. 137
Figure 35. The segmentation and prediction performance comparison in terms of various number of neighbors on SemanticKITTI dataset. 141