Title Page
Contents
List of Abbreviations 14
ABSTRACT 15
CHAPTER Ⅰ. Introduction 18
1.1. Motivation 18
1.2. Related Works 22
1.2.1. Relation Network 23
1.3. Contributions 24
1.4. Outline 25
CHAPTER Ⅱ. Preliminaries 27
2.1. Overview of FER 27
2.1.1. FER issues related to methods and datasets 30
2.2. Overview of FER with FSL 31
2.2.1. Generalization on novel data 34
2.2.2. Domain adaptation 35
CHAPTER Ⅲ. Proposed Method 37
3.1. Data Preprocessing 37
3.2. Network architecture 40
3.2.1. Feature embedding 42
3.2.2. Channel Selection 44
3.2.3. Emotion similarity learning 48
3.3. Overall training process 49
CHAPTER Ⅳ. Experimental Results and Discussion 52
4.1. Dataset 52
4.1.1. Dataset division 52
4.1.2. Dataset Construction 54
4.2. Training Details 56
4.3. Performance Analysis 57
4.4. Ablation study 63
4.4.1. Ablation study comparing the channel information of sample feature and average feature 63
4.4.2. Ablation study on the impact of feature size in CS module 65
4.4.3. Ablation study on fine-tuning weights on individual modules 66
CHAPTER Ⅴ. Conclusion 69
References 71
ABSTRACT IN KOREAN 83
Curriculum Vitae 85
Table 2.1. Various issues existing in facial datasets 32
Table 2.2. Various issues related to methods 33
Table 4.1. The number of images of each class for experiments. 53
Table 4.2. Value of nsample in one training episode. 55
Table 4.3. Environment condition for experiments. 56
Table 4.4. The comparison of additional lip and eye data accuracy. 56
Table 4.5. The performance comparison of FER accuracy. 59
Table 4.6. The performance comparison based on the CS module's channel size. 65
Table 4.7. The comparison of accuracy (%) in different weights on individual modules. 67
Table 4.8. The comparison accuracy (%) of various combinations of weights on each individual module. 67
Fig. 1.1. In-the-lab vs in-the-wild. 19
Fig. 1.2. Deep learning-based SOTA model accuracy for FER datasets. 20
Fig. 1.3. Relation Network architecture. 24
Fig. 2.1. The general pipeline of Facial Expression Recognition task. 30
Fig. 3.1. A process of getting aligned faces from video dataset. 38
Fig. 3.2. Overview of MAXDIST peak frame selection method. 39
Fig. 3.3. An example of selecting a peak frame. 41
Fig. 3.4. A process of cropping eye and lip region by landmark information. 41
Fig. 3.5. The overview architecture of the proposed model. 43
Fig. 3.6. The overview of the Channel Selection process. 46
Fig. 3.7. The whole pipeline of Channel Selective Spatial Relation Network. 51
Fig. 4.1. An accuracy of the proposed method with the ratios of the chosen parameter λ of channel selection on the RAFDB. 57
Fig. 4.2. Visual comparison between the sample channel features and averaged channel features. 64