목차

Title Page

Contents

List of Abbreviations 14

ABSTRACT 15

CHAPTER Ⅰ. Introduction 18

1.1. Motivation 18

1.2. Related Works 22

1.2.1. Relation Network 23

1.3. Contributions 24

1.4. Outline 25

CHAPTER Ⅱ. Preliminaries 27

2.1. Overview of FER 27

2.1.1. FER issues related to methods and datasets 30

2.2. Overview of FER with FSL 31

2.2.1. Generalization on novel data 34

2.2.2. Domain adaptation 35

CHAPTER Ⅲ. Proposed Method 37

3.1. Data Preprocessing 37

3.2. Network architecture 40

3.2.1. Feature embedding 42

3.2.2. Channel Selection 44

3.2.3. Emotion similarity learning 48

3.3. Overall training process 49

CHAPTER Ⅳ. Experimental Results and Discussion 52

4.1. Dataset 52

4.1.1. Dataset division 52

4.1.2. Dataset Construction 54

4.2. Training Details 56

4.3. Performance Analysis 57

4.4. Ablation study 63

4.4.1. Ablation study comparing the channel information of sample feature and average feature 63

4.4.2. Ablation study on the impact of feature size in CS module 65

4.4.3. Ablation study on fine-tuning weights on individual modules 66

CHAPTER Ⅴ. Conclusion 69

References 71

ABSTRACT IN KOREAN 83

Curriculum Vitae 85

Table 2.1. Various issues existing in facial datasets 32

Table 2.2. Various issues related to methods 33

Table 4.1. The number of images of each class for experiments. 53

Table 4.2. Value of nsample in one training episode. 55

Table 4.3. Environment condition for experiments. 56

Table 4.4. The comparison of additional lip and eye data accuracy. 56

Table 4.5. The performance comparison of FER accuracy. 59

Table 4.6. The performance comparison based on the CS module's channel size. 65

Table 4.7. The comparison of accuracy (%) in different weights on individual modules. 67

Table 4.8. The comparison accuracy (%) of various combinations of weights on each individual module. 67

Fig. 1.1. In-the-lab vs in-the-wild. 19

Fig. 1.2. Deep learning-based SOTA model accuracy for FER datasets. 20

Fig. 1.3. Relation Network architecture. 24

Fig. 2.1. The general pipeline of Facial Expression Recognition task. 30

Fig. 3.1. A process of getting aligned faces from video dataset. 38

Fig. 3.2. Overview of MAXDIST peak frame selection method. 39

Fig. 3.3. An example of selecting a peak frame. 41

Fig. 3.4. A process of cropping eye and lip region by landmark information. 41

Fig. 3.5. The overview architecture of the proposed model. 43

Fig. 3.6. The overview of the Channel Selection process. 46

Fig. 3.7. The whole pipeline of Channel Selective Spatial Relation Network. 51

Fig. 4.1. An accuracy of the proposed method with the ratios of the chosen parameter λ of channel selection on the RAFDB. 57

Fig. 4.2. Visual comparison between the sample channel features and averaged channel features. 64