목차

Title Page

Contents

Abstract 9

Chapter 1. Introduction 10

1.1. Motivation 10

1.2. Overview of the proposed method 11

Chapter 2. Related Works 13

2.1. Speech Synthesis 13

2.1.1. Neural Audio Codec based TTS 14

2.2. Talking Head Generation 17

2.2.1. Text-Driven 17

2.2.2. Audio-Driven 17

2.2.3. NeRF-Based Talking Head Synthesis 18

Chapter 3. Proposed Method 20

3.1. Overview 20

3.2. input features 22

3.2.1. audio and text features 22

3.2.2. video features 22

3.3. Text-to-Speech Module 23

3.4. NeRF-based Talking Head Generation Module 23

3.5. Loss Function 24

Chapter 4. Experiments 25

4.1. Dataset and Evaluation Metrics 25

4.2. Quantitative Evaluation Results. 25

4.3. Qualitative Evaluation Results. 26

4.3.1. User Study 26

4.4. Advantages of TTS-integrated Synthesis 28

4.4.1. Synchronization 28

4.4.2. Efficiency 29

Chapter 5. Conclusion 31

5.1. Conclusion 31

5.2. limitations and future work 31

Bibliography 33

Table 2.1. A comparison between neural audio codec based TTS and previous TTS systems. 14

Table 4.1. Quantitative comparison under real audio condition. 26

Table 4.2. User study. 26

Table 4.3. User study. 28

Table 4.4. Comparison of synthesis time for TTS-integrated and cascading approaches 30

Figure 2.1. Neural Audio Codec based TTS 15

Figure 2.2. pre-trained neural codec model Architecture 15

Figure 2.3. Rad-NeRF Architecture 19

Figure 3.1. Overview of the proposed method 21

Figure 3.2. EnCodec feature processing 22

Figure 3.3. audio feature extraction module 23

Figure 4.1. Qualitative evaluation results 27