Title Page
Contents
Abstract 9
Chapter 1. Introduction 10
1.1. Motivation 10
1.2. Overview of the proposed method 11
Chapter 2. Related Works 13
2.1. Speech Synthesis 13
2.1.1. Neural Audio Codec based TTS 14
2.2. Talking Head Generation 17
2.2.1. Text-Driven 17
2.2.2. Audio-Driven 17
2.2.3. NeRF-Based Talking Head Synthesis 18
Chapter 3. Proposed Method 20
3.1. Overview 20
3.2. input features 22
3.2.1. audio and text features 22
3.2.2. video features 22
3.3. Text-to-Speech Module 23
3.4. NeRF-based Talking Head Generation Module 23
3.5. Loss Function 24
Chapter 4. Experiments 25
4.1. Dataset and Evaluation Metrics 25
4.2. Quantitative Evaluation Results. 25
4.3. Qualitative Evaluation Results. 26
4.3.1. User Study 26
4.4. Advantages of TTS-integrated Synthesis 28
4.4.1. Synchronization 28
4.4.2. Efficiency 29
Chapter 5. Conclusion 31
5.1. Conclusion 31
5.2. limitations and future work 31
Bibliography 33
Table 2.1. A comparison between neural audio codec based TTS and previous TTS systems. 14
Table 4.1. Quantitative comparison under real audio condition. 26
Table 4.2. User study. 26
Table 4.3. User study. 28
Table 4.4. Comparison of synthesis time for TTS-integrated and cascading approaches 30
Figure 2.1. Neural Audio Codec based TTS 15
Figure 2.2. pre-trained neural codec model Architecture 15
Figure 2.3. Rad-NeRF Architecture 19
Figure 3.1. Overview of the proposed method 21
Figure 3.2. EnCodec feature processing 22
Figure 3.3. audio feature extraction module 23
Figure 4.1. Qualitative evaluation results 27