본문 바로가기 주메뉴 바로가기
국회도서관 홈으로 정보검색 소장정보 검색

목차보기

Title Page

Contents

Abstract 9

Chapter 1. Introduction 10

1.1. Motivation 10

1.2. Overview of the proposed method 11

Chapter 2. Related Works 13

2.1. Speech Synthesis 13

2.1.1. Neural Audio Codec based TTS 14

2.2. Talking Head Generation 17

2.2.1. Text-Driven 17

2.2.2. Audio-Driven 17

2.2.3. NeRF-Based Talking Head Synthesis 18

Chapter 3. Proposed Method 20

3.1. Overview 20

3.2. input features 22

3.2.1. audio and text features 22

3.2.2. video features 22

3.3. Text-to-Speech Module 23

3.4. NeRF-based Talking Head Generation Module 23

3.5. Loss Function 24

Chapter 4. Experiments 25

4.1. Dataset and Evaluation Metrics 25

4.2. Quantitative Evaluation Results. 25

4.3. Qualitative Evaluation Results. 26

4.3.1. User Study 26

4.4. Advantages of TTS-integrated Synthesis 28

4.4.1. Synchronization 28

4.4.2. Efficiency 29

Chapter 5. Conclusion 31

5.1. Conclusion 31

5.2. limitations and future work 31

Bibliography 33

List of Tables

Table 2.1. A comparison between neural audio codec based TTS and previous TTS systems. 14

Table 4.1. Quantitative comparison under real audio condition. 26

Table 4.2. User study. 26

Table 4.3. User study. 28

Table 4.4. Comparison of synthesis time for TTS-integrated and cascading approaches 30

List of Figures

Figure 2.1. Neural Audio Codec based TTS 15

Figure 2.2. pre-trained neural codec model Architecture 15

Figure 2.3. Rad-NeRF Architecture 19

Figure 3.1. Overview of the proposed method 21

Figure 3.2. EnCodec feature processing 22

Figure 3.3. audio feature extraction module 23

Figure 4.1. Qualitative evaluation results 27

초록보기

 With the progression of deep learning techniques, the field of generating videos automatically from audio or text inputs has emerged as a highly promising and rapidly evolving area of research. This paper presents NeRF-THIS(Neural Radiance Field based Talking Head Synthesis Incorporating Text-to-Speech), a novel approach to text-driven talking head generation that combines the strengths of text-based audio generation models with audio-driven video generation models. The method builds a Neural Radiance Fields (NeRF) based talking head generation architecture integrated with text-to-speech(TTS). This approch has a number of advantages. :1) it only needs 5 min of trainning data. 2)It is not constrained by Automatic Speech Recognition (ASR) models, thereby offering freedom from language barriers. 3)It cat support real-time inference in low computational cost. Our findings indicate a promising direction for future research in multimedia content generation, opening new avenues for applications in virtual reality, digital entertainment, and interactive media.