대한민국 국회도서관

인명/단체명 검색결과
전체 선택	대표형(전거형, Authority)	생물정보	이형(異形, Variant)	소속	직위	직업	활동분야	주기	서지
연구/단체명을 입력해주세요.

소장자료
공공정책정보
외부기관 자료

학위논문 Decoupled denoising diffusion models with disentangled representation and prior mixup for verified robust voice conversion [전자자료] = 고품질 제로샷 음성 변환

저자명
Ha-Yeong Choi
발행사항
서울 : 고려대학교 대학원, 2024.2
청구기호
전자형태로만 열람 가능함
형태사항
1 온라인자료 : PDF
자료실 전자자료
제어번호
KDMT12024000023251
주기사항
학위논문(석사) -- 고려대학교 대학원, Dept. of Artificial Intelligence, 2024.2. 지도교수: Seong-Whan Lee
연계정보
원문
외부기관 원문

목차보기

Title Page

Abstract

국문 초록

Preface

Contents

Chapter 1. Introduction 17

Chapter 2. Related Works 20

2.1. Voice Conversion 20

2.2. Zero-shot Voice Style Transfer 21

2.3. Diffusion Probabilistic Models 23

Chapter 3. Method 24

3.1. Decoupled Denoising Diffusion Models 24

3.2. DDDM-VC 26

3.2.1. Speech Disentanglement 26

3.2.2. Speech Resynthesis 27

3.2.3. Prior Mixup 29

Chapter 4. Experiment and Result 32

4.1. Experimental Setup 32

4.1.1. Datasets 32

4.1.2. Preprocessing 32

4.1.3. Training 33

4.2. Evaluation Metrics 33

4.2.1. Subjective Metrics 33

4.2.2. Objective Metrics 34

4.3. Many-to-Many Voice Conversion 35

4.4. Zero-shot Voice Conversion 36

4.5. One-shot Speaker Adaptation 36

4.6. Zero-shot Cross-lingual Voice Conversion 37

4.7. Ablation Study 38

4.7.1. Prior Mixup 38

4.7.2. Disentangled Denoiser 39

4.7.3. Normalized F0 40

4.7.4. Data-driven Prior 40

Chapter 5. Conclusion 41

Chapter 6. Broader Impact and Limitation 42

6.1. Broader Impact 42

6.2. Limitation 42

Appendices 43

Appendix A. Implementation Details 43

A.1. DDDM-VC 43

A.2. Baseline Models 44

A.3. Computational Cost and Inference Speed 45

A.4. Vocoder 45

Appendix B. Speech Resynthesis on LibriTTS 49

Appendix C. Zero-shot Voice Conversion on LibriTTS 51

Appendix D. One-shotSpeakerAdaptation 53

Appendix E. Style Control 55

Appendix F. t-SNE Visualization 56

Appendix G. Evaluation Details 58

Appendix H. Comparison with PPG-based Voice Conversion Model 61

Appendix I. Comparison with Traditional Voice Conversion Methods 63

Appendix J. Audio Mixing 64

J.1. DDDM-Mixer 64

J.2. Implementation Details 65

J.3. Evaluation 66

Appendix K. Text-to-Speech 67

K.1. DDDM-TTS 67

K.2. Text-to-Vec (TTV) 67

K.3. Implementation Details 69

K.4. Evaluation 69

Bibliography 71

List of Tables

Table 4.1. Many-to-many VC results on seen speakers from LibriTTS dataset 34

Table 4.2. Zero-shot VC results on unseen speakers from VCTK dataset. We additionally report the one-shot speaker adaptation result of DDDM-VC model (DDDM-VC-Fine-... 35

Table 4.3. Results of ablation study on many-to-many VC tasks with seen speakers from LibriTTS. 36

Table 4.4. CMOS results between zero-shot and one-shot VC. 36

Table 4.5. Ablation study for Prior Mixup with wrong prior. 38

Table 4.6. Objective evaluation for scaling-up and Prior Mixup. We train each model with LibriTTS-train-960 dataset and evaluate the zero-shot VC performance on VCTK dataset. 38

List of Appendix Tables

Table A.1. Hyperparameters of DDDM-VC. 47

Table A.2. Comparison results of computational cost and inference speed 48

Table A.3. The vocoder evaluation results. We used the pre-trained HiFi-GAN as a vocoder, and we replaced the multi-scale discriminator and multi-period discriminator with MS- STFTD. 48

Table B.1. Speech resynthesis results from dev-clean subset of LibriTTS 50

Table B.2. Speech resynthesis results from dev-other subset of LibriTTS 50

Table B.3. Speech resynthesis results from test-clean subset of LibriTTS 50

Table B.4. Speech resynthesis results from test-other subset of LibriTTS 50

Table C.1. Zero-shot voice conversion results on unseen speakers from LibriTTS-dev-clean subset 52

Table C.2. Zero-shot voice conversion results on unseen speakers from LibriTTS-dev-other subset 52

Table C.3. Zero-shot voice conversion results on unseen speakers from LibriTTS-test-clean subset 52

Table C.4. Zero-shot voice conversion results on unseen speakers from LibriTTS-test-other subset 52

Table E.1. Objective evaluation for controlling each attribute. We report the EER and SECS for target speaker as an ground-truth. 55

Table J.1. The objective evaluation of audio mixing 65

Table K.1. Text-to-Speech results on seen speakers from LibriTTS-train-clean subset 70

Table K.2. Text-to-Speech results on unseen speakers from VCTK dataset 70

List of Figures

Figure 1.1. Speech synthesis in DDDM and standard diffusion model. Although a single denoiser with same parameter is used for all denoising steps in standard diffusion models,... 18

Figure 3.1. Overall framework of DDDM-VC 25

Figure 3.2. (a) Speech resynthesis from disentangled speech representations (training). (b) Voice conversion from converted speech representations (inference). (c) Prior mixup for... 28

Figure 4.1. CER results for zero-shot cross-lingual VC on unseen languages from CSS10 multi-lingual dataset. 37

List of Appendix Figures

Figure D.1. EER and SECS results according to training steps of fine-tuning for one-shot speaker adaptation. 53

Figure D.2. CER and WER results according to training steps of fine-tuning for one-shot speaker adaptation. 53

Figure F.1. t-SNE visualization of content and speaker representation. 57

Figure G.1. The screenshots of the Amazon MTurk MOS survey. $0.08 per 1 hit is paid to participants for nMOS and sMOS. 58

Figure J.1. Overall framework of DDDM-Mixer 65

Figure K.1. Overall framework of DDDM-TTS 68

초록보기

확산 기반 생성 모델은 최근 강력한 생성 성능을 보여주고 있다. 그러나 데이터 분포에 존재하는 많은 속성과 생성 과정의 모든 단계에서 모델 매개변수를 공유하는 데에는 여러 한계점이 존재하기 때문에, 각 속성별로 특정 스타일을 제어하는 것은 여전히 도전적인 문제로 남아있다. 이러한 문제를 해결하기 위해, 본 논문에서는 각 생성 모델에서 각 속성별로 효과적인 스타일 전환을 가능하게 하는 분리된 표현을 갖는 분리된 노이즈 제거 확산 모델 (DDDMs)을 제안한다. 특히, 음성 변환 작업에 DDDMs를 적용하여 언어적 정보, 억양, 음색과 같은 음성의 구성 요소를 분리하고 각 음성의 속성을 개별적으로 제어할 수 있도록 한다. 먼저, 자기 지도 학습 표현을 사용하여 음성 구성 요소를 분리한다. 이후 DDDMs를 적용하여 분리된 음성 구성 요소로부터 음성을 재합성하여 각 속성에 대한 스타일 전환을 수행한다. 또한 강력한 음성 스타일 전환을 위해 혼합된 화자 스타일을 확산 모델의 사전 분포로 사용하는 사전 믹스업을 제안한다. 실험 결과를 통해, 우리의 방법이 공개적으로 사용 가능한 최첨단 음성 변환 시스템의 성능을 능가하는 우수성을 증명한다. 더욱이, 제안하는 모델 더 작은 모델 크기를 사용할 때에도 강력한 생성 성능을 보임을 제시한다.

자료명
저자사항
제어번호
*요청자 이름	회신요청
*전화번호	휴대폰 번호를 입력하세요.
*이메일	@
*요청내용
*오류항목

* 서재명
설명
* 공개수준	비공개 완전공개 * 주의: 국회도서관 이용자 모두에게 공유서재로 서비스 됩니다.

알림톡 발송로 자료명, 기사명/저자명, 수록지명, 자료실, 서가번호, 전화번호로 구성되어 있습니다.




*전화번호	※ '-' 없이 휴대폰번호를 입력하세요

연속간행물 상세정보 입니다.
청구기호
자료명/저자사항
발행사항
형태사항
ISSN

다국어입력

상세검색

다국어입력

저자 검색

관련 키워드 검색

주제별 검색

학위논문 Decoupled denoising diffusion models with disentangled representation and prior mixup for verified robust voice conversion [전자자료] = 고품질 제로샷 음성 변환

목차보기

초록보기

추천서가 (다양한 추천 자료를 만나보세요)

MARC 보기

오류 데이터 정정요청

알림톡 발송

권호기사보기

연속간행물 권호 선택

연속간행물 권호 선택

우편복사 안내

도서위치안내(서울관)

저자프로필

목차보기

우편복사 안내

우편복사 목록담기

확인

내서재에 담기

새로운 서재

저장

로그인