Title Page
Abstract
국문요약
Contents
Chapter 1. Introduction 19
1.1. Introduction 19
1.2. Outline for this dissertation 22
Chapter 2. Multiple Sound Source Localization Based on Interchannel Phase Differences in All Frequencies with Spectral Masks 24
2.1. Summary 24
2.2. Introduction 25
2.3. Spatial aliasing and phase replication method 26
2.4. Proposed method 31
2.5. Experimental results 35
Chapter 3. Residual Echo Suppression Considering Harmonic Distortion and Temporal Correlation 40
3.1. Summary 40
3.2. Introduction 41
3.3. Problem Formulation 43
3.4. Harmonic Distortion Residual Echo Suppression 45
3.5. Residual Echo Suppression Considering Harmonic Distortion and Temporal Correlation 46
3.6. Experiments 51
3.6.1. Experiments with Simulated Data 52
3.6.2. Experiments with Real-Recorded Data 55
3.6.3. Computational Complexity of the Proposed RES Algorithm 58
Chapter 4. Speech Enhancement using MLP-based Architecture with Convolutional Token Mixing Module and Squeeze-and-Excitation Network 60
4.1. Summary 60
4.2. Introduction 61
4.3. Speech enhancement and DeepMMSE framework 64
4.4. Proposed cgMLP-SE model 68
4.4.1. Convolutional Token-Mixing Module for Data of Various Lengths 68
4.4.2. Squeeze-and-Excitation Module for Global Contextual Information 70
4.4.3. Additional Feed-forward Module for Better Performance 72
4.5. Experiments 73
4.5.1. Dataset 73
4.5.2. Experimental Setup 75
4.5.3. Experimental Results 77
4.5.4. Ablation Study 78
Chapter 5. WavLM for Speech Enhancement 80
5.1. Summary 80
5.2. Introduction 80
5.3. Method 83
5.3.1. WavLM 83
5.3.2. Regression Loss and Noise Mixing Training 84
5.3.3. Fine-tuning and Exploration Design 85
5.4. Experiment 90
5.4.1. Experiment Setup 90
5.4.2. Evaluation Results for Low Fine-Tuning Resource 91
5.4.3. Evaluation Results for High Fine-Tuning Resource 92
5.4.4. Impact of Different Amounts of Pre-Training Data 92
Chapter 6. Conclusion 94
References 97
Table 2.1. Average minRMSE's for the conventional method ([1]), that with spectral masks only (mask), and the proposed method (Prop.) depending on the SNR and the DOA of source #2 when RT60 = 200 ms. 37
Table 2.2. Average minRMSE's for the conventional method ([1]), that with spectral masks only (mask), and the proposed method (Prop.) depending on the reverberation time and the DOA of source #2 when SNR = 20 dB. 37
Table 3.1. Echo return loss enhancement (ERLEs) in various echo-to-noise ratio (ENR) conditions for the acoustic echo canceller (AEC) without residual echo suppression... 54
Table 3.2. Average perceptual evaluation of speech quality (PESQ) scores in various signal-to-noise ratio (SNR) and signal-to-echo ratio (SER) conditions for the AEC... 55
Table 3.3. ERLEs in various ENR conditions for the AEC, with the PFRES [2], with the HDRES [3], and with the proposed RES during the far-end single-talk periods for the... 56
Table 3.4. Average PESQ scores in various SNR and SER conditions for the AEC without RES, with the PFRES [2], with the HDRES [3], and with the proposed RES... 57
Table 4.1. Comparison of the number of parameters and the computational cost of each model for the input of 4 seconds. M denotes a million. 72
Table 4.2. Speech enhancement performance using the DeepMMSE framework for various models on the TIMIT-DNS noise dataset. 73
Table 4.3. Speech enhancement performance using the DeepMMSE framework for various models on the Voice Bank-DEMAND noise dataset. 74
Table 4.4. Ablation study on the contribution of each module in the cgMLP-SE blocks to the performance improvement for the TIMIT-DNS noise dataset. SE module and FF module denote the squeeze-and-excitation module and the additional... 76
Table 5.1. Results on the low fine-tuning resource setup with WavLM and its variants. The reg. and m.n denote the regression loss and noise mixing training strategy, respectively. 88
Table 5.2. Results on the high fine-tuning resource setup with WavLM and its variants pre-trained with 94k data. The reg. and m.n denote the regression loss and noise mixing training strategy, respectively. 89
Figure 1.1. Overview of the system with the proposed modules for improving speech quality and speech recognition accuracy. 22
Figure 2.1. Block diagram of the proposed multiple sound source localization system. 28
Figure 2.2. DOA histograms for (a) applying the spectral masks only, and (b) applying both the spectral masks and probabilistic voting when two sources are located at -30°... 32
Figure 2.3. The room configurations and the locations of the microphones and sound sources in the experiments. 34
Figure 3.1. General block diagram of the acoustic echo cancellation system with residual echo suppression. 45
Figure 3.2. The ERLEs of the AEC system without RES, with PFRES, with HDRES, and with the proposed RES evolving with time and the microphone signal for one real-... 56
Figure 4.1. The block diagram of the DeepMMSE framework for speech enhancement. 64
Figure 4.2. Illustrations of the proposed cgMLP-SE blocks for speech enhancement: (a) the proposed cgMLP-SE block and the token-mixing module, (b) the Squeeze-and-Excitation (SE) module, and (c) the feed-forward module. 67
Figure 4.3. The MLP-based blocks with the convolutional token-mixing modules: (a) the cMLP-mixer block and (b) the cgMLP block. 70
Figure 5.1. An illustration of (a) the pre-training stage of WavLM and (b) the fine-tuning stage for speech enhancement. The masked predictive self-supervised encoder... 82