RawNet3를 통해 추출한 화자 특성 기반 원샷 다화자 음성합성 시스템 = One-shot multi-speaker text-to-speech using RawNet3 speaker representation / 한소희, 엄지섭, 김회린

국회도서관 홈으로 정보검색 소장정보 검색

결과 내 검색

동의어 포함

고급검색

상세검색
저자 검색
관련 키워드 검색
주제별 검색

완전일치
전방일치
후방일치

인명/단체명

	저자정보	상세정보
인명/단체명을 입력하세요.

전방일치
완전일치
후방일치
부분일치

키워드

대표어
외국어
네이버 백과사전

용어관계 검색결과
대표어	동의어	상위어	하위어	관련어	대립어

대분류

중분류

소분류

소장자료
외부기관 자료

국내기사 RawNet3를 통해 추출한 화자 특성 기반 원샷 다화자 음성합성 시스템 = One-shot multi-speaker text-to-speech using RawNet3 speaker representation

저자명
한소희, 엄지섭, 김회린
발행사항
서울 : 한국음성학회, 2024. 3. 31
수록지명
말소리와 음성과학 = Phonetics and speech sciences. 제16권 제1호 통권61호 (2024년 3월), p. 67-76
자료실 서울관 전자자료
제어번호
KINX2024104656
주기사항
한국연구재단 제공 KCI 등재(후보)학술지
본문은 한국어, 요약문은 영어, 한국어 수록
원문
연계정보
외부기관 원문
한국학술지인용색인(NRF)
외부기관 원문

초록보기

최근 음성합성(text-to-speech, TTS) 기술의 발전은 합성음의 음질을 크게 향상하였으며, 사람의 음성에 가까운 합성음을 생성할 수 있는 수준에 이르렀다. 특히, 다양한 음성 특성과 개인화된 음성을 제공하는 TTS 모델은 AI(artificial intelligence) 튜터, 광고, 비디오 더빙과 같은 분야에서 널리 활용되고 있다. 따라서 본 논문은 훈련 중 보지 않은 화자의 발화를 사용하여 음성을 합성함으로써 음향적 다양성을 보장하고 개인화된 음성을 제공하는 원샷 다화자 음성합성 시스템을 제안했다. 이 제안 모델은 FastSpeech2 음향 모델과 HiFi-GAN 보코더로 구성된 TTS 모델에 RawNet3 기반 화자 인코더를 결합한 구조이다. 화자 인코더는 목표 음성에서 화자의 음색이 담긴 임베딩을 추출하는 역할을 한다. 본 논문에서는 영어 원샷 다화자 음성합성 모델뿐만 아니라 한국어 원샷 다화자 음성합성 모델도 구현하였다. 제안한 모델로 합성한 음성의 자연성과 화자 유사도를 평가하기 위해 객관적인 평가 지표와 주관적인 평가 지표를 사용하였다. 주관적 평가에서, 제안한 한국어 원샷 다화자 음성합성 모델의 NMOS(naturalness mean opinion score)는 3.36점이고 SMOS(similarity MOS)는 3.16점이었다. 객관적 평가에서, 제안한 영어 원샷 다화자 음성합성 모델과 한국어 원샷 다화자 음성합성 모델의 P-MOS(prediction MOS)는 각각 2.54점과 3.74점이었다. 이러한 결과는 제안 모델이 화자 유사도와 자연성 두 측면 모두에서 비교 모델들보다 성능이 향상되었음을 의미한다.

권호기사

권호기사 목록 테이블로 기사명, 저자명, 페이지, 원문, 기사목차 순으로 되어있습니다.
기사명	저자명	페이지
ToBI and beyond : phonetic intonation of Seoul Korean ani in Korean Intonation Corpus (KICo)	Ji-eun Kim	p. 1-9

(The) fundamental frequency (f0) distribution of American speakers in a spontaneous speech corpus	Byunggon Yang	p. 11-16

Voice quality distinctions of the three-way stop contrast under prosodic strengthening in Korean	Jiyoung Jang, Sahyang Kim, Taehong Cho	p. 17-24

(A) perception-based analysis of voice onset time (VOT) dissimilation in Korean	Hijo Kang, Mira Oh	p. 25-31

치조 마찰음 왜곡 오류 유무에 따른 아동 발화 적률분석 비교 = Spectral moment analysis of distortion errors in alveolar fricatives in Korean children	한윤주, 김도형, 황자은, 장대현, 김재원	p. 33-40

자기보고식 음성장애 설문지 문항의 가독성 평가 = Evaluation of the readability of self-reported voice disorder questionnaires	곽혜림, 이석재, 이승진, 김향희	p. 41-48

4세 말소리발달 선별검사 개발과 한국어말소리분석도구(Korean Speech Sound Analysis Tool, KSAT)의 활용 = Developing the speech screening test for 4-year-old children and application of Korean speech sound analysis tool (KSAT)	김수진, 장기완, 장문수	p. 49-55

정상압 수두증 환자와 정상 노인의 조음교대운동 수행력 비교 = Characteristics of accurate token and all token diadochokinesis in patients with normal pressure hydrocephalus	윤성희, 박기수, 강경훈, 윤장혁, 하지완	p. 57-65

RawNet3를 통해 추출한 화자 특성 기반 원샷 다화자 음성합성 시스템 = One-shot multi-speaker text-to-speech using RawNet3 speaker representation	한소희, 엄지섭, 김회린	p. 67-76

참고문헌 (23건) : 자료제공( 네이버학술정보 )

참고문헌 목록에 대한 테이블로 번호, 참고문헌, 국회도서관 소장유무로 구성되어 있습니다.
번호	참고문헌	국회도서관 소장유무
1	Cooper, E., Lai, C. I., Yasuda, Y., Fang, F., Wang, X., Chen, N., & Yamagishi, J. (2020, May). Zero-shot multi-speaker text-tospeech with state-of-the-art neural speaker embeddings. Proceedings of the ICASSP 2020−2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6184-6188). Barcelona, Spain.	미소장
2	Choi, S., Han, S., Kim, D., & Ha, S. (2020, October). Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding. Proceedings of the Interspeech 2020. Shanghai, China.	미소장
3	Casanova, E., Shulby, C., Gölge, E., Müller, N. M., de Oliveira, F. S., Candido A. C. Jr., Soares, A. S., ... & Ponti, M. A. (2021, August-September). Sc-glowTTS: An efficient zero-shot multi-speaker text-to-speech model. Proceedings of the Interspeech 2021 (pp. 3645-3649). Brno, Czechia.	미소장
4	Casanova, E., Weber, J., Shulby, C., Candido Jr., A., Gölge, E., & Ponti, M. A. (2022, June). Yourtts: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. Proceedings of the 39th International Conference on Machine Learning (pp. 2709-2720). Baltimore, MD.	미소장
5	Chung, J. S., Nagrani, A., & Zisserman, A. (2018, September). VoxCeleb2: Deep speaker recognition. Proceedings of the Interspeech (pp. 1086-1090). Hyderabad, India.	미소장
6	Desplanques, B., Thienpondt, J., & Demuynck, K. (2020, October). ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification. Proceedings of the Interspeech (pp. 3830-3834). Shanghai, China.	미소장
7	Heo, H. S., Lee, B. J., Huh, J., & Chung, J. S. (2020, October). Clova baseline system for the VoxCeleb speaker recognition challenge 2020. Proceedings of the Interspeech. Shanghai, China.	미소장
8	Hu, J., Shen, L., & Sun, G. (2018, June). Squeeze-and-excitation networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7132-7141). Salt Lake City, UT.	미소장
9	Hsu, W. N., Zhang, Y., Weiss, R. J., Zen, H., Wu, Y., Wang, Y., Cao, Y., ... Pang, R. (2018, April−May). Hierarchical generative modeling for controllable speech synthesis. Proceedings of the International Conference on Learning Representations. Vancouver, BC.	미소장
10	Jung, J., Kim, Y., Heo, H. S., Lee, B. J., Kwon, Y., & Chung, J. S. (2022, September). Pushing the limits of raw waveform speaker recognition. Proceedings of the Interspeech 2022 (pp. 2228-2232). Incheon, Korea,	미소장
11	Jung, J., Kim, S., Shim, H., Kim, J., & Yu, H. (2020, October). Improved rawnet with feature map scaling for text-independent speaker verification using raw waveforms. Proceedings of the Interspeech 2020 (pp. 1496-1500). Shanghai, China.	미소장
12	Kwon, Y., Heo, H. S., Lee, B. J., & Chung, J. S. (2021, June). The ins and outs of speaker recognition: Lessons from VoxSRC 2020. Proceedings of the ICASSP 2021−2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5809-5813). Toronto, ON.	미소장
13	Kong, J., Kim, J., & Bae, J. (2020). HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33, 17022-17033.	미소장
14	Li, N., Liu, S., Liu, Y., Zhao, S., & Liu, M. (2019, July). Neural speech synthesis with transformer network. Proceedings of the AAAI Conference on Artificial Intelligence (pp. 6706-6713). Honolulu, HI.	미소장
15	Morise, M., Yokomori, F., & Ozawa, K. (2016). WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7), 1877-1884.	미소장
16	Moss, H. B., Aggarwal, V., Prateek, N., González, J., & Barra-Chicote, R. (2020, May). Boffin TTS: Few-shot speaker adaptation by Bayesian optimization. Proceedings of the ICASSP 2020−2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7639-7643). Barcelona, Spain.	미소장
17	Nagrani, A., Chung, J. S., & Zisserman, A. (2017, August). VoxCeleb: A large-scale speaker identification dataset. Proceedings of the Interspeech 2017 (pp. 2616-2620). Stockholm, Sweden.	미소장
18	Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019). FastSpeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems 32 (NeurIPS 2019). Vancouver, BC.	미소장
19	Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2020, June). FastSpeech 2: Fast and high-quality end-to-end text to speech. Retrieved from arXiv:2006.04558v8	미소장
20	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł, ... Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems 30 (NIPS 2017). Long Beach, CA.	미소장
21	Wan, L., Wang, Q., Papir, A., & Moreno, I. L. (2018, April). Generalized end-to-end loss for speaker verification. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4879-4883). Calgary, AB.	미소장
22	Zen, H., Dang, V., Clark, R., Zhang, Y., Weiss, R. J., Jia, Y., Chen, Z., ... Wu, Y. (2019, September). LibriTTS: A corpus derived from librispeech for text-to-speech. Proceedings of the Interspeech 2019. Graz, Austria.	미소장
23	Zhao, B., Zhang, X., Wang, J., Cheng, N., & Xiao, J. (2022, May). nnspeech: Speaker-guided conditional variational autoencoder for zero-shot multi-speaker text-to-speech. Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4293-4297). Singapore.	미소장

자료명
저자사항
제어번호
*요청자 이름
*전화번호	휴대폰 번호를 입력하세요.
*이메일	@
*요청내용
*오류항목

청구기호
자료명/저자사항
발행사항
형태사항
ISSN

* 서재명
설명
* 공개수준	비공개 완전공개 * 주의: 국회도서관 이용자 모두에게 공유서재로 서비스 됩니다.

고급검색

다국어입력

국내기사 RawNet3를 통해 추출한 화자 특성 기반 원샷 다화자 음성합성 시스템 = One-shot multi-speaker text-to-speech using RawNet3 speaker representation

초록보기

권호기사

참고문헌 (23건) : 자료제공( 네이버학술정보 )

추천서가 (다양한 추천 자료를 만나보세요)

권호

알림톡 발송로 자료명, 기사명/저자명, 수록지명, 자료실, 서가번호, 전화번호로 구성되어 있습니다.




전화번호

고급검색

다국어입력

국내기사 RawNet3를 통해 추출한 화자 특성 기반 원샷 다화자 음성합성 시스템 = One-shot multi-speaker text-to-speech using RawNet3 speaker representation

초록보기

권호기사

참고문헌 (23건) : 자료제공( 네이버학술정보 )

추천서가 (다양한 추천 자료를 만나보세요)

MARC 보기

오류 데이터 정정요청

알림톡 발송

권호기사보기

연속간행물 권호 선택

연속간행물 권호 선택

우편복사 안내

도서위치안내(서울관)

저자프로필

목차보기

우편복사 안내

우편복사 목록담기

확인

내서재에 담기

새로운 서재

저장

로그인

권호