초록

This study investigated the pronunciation differences in L1 and L2 English speech via the phonetic posteriorgram (PPG) vectors obtained from acoustic model trained for automatic speech recognition (ASR). PPG is the frame-wise phone-classification result in conventional ASR systems, which has been widely applied to Computer-Assisted Pronunciation Training (CAPT). More recently, PPG has been applied to foreign accent conversion, a field of voice conversion research that can be applied to user-adaptable pronunciation training. This study was aimed at investigating the L1-L2 PPG pattern differences between the speech data previously used for these research endeavors. PPG vectors of the 18 native and 24 non-native English speakers were extracted using a deep neural network (DNN) based ASR system. When analyzed in phone-level, results showed that L1 PPGs were more concentrated on correct phone dimension than L2 PPG, which appeared more dispersed across dimensions. The result is in line with the previous PPG-based CAPT research that identified distinct L2 patterns in English speech. The findings are further discussed in the context application to pronunciation training.

이 연구는 음성 인식의 음향 모델 출력물인 음소 사후확률 (Phonetic Posteriogram) (PPG) 자질에 나타나는 영어 원어민-비원어민 간의 차이점을 분석하였다. PPG 자질은 상용 음성인식 시스템에서의 프레임별 음소 분류 결과로써, 컴퓨터 보조 발음 교육(Computer-Assisted Pronunciation. Training, CAPT)에 널리 사용되어왔다. 최근에는 음성 변환(voice conversion)의 일종인 억양 변환(accent conversion) 연구에도 사용되고 있으며, 이는 사용자 맞춤 발음 훈련에 응용될 수 있는 분야로 알려져 있다. 본 연구는 이와 같은 억양 변환 연구에 사용된 원어민과 비원어민 PPG 자질상 패턴 차이를 조사하였다. 총 18명의 원어민과 24명의 비원어민에 대한 PPG 벡터가 분석에 사용되었으며, 이에 심층 인공 신경망 (deep neural network, DNN) 기반의 상용 음성 인식 모델이 사용되었다. 음소별 분석 결과, 원어민과 비원어민의 음소별 유사도는 높게 나타났다. 그러나 원어민의 평균 PPG 자질은 비원어민의 PPG 자질보다 더 뚜렷한 음소 분류도를 보였다. 이에 반해 비원어민의 PPG 자질은 비교적 분산된 패턴을 보였다. 이러한 결과는 이전의 CAPT 분야 연구에서 보여진 대로, 비원어민의 영어 발화 패턴이 원어민과 구분되는 특성이 있음을 시사한다. 이를 통하여 본 연구는, 억양 변환 연구에 사용된 해당 데이터 상의 원어민-비원어민간 발음 패턴 차이를 확인하고, 발음 교육에의 적용에 대해 논의한다.