초록

본 연구는 소셜 빅데이터와 머신러닝을 이용하여 가계부채 부실위험을 예측하는 모형을 개발하고 그 결과를 제시하였다. 자료로 2014년~2018년 5년 동안 총 280개의 온라인 채널에 등장한 채무 관련 문서 총 3,927,165건을 사용하였다. 머신러닝을 위해 지도학습을 사용하였고 구체적인 알고리듬으로 랜덤포레스트 그리고 의사결정나무 모형을 적용하였다. 또한 채무 부실위험 예측요인들 간의 상호관련성을 파악하기 위하여 비지도학습 알고리듬인 연관분석을 실시하였다. 이 연구의 결과는 머신러닝 알고리듬이 소득, 부채액, 원리금 상환액, 신용거래와 같은 개인정보 없이 온라인 문서에 등장하는 다양한 채무 및 인구사회학적 특성 요인들의 조합으로 가계부채의 부실위험을 상당히 높은 수준에서 예측할 수 있음을 보여주었다. 이 연구는 가계부채 부실위험 예측모형 개발로 실천적 함의를 제시하고, 사회현상의 이해와 위험집단의 예측을 위한 머신러닝의 적용 사례를 보여주었다는 점에서 큰 의의가 있다.

This study aims to predict the quality of household debt using big data and machine learning approach. Data for this study include a total of 3,927,165 debt-related documents collected through 280 publicly available online channels in South Korea for the 5 year period between 2014 and 2018. Supervised machine learning techniques used in this study include naïve Bayes classification, logistic regression, random forest, decision tree, artificial neural network, support vector machine algorithms. An unsupervised machine learning technique, association analysis, was also applied. The results show that machine learning algorithms were highly capable of predicting the quality of household debt based on a combination of an array of debt-related and sociodemographic characteristics without such information as income, asset, total amount of debt, amount of repayment. Practice and methodological implications of the findings were discussed.