초록

트랙백은 블로그와 같은 1인 미디어에서 주로 사용하는 기능으로, 자신이 작성하여 게시한 글의 링크를 원하는 대상의 글에 자동적으로 생성해 준다. 트랙백은 원문 포스트에 트랙백에 대한 링크를 생성해줌으로써 1인 미디어를 지향하는 블로그들 사이의 연결고리를 생성하여 소통 네트워크를 만들어 내는 역할을 한다. 또한 자신의 블로그에 글을 쓰기 때문에 덧글을 다는 것보다 더 자유롭게 글을 쓸 수 있다는 장점이 있다. 하지만 이러한 자유로움을 악용해 트랙백 스팸을 유발하여 네트워크의 자원을 낭비하고 방문자들에게 잘못된 정보를 전달해 해당 포스트의 신뢰를 떨어뜨리는 경우가 종종 발생한다. 트랙백 스팸은 유명한 포스트와 연계하여 자신의 포스트로 사용자들을 유도하는 특징을 가지기 때문에 일반적인 웹 스팸을 탐지하는 기술을 적용하기 어렵다. 본 논문에서는 자신이 작성한 글이 다른 사람의 글과 관련이 있다고 생각하여 다른 사람의 글에 자신의 글을 링크시키는 트랙백의 특성을 이용하여 트랙백 스팸 탐지 방법을 제안한다. 제안하는 방법의 주된 아이디어는 원문 페이지와 트랙백 페이지, 그리고 트랙백 페이지의 아웃링크 내용상의 유사도와 트랙백 페이지 내의 문장이 원문에 포함되는지 그리고 트랙백 페이지의 내용에서 링크의 비율을 분석한 것이다. 이진 분류기에서 많이 사용되는 SVM과 비교한 실험을 통해 햄(ham)과 스팸에 대한 재현율의 조화평균 값이 약 19% 나은 성능이 나왔고, 이를 통해 제안한 시스템의 성능이 만족스러움을 확인하였다.

Trackback is one of frequently used functions in a personal media like a blog. It automatically generates a link to the post that the user wrote in other target posts. The role of trackback is to make a communication network active by creating connections between blogs with links. Also, since they write on their blog, it is advantageous to do writings more freely rather than commenting. However, the misuse of trackback often occurs by some malicious users who generate trackback spams, resulting in wasting the network resources and dropping the trust of posts. However, it is difficult to apply existing technologies of detecting web spams because trackback spams induce users to access a post by connecting the spams to some famous posts. In this paper, we propose a scheme of trackback spam detection by exploiting the characteristics of trackback in which a link is built from other person's post to user's own post when the user thinks that the other person's post is relevant to his/her own post. The main idea of proposal method is to analyze the rate of link in the contents of trackback pages and whether the semantic similaryiy on the outlink contents of the trackback page, trackback page, the original page and sentences in the page are included in the original text. We found that the outcome of harmonic mean of recall ratio on ham and spam was about 19% better than before through testing on comparing to SVM which is frequently used for binary classifier, and we confirmed that performance of the proposed system is satisfactory through this.