권호기사보기
기사명 | 저자명 | 페이지 | 원문 | 기사목차 |
---|
대표형(전거형, Authority) | 생물정보 | 이형(異形, Variant) | 소속 | 직위 | 직업 | 활동분야 | 주기 | 서지 | |
---|---|---|---|---|---|---|---|---|---|
연구/단체명을 입력해주세요. |
|
|
|
|
|
* 주제를 선택하시면 검색 상세로 이동합니다.
Title Page
Abstract
Contents
Chapter 1. Introduction 15
Chapter 2. Preliminaries 19
2.1. Program Slicing 19
2.2. Source Code Embedding 22
2.2.1. Word2Vec, GloVe and FastText 23
2.2.2. Sent2Vec and Doc2Vec 25
2.3. Sequence-to-Sequence Model 26
2.4. Fuzzing method 31
2.4.1. Fuzzing Overview 31
2.4.2. Library Fuzzing 33
2.4.3. Code Coverage in the Fuzzing 34
2.5. Test Framework 36
Chapter 3. Related Works 38
3.1. Static-based Vulnerability Detection 38
3.1.1. Code Similarity-based Approach 39
3.1.2. Pattern-based Approach 40
3.2. Intelligent Fuzzing Method 42
3.2.1. Mutation Scheme 42
3.2.2. Seed Corpus Generation 44
3.2.3. Fuzzing Executable Generation 47
Chapter 4. Static Analysis : Vulnerability Detection with a Deep-Learning 49
4.1. Contribution 49
4.2. Scope and Assumptions 50
4.2.1. Threat Model 50
4.2.2. Assumptions 52
4.3. Proposed Method: AutoVAS 53
4.3.1. Pre-processing 53
4.3.2. Embedding 61
4.4. Experiment Results 64
4.4.1. Evaluation Metrics 64
4.4.2. Experimental Setup and Takeaways 65
4.4.3. Comparison of Embedding Methods 69
4.4.4. Effectiveness of the AutoVAS 75
4.4.5. Comparison of Other Approaches 77
4.4.6. AutoVAS in Practice 78
Chapter 5. Dynamic Analysis I : Data-driven Fuzzing for Seed Generation 80
5.1. Contribution 80
5.2. Design Consideration 81
5.3. Proposed Method: DDRFuzz 82
5.3.1. Architecture Overview of DDRFuzz 82
5.3.2. Data Gathering Phase 83
5.3.3. Seed Generation Phase 86
5.3.4. Fuzzing Phase 93
5.4. Experiment Results 94
5.4.1. Evaluation Metrics 94
5.4.2. Experiment Setup and Takeaways 94
5.4.3. Comparison of performance between seq2seq models 97
5.4.4. Effectiveness of DDRFuzz 99
5.4.5. Analysis of fuzzing efficiency affected by model accuracy 101
5.4.6. Vulnerability analysis 103
5.4.7. Summary of experimental results 105
Chapter 6. Dynamic Analysis II : Automated Fuzz Driver Builder using Test Framework 106
6.1. Contribution 106
6.2. Scope and Assumptions 108
6.2.1. Scope 108
6.2.2. Assumptions 109
6.3. Design Consideration 110
6.4. Proposed Method: FuzzBuilderEx 111
6.4.1. Inferring Fuzzable APIs 111
6.4.2. Generating Seed Corpus 114
6.4.3. Connecting Fuzzer and Library 117
6.4.4. Generating Fuzzing Executables 120
6.5. Experiment Results 124
6.5.1. Evaluation Metrics 124
6.5.2. Experiment Setup and Takeaways 125
6.5.3. Comparison of Library Fuzzing Automation 127
6.5.4. Effectiveness of a Seed Corpus 128
6.5.5. Effectiveness of Fuzzing Executables 131
6.5.6. Vulnerability Analysis 135
6.5.7. Summary of Experimental Results 136
Chapter 7. Discussion 138
7.1. Vulnerability Detection with a Deep-Learning 138
7.1.1. Dealing with various environments 138
7.1.2. Vulnerability detection using unsupervised learning 139
7.1.3. Searching for the defect origin using Interpretable Machine Learning (IML) 139
7.1.4. Exploitable vulnerability 140
7.1.5. Future directions for better representation 140
7.2. Data-driven Fuzzing for Seed Generation 141
7.2.1. Extension of interesting behavior 141
7.2.2. Applying deep-learning approach in seed selection 141
7.2.3. Applying advanced feedback mechanism 142
7.2.4. Seed generation for highly structured input format 142
7.3. Automated Fuzz Driver Builder using Test Framework 143
7.3.1. Complex Function Sequence 143
7.3.2. Extension of the Target Language 143
7.3.3. Seed Augmentation 144
7.3.4. Lessons Learned 144
Chapter 8. Conclusion 146
Chapter 9. Appendix 149
9.1. Vulnerability types of NVD and SARD Dataset 149
9.2. Vulnerability types of AutoVAS Dataset 150
9.3. Vulnerability types of C language 151
9.4. Seed Generation Algorithm 153
9.5. Fuzzing Executable Generation Algorithm 154
Bibliography 157
Table 2.1. Attention score functions: There are several functions to calculate the attention score, and the various types of attention score functions are as follows. In the table, st is the...[이미지참조] 29
Table 4.1. Mapping between Vulnerability types and Threats 51
Table 4.2. Overview of AutoVAS Dataset 66
Table 4.3. Hyper-parameters of neural network models 68
Table 4.4. Experiment result for program slicing 70
Table 4.5. Experiment result for word embedding 71
Table 4.6. Experiment result for symbolization 73
Table 4.7. Experiment result for padding and split 74
Table 4.8. Summary of optimal embedding method 74
Table 4.9. Hyper-parameters of deep learning model 76
Table 4.10. Comparative experiment result with other approaches 77
Table 4.11. Found Known Vulnerabilities from open-source projects 78
Table 4.12. Found Unknown Vulnerabilities from open-source projects 79
Table 5.1. Project list and dataset: This study selected five target programs, which are popular open-source projects using the image, audio, and video multimedia files as the input formats.... 96
Table 5.2. Experiment environment 97
Table 5.3. Model parameters: The simple seq2seq and attention models used one-layer LSTM as the neural network, and a neural network in the transformer model used a deep neural net-... 98
Table 5.4. Comparison of crash and coverage (path) of different seq2seq models 99
Table 5.5. Effectiveness of DDRFuzz: As the experiment result, this study achieves higher coverage and detects more crash counts in all five target programs. Particularly, in the case of... 100
Table 5.6. Effective of DDRFuzz compared with other fuzzer: In addition to the five target projects used above, we compare performance with other fuzzers using the UniFuzz benchmark.... 101
Table 5.7. Effective of DDRFuzz applying to various fuzzers: We compare the effectiveness of DDRFuzz applying to various fuzzers. As the experiment result, DDRFuzz with AFL-Fast... 102
Table 5.8. Impact of NLP model accuracy on fuzzing efficiency: According to the experiment result, this study cannot discover a specific accuracy range that constantly achieves the best... 103
Table 5.9. DDRFuzz discovered three vulnerabilities, of which two are known vulnerabilities, and one is the zero-day vulnerability. 105
Table 6.1. Experimental Design 126
Table 6.2. Fuzzing Target Library 127
Table 6.3. Library Fuzzing Process 127
Table 6.4. Total FA Count of Target Libraries 128
Table 6.5. FA List of Target Libraries 130
Table 6.6. Fuzzing Executable List of Target Libraries 132
Table 6.7. Comparison of the Line Coverage and Unique Crash Count for Fuzzing Executable 133
Table 6.8. Found Known Vulnerabilities from Open-source Libraries 136
Table 6.9. Found Unknown Vulnerabilities from Open-source Libraries 137
Table 9.1. Vulnerability types of AutoVAS Dataset 150
Table 9.2. Vulnerability types of C language 153
Figure 2.1. CBOW and Skip-gram. CBOW uses surrounding words to predict the central word's meaning, whereas Skip-gram predicts the meaning of surrounding words through the... 23
Figure 2.2. Word embedding for Doc2Vec. Even if the words are different, the embedding vectors of each sentence become similar because the words' embedding vectors are similar. 26
Figure 2.3. Seq2Seq model. This is a representative model for a machine translator. The figure represents a process in which the encoder receives the English input sentence "I am a... 27
Figure 2.4. Attention mechanism: Basic concept of the attention mechanism is that the decoder refers to the entire input sentence of the encoder whenever it predicts the output word.... 28
Figure 2.5. Three attentions of transformer model: The first self-attention is at the encoder, whereas the second self-attention and the third encoder–decoder attention are at the decoder;... 30
Figure 2.6. Illustration of the general fuzzing process. Many advanced fuzzers improve the algorithms based on the illustrated process. 31
Figure 2.7. Two properties of the fuzzing process. First, the exploration property indicates that the energy of closer seeds is similar to the energy of seeds that are further away. Second,... 32
Figure 4.1. Threat Model of AutoVAS. When a hacker attacks a vulnerability inherent in software, an exploit occurs. Unlike legacy Safeguards, AutoVAS uses an automated template... 50
Figure 4.2. Overview of AutoVAS. It consists of a learning phase that trains a model using a dataset and a detection phase that uses the trained model to detect whether the input source... 53
Figure 4.3. Process of preprocessing activity. After synthesizing a code snippet by slicing the source code, tokenization is used to prepare for embedding vector generation. 54
Figure 4.4. Labeling the ground truth. This study labeled data based on the function name for the SARD dataset and the file name for the NVD dataset. 59
Figure 4.5. Example of tokenization. The code snippet generated through program-slicing is divided into token units. 61
Figure 4.6. Process of embedding activity. A fixed-length embedding vector is created by using the set of code snippets divided by token units as a corpus. Additionally, to create the... 61
Figure 4.7. Example of code snippets. After forward/backward slicing based on the slicing criteria, the result is assembled to create a code snippet. 67
Figure 4.8. Vulnerability types in the dataset. The vulnerabilities included in the NVD and SARD datasets are classified into eight types, indicating the composition ratios. The buffer... 69
Figure 4.9. Comparison of the symbolization effect of Doc2Vec. The figure shows the comparison performance according to whether the symbolic representation is applied to each word... 72
Figure 4.10. Experimental results depending on snippet size. The figure shows the performance change according to the snippet size for each embedding algorithm. The FastText algorithm... 73
Figure 4.11. Experimental results for the padding and split method depending on snippet size. The figure compares and shows the change in performance according to snippet size for each... 75
Figure 5.1. Architecture overview: DDRFuzz sequentially operates three phases using two engines: data gathering, seed generation, and fuzzing phases. 83
Figure 5.2. Data gathering phase: With AFL4DDRFuzz, this study collected input/output data, which triggers the interesting behavior, as the training data. 84
Figure 5.3. Data gathering process: This study defines the criteria for an interesting seed file as triggering the unique crash or new path. When the interesting behavior occurs, this study saves... 84
Figure 5.4. Seed generation phase. It sequentially operates the training and inference steps. In the training step, it trains the seq2seq model with the data encoded in Base64 format. Then,... 86
Figure 5.5. Encoding and decoding process: Encoding converts the multimedia binary to ASCII format's character string and then generates fixed-length data applying post-padding... 87
Figure 5.6. Simple seq2seq architecture: This architecture is the RNN- based seq2seq model; this study uses one-layer LSTM as a neural network. After receiving all the words of the input... 89
Figure 5.7. Seq2seq with attention architecture: To compensate for the limitations of the simple seq2seq model, the attention model uses the attention mechanism wherein the decoder... 90
Figure 5.8. Transformer architecture: The transformer model follows the existing seq2seq structure; however, it implements the encoder/decoder using only the attention mechanism.... 91
Figure 5.9. Fuzzing phase: This is the fuzzing process of the target program using the valuable seed corpus. Separating the previous phases from the fuzzing phase allows us to use various... 92
Figure 5.10. Feedback mechanism: DDRFuzz can add the paired data that trigger the inter- esting behavior while running the fuzz. Note that excessive usage of the feedback mechanism... 93
Figure 5.11. Impact of input data length on fuzzing efficiency: When given small input data, such as mpg123 and libavc, the attention or transformer model shows better performance than... 99
Figure 5.12. Data distribution: This study analyzed the distribution of input and inferred data of libavif, which had better performance at the highest accuracy model, and libtiff, which... 104
Figure 6.1. Overview of LLVM IR. LLVM is designed around a language-independent in- termediate representation that serves as a portable, high-level assembly language that can be... 109
Figure 6.2. Process overview of inferring FA. Common API is generated using target library and test binary, and FAs are selected considering FA's constraints. 112
Figure 6.3. Overall process to generate seed corpus. First, the modified library bitcode and test code are used to collect logs, and then, the seed corpus is generated through parsing and... 114
Figure 6.4. Overall process to Connect Fuzzer and Library. Replacing FA parameters. Replac- ing FA parameters is performed to feed a seed corpus provided by fuzzer to the target library. 120
Figure 6.5. Overall process to Generating Fuzzing Executable. The fuzzing executable is created after deleting unnecessary code statements for fuzzing in test functions. 121
Figure 6.6. Comparison of Line Coverage for Seed Corpus. This study achieved higher code coverage in 6 libraries excluding boringssl compared to OSSFuzz, and higher performance in all... 129
Figure 6.7. Comparison of Execution Speed. Compare to the previous study, FuzzBuilderEx improves execution speed in all seven libraries. Compared to OSSFuzz, FuzzBuilderEx does not... 133
A software vulnerability in a computer system refers to an error, failure, or fault that causes it to behave incorrectly; it allows an attacker to gain unauthorized access or compromise the system. Owing to the advances in automated hacking and analysis technologies in recent years, numerous software security vulnerabilities have been announced. Software vulnerabilities are increasing rapidly, whereas methods to analyze and cope with them depend on manual analyses, which result in a slow response. Furthermore, as exploiting vulnerabilities as a means of cyber-attack increases, it is essential to detect vulnerabilities before attackers abuse them.
This thesis is a comprehensive study that enhances the efficiency of software vulnerability analysis using data-driven methods in terms of static and dynamic approaches. First, to resolve the limitations of existing static rule-based or similarity-based studies, this study proposes a deep learning-based automated vulnerability analysis system (AutoVAS). AutoVAS effectively represents source code as embedding vectors by using various datasets from the National Vulnerability Database (NVD) and Software Assurance Reference Database (SARD). Second, this study discovers that the NLP model could contribute to fuzzing efficiency against restricted seed files. Under the paradigm above, this study designs a sequence-to-sequence (seq2seq) model-based efficient fuzzing system, denoted as the Data-DRiven Fuzzing (DDRFuzz) system. DDRFuzz creates a training dataset with interesting seed files that cause a unique crash or new path in the target program and infers valuable seed files using the trained model. Finally, to reduce the dependency on human knowledge and enhance automation fuzzing, this study proposes FuzzBuilderEx system that provides an automated fuzzing environment for a library by utilizing the test framework. FuzzBuilderEx conducts a static/dynamic analysis of the test code to automatically generate seed corpus and fuzzing executables that enable library automation fuzzing.
The experiments of this study used datasets and open-source projects from the wild to evaluate proposed systems. The experimental results demonstrate that the proposed methods have high detection performances and significant improvements compared to the prior studies. Notably, this thesis discovered eight zero-day vulnerabilities, five of which were patched after being informed by this study. Three vulnerabilities received the Common Vulnerabilities and Exposures (CVE) IDs (CVE-2019-15903, CVE-2019-1010239, CVE-2022-34096) after being detected by AutoVAS, DDRFuzz, and FuzzBuilderEx.
*표시는 필수 입력사항입니다.
*전화번호 | ※ '-' 없이 휴대폰번호를 입력하세요 |
---|
기사명 | 저자명 | 페이지 | 원문 | 기사목차 |
---|
번호 | 발행일자 | 권호명 | 제본정보 | 자료실 | 원문 | 신청 페이지 |
---|
도서위치안내: / 서가번호:
우편복사 목록담기를 완료하였습니다.
*표시는 필수 입력사항입니다.
저장 되었습니다.