Luận án Android malware classification using deep learning

Besides, many papers still convert API calls to vectors [9, 14, 70, 73]. Transforming API calls into vectors as input to the model also produces good results. S. K. Sasidharan et al. [70] trained a model using the Profile Hidden Markov model (PHHM). API calls and methods from malware in the DREBIN dataset were transformed into an encoded list and trained with a proportion of 70% for training and 30% for testing. The result’s accuracy reached 94.5% with a 7% false positive rate. The precision and recall acquired 0.93 and 0.95, respectively. Although not being used as much as permissions and API calls, many studies have used opcodes exclusively in malware detection problems such as [32, 52, 53, 74, 75, 76, 77, 78, 79, 80, 81]. The extracted opcodes were converted to grey images and put into a deep-learning model, resulting in a detection accuracy of 95.55% and a classification accuracy of 89.96%. Besides, V. Sihag et al. [53] used opcode to solve the problem of code obfuscation. The detection result achieved 98.8% accuracy when using the Random Forest algorithm on the Drebin and PRAGuard dataset of code obfuscation, with the number of malware apps used is 10,479. In [79], the authors proposed an effective opcode extraction method and applied a Convolutional Neural Network for classification. The k-max pooling method was used in the pooling phase to achieve an accuracy of more than 99%. On the other hand, M.Amin et al. [80] vectorized the extracted opcode through encoding and applied deep neural networks to train the model, e.g., Bidirectional long-short-term memory (BiLSTMs). With a dataset of more than 1.8 million apps, the paper acquired a result of 99.9% accuracy level. For other feature groups, they are usually combined with permissions, API calls, or opcodes. Because these groups often have few features and are unavailable in all apps, it isn’t easy to use them independently. From 2019 until now, according to the statistics in dblp, only two papers [82, 83] use the Intent feature independently. The results show that accuracy reaches 95.1% [82] and F1-score reaches 97% [83]; however, the dataset is self-collected, and the number of usable files in a dataset is small. Some common API packages in the Android malware detection problem datasets are described in Table 1.4. Features combination is commonly used, in which permission and API calls appear a lot as they play a crucial part in malware detection [14, 25, 33, 44, 84, 85, 86, 87, 88, 89, 90]. In many research papers, using feature groups has shown high effectiveness through evaluation results.

pdf141 trang | Chia sẻ: Tuệ An 21 | Ngày: 08/11/2024 | Lượt xem: 83 | Lượt tải: 1download
Bạn đang xem trước 20 trang tài liệu Luận án Android malware classification using deep learning, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY LE DUC THUAN ANDROID MALWARE CLASSIFICATION USING DEEP LEARNING DOCTORAL DISSERTATION OF COMPUTER ENGINEERING Hanoi−2024 MINISTRY OF EDUCATION AND TRAINING HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY LE DUC THUAN ANDROID MALWARE CLASSIFICATION USING DEEP LEARNING Major: Computer Engineering Code: 9480106 DOCTORAL DISSERTATION OF COMPUTER ENGINEERING SUPERVISORS Ph.D. Nguyen Kim Khanh Ph.D. Hoang Van Hiep Hanoi−2024 DECLARATION OF AUTHORSHIP I declare that my dissertation titled "Android malware classification using deep learning" has been entirely composed by myself, supervised by my co-supervisors, Ph.D. Nguyen Kim Khanh and Ph.D. Hoang Van Hiep. I assure you some statements as follows: • This work was done as a part of the requirements for the degree of Ph.D. Hanoi University of Science and Technology. • This dissertation has not previously been submitted for any degree. • The results in my dissertation are my independent work, except where works in the collaboration have been included. Other appropriate acknowledgments are given within this dissertation by explicit references. Hanoi, April, 2024 Ph.D. Student LE DUC THUAN SUPERVISORS Ph.D. NGUYEN KIM KHANH Ph.D. HOANG VAN HIEP ACKNOWLEDGEMENT My dissertation was realized during my doctoral course at the School of Information Communication and Technology (SoICT), Hanoi University of Science and Technology (HUST). HUST is a special place where I accumulated immense knowledge in my Ph.D. process. A Ph.D. process is not a one-man process. Therefore, I am heartily thankful to my supervisors, Ph.D. Nguyen Kim Khanh and Ph.D. Hoang Van Hiep, whose encourage- ment, guidance, and support from start to finish enabled me to develop my research skills and understanding of the subject. I have learned countless things from them. This dissertation would not have been possible without their precious support. I would like to thank the Executive Board and all members of the Computer Engi- neering Department, SoICT, and HUST for their frequent support in my Ph.D. course. I thank my colleagues at the Academy of Cryptography Techniques for their help. Last but not least, I would like to thank my family: my parents, my wife, and my friends, who have supported me spiritually throughout my life. They were always there to cheer me up and stand by me through good and bad times. Hanoi, April, 2024 Ph.D. Student LE DUC THUAN CONTENTS CONTENTS i ABBREVIATIONS v LIST OF TABLES vi LIST OF FIGURES viii INTRODUCTION 1 1 OVERVIEWOF ANDROIDMALWARE CLASSIFICATION BASED ON MACHINE LEARNING 6 1.1 Background Information . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.1 Android Platform . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.2 Overview of Android Malware . . . . . . . . . . . . . . . . . . . 10 1.2 Android Malware Classification Methods . . . . . . . . . . . . . . . . . 15 1.2.1 Signature-based Method . . . . . . . . . . . . . . . . . . . . . . 16 1.2.2 Anomaly-based Method . . . . . . . . . . . . . . . . . . . . . . 17 1.2.3 Android Malware Classification Evaluation Metrics . . . . . . . 18 1.2.3.1 Metrics for the Binary Classification Problem . . . . . 19 1.2.3.2 Metrics for Multi-labelled Classification Problem . . . 20 1.2.4 Android Malware Dataset . . . . . . . . . . . . . . . . . . . . . 20 1.3 Machine Learning-based Method for Android Malware Classification . . 24 1.4 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.4.1 Related Works on Feature Extraction . . . . . . . . . . . . . . . 26 1.4.1.1 Features Extraction Methods . . . . . . . . . . . . . . 26 1.4.1.2 Feature Augmentation Methods . . . . . . . . . . . . . 37 1.4.1.3 Feature Selection Methods . . . . . . . . . . . . . . . . 38 1.4.2 Related Works on Machine Learning-based Methods . . . . . . . 40 1.4.2.1 Random Forest Algorithm . . . . . . . . . . . . . . . . 41 1.4.2.2 Support Vector Machine . . . . . . . . . . . . . . . . . 42 1.4.2.3 K-Nearest Neighbor Algorithm . . . . . . . . . . . . . 43 1.4.2.4 Deep Belief Network . . . . . . . . . . . . . . . . . . . 44 1.4.2.5 Convolutional Neural Network . . . . . . . . . . . . . . 44 1.4.2.6 Some Other Models . . . . . . . . . . . . . . . . . . . 45 1.5 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 i 1.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2 PROPOSED METHODS FOR FEATURE EXTRACTION 49 2.1 Feature Augmentation based on Co-occurrence matrix . . . . . . . . . 49 2.1.1 Proposed Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.1.2 Raw Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 50 2.1.3 Co-occurrence Matrix Feature Computation . . . . . . . . . . . 51 2.1.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 52 2.1.4.1 Experimental Dataset . . . . . . . . . . . . . . . . . . 52 2.1.4.2 Experimental Scenario . . . . . . . . . . . . . . . . . . 53 2.1.4.3 Malware Classification based on CNN Model . . . . . . 54 2.1.4.4 Summary of Experimental Results . . . . . . . . . . . 54 2.2 Feature Augmentation based on Apriori Algorithm . . . . . . . . . . . 55 2.2.1 Proposed Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.2.2 Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.2.2.1 Introduction to Apriori Algorithm . . . . . . . . . . . 56 2.2.2.2 Apriori Algorithm . . . . . . . . . . . . . . . . . . . . 56 2.2.3 Feature Set Creation . . . . . . . . . . . . . . . . . . . . . . . . 57 2.2.3.1 Raw Android Feature Set . . . . . . . . . . . . . . . . 57 2.2.3.2 The Feature Augmentation Set . . . . . . . . . . . . . 58 2.2.3.3 Input Feature Normalization . . . . . . . . . . . . . . 59 2.2.3.4 Feature Augmentation Set . . . . . . . . . . . . . . . . 59 2.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 60 2.2.4.1 Experimental Dataset and Scenario . . . . . . . . . . . 60 2.2.4.2 experiment based on CNN Model . . . . . . . . . . . . 61 2.2.4.3 Summary of Experimental Results . . . . . . . . . . . 61 2.2.4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 62 2.3 Feature Selection Based on Popularity and Contrast Value in a Multi- objective Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.3.1 Proposed idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.3.2 Popularity and Contrast Computation . . . . . . . . . . . . . . 64 2.3.3 Pareto Multi-objective Optimization Method . . . . . . . . . . . 65 2.3.4 Selection Function and Implementation . . . . . . . . . . . . . . 65 2.3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 67 2.3.5.1 Experimental Dataset . . . . . . . . . . . . . . . . . . 67 2.3.5.2 Experimental Scenario . . . . . . . . . . . . . . . . . . 68 2.3.5.3 Summary of Experimental Results . . . . . . . . . . . 69 2.3.5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 71 2.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 ii 3 DEEP LEARNING-BASED ANDROID MALWARE CLASSIFICA- TION 75 3.1 Applying DBN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.1.1 DBN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.1.2 Boltzmann Machine and Deep Belief Network . . . . . . . . . . 77 3.1.2.1 Restricted Boltzmann Machine . . . . . . . . . . . . . 77 3.1.2.2 Deep Belief Network . . . . . . . . . . . . . . . . . . . 77 3.1.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 77 3.1.3.1 Experimental Dataset . . . . . . . . . . . . . . . . . . 77 3.1.3.2 Experimental Scenario . . . . . . . . . . . . . . . . . . 78 3.1.3.3 Summary of Experimental Results . . . . . . . . . . . 79 3.2 Applying CNN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.2.1 CNN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 81 3.2.2.1 Experimental Dataset . . . . . . . . . . . . . . . . . . 81 3.2.2.2 Raw Feature Dataset . . . . . . . . . . . . . . . . . . . 82 3.2.2.3 Malware Classification using CNN Model . . . . . . . . 83 3.2.2.4 Summary of Experimental Results . . . . . . . . . . . 83 3.3 Proposed Method using WDCNN Model for Android Malware Classifi- cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.3.1 Proposed Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.3.2 Building Components in the WDCNN Model . . . . . . . . . . . 86 3.3.2.1 Mathematical Model . . . . . . . . . . . . . . . . . . . 86 3.3.2.2 Data Partitioning . . . . . . . . . . . . . . . . . . . . . 88 3.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 90 3.3.3.1 Experimental Dataset . . . . . . . . . . . . . . . . . . 90 3.3.3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . 90 3.3.3.3 Experimental Scenarios . . . . . . . . . . . . . . . . . 92 3.3.3.4 Summary of Experimental Results . . . . . . . . . . . 93 3.3.3.5 Evaluation Results . . . . . . . . . . . . . . . . . . . . 95 3.4 Applying Federated Learning Model . . . . . . . . . . . . . . . . . . . . 99 3.4.1 Federated Learning Model . . . . . . . . . . . . . . . . . . . . . 99 3.4.2 Implement Federated Learning Model . . . . . . . . . . . . . . . 100 3.4.2.1 Mathematical Model . . . . . . . . . . . . . . . . . . . 100 3.4.2.2 The Process of Synthesizing Weight Set . . . . . . . . 102 3.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 102 3.4.3.1 Experimental Dataset . . . . . . . . . . . . . . . . . . 102 3.4.3.2 Experimental Scenario . . . . . . . . . . . . . . . . . . 103 3.4.3.3 Summary of Experimental Results . . . . . . . . . . . 104 iii 3.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 CONCLUSIONS 109 PUBLICATIONS 111 BIBLIOGRAPHY 113 iv ABBREVIATIONS No. Abbreviation Meaning 1 Acc Accuracy 2 API Application Programming Interface 3 CNN Convolutional Neural Network 4 DBN Deep Belief Network 5 DNN Deep Neural Network 6 FN False Negative 7 FP False Positive 8 GA Genetic Algorithm 9 GAN Generative Adversarial Network 10 GRB Red-Green-Blue 11 IG Information Gain 12 KNN K-Nearest Neighbors 13 LSTM Long Short-Term Memory 14 PSO Particle Swarm Optimization 15 RF Random Forest 16 RNN Recurrent Neural Network 17 SVM Support Vector Machine 18 TF-IDF Term Frequency – Inverse Document Frequency 19 TN True Negative 20 TP True Positive 21 RBM Restricted Boltzmann Machine 22 WDCNN Wide and Deep CNN 23 XML Androidmanifest.xml 24 DEX Classes.dex v LIST OF TABLES 1.1 Types of malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2 Summary of Android malware datasets . . . . . . . . . . . . . . . . . . 22 1.3 Sensitive permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 1.4 Common API packages . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.5 Common suspicious API call . . . . . . . . . . . . . . . . . . . . . . . . 31 1.6 Some typical traffic flows . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.1 Details of parameters set in the CNN model . . . . . . . . . . . . . . . 54 2.2 Classification with CNN model using accuracy measure (%) . . . . . . 54 2.3 Measurements evaluate effectiveness (%) . . . . . . . . . . . . . . . . . 55 2.4 Details of parameters set in the CNN model . . . . . . . . . . . . . . . 61 2.5 Classification results by CNN . . . . . . . . . . . . . . . . . . . . . . . 62 2.6 Results of using CNN with measurements (%) . . . . . . . . . . . . . . 62 2.7 Details of parameters set in the CNN model for selection feature . . . . 68 2.8 Summary of feature evaluation measures selectivity functions (top (10)) – with API set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 2.9 Summary of results with datasets and feature sets . . . . . . . . . . . . 70 2.10 Summary of results of proposed feature augmentation methods . . . . . 72 3.1 Result with Acc measure (%) in scenario 1 . . . . . . . . . . . . . . . . 79 3.2 Result with Acc measure (%) in scenario 2 . . . . . . . . . . . . . . . . 79 3.3 Results with measures in scenario 3 (%) . . . . . . . . . . . . . . . . . 79 3.4 Experimental results using CNN model . . . . . . . . . . . . . . . . . . 84 3.5 The datasets used for the experiment . . . . . . . . . . . . . . . . . . . 91 3.6 Experimental results of Simple dataset . . . . . . . . . . . . . . . . . . 95 3.7 Experimental results of Complex dataset . . . . . . . . . . . . . . . . . 96 3.8 Experimental results when comparing models . . . . . . . . . . . . . . 96 3.9 Accuracy comparison of models Features: Images 128x128 + permission + API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.10 Experimental results with scenario 3 (%) . . . . . . . . . . . . . . . . . 97 3.11 Average set of weights (accuracy - %) . . . . . . . . . . . . . . . . . . . 104 3.12 Set of Weights according to the number of samples (accuracy - %) . . . 105 3.13 Our proposed set of weights (accuracy - %) . . . . . . . . . . . . . . . . 105 3.14 Summary of results of proposed machine learning, deep learning models and comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 vi LIST OF FIGURES 1.1 Architecture of Android OS system [37] . . . . . . . . . . . . . . . . . . 7 1.2 The increase of malware on Android OS . . . . . . . . . . . . . . . . . 14 1.3 Types of malware on Android OS . . . . . . . . . . . . . . . . . . . . . 14 1.4 Anomaly-Based Detection Technique . . . . . . . . . . . . . . . . . . . 17 1.5 Overview of the problem of detecting malware on the Android . . . . . 25 1.6 General model of feature extraction methods . . . . . . . . . . . . . . . 27 1.7 Statistics of papers using machine learning and deep learning from 2019- 2022 on dblp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 1.8 Architecture of the CNN model [133] . . . . . . . . . . . . . . . . . . . 45 2.1 Evaluation model for Android malware classification using co-occurrence matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.2 Output matrix with different size . . . . . . . . . . . . . . . . . . . . . 52 2.3 Top (10) malware families in Drebin dataset . . . . . . . . . . . . . . . 53 2.4 CNN having multi-convolutional networks . . . . . . . . . . . . . . . . 53 2.5 The process of research and experiment using Apriori . . . . . . . . . . 56 2.6 Apply the Apriori algorithm to the feature set . . . . . . . . . . . . . . 60 2.7 Architecture of CNN model used in the experiment with Apriori . . . 61 2.8 Learning method implementation results . . . . . . . . . . . . . . . . . 63 2.9 Proposing a feature selection model . . . . . . . . . . . . . . . . . . . . 64 2.10 Top (20) family of malware with the most samples in the AMD dataset 67 2.11 Experimental model when applying feature selection algorithm . . . . . 69 2.12 Experimental results when applying feature selection algorithm . . . . . 71 3.1 System development and evaluation process using the DBN . . . . . . 76 3.2 Architectural diagram of DBN application in Android malware detection 78 3.3 The overall model of the training and classification of malware using the CNN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.4 Test rate according to the 10-fold . . . . . . . . . . . . . . . . . . . . . 85 3.5 WDCNN model operation diagram . . . . . . . . . . . . . . . . . . . . 86 3.6 Structure and parameters of the WDCNN model . . . . . . . . . . . . . 87 3.7 Top 20 malware family AMD and Drebin . . . . . . . . . . . . . . . . . 91 3.8 Experimental model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.9 Classification of malware depending on the number of labels . . . . . . 94 3.10 DEX file size by size in the Drebin dataset . . . . . . . . . . . . . . . . 100 vii 3.11 Overall model using federated learning . . . . . . . . . . . . . . . . . . 101 3.12 Compare the results of the weighted aggregation methods . . . . . . . . 106 3.13 Classification results with influence factor . . . . . . . . . . . . . . . . . 107 viii INTRODUCTION In the present day, there is a growing inclination towards the adoption of digital transformation and artificial intelligence in smart device applications across diverse operating systems. This trend aligns with the advancements of the fourth industrial revolution and is being observed in numerous domains of social and economic activity. According to the statistics [1] in June 2023, Android dominated the market for mobile operating systems with 70.79%. Furthermore, the Android operating system is utilized in a diverse range of smart devices, including but not limited to mobile phones, televi- sions, watches, automobiles, vending machines, and network routers. The rapid growth and variety of devices that use the Android operating system (OS) have contributed to the significant increase in the number, style, and appearance of malware. According to the statistics [2], in 2021, there were a total of 3.36 million malware found in the Android OS market. This situation leads to danger for users of mobile operating sys- tems. Solving the problems of malware detection is, therefore, urgent and necessary. As reported in the DBLP database [3] from 2013 to 2022, there were 1,081 researches on this issue. Two main approaches are commonly applied to detect Android malware: static and dynamic analysis. Static analysis involves inspecting a program’s executable file structure, characteristics, and source code. The advantage of static analysis is that it does not require that the code be executed (of course, it is pretty dangerous to run a malware file on a natural system). By examining the decompiled code, the static analysis can determine the flows and actions of the execution file and thus identify it as either malware or benign. The disadvantage, however, is that some sophisticated malware can include malicious runtime behavior that can go undetected. On the other hand, dynamic analysis involves executing potentially malicious code in a real or sand- box environment to monitor its behavior. The sandbox environment helps analysts examine potential threats without putting the system at risk of infection. Although dynamic analysis could detect threats that might be ignored by static analysis, this approach requires more time and resources than static analysis. It may not be able to cover all the possible execution paths of the malware. In summary, static analysis is said to help find known threats and vulnerabilities. In contrast, dynamic analysis is suitable for finding new types and uncovering threats not previously documented (i.e., zero-day threats). For the problem of malware detection, dynamic analysis seems recommended for organizations that need a deeper understanding of malware behavior or impact and have the necessary tools and expertise to perform it. For the problem of malware classification, static analysis is more popular due to its more straightforward 1 implementation. This dissertation also uses static analysis as the main method for feature extraction [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. Malware classification assigns malware samples into specific malware families, in- cluding benign ones. Signature-based and machine learning-based methods have usu- ally been used for this problem. Signature-based methods have been traditional and widely used [15, 16, 17]. They rely on matching the "signature" of known malware sam- ples with unknown ones. As mentioned in the previous pa

Các file đính kèm theo tài liệu này:

  • pdfluan_an_android_malware_classification_using_deep_learning.pdf
  • docxQuyển tóm tắt LATS tiếng anh.docx
  • pdfQuyển tóm tắt LATS tiếng anh.pdf
  • docxQuyển tóm tắt LATS tiếng việt.docx
  • pdfQuyển tóm tắt LATS tiếng việt.pdf
  • docxThông tin tóm tắt về những kết luận mới của luận án tiến sĩ - Tiếng anh.docx
  • pdfThông tin tóm tắt về những kết luận mới của luận án tiến sĩ - Tiếng anh.pdf
  • docxThông tin tóm tắt về những kết luận mới của luận án tiến sĩ - Tiếng việt.docx
  • pdfThông tin tóm tắt về những kết luận mới của luận án tiến sĩ - Tiếng việt.pdf
  • docxTrích yếu luận án NCS Lê Đức Thuận.docx
  • pdfTrích yếu luận án NCS Lê Đức Thuận.pdf