Besides, many papers still convert API calls to vectors [9, 14, 70, 73]. Transforming
API calls into vectors as input to the model also produces good results. S. K. Sasidharan
et al. [70] trained a model using the Profile Hidden Markov model (PHHM).
API calls and methods from malware in the DREBIN dataset were transformed into
an encoded list and trained with a proportion of 70% for training and 30% for testing.
The result’s accuracy reached 94.5% with a 7% false positive rate. The precision and
recall acquired 0.93 and 0.95, respectively.
Although not being used as much as permissions and API calls, many studies have
used opcodes exclusively in malware detection problems such as [32, 52, 53, 74, 75, 76,
77, 78, 79, 80, 81]. The extracted opcodes were converted to grey images and put into
a deep-learning model, resulting in a detection accuracy of 95.55% and a classification
accuracy of 89.96%. Besides, V. Sihag et al. [53] used opcode to solve the problem
of code obfuscation. The detection result achieved 98.8% accuracy when using the
Random Forest algorithm on the Drebin and PRAGuard dataset of code obfuscation,
with the number of malware apps used is 10,479. In [79], the authors proposed an
effective opcode extraction method and applied a Convolutional Neural Network for
classification. The k-max pooling method was used in the pooling phase to achieve
an accuracy of more than 99%. On the other hand, M.Amin et al. [80] vectorized
the extracted opcode through encoding and applied deep neural networks to train the
model, e.g., Bidirectional long-short-term memory (BiLSTMs). With a dataset of more
than 1.8 million apps, the paper acquired a result of 99.9% accuracy level.
For other feature groups, they are usually combined with permissions, API calls,
or opcodes. Because these groups often have few features and are unavailable in all
apps, it isn’t easy to use them independently. From 2019 until now, according to the
statistics in dblp, only two papers [82, 83] use the Intent feature independently. The
results show that accuracy reaches 95.1% [82] and F1-score reaches 97% [83]; however,
the dataset is self-collected, and the number of usable files in a dataset is small.
Some common API packages in the Android malware detection problem datasets
are described in Table 1.4.
Features combination is commonly used, in which permission and API calls appear
a lot as they play a crucial part in malware detection [14, 25, 33, 44, 84, 85, 86, 87,
88, 89, 90]. In many research papers, using feature groups has shown high effectiveness
through evaluation results.
141 trang |
Chia sẻ: Tuệ An 21 | Ngày: 08/11/2024 | Lượt xem: 83 | Lượt tải: 1
Bạn đang xem trước 20 trang tài liệu Luận án Android malware classification using deep learning, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
LE DUC THUAN
ANDROID MALWARE CLASSIFICATION
USING DEEP LEARNING
DOCTORAL DISSERTATION OF
COMPUTER ENGINEERING
Hanoi−2024
MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
LE DUC THUAN
ANDROID MALWARE CLASSIFICATION
USING DEEP LEARNING
Major: Computer Engineering
Code: 9480106
DOCTORAL DISSERTATION OF
COMPUTER ENGINEERING
SUPERVISORS
Ph.D. Nguyen Kim Khanh
Ph.D. Hoang Van Hiep
Hanoi−2024
DECLARATION OF AUTHORSHIP
I declare that my dissertation titled "Android malware classification using deep
learning" has been entirely composed by myself, supervised by my co-supervisors, Ph.D.
Nguyen Kim Khanh and Ph.D. Hoang Van Hiep. I assure you some statements as
follows:
• This work was done as a part of the requirements for the degree of Ph.D. Hanoi
University of Science and Technology.
• This dissertation has not previously been submitted for any degree.
• The results in my dissertation are my independent work, except where works in
the collaboration have been included. Other appropriate acknowledgments are
given within this dissertation by explicit references.
Hanoi, April, 2024
Ph.D. Student
LE DUC THUAN
SUPERVISORS
Ph.D. NGUYEN KIM KHANH
Ph.D. HOANG VAN HIEP
ACKNOWLEDGEMENT
My dissertation was realized during my doctoral course at the School of Information
Communication and Technology (SoICT), Hanoi University of Science and Technology
(HUST). HUST is a special place where I accumulated immense knowledge in my Ph.D.
process.
A Ph.D. process is not a one-man process. Therefore, I am heartily thankful to my
supervisors, Ph.D. Nguyen Kim Khanh and Ph.D. Hoang Van Hiep, whose encourage-
ment, guidance, and support from start to finish enabled me to develop my research
skills and understanding of the subject. I have learned countless things from them.
This dissertation would not have been possible without their precious support.
I would like to thank the Executive Board and all members of the Computer Engi-
neering Department, SoICT, and HUST for their frequent support in my Ph.D. course.
I thank my colleagues at the Academy of Cryptography Techniques for their help.
Last but not least, I would like to thank my family: my parents, my wife, and my
friends, who have supported me spiritually throughout my life. They were always there
to cheer me up and stand by me through good and bad times.
Hanoi, April, 2024
Ph.D. Student
LE DUC THUAN
CONTENTS
CONTENTS i
ABBREVIATIONS v
LIST OF TABLES vi
LIST OF FIGURES viii
INTRODUCTION 1
1 OVERVIEWOF ANDROIDMALWARE CLASSIFICATION BASED
ON MACHINE LEARNING 6
1.1 Background Information . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.1 Android Platform . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.2 Overview of Android Malware . . . . . . . . . . . . . . . . . . . 10
1.2 Android Malware Classification Methods . . . . . . . . . . . . . . . . . 15
1.2.1 Signature-based Method . . . . . . . . . . . . . . . . . . . . . . 16
1.2.2 Anomaly-based Method . . . . . . . . . . . . . . . . . . . . . . 17
1.2.3 Android Malware Classification Evaluation Metrics . . . . . . . 18
1.2.3.1 Metrics for the Binary Classification Problem . . . . . 19
1.2.3.2 Metrics for Multi-labelled Classification Problem . . . 20
1.2.4 Android Malware Dataset . . . . . . . . . . . . . . . . . . . . . 20
1.3 Machine Learning-based Method for Android Malware Classification . . 24
1.4 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.1 Related Works on Feature Extraction . . . . . . . . . . . . . . . 26
1.4.1.1 Features Extraction Methods . . . . . . . . . . . . . . 26
1.4.1.2 Feature Augmentation Methods . . . . . . . . . . . . . 37
1.4.1.3 Feature Selection Methods . . . . . . . . . . . . . . . . 38
1.4.2 Related Works on Machine Learning-based Methods . . . . . . . 40
1.4.2.1 Random Forest Algorithm . . . . . . . . . . . . . . . . 41
1.4.2.2 Support Vector Machine . . . . . . . . . . . . . . . . . 42
1.4.2.3 K-Nearest Neighbor Algorithm . . . . . . . . . . . . . 43
1.4.2.4 Deep Belief Network . . . . . . . . . . . . . . . . . . . 44
1.4.2.5 Convolutional Neural Network . . . . . . . . . . . . . . 44
1.4.2.6 Some Other Models . . . . . . . . . . . . . . . . . . . 45
1.5 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
i
1.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2 PROPOSED METHODS FOR FEATURE EXTRACTION 49
2.1 Feature Augmentation based on Co-occurrence matrix . . . . . . . . . 49
2.1.1 Proposed Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.1.2 Raw Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 50
2.1.3 Co-occurrence Matrix Feature Computation . . . . . . . . . . . 51
2.1.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 52
2.1.4.1 Experimental Dataset . . . . . . . . . . . . . . . . . . 52
2.1.4.2 Experimental Scenario . . . . . . . . . . . . . . . . . . 53
2.1.4.3 Malware Classification based on CNN Model . . . . . . 54
2.1.4.4 Summary of Experimental Results . . . . . . . . . . . 54
2.2 Feature Augmentation based on Apriori Algorithm . . . . . . . . . . . 55
2.2.1 Proposed Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.2.2 Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.2.2.1 Introduction to Apriori Algorithm . . . . . . . . . . . 56
2.2.2.2 Apriori Algorithm . . . . . . . . . . . . . . . . . . . . 56
2.2.3 Feature Set Creation . . . . . . . . . . . . . . . . . . . . . . . . 57
2.2.3.1 Raw Android Feature Set . . . . . . . . . . . . . . . . 57
2.2.3.2 The Feature Augmentation Set . . . . . . . . . . . . . 58
2.2.3.3 Input Feature Normalization . . . . . . . . . . . . . . 59
2.2.3.4 Feature Augmentation Set . . . . . . . . . . . . . . . . 59
2.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 60
2.2.4.1 Experimental Dataset and Scenario . . . . . . . . . . . 60
2.2.4.2 experiment based on CNN Model . . . . . . . . . . . . 61
2.2.4.3 Summary of Experimental Results . . . . . . . . . . . 61
2.2.4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 62
2.3 Feature Selection Based on Popularity and Contrast Value in a Multi-
objective Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.3.1 Proposed idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.3.2 Popularity and Contrast Computation . . . . . . . . . . . . . . 64
2.3.3 Pareto Multi-objective Optimization Method . . . . . . . . . . . 65
2.3.4 Selection Function and Implementation . . . . . . . . . . . . . . 65
2.3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 67
2.3.5.1 Experimental Dataset . . . . . . . . . . . . . . . . . . 67
2.3.5.2 Experimental Scenario . . . . . . . . . . . . . . . . . . 68
2.3.5.3 Summary of Experimental Results . . . . . . . . . . . 69
2.3.5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 71
2.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
ii
3 DEEP LEARNING-BASED ANDROID MALWARE CLASSIFICA-
TION 75
3.1 Applying DBN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.1.1 DBN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.1.2 Boltzmann Machine and Deep Belief Network . . . . . . . . . . 77
3.1.2.1 Restricted Boltzmann Machine . . . . . . . . . . . . . 77
3.1.2.2 Deep Belief Network . . . . . . . . . . . . . . . . . . . 77
3.1.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1.3.1 Experimental Dataset . . . . . . . . . . . . . . . . . . 77
3.1.3.2 Experimental Scenario . . . . . . . . . . . . . . . . . . 78
3.1.3.3 Summary of Experimental Results . . . . . . . . . . . 79
3.2 Applying CNN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.2.1 CNN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 81
3.2.2.1 Experimental Dataset . . . . . . . . . . . . . . . . . . 81
3.2.2.2 Raw Feature Dataset . . . . . . . . . . . . . . . . . . . 82
3.2.2.3 Malware Classification using CNN Model . . . . . . . . 83
3.2.2.4 Summary of Experimental Results . . . . . . . . . . . 83
3.3 Proposed Method using WDCNN Model for Android Malware Classifi-
cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.3.1 Proposed Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.3.2 Building Components in the WDCNN Model . . . . . . . . . . . 86
3.3.2.1 Mathematical Model . . . . . . . . . . . . . . . . . . . 86
3.3.2.2 Data Partitioning . . . . . . . . . . . . . . . . . . . . . 88
3.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 90
3.3.3.1 Experimental Dataset . . . . . . . . . . . . . . . . . . 90
3.3.3.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . 90
3.3.3.3 Experimental Scenarios . . . . . . . . . . . . . . . . . 92
3.3.3.4 Summary of Experimental Results . . . . . . . . . . . 93
3.3.3.5 Evaluation Results . . . . . . . . . . . . . . . . . . . . 95
3.4 Applying Federated Learning Model . . . . . . . . . . . . . . . . . . . . 99
3.4.1 Federated Learning Model . . . . . . . . . . . . . . . . . . . . . 99
3.4.2 Implement Federated Learning Model . . . . . . . . . . . . . . . 100
3.4.2.1 Mathematical Model . . . . . . . . . . . . . . . . . . . 100
3.4.2.2 The Process of Synthesizing Weight Set . . . . . . . . 102
3.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 102
3.4.3.1 Experimental Dataset . . . . . . . . . . . . . . . . . . 102
3.4.3.2 Experimental Scenario . . . . . . . . . . . . . . . . . . 103
3.4.3.3 Summary of Experimental Results . . . . . . . . . . . 104
iii
3.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
CONCLUSIONS 109
PUBLICATIONS 111
BIBLIOGRAPHY 113
iv
ABBREVIATIONS
No. Abbreviation Meaning
1 Acc Accuracy
2 API Application Programming Interface
3 CNN Convolutional Neural Network
4 DBN Deep Belief Network
5 DNN Deep Neural Network
6 FN False Negative
7 FP False Positive
8 GA Genetic Algorithm
9 GAN Generative Adversarial Network
10 GRB Red-Green-Blue
11 IG Information Gain
12 KNN K-Nearest Neighbors
13 LSTM Long Short-Term Memory
14 PSO Particle Swarm Optimization
15 RF Random Forest
16 RNN Recurrent Neural Network
17 SVM Support Vector Machine
18 TF-IDF Term Frequency – Inverse Document Frequency
19 TN True Negative
20 TP True Positive
21 RBM Restricted Boltzmann Machine
22 WDCNN Wide and Deep CNN
23 XML Androidmanifest.xml
24 DEX Classes.dex
v
LIST OF TABLES
1.1 Types of malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Summary of Android malware datasets . . . . . . . . . . . . . . . . . . 22
1.3 Sensitive permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.4 Common API packages . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.5 Common suspicious API call . . . . . . . . . . . . . . . . . . . . . . . . 31
1.6 Some typical traffic flows . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1 Details of parameters set in the CNN model . . . . . . . . . . . . . . . 54
2.2 Classification with CNN model using accuracy measure (%) . . . . . . 54
2.3 Measurements evaluate effectiveness (%) . . . . . . . . . . . . . . . . . 55
2.4 Details of parameters set in the CNN model . . . . . . . . . . . . . . . 61
2.5 Classification results by CNN . . . . . . . . . . . . . . . . . . . . . . . 62
2.6 Results of using CNN with measurements (%) . . . . . . . . . . . . . . 62
2.7 Details of parameters set in the CNN model for selection feature . . . . 68
2.8 Summary of feature evaluation measures selectivity functions (top (10))
– with API set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.9 Summary of results with datasets and feature sets . . . . . . . . . . . . 70
2.10 Summary of results of proposed feature augmentation methods . . . . . 72
3.1 Result with Acc measure (%) in scenario 1 . . . . . . . . . . . . . . . . 79
3.2 Result with Acc measure (%) in scenario 2 . . . . . . . . . . . . . . . . 79
3.3 Results with measures in scenario 3 (%) . . . . . . . . . . . . . . . . . 79
3.4 Experimental results using CNN model . . . . . . . . . . . . . . . . . . 84
3.5 The datasets used for the experiment . . . . . . . . . . . . . . . . . . . 91
3.6 Experimental results of Simple dataset . . . . . . . . . . . . . . . . . . 95
3.7 Experimental results of Complex dataset . . . . . . . . . . . . . . . . . 96
3.8 Experimental results when comparing models . . . . . . . . . . . . . . 96
3.9 Accuracy comparison of models Features: Images 128x128 + permission
+ API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.10 Experimental results with scenario 3 (%) . . . . . . . . . . . . . . . . . 97
3.11 Average set of weights (accuracy - %) . . . . . . . . . . . . . . . . . . . 104
3.12 Set of Weights according to the number of samples (accuracy - %) . . . 105
3.13 Our proposed set of weights (accuracy - %) . . . . . . . . . . . . . . . . 105
3.14 Summary of results of proposed machine learning, deep learning models
and comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
vi
LIST OF FIGURES
1.1 Architecture of Android OS system [37] . . . . . . . . . . . . . . . . . . 7
1.2 The increase of malware on Android OS . . . . . . . . . . . . . . . . . 14
1.3 Types of malware on Android OS . . . . . . . . . . . . . . . . . . . . . 14
1.4 Anomaly-Based Detection Technique . . . . . . . . . . . . . . . . . . . 17
1.5 Overview of the problem of detecting malware on the Android . . . . . 25
1.6 General model of feature extraction methods . . . . . . . . . . . . . . . 27
1.7 Statistics of papers using machine learning and deep learning from 2019-
2022 on dblp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.8 Architecture of the CNN model [133] . . . . . . . . . . . . . . . . . . . 45
2.1 Evaluation model for Android malware classification using co-occurrence
matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.2 Output matrix with different size . . . . . . . . . . . . . . . . . . . . . 52
2.3 Top (10) malware families in Drebin dataset . . . . . . . . . . . . . . . 53
2.4 CNN having multi-convolutional networks . . . . . . . . . . . . . . . . 53
2.5 The process of research and experiment using Apriori . . . . . . . . . . 56
2.6 Apply the Apriori algorithm to the feature set . . . . . . . . . . . . . . 60
2.7 Architecture of CNN model used in the experiment with Apriori . . . 61
2.8 Learning method implementation results . . . . . . . . . . . . . . . . . 63
2.9 Proposing a feature selection model . . . . . . . . . . . . . . . . . . . . 64
2.10 Top (20) family of malware with the most samples in the AMD dataset 67
2.11 Experimental model when applying feature selection algorithm . . . . . 69
2.12 Experimental results when applying feature selection algorithm . . . . . 71
3.1 System development and evaluation process using the DBN . . . . . . 76
3.2 Architectural diagram of DBN application in Android malware detection 78
3.3 The overall model of the training and classification of malware using the
CNN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.4 Test rate according to the 10-fold . . . . . . . . . . . . . . . . . . . . . 85
3.5 WDCNN model operation diagram . . . . . . . . . . . . . . . . . . . . 86
3.6 Structure and parameters of the WDCNN model . . . . . . . . . . . . . 87
3.7 Top 20 malware family AMD and Drebin . . . . . . . . . . . . . . . . . 91
3.8 Experimental model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.9 Classification of malware depending on the number of labels . . . . . . 94
3.10 DEX file size by size in the Drebin dataset . . . . . . . . . . . . . . . . 100
vii
3.11 Overall model using federated learning . . . . . . . . . . . . . . . . . . 101
3.12 Compare the results of the weighted aggregation methods . . . . . . . . 106
3.13 Classification results with influence factor . . . . . . . . . . . . . . . . . 107
viii
INTRODUCTION
In the present day, there is a growing inclination towards the adoption of digital
transformation and artificial intelligence in smart device applications across diverse
operating systems. This trend aligns with the advancements of the fourth industrial
revolution and is being observed in numerous domains of social and economic activity.
According to the statistics [1] in June 2023, Android dominated the market for mobile
operating systems with 70.79%. Furthermore, the Android operating system is utilized
in a diverse range of smart devices, including but not limited to mobile phones, televi-
sions, watches, automobiles, vending machines, and network routers. The rapid growth
and variety of devices that use the Android operating system (OS) have contributed
to the significant increase in the number, style, and appearance of malware. According
to the statistics [2], in 2021, there were a total of 3.36 million malware found in the
Android OS market. This situation leads to danger for users of mobile operating sys-
tems. Solving the problems of malware detection is, therefore, urgent and necessary.
As reported in the DBLP database [3] from 2013 to 2022, there were 1,081 researches
on this issue.
Two main approaches are commonly applied to detect Android malware: static
and dynamic analysis. Static analysis involves inspecting a program’s executable file
structure, characteristics, and source code. The advantage of static analysis is that it
does not require that the code be executed (of course, it is pretty dangerous to run
a malware file on a natural system). By examining the decompiled code, the static
analysis can determine the flows and actions of the execution file and thus identify it
as either malware or benign. The disadvantage, however, is that some sophisticated
malware can include malicious runtime behavior that can go undetected. On the other
hand, dynamic analysis involves executing potentially malicious code in a real or sand-
box environment to monitor its behavior. The sandbox environment helps analysts
examine potential threats without putting the system at risk of infection. Although
dynamic analysis could detect threats that might be ignored by static analysis, this
approach requires more time and resources than static analysis. It may not be able
to cover all the possible execution paths of the malware. In summary, static analysis
is said to help find known threats and vulnerabilities. In contrast, dynamic analysis
is suitable for finding new types and uncovering threats not previously documented
(i.e., zero-day threats). For the problem of malware detection, dynamic analysis seems
recommended for organizations that need a deeper understanding of malware behavior
or impact and have the necessary tools and expertise to perform it. For the problem of
malware classification, static analysis is more popular due to its more straightforward
1
implementation. This dissertation also uses static analysis as the main method for
feature extraction [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14].
Malware classification assigns malware samples into specific malware families, in-
cluding benign ones. Signature-based and machine learning-based methods have usu-
ally been used for this problem. Signature-based methods have been traditional and
widely used [15, 16, 17]. They rely on matching the "signature" of known malware sam-
ples with unknown ones. As mentioned in the previous pa