Going deep into the cat and the mouse game: deep learning

Gibert Llauradó, Daniel

Going deep into the cat and the mouse gamedeep learning

Gibert Llauradó, Daniel

Dirigida por:

Jordi Planes Cid Director
Carlos Mateu Piñol Codirector

Universidad de defensa: Universitat de Lleida

Fecha de defensa: 15 de diciembre de 2020

Tribunal:

César Fernández Camón Presidente
António Morgado Secretario/a
Eva Armengol Voltas Vocal

Tipo: Tesis

Teseo: 645277 DIALNET TDX editor

Resumen

The fight against malware has never stopped since the dawn of computing. This fight has turned out to be a never-ending and cyclical arms race: as security analysts and researchers improve their defenses, malware developers continue to innovate, fi nd new infection vectors and enhance their obfuscation techniques. Lately, due to the massive growth of malware streams, new methods have to be devised to complement traditional detection approaches and keep pace with new attacks and variants. The aim of this thesis is the design, implementation, and evaluation of machine learning approaches for the task of malware detection and classifi cation, due to its ability to handle large volumes of data and to generalize to never-before-seen malware. This thesis is structured into four main parts. The first part provides a systematic and detailed overview of machine learning techniques to tackle the problem of malware detection and classi cation. This dissertation presents the following contributions that extend and complement previous work:(1) it provides a complete description of the methods and features in a traditional machine learning workow for malware detection and classifi cation; (2) it explores the challenges and limitations of traditional machine learning; (3) it analyzes recent trends and developments in the eld with special emphasis on deep learning approaches; (4) it presents the research issues and unsolved challenges of the state-of-the-art techniques; and (5) it discusses new directions of research. The second part is devoted to automating the feature engineering process through deep learning. Traditional machine learning approaches in the literature rely on the manual extraction of hand-crafted features de fined by experts. However, these solutions depend almost entirely on the ability of the domain experts to extract characterizing features that accurately represent malware, and depending on the type of features extracted, such as n-gram features, the feature extraction process becomes a very time-consuming and memory-intensive process. Deep learning replaces the feature engineering process with an underlying system, which typically consists of a neural network with multiple layers, that performs both feature learning and classi fication. With deep learning one can start with raw data, as features will be automatically created by the network during the training procedure. This is achieved by stacking one or more convolutional layers, where the first ones learn to extract n-gram like features from the hexadecimal representation of malware's binary content and its the assembly language source code. The third part of this thesis is devoted to investigating mechanisms to combine multiple modalities of information to increase the robustness of deep learning classifi ers. Modalities are, essentially, channels of information. These data from multiple sources are semantically correlated, and sometimes provide complementary information to each other, thus rejecting patterns that are not visible when working with individual modalities on their own. Consequently, by only taking as input the raw bytes or opcodes a great deal of useful information for classifi cation is overlooked. Subsequently, this thesis investigates how to combine various sources of information in deep learning architectures using an intermediate fusion strategy, and it presents a wide and deep learning framework, named HYDRA, that combines the bene fits of feature engineering and deep learning. The fourth part of this dissertation discusses the main issues and challenges faced by security researchers such as the availability of public benchmarks for malware research, and the problems of class imbalance, concept drift and adversarial learning. To this end, it provides an extensive evaluation of deep learning approaches for malware classi fication against common metamorphic techniques, and it explores their usage to augment the training set and reduce class imbalance. The metamorphic techniques analyzed are the following: (1) the dead code insertion technique, (2) the registers reassignment technique, (3) the subroutine reordering technique and (4) the code reordering through jumps technique.