CNN for malware analysis

James
31 dic 2025
Tempo di lettura: 3 min

The idea is simple but extremely interesting. CNNs, or convolutional neural networks, are extremely powerful machine learning algorithms that have marked a leap forward in the task of classifying objects in an image. So let's try to apply them not only to classifying objects in an image. What happens if we somehow represent an executable as an image and then use these algorithms to classify it.

If you follow me, this is a rudimentary, primitive AI classifier for executables. Let's see if this simple experiment already reveals characteristics that differentiate malware from a legitimate program.

My aim is to work alongside you on a project that applies machine learning algorithms to cybersecurity, and to share an approach to experimentation and discovery that I hope you’ll find inspiring, and to show you my own journey in building customized tools for cybersecurity.

Binary-to-Image Representation

Each Windows EXE file is treated as a raw byte stream and converted into a fixed-size visual representation.The binary content is read sequentially and mapped to an RGB image by grouping three consecutive bytes into a single pixel, preserving local byte adjacency. Images are stored in a lossless PNG format to avoid compression artifacts and ensure deterministic reconstruction. All images are resized to a fixed spatial resolution using nearest-neighbor interpolation to preserve the discrete structure of the original binary data.

This representation enables the use of convolutional neural networks while retaining low-level structural information of the executable.

Feature Extraction via Convolutional Neural Networks

A convolutional neural network (ResNet-18) is employed as a feature extractor. The final classification layer is removed, and the network outputs a fixed-length embedding vector for each executable. The CNN is not trained for malware classification; instead, it is used to learn a generic representation of structural patterns present in executable binaries.

Each executable is thus mapped to a high-dimensional latent vector, defining a continuous embedding space in which similarity reflects structural resemblance rather than explicit semantic labels.

Dimensionality Reduction and Visualization

To analyze the geometric structure of the learned representations, dimensionality reduction is applied. For small datasets, Principal Component Analysis (PCA) is used directly to project embeddings into a three-dimensional space, ensuring numerical stability. For larger datasets, PCA is first used for noise reduction, followed by Uniform Manifold Approximation and Projection (UMAP) to capture non-linear manifold structure.

This adaptive strategy allows consistent visualization across different dataset sizes while avoiding instability in low-sample regimes.

Exploratory Analysis of Benign and Malicious Samples

The resulting low-dimensional embeddings are visualized in a three-dimensional space. Benign executables typically form a compact, high-density region, while malicious samples tend to occupy peripheral areas of the embedding space. This separation emerges without explicit supervision and reflects structural deviations from common benign software rather than behavioral intent.

The approach is therefore framed as exploratory representation learning and anomaly analysis, rather than supervised malware classification.

The observed geometric organization suggests that convolutional representations learned from binary visualizations capture meaningful structural properties of executable files. While the separation between benign and malicious samples may partially reflect factors such as packing, entropy, or compilation artifacts, these characteristics are nonetheless relevant for anomaly detection and triage in malware analysis workflows.

Importantly, the method does not assume prior knowledge of malware families or behaviors and is robust to limited availability of labeled malicious samples.

Future Work

Future work will focus on strengthening the learned representations by adopting self-supervised training directly on executable-derived images, enabling more domain-specific embeddings without reliance on labels. The framework can be extended by integrating complementary views of binaries, such as entropy-based or section-aware representations, to improve interpretability. Additionally, the embedding space naturally supports quantitative anomaly scoring and large-scale analysis, opening the way to automated triage and longitudinal studies of malware evolution.

Full code on GitHub: https://github.com/cyberc0ffee/dpl-malware-finder