Endgame Malware BEnchmark for Research (EMBER) Dataset

Description

EMBER is an open source dataset, published in 2018, for the Windows Portable Executable file format.  The Windows Portable Executable (PE)is a file format for Windows based executables, object code, DLLs.  The Portable Executable file contains all of the necessary information for the Windows Operating System to manage, parse and execute the code contained within.  The PE file performs the same function as an Executable and Linkable Format (ELF) on Linux or a Mach-O file in macOS and iOS.

The ember database is built using the following allocations as outlined below:

Training Samples (900k)
300k Malicious
300k Benign
300k Unlabeled
Test Samples
100K Malicious
100k Benign

Features

The Ember dataset consists of a comprehensive set of both Raw features as well as vectorized features.  The raw features are extracted directly from the dataset while the vectorized features are derived from the data set.  The data can be broken into Parsed features and Format-agnostic features.

Parsed Features

General File Information is a parsed feature which includes some general information such as file size, PE Header details (e.g., virtual size, number of imported and exported functions, debug section present, thread local storage, resources, relocations, signature, number of symbols).

Header information is extracted from the PE File COFF header (e.g., timestamp, target machine, list of image characteristics).  

Imported functions are parsed to extract the listing of functions which are imported by the PE file

Exported functions are also parsed out of the PE file and added into the data set.

Section Information is extracted for each section building a dataset including:  Name, size, entropy, virtual size, list of strings.

Format-agnostic features

Byte Histogram: This set extracts each byte from the binary and creates a histogram of each of the 256 possible integer values, representing the counts of each byte value.

Byte-Entropy Histogram:  Creation of a byte entropy histogram which approximates the joint distribution of [(H,X) of entropy H and byte value X.

String Information:  Simple statistics about printable strings.  Specifically the following is reported:  Number of Strings, Average Length, Histogram of printable characters, entropy of characters across all printable strings.

Advantages

This dataset includes both benign samples as well as malicious samples while prior data sets only included malicious samples.  This is an important feature in a dataset as if only malicious samples are included it would make training exceedingly difficult and prone to having a high false positive rate.

Disadvantages

The EMBER dataset is a features only data set which does not include the raw binaries in which limits the extraction of new features or limiting experiments using featureless deep-learning algorithms.

The original paper describing this data set can be found at: https://arxiv.org/abs/1804.04637