Kaggle Malicious URLs Security Dataset

Description

The malicious URLs dataset is a set of 650 thousand URLs, which contains 430 thousand benign URLs, 96 thousand defacement URLs, 94 thousand URLs, and 32 thousand Malware URLs. This data set is designed to help researchers train machine learning algorithms to detect and prevent data exfiltration, or attacks via a malicious URL. This simple dataset contains the raw data of URL, type, where type is benign, defacement, phishing, or malware.

Advantages

This dataset contains the raw data which allows researchers to extract and create any features and are not constrained by already created features.

Disadvantages

This dataset is built from several datasets and joined into a single. This introduces risk of the different quality of the original datasets and their labeling. Not all of the source data for this may have been enforced to the same standard and thus the burden of verification is on the consumer of this data set to evaluate the individual sources.