Intrusion detection is the process of identifying malicious activity in network traffic. When a cyberattack is detected, countermeasures can be taken to minimize the damage of the attack, which makes intrusion detection valuable for network security. One way to detect malicious traffic is by using a machine learning algorithm. Supervised machine learning algorithms are commonly used, but they require labeled data in order to function. Often, labeled data are not available. Unsupervised machine learning algorithms provide an alternative approach which does not require data to be labeled. This thesis focuses on unsupervised machine learning algorithms. Another challenge is that network traffic data often contain many features. Decreasing the number of features can speed up the algorithms that are used on the data, and if redundant features are removed the accuracy of the algorithms may improve. The process of selecting a subset of features is known as Feature Selection, and is explored in this work.
This thesis compares the unsupervised algorithms K-means, Mini Batch K-means, Gaussian Mixture Model, DBSCAN, BIRCH, Isolation Forest, and One-Class Support Vector Machine on the intrusion detection dataset UNSW-NB15. The algorithms are evaluated in two separate experiments designed to measure their clustering and classification ability. For comparison, three supervised algorithms are included in the experiments, namely K-Nearest Neighbors, Random Forest, and Support Vector Machine. The experiments are performed with all features, and with a feature subset selected through Feature Selection with a Genetic Algorithm. Results for the unsupervised algorithms show that Gaussian Mixture Model performs the best for clustering, while BIRCH and Mini Batch K-means perform the best for classification. The supervised algorithms outperform the unsupervised ones in all of the experiments. Additionally, Feature selection is found to improve the performance of the unsupervised algorithms.