Javascript must be enabled for the correct page display

Web-scale outlier detection

Driesprong, F.T. (2015) Web-scale outlier detection. Master's Thesis / Essay, Computing Science.

Masterscriptie.pdf - Published Version

Download (1MB) | Preview


The growth of information in today’s society is clearly exponential. To process these staggering amounts of data, the classical approaches are not up to the task. Instead we need highly parallel software running on tens, hundreds, or even thousands of servers to process the data. This research presents an introduction into outlier detection and its application. An outlier is one or multiple observations that deviates quantitatively from the majority and may be the subject of further investigation. After comparing different approaches to outlier detection, a scalable implementation of the unsupervised Stochastic Outlier Selection algorithm is given. The Docker-based microservice architecture allows dynamically scaling according to the current needs. The application stack consists of Apache Spark as the computational engine, Apache Kafka as data store and Apache Zookeeper to ensure high reliability. Based on this we empirically observe the quadratic time complexity of the algorithm as expected. We explore the importance of matching the number of worker nodes based on the underlying hardware. Finally the effect of the distributed data-shuffles is discussed which is sometimes necessary for synchronizing data between the different worker nodes

Item Type: Thesis (Master's Thesis / Essay)
Degree programme: Computing Science
Thesis type: Master's Thesis / Essay
Language: English
Date Deposited: 15 Feb 2018 08:10
Last Modified: 15 Feb 2018 08:10

Actions (login required)

View Item View Item