Outlier Detection
Code and Datasets
A scaffolding of the implementation is provided here.
Three datasets of varying size to test the implementations on are also provided.
Q1: Simple Nested Loop Algorithm
Definition: Outliers are the top n data elements whose
average distance to the k nearest neighbours is greatest.
Implement an algorithm which uses the above definition to detect
outliers in a dataset.
Q2: Nested Loop with Randomization and Pruning
Implement a variation of the above algorithm that makes use of pruning
and randomization to reduce the computational complexity.
- randomize the data X
- compute the outlier score for N data points
- for the remaining points
- iteratively compute the k nearest neighbours
- if the average distance is smaller then the weakest outlier score the point is a non-outlier
- update the top outliers and the weakest outlier score