Outlier Detection

Code and Datasets

A scaffolding of the implementation is provided here. Three datasets of varying size to test the implementations on are also provided.

Q1: Simple Nested Loop Algorithm

Definition: Outliers are the top n data elements whose average distance to the k nearest neighbours is greatest.

Implement an algorithm which uses the above definition to detect outliers in a dataset.

Q2: Nested Loop with Randomization and Pruning

Implement a variation of the above algorithm that makes use of pruning and randomization to reduce the computational complexity.
  1. randomize the data X
  2. compute the outlier score for N data points
  3. for the remaining points
    1. iteratively compute the k nearest neighbours
    2. if the average distance is smaller then the weakest outlier score the point is a non-outlier
    3. update the top outliers and the weakest outlier score