Outlier Detection & General Questions

Code and Datasets

A scaffolding of the implementation is provided here. Three datasets of varying size to test the implementations on are also provided.

Q1: Simple Nested Loop Algorithm

Definition: Outliers are the top n data elements whose average distance to the k nearest neighbours is greatest.

Implement an algorithm which uses the above definition to detect outliers in a dataset.

Q2: Nested Loop with Randomization and Pruning

Implement a variation of the above algorithm that makes use of pruning and randomization to reduce the computational complexity.
  1. randomize the data X
  2. compute the outlier score for N data points
  3. for the remaining points
    1. iteratively compute the k nearest neighbours
    2. if the average distance is smaller then the weakest outlier score the point is a non-outlier
    3. update the top outliers and the weakest outlier score

Q3: Questions

The following are some questions you should now be able to answer:
  1. What are the main uses of PCA?
  2. How can you decide the number of principal components to use?
  3. What is a frequent item set and how can it be used to reduce the computational complexity of the a-priori algorithm?
  4. What's the formula for Bayes' theorem and why is it useful?
  5. How does K-means work and what are its drawbacks?
  6. How does supervised classification work?
  7. How does k nearest neighbour classifier differ from support vector machines?
  8. What is the conceptual difference between methods such as naive Bayes and logistic regression and others like svm and knn?
  9. What is the goal of outlier detection and why is it important?
  10. What are different types of methods for outlier detection? What are the advantages and disadvantages of these methods?