Outlier Detection & General Questions
Code and Datasets
A scaffolding of the implementation is provided here.
Three datasets of varying size to test the implementations on are also provided.
Q1: Simple Nested Loop Algorithm
Definition: Outliers are the top n data elements whose
average distance to the k nearest neighbours is greatest.
Implement an algorithm which uses the above definition to detect
outliers in a dataset.
Q2: Nested Loop with Randomization and Pruning
Implement a variation of the above algorithm that makes use of pruning
and randomization to reduce the computational complexity.
- randomize the data X
- compute the outlier score for N data points
- for the remaining points
- iteratively compute the k nearest neighbours
- if the average distance is smaller then the weakest outlier score the point is a non-outlier
- update the top outliers and the weakest outlier score
Q3: Questions
The following are some questions you should now be able to answer:
- What are the main uses of PCA?
- How can you decide the number of principal components to use?
- What is a frequent item set and how can it be used to reduce the computational complexity of the a-priori algorithm?
- What's the formula for Bayes' theorem and why is it useful?
- How does K-means work and what are its drawbacks?
- How does supervised classification work?
- How does k nearest neighbour classifier differ from support vector machines?
- What is the conceptual difference between methods such as naive Bayes and logistic regression and others like svm and knn?
- What is the goal of outlier detection and why is it important?
- What are different types of methods for outlier detection? What are the advantages and disadvantages of these methods?