School of Information Technologies


 

COMP5318 KNOWLEDGE DISCOVERY AND DATA MINING
Semester 1, 2012

Outline eLearning/Blackboard
Timetable  Assessment 
Syllabus Resources


News

3/3/2012
Welcome to COMP5318!



Course outline

This course will offer a comprehensive coverage of well known Data Mining topics including classification, clustering and association rules. A number of specific algorithms and techniques under each category will be discussed. Methods for feature selection, dimensionality reduction and performance evaluation will also be covered. Students will learn and work with appropriate software tools and packages in the laboratory. They will be exposed to relevant Data Mining research.

Teaching staff


Irena Koprinska - course coordinator, lecturer and tutor
Email: irena AT it.usyd.edu.au
Office: room 450, School of IT Building
Consultation time: Monday 5-6pm (before the lectures)


Tim O'Keefe - tutor
Email: tokeefe AT it.usyd.edu.au

Joshua Akehurst - tutor
Email: joshua.akehurst AT sydney.edu.au


Timetable

Activity
Day
Time
Venue
Lectures
Monday
6-8pm
Architecture LT 1
Laboratory/Tutorial
(start in Week 2)
Monday
8-9pm
SIT labs 115, 116 and 117

Assessment overview
The assignment specifications will be available on the eLearning site. 

Assignement
%
Out
Due
Individual/Group
Notes
Late submission policy
Ass1: Test
15
 
w6, in class

Individual
In the 1st first hour of the lectures (6-7pm). Semi-open as the exam. Students are allowed  1 sheet of their own notes (A4-size, double-sided, handwritten or typed). The test will cover the material on Clustering.
Not possible to re-sit the test.
Ass2: Data
analysis

20

w10, Friday, 5pm
Individual or in pairs (groups of more than 2 people are not allowed) Submission: 1) hard copy in the locker labelled COMP5318 located in the School of IT Building, level 1, in the postgraduate labs wing and  2) electronically via eLearning - A penalty of minus 1 mark per each day after the deadline
- the maximum delay is 7 days; after that assignments will not be accepted
Ass3:
Research paper presentation

 final schedule
15

w12 and 13, in class
Group

- No late presentations are allowed; a student who is unable to present on the specified date will receive 0 marks for this assessment
Written exam 50

examination period
Individual
The exam will be semi-open. You are allowed 1  sheet of  your own notes (hand-written or typed, double-sided, A4-size) and a non-programable calculator (you don't need a calculator). No other material is allowed (no book, no additional notes). The exam will be on all material except Clustering.

In order to pass the course, the School requires at least 40% in the written exam, at least 40% in the other assessment components together and an overall final mark of 50 or more. This means that students who score less than 40% in the exam will fail the course regardless of their marks during the semester.

Academic honesty: Please read the University Policy on Academic Honesty and submit the appropriate cover sheet with your signature with your assignments. The cover sheets are available from the link above.

Special considerations: If you have a condition requiring a special consideration, you must: 1) submit a form within 1 week from the date when assessment was due, 2) include your e-mail address, phone number and the name of your tutor, and 3) e-mail your lecturer that you have submitted a special consideration form. For more information please read the Policy on special consideration due to illness or misadventure; you can also download the form from there.


Syllabus

 The teaching materials (lecture notes, lab notes and lab solutions) will be available on the eLearning site.

Week Date Topic
1 5 March Admin matters. Introduction to Data Mining (DM); challenges, origins, DM vs Machine Learning and Knowledge Discovery in Databases; DM tasks.
 
Data: types, cleaning (noise, missing values), pre-processing (aggregation, feature selection, discretization and binarization, normalization), similarity measures.
2 12 March Clustering1:
Introduction to clustering. Partitional algorithms: k-means, bisecting k-means. Hierarchical algorithms: single, complete and average link; Ward’s method.
3 19 March Clustering 2:
Fuzzy clustering – c-means algorithm. Self-organising maps (SOM).
4 26 March Clustering 3:
Density-based clustering – DBSCAN algorithm.
Evaluating clustering results: unsupervised and supervised measures; determining the number of clusters, evaluating the clustering tendency.
5 2 April Classification 1:
Introduction. Nearest-neighbour algorithm. Rule-based classifiers: 1R and PRISM.
  9 April Mid-semester break
6 16 April Ass1: Test

Classification 2:

Evaluating classifiers: performance measures; evaluation procedures: single holdout, cross validation, bootstrapping. Comparing two classifiers - statistical significance testing.
7 23 April Classification 3:
Bayesian classifiers: Naïve Bayes and Bayesian networks. 
8 30 April Classification 4:
Decision trees: building decision trees, information gain, decision boundary, overfitting and pruning.
9 7 May Feature subset selection: CFS, Relief, Wrapper-based approaches.
Dimensionality reduction - PCA and SVD.
10 14 May Classification 5:
Linear regression.
Support vector machines (SVM); maximum margin hyperplane, finding a maximum margin hyperplane as an optimisation problem, linear SVM with hard and soft margin, nonlinear SVM (kernel trick and Mercer’s theorem).

Ass2: Data analysis due Friday 5pm
11 21 May Association rules 1: Introduction. Mining frequent items. Apriori algorithm.
Association rules 2: Sequential pattern analysis. GSP and PrefixSpan algorithms.

Ass2: Data analysis due Wednesday 5pm

12 28 May Ass3: Student presentations of research papers.
13 4 June Ass3: Student presentations of research papers.

Resources

Textbook

Introduction to Data Mining
Pang-Ning Tan, Michael Steinbach, Vipin Kumar,
Pearson Education (Addison Wesley), 0-321-32136-7, 2006

Chapters 4, 6 and 8 are freely available here and from the publisher.

tan.jpg

Recommended book

Data mining - practical machine learning tools and techniques with Java implementations, 3d edition
Ian H. Witten, Eibe Frank and M. Hall
Morgan Kaufmann, 2011, ISBN: 978-0-12-374856-0

Machine Learning view of Data Mining. Very readable. The book of the WEKA software. You cana lso use the previous edition of the book (2d edition).



Other recommended books

Data Mining: Introductory and Advanced Topics
Margaret Dunham, Prentice Hall, 0-13088892-3, 2003

Good coverage of the topics included in the course. Very readable. Pseudo code and computation complexity covered.

dunham.jpg

Data Mining Concepts and Techniques
 J. Han and M. Kamber
Morgan Kaufmann, 2006, ISBN 1-55860-901-6 

Database view of Data Mining.

han

Principles of Data Mining
D. Hand, H. Mannila, P. Smyth, Principles of data mining,
MIT Press, 2001, ISBN: 0-262-08290-X

Statistical view of Data Mining. Advanced, requires good statistical knowledge.

hand.jpg


Tan and Witten are placed in the library Reserve collection (2 Hour Loan collection) and are also available in the Co-op Bookshop.


Last modified: 12 May 2012