UMBC Special Course: Foundations of Data Mining

Classes: Tuesday and Thursday, 1:30pm- 2:45pm
Venue: UMBC Tech Center, Room 3.014
Instructor: Hillol Kargupta
Contact Address: hillol@cs.umbc.edu, (410) 455-3972 (Voice), (410)455-3969(Fax) 
Office hours: Tuesday 3:00--4:00pm or by appointment

Teaching Assistant: Niyati Chhaya
E-mail: niyatic1@umbc.edu
Office hours: Thursday 4--5pm
Office room: ITE 334


GOALS

The field of data mining studies algorithms and systems that allow efficient discovery of patterns hidden in data by paying careful attention to the data storage, computing, communication, and human-computer interaction issues. This course is intended to serve as an introduction to this field. A data mining process deals with (1) data accessing and pre-processing, (2) representation construction, (3) analysis, and (4) presentation of the patterns to the user(s). The course will provide a comprehensive introduction to each of these steps using practical applications. The emphasis will be on the foundation of the data mining techniques. The course will provide ample opportunity for participants to learn about this growing new research area, and scout around for promising research topics.
 

PREREQUISITES

Undergraduate level background in linear algebra, statistics, and data structures. You may want to read the appendices (Counting, Sets, and Probability) of Introduction to Algorithms by Cormen, Leiserson, and Rivest. The students will need programming knowledge in C/C++ or Java.
 

TEXT BOOKS

Primary Text:

V. Kumar et al. (2005). Data Mining. Addison Wesley.

References:

J. Han and M. Kamber (2000) Data Mining: Concepts and Techniques , Morgan Kaufmann. Morgan Kaufmann Publishers; ISBN: 1558604898.

Ian H. Witten, Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations; Morgan Kaufmann Publishers; ISBN: 1-55860-552-5.

T. Hastie, R. Tibshirani J. Friedman. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction.

D. Hand, H. Mannila, P. Smyth (2000), Principles of Data Mining , MIT Press. S. M. Weiss and N. Indurkhya, Predictive Data Mining: A Practical Guide, Morgan Kaufmann Publishers, 1998. ISBN: 1-55860-403-0.

M. Berthold and D. Hand (1999). Intelligent Data Analysis. Springer.

J. Komorowski and J. Zytkow (eds.), Principles of Data Mining and Knowledge Discovery,  Springer, 1997. ISBN: 3-540-63223-9.

References on Mathematical Statistics and Information Theory:

Mathematical Statistics: Basic Ideas and Selected Topics. P. J. Bickel and K. A. Doksum.

Elements of Information Theory. Thomas M. Cover, Joy A. Thomas.

An Introduction to Probability Theory and Its Applications, (Vol. 1 & Vol. 2). William Feller.

Web Resources:

Weka Software (Useful link: Introduction to Weka)

MATLAB Primer

Mathematica

Handbook on Engineering Statistics

Statistics Glossary

Kdnuggets.com

Numerical Recipe

JAMA package for Matrix-based Computation
 
 

INSTRUCTIONAL METHODOLOGY AND GRADING POLICY

Classroom lectures, projects, and student project presentations (if time permits).

Basis for evaluation and weighing:
Homework/Quizzes: 20%
Exams: 45% (Exam I, II, III)
Term project: 35%

ACADEMIC HONESTY

By enrolling in this course, each student assumes the responsibilities of an active participant in UMBC's scholarly community, in which everyone's academic work and behavior are held to the highest standards of honesty. Cheating, fabrication, plagiarism, and helping others to commit these acts are all forms of academic dishonesty, and they are wrong. Academic misconduct could result in disciplinary action that may include, but is not limited to, suspension or dismissal. To read the full Student Academic Conduct Policy, consult the UMBC Student Handbook , the Faculty Handbook, or the UMBC Policies section of the UMBC Directory. [Statement adopted by UMBC's Undergraduate Council and Provost's Office.] Cheating in any form will not be tolerated. In particular, plagiarism of any published work, another student's work, or your own previously published or submitted work without proper attribution will not be tolerated. We will be discussing plagiarism, summarization, and proper citation techniques in the class. If you have any questions about what is acceptable, please bring them to me before submitting your work. The minimum penalty for a violation of the academic honesty policy is a zero on the assignment. Other penalties may include a letter grade reduction, failing the class, or, in extreme or repeated cases, dismissal from the program.

 

COURSE OUTLINE (TO BE UPDATED)

1. Overview and Motivation: (Sept 1 and Sept 3)
   a) Data collection, storage, and pre-processing
   b) Various sources of data: Databases, data warehouses, web sites, and data streams
   c) Review material - linear algebra, probability theory, statistics
   d) Brief overview of basic data mining techniques
   e) Introduction to the data analysis software to be used by the course
        --- Weka(publicly available) 
        --- Matlab, Mathematica (Available on university machines)

2. Understanding and Exploring Data: (Sept 8 and Sept 15; No class on Sept 10)
   a) Data types
   b) Data quality
   c) Measure of similarity and Dissimilarity
   d) Summary statistics
   e) Visualization
   f) OLAP and multidimensional data analysis

3. Data Pre-processing: (Sept 17 and Sept 24)
   Different statistical techniques to massage and explore data
        -- Normalization
        -- Smoothing techniques
        -- Filtering
        -- Hypothesis testing
        -- Common distributions
        -- Sampling

=====================================
MIDTERM  I: Sept 22, in-class, closed book.
Syllabus: Everything covered until the exam-date
=====================================

4. Representation Construction: (Sept 29 and Oct 6)
   a) Feature selection techniques
   b) Feature construction:
        -- Principal Component Analysis
        -- Singular Value Decomposition
        -- Random Projection
        -- Fourier
        -- Wavelet

=====================================
October 1: National Science Foundation Supported Next Generation Data Mining Summit, Baltimore, MD, USA.
=====================================

5. Clustering: (Oct 8, and Oct 13)
   a) Partitioning and agglomerative hierarchical techniques
   b) Self Organizing Feature Maps
   c) Scalable clustering techniques

=====================================
MIDTERM  II: Oct 15, in-class, closed book.
Syllabus: Everything covered until the exam-date
=====================================

6. Association Rule Learning (Oct 20)
   a) Apiori and other multi-pass algorithms
   b) Single-pass association rule learning

7. Learning predictive models and classifiers (Oct 22 and Oct 27)
   a) Foundations
   b) Background material  

8. Regression and Other Statistical Techniques: (Oct 29)
   a) Linear multi-variate regression
   b) Logistic regression
   c) ARMA and ARIMA

9. Decision Trees: (Nov 3 and Nov 5)
   a) ID3/C4.5, CART
   b) Sclable decision tree learning
        -- SLIQ
        -- BOAT
   c) Online Decision Tree learning

10. Support Vector Machines (SVM) and Other Classifiers: (Nov 10 and Nov 12)
   a) Introduction
   b) Learning SVM classifiers
   c) Artificial Neural Networks
   d) Ensembles: Bagging and Boosting
 

11. Graphical Models: (Nov 17 and Nov 19)
   a) Bayesian learning
   b) Neural networks

12. Human-Computer Interaction in Data Mining: (Nov 24)
   a) Privacy issues

=====================================
Nov 26, 27 -- Thanksgiving Holidays
=====================================

13. Advanced Topics: (Dec 3 and Dec 10)
 
  a) Distributed data mining
 

=====================================
December 8: IEEE Data Mining Conference, Miami, FL, USA

Last day of class: Dec 10
=====================================

=====================================
Final Exam: Dec 22, 1--3pm, in-class, closed book.
Syllabus: Everything covered until the exam-date
=====================================

HOME WORKS

Home Work 1 (Solution)

Data set I (for Question 2).
Data set (for Question 3).

Practice Exercise

Home Work 2

Exam I Solution

Home Work 3

Home Work 4

PROJECT

1. Form small groups with no more than four students per group. Talk to me if you have any special need.

2. If you are looking for a topic, talk to me. I can help you finding topics for the project. You are also most welcome to come with your own project topics.

3. Project proposal submission deadline is Sept 29, 2009. The proposal should contain the following items:

    a) Title, name of the project members
    b) Executive summary of the proposal (One page)
    c) Problem definition and background
    d) Technical scope of the project
    e) Distribution of the work among the project members
    f)  Project schedule
    g)  References

4. Intermediate project status report is due on Oct 29, 2009.

5. Final project report is due on Dec 10, 2009.
 

Suggested Project Topics (*File Available Now*)

CLASS NOTES

Notes 1 (*File Available Now*)

Review material (Probability Theory/Basic Statistics/Information Theory (*File Available Now*)
 

Notes 3 (*File Available Now*)
 
References:
1) A Review of Eigenvalue Computation Techniques
2) Survey of Wavelet Applications in Data Mining

Notes 4 (*File Available Now, last updated on Oct 4, 2009*)
 

Notes 5 (*File Available Now*)
 

Reference material from Vipin's book (*File Available Now*)
 

Useful pointers for Statistical Hypothesis Testing Introductory Material
 

Notes 6 (*File Available Now*)
 

A good paper on randomized algorithms for similarity preserving representation construction
 

Reference material on Clustering from Vipin's book (*File Available Now*)
 

Notes 7 (*File Available Now*)
 

Reference material on Association Rule Learning from Vipin's book (*File Available Now*)
 

Notes 8 (*File Available Now*)
 

Introduction to Neural Networks (reference)
 

SVM (reference)
 

Notes 9
 

Notes 10