GOALS
The field of data mining studies algorithms and systems that allow efficient
discovery of patterns hidden in data by paying careful attention to the data
storage, computing, communication, and human-computer interaction issues. This
course is intended to serve as an introduction to this field. A data mining
process deals with (1) data accessing and pre-processing, (2) representation
construction, (3) analysis, and (4) presentation of the patterns to the
user(s). The course will provide a comprehensive introduction to each of these
steps using practical applications. The emphasis will be on the foundation of
the data mining techniques. The course will provide ample opportunity for
participants to learn about this growing new research area, and scout around
for promising research topics.
PREREQUISITES
Undergraduate level background in linear algebra, statistics, and data
structures. You may want to read the appendices (Counting, Sets, and Probability)
of Introduction to Algorithms by Cormen, Leiserson, and Rivest. The
students will need programming knowledge in C/C++ or Java.
TEXT BOOKS
Primary Text:
V. Kumar et al. (2005). Data Mining. Addison Wesley.
References:
J. Han and M. Kamber (2000) Data Mining: Concepts and Techniques , Morgan Kaufmann. Morgan Kaufmann Publishers; ISBN: 1558604898.
Ian H. Witten, Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations; Morgan Kaufmann Publishers; ISBN: 1-55860-552-5.
T. Hastie, R. Tibshirani J. Friedman. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction.
D. Hand, H. Mannila, P. Smyth (2000), Principles of Data Mining , MIT Press. S. M. Weiss and N. Indurkhya, Predictive Data Mining: A Practical Guide, Morgan Kaufmann Publishers, 1998. ISBN: 1-55860-403-0.
M. Berthold and D. Hand (1999). Intelligent Data Analysis. Springer.
J. Komorowski and J. Zytkow (eds.), Principles of Data Mining and Knowledge Discovery, Springer, 1997. ISBN: 3-540-63223-9.
References on Mathematical Statistics and Information Theory:
Mathematical Statistics: Basic Ideas and Selected Topics. P. J. Bickel and K.
A. Doksum.
Elements of Information Theory. Thomas M. Cover, Joy A. Thomas.
An Introduction to Probability Theory and Its Applications, (Vol. 1 & Vol. 2). William Feller.
Web Resources:
Weka Software (Useful link: Introduction to Weka)
Handbook on Engineering Statistics
JAMA package for Matrix-based Computation
INSTRUCTIONAL METHODOLOGY
Classroom lectures, projects, and student project presentations (if time permits).
Basis for evaluation and weighing:
Homework/Quizzes: 20%
Exams: 45% (Exam I, II, III)
Term project: 35%
ACADEMIC HONESTY
By enrolling in this course, each student assumes the responsibilities of an active participant in UMBC's scholarly community, in which everyone's academic work and behavior are held to the highest standards of honesty. Cheating, fabrication, plagiarism, and helping others to commit these acts are all forms of academic dishonesty, and they are wrong. Academic misconduct could result in disciplinary action that may include, but is not limited to, suspension or dismissal. To read the full Student Academic Conduct Policy, consult the UMBC Student Handbook , the Faculty Handbook, or the UMBC Policies section of the UMBC Directory. [Statement adopted by UMBC's Undergraduate Council and Provost's Office.] Cheating in any form will not be tolerated. In particular, plagiarism of any published work, another student's work, or your own previously published or submitted work without proper attribution will not be tolerated. We will be discussing plagiarism, summarization, and proper citation techniques in the class. If you have any questions about what is acceptable, please bring them to me before submitting your work. The minimum penalty for a violation of the academic honesty policy is a zero on the assignment. Other penalties may include a letter grade reduction, failing the class, or, in extreme or repeated cases, dismissal from the program.
COURSE OUTLINE (TO BE UPDATED)
1. Overview and Motivation: (Sept 1 and Sept 3)
a) Data collection, storage, and pre-processing
b) Various sources of data: Databases, data warehouses, web sites,
and data streams
c) Review material - linear algebra, probability theory, statistics
d) Brief overview of basic data mining
techniques
e) Introduction to the data analysis software to be used by the
course
--- Weka(publicly
available)
--- Matlab,
Mathematica (Available on university machines)
2. Understanding and Exploring Data: (Sept 8 and Sept 15; No class on Sept 10)
a) Data types
b) Data quality
c) Measure of similarity and
Dissimilarity
d) Summary statistics
e) Visualization
f) OLAP and multidimensional data
analysis
3. Data Pre-processing: (Sept 17 and Sept 24)
Different statistical techniques to massage and explore data
-- Normalization
-- Smoothing techniques
-- Filtering
-- Hypothesis testing
-- Common distributions
-- Sampling
MIDTERM I: Sept 22, in-class, closed book.
=====================================
4. Representation Construction: (Sept 29 and Oct 6)
a) Feature selection techniques
b) Feature construction:
-- Principal Component Analysis
-- Singular Value Decomposition
-- Random Projection
-- Fourier
-- Wavelet
5. Clustering: (Oct 8, and Oct 13)
a) Partitioning and agglomerative hierarchical techniques
b) Self Organizing Feature Maps
c) Scalable clustering techniques
MIDTERM II: Oct 15, in-class, closed book.
=====================================
6. Association Rule Learning (Oct 20)
a) Apiori and other multi-pass
algorithms
b) Single-pass association rule learning
7. Learning predictive models and classifiers (Oct 22 and Oct 27)
a) Foundations
b) Background material
8. Regression and Other Statistical Techniques: (Oct 29)
a) Linear multi-variate regression
b) Logistic regression
c) ARMA and ARIMA
9. Decision Trees: (Nov 3 and Nov 5)
a) ID3/C4.5, CART
b) Sclable decision tree learning
-- SLIQ
-- BOAT
c) Online Decision Tree learning
10. Support Vector Machines (
a) Introduction
b) Learning
c) Artificial Neural Networks
d) Ensembles: Bagging and Boosting
a) Bayesian learning
b) Neural networks
a) Privacy issues
a) Distributed data mining
December 8: IEEE Data Mining
Conference, Miami, FL, USA
Last day of class: Dec 10
=====================================
Final Exam: Dec 22, 1--3pm, in-class, closed book.
=====================================
HOME WORKS
PROJECT
1. Form small groups with no more than four students per group. Talk to me if you have any special need.
2. If you are looking for a topic, talk to me. I can help you finding topics for the project. You are also most welcome to come with your own project topics.
3. Project proposal submission deadline is Sept 29
a) Title, name of the project members
b) Executive summary of the proposal (One page)
c) Problem definition and background
d) Technical scope of the project
e) Distribution of the work among the project members
f) Project schedule
g) References
4. Intermediate project status report is due on Oct 29
5. Final project report is due on Dec 10
Suggested Project Topics (*File Available Now*)
CLASS NOTES
Notes 1 (*File Available Now*)
Review material (Probability Theory/Basic Statistics/Information Theory (*File Available Now*)
Notes 3 (*File Available Now*)
References:
1) A Review of Eigenvalue Computation
Techniques
2) Survey
of Wavelet Applications in Data Mining
Notes 4 (*File Available Now, last updated on Oct 4, 2009*)
Notes 5 (*File Available Now*)
Reference material from Vipin's book (*File Available Now*)
Useful pointers for Statistical Hypothesis Testing Introductory Material
Notes 6 (*File Available Now*)
A good paper on randomized algorithms for similarity preserving representation construction
Reference material on Clustering from Vipin's book (*File Available Now*)
Notes 7
(*File Available Now*)
Reference material on
Association Rule Learning from Vipin's book
(*File Available Now*)
Notes 8
(*File Available Now*)
Introduction to Neural Networks (reference)
Notes 9
Notes 10