Distributed Data Mining  

DIADIC Laboratory
Department of Computer Science and Electrical Engineering

University of Maryland at Baltimore County

 



Introduction

Advances in computing and communication over wired and wireless networks have resulted in many pervasive distributed computing environments. The Internet, intranets, local area networks, ad hoc wireless networks, and sensor networks are some examples. These environments often come with different distributed sources of data and computation. Mining in such environments naturally calls for proper utilization of these distributed resources. Moreover, in many privacy sensitive applications different, possibly multi-party, data sets collected at different sites must be processed in a distributed fashion without collecting everything to a single central site. However, most off-the-shelf data mining systems are designed to work as a monolithic centralized application. They normally down-load the relevant data to a centralized location and then perform the data mining operations. This centralized approach does not work well in many of the emerging distributed, ubiquitous, possibly privacy-sensitive data mining applications.

Distributed Data Mining (DDM) offers an alternate approach to address this problem of mining data using distributed resources. DDM pays careful attention to the distributed resources of data, computing, communication, and human factors in order to use them in a near optimal fashion.

DDM applications come in different flavors. When the data can be freely and efficiently transported from one node to another without significant overhead, DDM algorithms may offer better scalability and response time by (1) properly redistributing the data in different partitions or (2) distributing the computation, or (3) a combination of both. These algorithms often rely on fast communication between participating nodes. However, when the data sources are distributed and cannot be transmitted freely over the network due to privacy-constraints or bandwidth limitation or scalability problems, DDM algorithms work by avoiding or minimizing communication of the raw data.

The distributed data mining research of the DIADIC Laboratory, University of Maryland at Baltimore County is directed toward the development of a new technology. Our current research is focussed on the following areas:

  • DDM in mobile and wireless environments
  • DDM for large scale scientific, business, and grid applications
  • Privacy preserving DDM from multi-party data


  • Our research considers both DDM algorithm and system development. Some of our current and past research contributions are listed below:

    Algorithmic Research

  • Distributed Principal Component Analysis (PCA) from Heterogeneous Data
  • Distributed PCA-based Clustering from Heterogeneous Data
  • Distributed multi-variate regression
  • Distributed Bayesian Network Learning
  • Distributed Hierarchical Clustering
  • Distributed classifier learning
  • Distributed privacy-preserving data mining from multi-party data using additive and multiplicative noise
  • Using fourier spectrum of decision trees for aggregating decision tree ensembles
  • Distributed representation construction in the biological process of gene expression and its algorithmic understanding
  • Power consumption characteristics of distributed data mining algorithms and developing data mining algorithms that minimize power consumption

    System Development and Applications

  • PArallel Data Mining Agents (PADMA) System (1997) (a parallel data mining system)
  • BODHI (1998) System (an early general purpose DDM system)
  • Mobile Data Mining (MobiMine) System (for monitoring static data streams using mobile devices)
  • VEhcile DAta Stream mining (VEDAS) System (2002) (for monitoring mobile data sensors in vehicles for distributed fleet management applications)


  • Publications regarding our above listed work can be found here....