Elsevier

Information Sciences

Volume 176, Issue 14, 22 July 2006, Pages 1952-1985
Information Sciences

Clustering distributed data streams in peer-to-peer environments

https://doi.org/10.1016/j.ins.2005.11.007Get rights and content

Abstract

This paper describes a technique for clustering homogeneously distributed data in a peer-to-peer environment like sensor networks. The proposed technique is based on the principles of the K-Means algorithm. It works in a localized asynchronous manner by communicating with the neighboring nodes. The paper offers extensive theoretical analysis of the algorithm that bounds the error in the distributed clustering process compared to the centralized approach that requires downloading all the observed data to a single site. Experimental results show that, in contrast to the case when all the data is transmitted to a central location for application of the conventional clustering algorithm, the communication cost (an important consideration in sensor networks which are typically equipped with limited battery power) of the proposed approach is significantly smaller. At the same time, the accuracy of the obtained centroids is high and the number of samples which are incorrectly labeled is also small.

Introduction

Clustering [1] is a well-known and widely used exploratory data analysis technique. Most of the clustering algorithms that are available in the literature deal with data available at a single location. However, there exists many applications where data sources are distributed over a network and collecting the data at a central location before clustering is not a viable option. Sensor networks connected over wireless networks offer one such environment where centralized data clustering is difficult and often not scalable because of various reasons such as limited communication bandwidth and limited battery power supply for running the sensor nodes. Sensor networks communicate in a peer-to-peer (P2P) fashion which allows only local communication among the neighboring sensor nodes. This requires that the data analysis algorithms also communicate in a P2P mode and work in an asynchronous way [2], [3]. No such clustering algorithm has so far been reported in the literature.

This paper offers a P2P clustering algorithm that is designed for environments like sensor networks monitoring continuous data streams at various nodes. The clustering technique, referred to as the P2P K-Means clustering, is based on the principles of the K-Means algorithm, and utilizes certain statistical bounds for estimating the error in computing the centroids of the clusters in a distributed manner vis-a-vis the centralized approach. The algorithm deals with the general scenario where each node contains a subset of the overall data to be clustered and the nodes observe the same set of attributes. Although most sensor network applications deal with continuous data streams, the paper does not directly address that. The P2P K-Means algorithm can be easily extended following the work by [4] in order to handle stream data. The objective of the current work is to develop a P2P version of the K-Means algorithms so that it can be used by the stream version of the K-Means algorithm [4] for clustering stream data in a sensor network.

Section 2 describes some of the clustering applications for sensor networks that motivated this research. Section 3 provides some background discussion on clustering, research issues in sensor networks, and related work in the domain of distributed clustering. The proposed clustering technique is described in Section 4. A detailed theoretical analysis of the proposed algorithm is carried out in Section 5. Section 6 provides the experimental results. Finally, Section 7 concludes the article.

Section snippets

Motivation

Consider a scene segmentation and monitoring application using sensor networks where the sensor nodes are equipped with audio, vibration, temperature and reflectance probes. The sensors are monitoring a given geographical region. The sensors are battery powered. Therefore, in the normal mode they are designed not to be very active. However, as soon as a node detects a possible change in the scenario, sensors must wake up, observe, reason and collaborate with each other in order to track and

Background

As already mentioned, distributed clustering of data streams in an energy efficient manner is an emerging research problem with significant applications in sensor networks. This section briefly reviews some of the existing clustering techniques, with particular emphasis on data stream clustering, distributed clustering, and information processing in sensor networks. Issues in sensor networks, which make the problem unique, are also discussed.

Distributed peer-to-peer K-Means clustering

This section describes the proposed P2P K-Means clustering algorithm in a peer-to-peer homogeneously distributed setting. Since the P2P K-Means clustering technique utilizes the principles of the conventional K-Means algorithm, it is described first. Subsequently, the distributed clustering problem is formally stated, followed by a description of the proposed method. A theoretical analysis of P2P K-Means is provided in the next section.

Analysis of the P2P K-Means algorithm

In this section we provide results which allow a node, at a given iteration, to compute an upper-bound on the centroid error based on current run-time information. Such a bound can be thought of as a “gage” by which the node can measure the degree to which accuracy has been sacrificed at the expense of lowered communication cost.

We analyze the variant of our algorithm which uses a random sampling of nodes to update centroids. Our analysis is an adaption of that provided in [7] where K-Means

Experimental results

We carried out two sets of experiments. First we applied a random sampling-based approach, i.e., in each iteration, every node (site) updates its cluster centroids based on the cluster information from randomly selected nodes over the whole network. Then we experimented with the scenario where random sampling is replaced by the deterministic immediate neighbor-based approach, i.e., each node updates its cluster centroids by only considering the information from its immediate neighbors. The

Discussion and conclusions

This article describes the P2P K-Means algorithm for distributed clustering of data streams in a peer-to-peer sensor network environment. Sensor networks are characterized by low communication and computational capabilities, limited battery power, asynchronous nature and existence of faults. In the P2P K-Means algorithm, computation is performed locally, and communication of the local data models (represented by the corresponding centroids and the cluster counts) is restricted only within a

Acknowledgements

The authors acknowledge supports from the United States National Science Foundation (NSF) CAREER award IIS-0093353, NSF Grant IIS-0329143, and NASA (NRA) NAS2-37143.

References (56)

  • I.F. Akyildiz et al.

    Wireless sensor networks: a survey

    Computer Networks

    (2002)
  • A.K. Jain et al.

    Algorithms for Clustering Data

    (1988)
  • W. Kowalczyk et al.

    Towards data mining in large and fully distributed peer-to-peer overlay networks

  • R. Wolff et al.

    Association rule mining in peer-to-peer systems

    IEEE Trans. Syst. Man Cyberns. Part B

    (2004)
  • S. Guha et al.

    Clustering data streams

  • J.T. Tou et al.

    Pattern Recognition Principles

    (1974)
  • J. Han et al.
  • P. Domingos et al.

    A general method for scaling up machine learning algorithms and its application to clustering

  • C. Aggarwal, J. Han, J. Wang, P. Yu, A framework for clustering evolving data streams, in: Proceedings of the 29th...
  • B. Babcock, M. Datar, R. Motwani, L. O’Callaghan, Maintaining variance and k-medians over data stream windows, in: ACM...
  • P. Domingos et al.

    Mining high speed data streams

  • P. Domingos et al.

    Mining time-changing data streams

  • J. Lin et al.

    Iterative incremental clustering of time series

  • E. Keogh, J. Lin, W. Truppel, Clustering of time series subsequences is meaningless: implications for past and future...
  • I. Dhillon, D. Modha, A data-clustering algorithm on distributed memory multiprocessors, in: Proceedings of the KDD’99...
  • G. Forman et al.

    Distributed data clustering can be efficient and exact

    SIGKDD Explorations

    (2000)
  • H. Kargupta et al.

    Distributed clustering using collective principal component analysis

    Knowledge Inform. Syst. J.

    (2001)
  • M. Klusch, S. Lodi, G. Moro, Distributed clustering based on sampling local density estimates, in: Proceedings of the...
  • A. Hinneburg, D. Keim, An efficient approach to clustering in large multimedia databases with noise, in: Proceedings of...
  • M. Eisenhardt, W. Muller, A. Henrich, Classifying documents by distributed P2P clustering, in: Proceedings of...
  • S. Merugu, J. Ghosh, Privacy-preserving distributed clustering using generative models, in: Proceedings of the IEEE...
  • E. Johnson et al.

    Collective, hierarchical clustering from distributed, heterogeneous data

  • A. Lazarevic, D. Pokrajac, Z. Obradovic, Distributed clustering and local regression for knowledge discovery in...
  • N. Samatova et al.

    RACHET: An efficient cover-based merging of clustering hierarchies from distributed datasets

    Distrib. Parallel Databases

    (2002)
  • A. Strehl et al.

    Cluster ensembles—a knowledge reuse framework for combining multiple partitions

    J. Mach. Learning Res.

    (2002)
  • A. Fred, A. Jain, Data clustering using evidence accumulation, in: Proceedings of the International Conference on...
  • P. Jouve, N. Nicoloyannis, A new method for combining partitions, applications for distributed clustering, in:...
  • E. Januzaj, H.-P. Kriegel, M. Pfeifle, DBDC: density based distributed clustering, in: Proceedings of EDBT in Lecture...
  • Cited by (133)

    • Distributed clustering using multi-tier hierarchical overlay super-peer peer-to-peer network architecture for efficient customer segmentation

      2021, Electronic Commerce Research and Applications
      Citation Excerpt :

      All peers should be synchronized in each iteration; however, this mechanism has resulted in heavy traffic and network congestion. Two other approximate local algorithms, Local Synchronization-based P2P (LSP2P) and Uniform node Sampling-based P2P (USP2P) for P2P K-means clustering, are described in (Eisenhardt et al., 2003) and (Bandyopadhyay et al., 2006). The local synchronization-based P2P K-means distributes the centroids using gossiping.

    • Distributed clustering in peer to peer networks using multi-objective whale optimization

      2020, Applied Soft Computing Journal
      Citation Excerpt :

      The peers have ability to communicate with each other. Distributed clustering is an important tool to extract significant information from the data collected at the peer nodes [1,2]. It avoids the use of a traditional central processor that collects all data from the neighborhood nodes and performs clustering.

    • Distributed robust data clustering in wireless sensor networks using diffusion moth flame optimization

      2020, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      The points derived from the analysis are highlighted in the conclusion Section 7. Distributed data clustering in peer to peer networks is introduced in 2006 by Bandyopadhyay et al. (2006) which is based on popular K-Means algorithm. This approach has the limitation of global synchronization (as the nodes can perform successive iteration only after completion of the previous one for all the nodes involved in the network).

    • Distributed stream clustering using micro-clusters on Apache Storm

      2017, Journal of Parallel and Distributed Computing
    View all citing articles on Scopus
    1

    On leave from Indian Statistical Institute, Kolkata, India.

    2

    On leave from Jadavpur University, Kolkata, India.

    3

    Also affiliated with Agnik, LLC.

    View full text