Clustering distributed data streams in peer-to-peer environments
Introduction
Clustering [1] is a well-known and widely used exploratory data analysis technique. Most of the clustering algorithms that are available in the literature deal with data available at a single location. However, there exists many applications where data sources are distributed over a network and collecting the data at a central location before clustering is not a viable option. Sensor networks connected over wireless networks offer one such environment where centralized data clustering is difficult and often not scalable because of various reasons such as limited communication bandwidth and limited battery power supply for running the sensor nodes. Sensor networks communicate in a peer-to-peer (P2P) fashion which allows only local communication among the neighboring sensor nodes. This requires that the data analysis algorithms also communicate in a P2P mode and work in an asynchronous way [2], [3]. No such clustering algorithm has so far been reported in the literature.
This paper offers a P2P clustering algorithm that is designed for environments like sensor networks monitoring continuous data streams at various nodes. The clustering technique, referred to as the P2P K-Means clustering, is based on the principles of the K-Means algorithm, and utilizes certain statistical bounds for estimating the error in computing the centroids of the clusters in a distributed manner vis-a-vis the centralized approach. The algorithm deals with the general scenario where each node contains a subset of the overall data to be clustered and the nodes observe the same set of attributes. Although most sensor network applications deal with continuous data streams, the paper does not directly address that. The P2P K-Means algorithm can be easily extended following the work by [4] in order to handle stream data. The objective of the current work is to develop a P2P version of the K-Means algorithms so that it can be used by the stream version of the K-Means algorithm [4] for clustering stream data in a sensor network.
Section 2 describes some of the clustering applications for sensor networks that motivated this research. Section 3 provides some background discussion on clustering, research issues in sensor networks, and related work in the domain of distributed clustering. The proposed clustering technique is described in Section 4. A detailed theoretical analysis of the proposed algorithm is carried out in Section 5. Section 6 provides the experimental results. Finally, Section 7 concludes the article.
Section snippets
Motivation
Consider a scene segmentation and monitoring application using sensor networks where the sensor nodes are equipped with audio, vibration, temperature and reflectance probes. The sensors are monitoring a given geographical region. The sensors are battery powered. Therefore, in the normal mode they are designed not to be very active. However, as soon as a node detects a possible change in the scenario, sensors must wake up, observe, reason and collaborate with each other in order to track and
Background
As already mentioned, distributed clustering of data streams in an energy efficient manner is an emerging research problem with significant applications in sensor networks. This section briefly reviews some of the existing clustering techniques, with particular emphasis on data stream clustering, distributed clustering, and information processing in sensor networks. Issues in sensor networks, which make the problem unique, are also discussed.
Distributed peer-to-peer K-Means clustering
This section describes the proposed P2P K-Means clustering algorithm in a peer-to-peer homogeneously distributed setting. Since the P2P K-Means clustering technique utilizes the principles of the conventional K-Means algorithm, it is described first. Subsequently, the distributed clustering problem is formally stated, followed by a description of the proposed method. A theoretical analysis of P2P K-Means is provided in the next section.
Analysis of the P2P K-Means algorithm
In this section we provide results which allow a node, at a given iteration, to compute an upper-bound on the centroid error based on current run-time information. Such a bound can be thought of as a “gage” by which the node can measure the degree to which accuracy has been sacrificed at the expense of lowered communication cost.
We analyze the variant of our algorithm which uses a random sampling of nodes to update centroids. Our analysis is an adaption of that provided in [7] where K-Means
Experimental results
We carried out two sets of experiments. First we applied a random sampling-based approach, i.e., in each iteration, every node (site) updates its cluster centroids based on the cluster information from randomly selected nodes over the whole network. Then we experimented with the scenario where random sampling is replaced by the deterministic immediate neighbor-based approach, i.e., each node updates its cluster centroids by only considering the information from its immediate neighbors. The
Discussion and conclusions
This article describes the P2P K-Means algorithm for distributed clustering of data streams in a peer-to-peer sensor network environment. Sensor networks are characterized by low communication and computational capabilities, limited battery power, asynchronous nature and existence of faults. In the P2P K-Means algorithm, computation is performed locally, and communication of the local data models (represented by the corresponding centroids and the cluster counts) is restricted only within a
Acknowledgements
The authors acknowledge supports from the United States National Science Foundation (NSF) CAREER award IIS-0093353, NSF Grant IIS-0329143, and NASA (NRA) NAS2-37143.
References (56)
- et al.
Wireless sensor networks: a survey
Computer Networks
(2002) - et al.
Algorithms for Clustering Data
(1988) - et al.
Towards data mining in large and fully distributed peer-to-peer overlay networks
- et al.
Association rule mining in peer-to-peer systems
IEEE Trans. Syst. Man Cyberns. Part B
(2004) - et al.
Clustering data streams
- et al.
Pattern Recognition Principles
(1974) - et al.
- et al.
A general method for scaling up machine learning algorithms and its application to clustering
- C. Aggarwal, J. Han, J. Wang, P. Yu, A framework for clustering evolving data streams, in: Proceedings of the 29th...
- B. Babcock, M. Datar, R. Motwani, L. O’Callaghan, Maintaining variance and k-medians over data stream windows, in: ACM...
Mining high speed data streams
Mining time-changing data streams
Iterative incremental clustering of time series
Distributed data clustering can be efficient and exact
SIGKDD Explorations
Distributed clustering using collective principal component analysis
Knowledge Inform. Syst. J.
Collective, hierarchical clustering from distributed, heterogeneous data
RACHET: An efficient cover-based merging of clustering hierarchies from distributed datasets
Distrib. Parallel Databases
Cluster ensembles—a knowledge reuse framework for combining multiple partitions
J. Mach. Learning Res.
Cited by (133)
Distributed clustering using multi-tier hierarchical overlay super-peer peer-to-peer network architecture for efficient customer segmentation
2021, Electronic Commerce Research and ApplicationsCitation Excerpt :All peers should be synchronized in each iteration; however, this mechanism has resulted in heavy traffic and network congestion. Two other approximate local algorithms, Local Synchronization-based P2P (LSP2P) and Uniform node Sampling-based P2P (USP2P) for P2P K-means clustering, are described in (Eisenhardt et al., 2003) and (Bandyopadhyay et al., 2006). The local synchronization-based P2P K-means distributes the centroids using gossiping.
Distributed clustering in peer to peer networks using multi-objective whale optimization
2020, Applied Soft Computing JournalCitation Excerpt :The peers have ability to communicate with each other. Distributed clustering is an important tool to extract significant information from the data collected at the peer nodes [1,2]. It avoids the use of a traditional central processor that collects all data from the neighborhood nodes and performs clustering.
Distributed robust data clustering in wireless sensor networks using diffusion moth flame optimization
2020, Engineering Applications of Artificial IntelligenceCitation Excerpt :The points derived from the analysis are highlighted in the conclusion Section 7. Distributed data clustering in peer to peer networks is introduced in 2006 by Bandyopadhyay et al. (2006) which is based on popular K-Means algorithm. This approach has the limitation of global synchronization (as the nodes can perform successive iteration only after completion of the previous one for all the nodes involved in the network).
Automatic Determination of K in Distributed K-Means Clustering
2019, Procedia Computer ScienceDistributed stream clustering using micro-clusters on Apache Storm
2017, Journal of Parallel and Distributed Computing
- 1
On leave from Indian Statistical Institute, Kolkata, India.
- 2
On leave from Jadavpur University, Kolkata, India.
- 3
Also affiliated with Agnik, LLC.