ABSTRACTS

 

 

Scalable Distributed Change Detection from Astronomy Data Streams using Local, Asynchronous Eigen Monitoring Algorithms

Abstract: This paper considers the problem of change detection using local distributed eigen monitoring algorithms for next generation of astronomy petascale data pipelines such as the Large Synoptic Survey Telescopes (LSST). This telescope will take repeat images of the night sky every 20 seconds, thereby generating 30 terabytes of calibrated imagery every night that will need to be co-analyzed with other astronomical data stored at different locations around the world. Change point detection and event classification in such data sets may provide useful insights to unique astronomical phenomenon displaying astrophysically significant variations: quasars, supernovae, variable stars, and potentially hazardous asteroids. However, performing such data mining tasks is a challenging problem for such high-throughput distributed data streams. In this paper we propose a highly scalable and distributed asynchronous algorithm for monitoring the principal components (PC) of such dynamic data streams. We demonstrate the algorithm on a large set of distributed astronomical data to accomplish well-known astronomy tasks such as measuring variations in the fundamental plane of galaxy parameters. The proposed algorithm is provably correct (i.e. converges to the correct PCs without centralizing any data) and can seamlessly handle changes to the data or the network. Real experiments performed on Sloan Digital Sky Survey (SDSS) catalogue data show the effectiveness of the algorithm.

A Communication Efficient Probabilistic Algorithm for Mining Frequent Itemset from Peer-to-Peer Network

Abstract: This paper focuses on developing a communication efficient algorithm for discovering frequent itemsets from data distributed homogeneously in a peer-to-peer(P2P) network. The classic frequent itemset mining algorithm does not work in this setting as it needs to access the data in its entirety available at one location. In a large-scale distributed environment like a P2P network, finding all the network-wide frequent itemsets with exact frequency is computationally difficult and usually requires large amount of communication (often communication per peer reaches order of the network size). Instead, this paper tries to find most of the frequent itemsets with probabilistic guarantee and their approximate frequency using only a bounded communication. The paper takes sampling based approach to identify frequent itemsets from the entire P2P network. It first addresses the challenging problem of collecting an unbiased uniform sample from a P2P network, and shows how to take a uniform sample of nodes and a uniform sample of data from a P2P network using random-walk. It then applies the proposed sampling technique to identify most of the frequent itemsets from a P2P network. Theoretical analysis shows how to bound the sample-size irrespective of the size of the entire data and hence, communication necessary to compute the results. Experimental analysis shows that the proposed algorithm discovers all of the network-wide frequent itemsets using only a bounded communication.

Approximate K-means Clustering Over a Peer-to-peer Network

Abstract: Data intensive peer-to-peer (P2P) networks are finding increasing number of applications. Data mining in such P2P environments is a natural extension. However, common monolithic data mining architectures do not fit well in such environments since they typically require centralizing the distributed data which is usually not practical in a large P2P network. Distributed data mining algorithms that avoid large-scale synchronization or data centralization offer an alternate choice. This paper considers the distributed K-means clustering problem where the data and computing resources are distributed over a large P2P network. It offers two algorithms which produce an approximation of the result produced by the standard centralized K-means clustering algorithm. The first is designed to operate in a dynamic P2P network that can produce clustering by "local" synchronization only. The second algorithm uses uniformly sampled peers and provides analytical guarantees regarding the accuracy of clustering on a P2P network. Empirical results show that both the algorithms demonstrate good performance compared to their centralized counterparts at little communication cost.

On the Privacy of Euclidean Distance Preserving Data Perturbation

Abstract: We examine Euclidean distance preserving data perturbation as a tool for privacy-preserving data mining. Such perturbations allow many important data mining algorithms, with only minor modification, to be applied to the perturbed data and produce exactly the same results as if applied to the original data, e.g., hierarchical cluatering and k-means clustering. However, the issue of how well the original data is hidden needs careful study. We take a step in this direction by assuming the role of an attacker armed with two types of prior information regarding the original data. We examine how well the attacker can recover the original data from the perturbed data and prior information. Our results offer insight into the vulnerabilities of Euclidean distance preserving transformations.

A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems

Abstract: In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system’s functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the model of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis

A Scalable Local Algorithm for Distributed Multivariate Regression

Abstract: This paper offers a local distributed algorithm for multivariate regression in large peer-to-peer environments. The algorithm can be used for distributed inferencing, data compaction, data modeling and classification tasks in many emerging peer-to-peer applications for bioinformatics, astronomy, social networking, sensor networks and web mining. Computing a global regression model from data available at the different peer-nodes using a traditional centralized algorithm for regression can be very costly and impractical because of the large number of data sources, the asynchronous nature of the peer-to-peer networks, and dynamic nature of the data/network. This paper proposes a two-step approach to deal with this problem. First, it offers an efficient local distributed algorithm that monitors the “quality” of the current regression model. If the model is outdated, it uses this algorithm as a feedback mechanism for rebuilding the model. The local nature of the monitoring algorithm guarantees low monitoring cost. Experimental results presented in this paper strongly support the theoretical claims.

Distributed Decision Tree Induction in Peer-to-Peer Systems

Abstract: This paper offers a scalable and robust distributed algorithm for decision tree induction in large Peer-to-Peer (P2P) environments. Computing a decision tree in such large distributed systems using standard centralized algorithms can be very communication-expensive and impractical because of the synchronization requirements. The problem becomes even more challenging in the distributed stream monitoring scenario where the decision tree needs to be updated in response to changes in the data distribution. This paper presents an alternate solution that works in a completely asynchronous manner in distributed environments and offers low communication overhead, a necessity for scalability. It also seamlessly handles changes in data and peer failures. The paper presents extensive experimental results to corroborate the theoretical claims.

An efficient local Algorithm for Distributed Multivariate Regression in Peer-to-Peer Networks

Abstract: This paper offers a local distributed algorithm for multivariate regression in large peer-to-peer environments. The algorithm is designed for distributed inferencing, data compaction, data modeling and classification tasks in many emerging peer-to-peer applications for bioinformatics, astronomy, social networking, sensor networks and web mining. Computing a global regression model from data available at the different peer-nodes using a traditional centralized algorithm for regression can be very costly and impractical because of the large number of data sources, the asynchronous nature of the peer-to-peer networks, and dynamic nature of the data/network. This paper proposes a two-step approach to deal with this problem. First, it offers an efficient local distributed algorithm that monitors the “quality” of the current regression model. If the model is outdated, it uses this algorithm as a feedback mechanism for rebuilding the model. The local nature of the monitoring algorithm guarantees low monitoring cost. Experimental results presented in this paper strongly support the theoretical claims.

Distributed Identification of Top-l Inner Product Elements and its Application in a Peer-to-Peer Network

Abstract: Inner product measures how closely two feature vectors are related. It is an important primitive for many popular data mining tasks, e.g., clustering, classification, correlation computation, and decision tree construction. If the entire data set is available at a single site, then computing the inner product matrix and identifying the top (in terms of magnitude) entries is trivial. However, in many real-world scenarios, data is distributed across many locations and transmitting the data to a central server would be quite communication-intensive and not scalable. This paper presents an approximate local algorithm for identifying top-l inner products among pairs of feature vectors in a large asynchronous distributed environment such as a peer-to-peer (P2P) network. We develop a probabilistic algorithm for this purpose using order statistics and Hoeffding bound. We present experimental results to show the effectiveness and scalability of the algorithm. Finally, we demonstrate an application of this technique for interest-based community formation in a P2P environment.

A Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods

Abstract: We focus primarily on the use of additive and matrix multiplicative data perturbation techniques in privacy preserving data mining (PPDM). We survey a recent body of research aimed at better understanding the vulnerabilities of these techniques. These researchers assumed the role of an attacker and developed methods for estimating the original data from the perturbed data and any available prior knowledge. Finally, we briefly discuss research aimed at attacking k-anonymization, another data perturbation technique in PPDM.

Privacy-Preserving Data Analysis on Graphs and Social Networks

Abstract: While literature within the field of privacy-preserving data mining (PPDM) has been around for many years, attention has mostly been given to the perturbation and anonymization of tabular data; understanding the role of privacy over graphs and networks is still very much in its infancy. In this chapter, we survey a very recent body of research on privacy-preserving data analysis over graphs and networks in an effort to allow the reader to observe common themes and future directions.

Thoughts on Human Emotions, Breakthroughs in Communication, and the Next Generation of Data Mining

HillolKargupta. 2007 National Science Foundation Symposium on Next Generation of Data Mining and Cyber-Enabled Discovery for Innovation.

Abstract: This paper revisits some of the breakthroughs in the field of communication that changed the human civilization and tries to understand where we are going tomorrow. It argues that while we have become very good in quickly connecting an entity with another entity world-wide as long as the former knows the address of the latter, current technology for taking the message or service from one individual to a large population of willing and interested individuals is still very primitive. We need technology for dealing with the new world of attention-economics. The paper also argues the currently popular centralized client-server model for the Internet applications may not work well in doing so. It offers some alternate thoughts borrowed from the nature and some emerging scalable applications that may offer some directions.

Uniform Data Sampling from a Peer-to-Peer Network

Souptik Datta, HillolKargupta (2007). 2007 IEEE International Conference on Distributed Computing Systems Aceepted for publication.

Abstract: Uniform random sample is often useful in analyzing data. Usually taking a uniform sample is not a problem if the entire data resides in one location. However, if the data is distributed in a peer-to-peer (P2P) network with different amount of data in different peers, collecting a uniform sample of data becomes a challenging task. A random sampling can be performed using random-walk on the P2P network modeled as a graph. However, due to varying degrees of connectivity and different sizes of data owned by each peer, this random walk with a specific length gives a biased sample. In this paper, we propose a random walk based sampling algorithm that can be run to sample datapoints uniformly from a large, unstructured P2P network. We model the random walk as a Markov chain and derive conditions to bound the length of the random walk necessary to achieve uniformity in the sampling procedure. A formal communication analysis shows logarithmic communication cost to discover a uniform data sample. Experiments verify the effectiveness of the technique in achieving uniformity of data.

Distributed Top-K Outlier Detection from Astronomy Catalogs using the DEMAC System

Haimonti Datta,, Chris Giannella, Kirk Borne and Hillol Kargupta. Proceedings of the 2007 SIAM International Conference on Data Mining (2007).

Abstract: The design, implementation and archiving of large sky surveys is an important part of astronomy research. The Sloan Digital Sky Survey (SDSS), The Two Micron All Sky Survey (2MASS) are some such surveys producing tera bytes of geographically distributed data which need to be stored, analyzed and queried to enable scientific discoveries. In this paper, we describe the architecture of a system for Distributed Exploration of Massive Astronomy Catalogs (DEMAC) which is built on top of the existing National Virtual Observatory environment. We describe distributed algorithms for doing Principal Component Analysis (PCA) using random projection and sampling based techniques. Using the approximate principal components, we develop a distributed outlier detection algorithm which enables identification of data points that deviate sharply from the ``correlation structure" of the data. We provide simulation results with data obtained from sky-surveys SDSS and 2MASS.

Distributed Probabilistic Inferencing in Sensor Networks using Variational Approximation

Sourav Mukherjee, HillolKargupta (2007).

Abstract: This paper considers the problem of distributed inferencing in a sensor network. It particularly explores the probabilistic inferencing problem in the context of a distributed Boltzmann machine-based framework for monitoring the network. The paper offers a variational mean-field approach to develop communication-efficient local algorithm for Variational Inferencing in Distributed Environments (VIDE). It compares the performance of the proposed approximate variational technique with respect to the exact and centralized techniques. It shows that the VIDE offers a much more communication-efficient solution at very little cost in terms of the accuracy. It also offers experimental results in order to substantiate the scalability of the proposed algorithm.

A Game Theoretic Approach toward Multi-Party Privacy-Preserving Distributed Data Mining

H. Kargupta , K. Das, K. Liu.

Abstract: Analyzing privacy-sensitive data in a multi-party environment often assumes that the parties are well-behaved and they abide by the protocols as expected. Parties compute whatever is needed, communicate correctly following the rules, and do not collude with other parties for exposing third party sensitive data. This paper argues that most of these nice assumptions fall apart in real-life applications of privacy-preserving distributed data mining (PPDM). The paper offers a more realistic formulation of the PPDM problem as a multi-party game where each party tries to maximize its own objectives. It develops a game-theoretic framework and offers a new methodology to develop PPDM algorithms. It also presents equilibrium-analysis of PPDM-games and outlines a game-theoretic solution based on the concept of ``cheap-talk'' borrowed from economics and game theory literature.

Clustering Distributed Data Streams in Peer-to-peer Environments

Sanghamitra Banyopadhyay, Chris Giannella, Ujjwal Maulik, Hillol Kargupta, Souptik Datta, Kun Liu (2006). Information Sciences , 176(14), 1952-1985,2006

Abstract: This paper describes a technique for clustering homogeneously distributed data in a peer-to-peer environment like sensor networks. The proposed technique is based on the principles of the K-Means algorithm. It works in a localized asynchronous manner by communicating with the neighboring nodes. The paper offers extensive theoretical anal- ysis of the algorithm that bounds the error in the distributed clustering process com- pared to the centralized approach that requires downloading all the observed data to a single site. Experimental results show that, in contrast to the case when all the data is transmitted to a central location for application of the conventional clustering algo- rithm, the communication cost (an important consideration in sensor networks which are typically equipped with limited battery power) of the proposed approach is significantly smaller.  At the same time, the accuracy of the obtained centroids is high and the number of samples which are incorrectly labeled is also small.

On-board Vehicle Data Stream Monitoring using MineFleet and Fast Resource Constrained Monitoring of Correlation Matrices

Hillol Kargupta, VasundharaPuttagunta, Martin Klein, Kakali Sarkar Next GenerationComputing. , Invited submission for special issue on learning from data streams, volume 25, no. 1, pp. 5--32, 2007

Abstract: This paper considers the problem of monitoring vehicle data streams in a resource-constrained environment. It particularly focuses on a monitoring task that requires frequent computation of correlation matrices using lightweight on-board computing devices. It motivates this problem in the context of the {\em MineFleet Real-Time} system and offers a randomized algorithm for fast monitoring of correlation (FMC), inner product, and Euclidean distance matrices among others. Unlike the existing approaches that compute all the entries of these matrices from a data set, the proposed technique works using a divide-and-conquer approach. This paper presents a probabilistic test for quickly detecting whether or not a subset of coefficients contains a significant one with a magnitude greater than a user given threshold. This test is used for quickly identifying the portions of the space that contain significant coefficients. The proposed algorithm is particularly suitable for monitoring correlation and related matrices computed from continuous data streams.

 

Distributed Data Mining in Peer-to-Peer Networks. 

Souptik. Datta, Kanishka Bhaduri, Chris Giannella, Ran Wolff, Hillol Kargupta. (2006) Invited submission to the IEEE Internet Computing special issue on Distributed Data Mining, July-August 2006.

Abstract: Peer-to-peer (P2P) networks are gaining popularity in many application s such as file sharing, e-commerce, and social networking, many of which deal with rich, distributed data sources that can benefit from data mining. P2P networks are, infact,well-suited to distributed data mining (DDM), which deals with the problem of data analysis in environments with distributed data,computing nodes,and users. This article offers an overview of DDM applications and algorithms for P2P environments,focusing particularly on local algorithms that perform data analysis by using computing primitives with limited communication overhead. The authors describe both exact and approximate local P2P data mining algorithms that work in a decentralized and communication-efficient manner.

K-Means Clustering over a Large, Dynamic Networks

Souptik Datta, Chris Giannella and Hillol Kargupta (2006 Proc. 2006 SIAM Conf. Data Mining (SDM 06), SIAM Press, 2006, pp. 153-164

Abstract: This paper presents an algorithm for K-means clustering of data distributed over a large, dynamic network.  The network is not assumed to contain any special server nodes ( i.e. is peer-to-peer) and is not assumed to be stable either with respect to the topology or the data held by nodes.  The algorithm requires only local communication and synchronization at each iteration: nodes communicate and synchronize only with their topologically neighboring nodes.  Due to the growing prevalence of peer-to-peer and mobile/wireless sensor networks, data analysis in large, dynamic networks is likely to garner increasing importance in the near future.  To our knowledge, our algorithm represents the first K-means algorithm (a common data analysis/mining technique) to be developed for a large dynamic network.
We tested our algorithm in a simulated environment of up to 1000 nodes on synthetic data.  We examine its behavior in a static environment (no data or network change) and a dynamic environment.  Empirical results show the algorithm demonstrates good accuracy (in both the static and dynamic environment) in that the cluster labels produced are very similar to those produced by K-means run
on centralized data.

On the Privacy Preserving Properties of Random Data Perturbation Techniques.   

Hillol Kargupta, Souptik Datta, Qi Wang, and Krishnamoorthy Sivakumar. (2003).  Proceedings of the Third ICDM IEEE International Conference on Data Mining (p. 99-107)  (ICDM- 2003), Melbourne, FL, November,2003

Abstract:Privacy is becoming an increasingly important issue in many data mining applications. This has triggered the development of many privacy-preserving data mining techniques. A large fraction of them use randomized data distortion techniques to mask the data for preserving the privacy of sensitive data. This methodology attempts to hide the sensitive data by randomly modifying the data values often using additive noise. This paper questions the utility of the random value distortion technique in privacy preservation. The paper notes that random objects (particularly random matrices) have “predictable” structures in the spectral domain and it develops a random matrix-based spectral filtering technique to retrieve original data from the dataset distorted by adding random values. The paper presents the
theoretical foundation of this filtering method and extensive experimental results to demonstrate that in many cases random data distortion preserve very little data privacy.

Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems.
R. Wolff, K. Bhaduri, H. Kargupta.  SIAM International Conference in Data Mining, Bethesda, Maryland, USA. 2006
 
 Abstract : In a large network of computers, wireless sensors,   or mobile devices, each of the components (hence, peers) has some   data about the global status of the system. Many of the functions   of the system, such as routing decisions, search strategies, data   cleansing, and the assignment of mutual trust, depend on the global status.  Therefore, it is essential that the system be able to detect, and   react to, changes in its global status.   Computing global predicates in such systems is usually very costly.   Mainly because of their scale, and in some cases (e.g., sensor networks) also because of the high cost   of communication. The cost further increases when the data changes rapidly   (due to state changes, node failure, etc.) and   computation has to follow these changes. In this paper we describe a two step   approach for dealing with these costs. First, we describe a highly efficient   local algorithm which detect when the L2 norm of the average data surpasses  a threshold. Then, we use this algorithm as a feedback loop for the monitoring  of complex predicates on the data -- such as the data's k-means clustering.   The efficiency of the L2 algorithm guarantees that so long as the clustering   results represent the data (i.e., the data is stationary) few resources are required.   When the  data undergoes   an epoch change -- a change in the underlying distribution --  and the model no longer represents it,   the feedback loop   indicates this and the model is rebuilt. Furthermore, the existence of   a feedback loop allows using approximate and ``best-effort'' methods for   constructing the model; if an ill-fit model is built the feedback loop   would indicate so, and the model would be rebuilt.

 

Identifying Significant Inner Product Elements in a Peer-to-Peer Network.
 K. Das, K. Bhaduri, H. Kargupta. In communication. 2006.

Abstract: Peer-to-peer networks are getting increasing attention in various   applications involving distributed file sharing, sensor networks, and   mobile ad hoc networks. Efficient identification of top few inner   product entries from the entire inner product matrix of features   in a distributed peer-to-peer network is a challenging problem since   centralizing the data from all the nodes in a synchronous,   communication efficient manner may not be an option. Inner product computation is an   important primitive used in many techniques for feature dependency   detection, distance computation, clustering and correlation computation   among others. This paper deals with the problem of identifying significant   inner products among features in a peer-to-peer environment where different   nodes observe a different set of data. It uses an ordinal framework   to develop probabilistic algorithms to find top l elements in the inner product matrix   which can be used for many data mining applications. It also presents experimental results demonstrating scalable performance of   this algorithm for large peer-to-peer networks.

Client-side Web Mining for Community Formation in Peer-to-Peer Environments.

K. Liu, K Bhaduri, K. Das, P. Nguyen, H. Kargupta. SIGKDD workshop on web usage and analysis (WebKDD).
  Philadelphia, Pennsylvania, USA. 2006.


Abstract: In this paper we present a framework for forming interests-based   Peer-to-Peer communities using client-side web browsing history. At the   heart of this framework is the use of an order statistics-based   approach to build communities with hierarchical structure. We have also   carefully considered privacy concerns of the peers and adopted cryptographic   protocols to measure similarity between them without disclosing their   personal profiles. We evaluated our framework on a distributed data   mining platform we have developed. The experimental results show that our   framework could effectively build interests-based communities.

Monitoring Any Data Model in a Large Distributed System.


R. Wolff, K. Bhaduri, H. Kargupta. In communication. 2006.

Abstract: In a large network of computers, wireless sensors, or mobile  devices, each of the components (hence, peers) has some data about  the global status of the system. Many of the functions of the  system, such as routing decisions, search strategies, data  cleansing, and the assignment of mutual trust, rely on modeling the  global status, and many of these model-ling techniques are akin to  data mining. In general, we refer to the outcome of the function  (e.g., the trust assigned to each peer) as the \emph{model} of the  system. Since the state of the system is in constant change, it is  necessary to monitor the suitability of the model to the state, and  react in case it is unsuitable.
 

Computing global predicates in large distributed systems is very  costly. Mainly, this is because of system's scale, and in some cases  (e.g., sensor networks) also because of the high cost of  communication. The cost further increases when the data changes  rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local  algorithm which can be used to monitor any ordinal function of the  global state, i.e., almost any data mining model. Then, we use this  algorithm as a feedback loop for the monitoring of complex  predicates on the data -- such as the data's k-means clustering. The efficiency of the local algorithm guarantees that so long as the  model represents the data (i.e., the data is stationary) few  resources are required. When the  data undergoes an epoch change --  a change in the underlying distribution --  and the model no longer  represents it, the feedback loop indicates this and the model is rebuilt. Furthermore, the existence of a feedback loop allows using  approximate and ``best-effort'' methods for  constructing the model;  if an ill-fit model is built the feedback loop would indicate so,  and the model would be rebuilt.

Random projection-based multiplicative data perturbation for privacy preserving distributed data mining.

K. Liu, H. Kargupta, and J. Ryan. IEEE Transactions on Knowledge and Data Engineering (TKDE), 18(1):92-106, January 2006.

Abstract: This paper explores the possibility of using multiplicative random projection matrices for privacy preserving distributed data mining. It specifically considers
the problem of computing statistical aggregates like the inner product matrix, correlation coefficient matrix, and Euclidean distance matrix from distributed privacy sensitive data possibly owned by multiple parties. This class of problems is directly related to many other data-mining problems such as clustering, principal component analysis, and classification. This paper makes primary contributions on two different grounds. First, it explores Independent Component Analysis as a possible tool for breaching privacy in deterministic multiplicative perturbation-based models such as random orthogonal transformation and random rotation. Then, it proposes an approximate random projection-based technique to improve the level of privacy protection while still preserving certain statistical characteristics of the data. The paper presents extensive theoretical analysis and experimental results. Experiments demonstrate that the proposed technique is effective and can be successfully used for different types of privacypreserving data mining applications.

 

An attacker's view of distance preserving maps for privacy preserving data mining.

K. Liu, C. Giannella, and H. Kargupta.In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06), Berlin, Germany, September 2006.

Abstract: We examine the effectiveness of distance preserving transformations in privacy preserving data mining. These techniques are potentially very useful in that some important data mining algorithms can be applied to the transformed data and produce exactly the same results as if applied to the original data e.g. distance-based clustering, k-nearest neighbor classification. However, the issue of how well the original data is hidden has, to our knowledge, not been carefully studied. We take a step in this direction by assuming the role of an attacker armed with two types of prior information regarding the original data. We examine how well the attacker can recover the original data from the transformed data and prior information. Our results offer insight into the vulnerabilities of distance preserving transformations.

Communication efficient construction of decision trees over heterogeneously distributed data.

C. Giannella, K. Liu, T. Olsen, and H. Kargupta. In Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM'04), pages 67-74, Brighton, UK, November 2004.

Abstract: We present an algorithm designed to efficiently construct a decision tree over heterogeneously distributed data without centralizing. We compare our algorithm against a standard centralized decision tree implementation in terms of accuracy as well as the communication complexity. Our experimental results show that by using only 20% of the communication cost necessary to centralize the data we can achieve trees with accuracy at least 80% of the trees produced by the centralized version.

Vedas: A mobile and distributed data stream mining system for real-time vehicle monitoring.

H. Kargupta, R. Bhargava, K. Liu, M. Powers, P. Blair, S. Bushra, J. Dull, K. Sarkar, M. Klein, M. Vasa, and D. Handy.
In Proceedings of the 2004 SIAM International Data Mining Conference (SDM'04), Orlando, FL, April 2004.

Abstract: This paper presents an overview of an experimental mobile and distributed data stream mining system that allows real time monitoring of vehicle diagnostic data for vehicle-health monitoring and driver characterization. It offers the motivation behind this application, explains the system architecture, discusses many of challenges that the project faced, and shares some of the adopted solutions. The main contribution of the paper is our experience in building one of the very early distributed data stream mining system for wireless applications that performs most of the data analysis related tasks using light-weight onboard computing devices.
This paper points out that the distributed data mining technology can play a key role in solving real-life problems in a mobile application environment where computing, storage, power, and communication resources are limited. The paper also illustrates how privacy-preserving distributed data mining can play an important role in this type of applications.

Privacy sensitive distributed data mining from multi-party data.

H. Kargupta, K. Liu, and J. Ryan. In Proceedings of the First NSF/NIJ Symposium on Intelligence and Security Informatics, Lecture Notes in Computer Science, pages 336-342, Tucson, AZ, June 2003. Springer Berlin/Heidelberg.

Abstract: Privacy is becoming an increasingly important issue in data mining, particularly in security and counter-terrorism-related applications where the data is often sensitive. This paper considers the problem of mining privacy sensitive distributed multi-party data. It specifically considers the problem of computing statistical aggregates like the correlation matrix from privacy sensitive data where the program for computing the aggregates is not trusted by the owner(s) of the data. It presents a brief overview of a random projection-based technique to compute the correlation matrix from a single third-party data site and also multiple homogeneous sites.

Homeland security and privacy sensitive data mining from multi-party distributed resources.

H. Kargupta, K. Liu, S. Datta, J. Ryan, and K. Sivakumar.In Proceedings of the 12th IEEE International Conference on Fuzzy Systems, volume 2, pages 1257-1260, St. Louis, MO, May 2003.

Abstract: Defending the safety of an open society from terrorism or other similar threats requires intelligent but careful ways to monitor different types of activities and transactions in the electronic media. Data mining techniques are playing an increasingly important role in sifting through large amount of data in search of useful patterns that might help us in securing our safety. Although the objective of this class of data mining applications is very well justi
ed, they also open up the possibility of misusing personal information by malicious people with access to the sensitive data. This brings up the following question: Can we design data mining techniques that are sensitive to privacy? Several researchers are currently working on a class of data mining algorithms that work without directly accessing the sensitive data in their original form. This paper considers the problem of mining distributed data in a privacy-sensitive manner. It
first points out the problems of some of the existing privacy-sensitive data mining techniques that make use of additive random noise to hide sensitive information. Next it briefly reviews some new approaches that make use of random projection matrices for computing statistical aggregates from sensitive data.

Orthogonal Decision Trees.

H. Kargupta, B. Park, H. Dutta. Orthogonal Decision Trees. IEEE Transactions on Knowledge and Data Engineering, 2006.

Abstract: This paper introduces orthogonal decision trees that offer an effective way to construct a redundancy-free, accurate, and meaningful representation of large decision-tree-ensembles often created by popular techniques such as Bagging, Boosting, Random Forests, and many distributed and data stream mining algorithms. Orthogonal decision trees are functionally orthogonal to each other and they correspond to the principal components of the underlying function space. This paper offers a technique to construct such trees based on the Fourier transformation of decision trees and eigen-analysis of the ensemble in the Fourier representation. It offers experimental results to document the performance of orthogonal trees on the grounds of accuracy and model complexity.

Orthogonal Decision Trees.

H. Kargupta and H. Dutta (2004). The Fourth IEEE International Conference on Data Mining. Brighton, UK , pages 487—490.

Abstract: This paper introduces orthogonal decision trees that offer an effective way to construct a redundancy-free, accurate, and meaningful representation of large decision-treeensembles often created by popular techniques such as Bagging, Boosting, Random Forests and many distributed and data stream mining algorithms. Orthogonal decision trees are functionally orthogonal to each other and they correspond to the principal components of the underlying function space. This paper offers a technique to construct such trees based on eigen-analysis of the ensemble and offers experimental results to document the performance of orthogonal trees on grounds of accuracy and model complexity.

Orthogonal Decision Trees for Resource-Constrained Physiological Data Stream Monitoring using Mobile Devices.

H. Dutta and H. Kargupta and A. Joshi. Lecture Notes in Computer Science (3769) High Performance Computing - HiPC 2005.

Abstract: Several challenging new applications demand the ability to do data mining on resource constrained devices. One such application is that
of monitoring physiological data streams obtained from wearable sensing devices. Such monitoring has applications for pervasive healthcare management, be it for seniors, emergency response personnel, soldiers in the battlefield or atheletes. A key requirement is that the monitoring system be able to run on resouce constrained handheld or wearable devices. Orthogonal decision trees (ODTs) offer an effective way to construct a redundancy-free, accurate, and meaningful representation of large decision-tree-ensembles often created by popular techniques such as Bagging, Boosting, Random Forests and many distributed and data stream mining algorithms. Orthogonal decision trees are functionally orthogonal to each other and they correspond to the principal components of the underlying function space. This paper discusses various properties of the ODTs and their suitability for monitoring physiological data streams in a resource-constrained environment. It offers experimental results to document the performance of orthogonal trees on grounds of accuracy, model complexity, and other characteristics in a resource-constrained mobile environment.

Efficient Kernel Density Estimation Over Distributed Data.

C. Giannella, H. Dutta, S. Mukherjee, H. Kargupta. Proceedings of the 9th International Workshop on High Performance and Distributed Mining, SIAM International Data Mining Conference, Bethesda, USA. April, 2006.

We consider the problem of Kernel Density Estimation (KDE) over distributed data. Our work is motivated by the fact that, in some cases, large data repositories are, in their native form, distributed over many sites. In environments where communication is limited, centralization of the data can be problematic. We examine a framework where the sites build local models which are combined at a central site to produce an approximation to the KDE over the entire dataset. We carry out experiments comparing two realizations of this framework (uniform sampling and linear binning). We compare these two techniques in terms of their accuracy at: directly approximating the centralized KDE (measured by 2-norm function distance), density-based classification, and density-based clustering.

Distributed Data Mining in Astronomy Catalogs.

C. Giannella, H. Dutta, K. Borne, R. Wolff, H. Kargupta.  Special Issue of Concurrency and Computation: Practice and Experience, 2006 (In Communication)

Abstract: The design, implementation, and archiving of very large sky surveys is playing an increasingly important role in today's astronomy research.  However, these data archives will necessarily be
geographically distributed.  To fully exploit the potential of this data lode, we believe that capabilities ought to be provided allowing users a more communication-efficient alternative to multiple archive data analysis than first downloading the archives fully to a centralized site. In this paper, we propose a system, DEMAC, for the distributed mining of massive astronomical catalogs.  The system 
is designed to sit on top of the existing national virtual observatory environment and provide tools for distributed data mining (as web services) without requiring datasets to be fully down-loaded to a centralized server. To illustrate the potential effectiveness of our system, we develop communication-efficient distributed algorithms for principal component analysis (PCA) and outlier detection.  Then, we carry out a case study using distributed PCA for detecting fundamental planes of astronomical parameters.  In particular, PCA enables dimensionality reduction within a set of correlated physical
parameters, such as a reduction of a 3-dimensional data distribution (in astronomer's observed units) to a planar data distribution (in fundamental physical units).  Fundamental physical insights are thereby
enabled through efficient access to distributed multi-dimensional data sets.

Distributed Data Mining for Astronomy Catalogs.

C. Giannella, H. Dutta, K. Borne, R. Wolff, H. Kargupta. Proceedings of 9th Workshop on Mining Scientific and Engineering Datasets, as part of the SIAM International Conference on Data Mining (SDM), 2006.

Abstract: The design, implementation, and archiving of very large sky surveys is playing an increasingly important role in today's astronomy research. However, these data archives will necessarily be geographically distributed. To fully exploit the potential of this data, we believe that capabilities ought to be provided allowing users a more communication-efficient alternative to multiple archive data analysis than first down-loading the archives fully to a centralized site.

In this paper, we describe the architecture of a system, DEMAC, for the distributed mining of massive astronomic catalogs. The system is designed to sit on top of the existing national virtual observatory environment and provide tools for distributed data mining (as web services) without requiring datasets to be fully down-loaded to a centralized server. To illustrate the potential effectiveness of DEMAC, we carry out a case
study using distributed principal component analysis (PCA) for detecting fundamental planes of astronomical parameters. In particular, PCA enables dimensionality reduction within a set of correlated physical parameters, such as reduction of a 3-dimensional data distribution (in astronomer's observed units) to a planar data distribution (in fundamental physical units). Fundamental physical insights are thereby enabled through efficient access to distributed multi-dimensional data sets.

Privacy Preserving Data Mining and Random Perturbation.

H. Kargupta, H. Dutta, S. Datta , K. Sivakumar. Proceedings of the Workshop on Privacy in the Electronic Society (WPES'03), Washington  DC,October,2003

Abstract: Privacy is becoming an increasingly important issue in many data mining applications, particularly in the security and defense area. This has triggered the development of many privacy-preserving data mining techniques. A large fraction of them uses randomized data distortion techniques to mask the data for preserving the privacy. They attempt to hide the sensitive data by randomly modifying the data values
using additive noise. This paper questions the utility of such randomized data distortion technique for preserving privacy in many cases and urges caution.It notes that random objects (particularly random matrices) have "predictable" structures in the spectral domain and then offers a random matrix-based spectral filtering technique to retrieve original data from the data-set distorted by adding random values. It extends our earlier work questioning the efficacy of random perturbation techniques using additive noise for privacy-preserving data mining in continuous valued domain and presents new results in the discrete domain. It shows that the growing collection of random perturbation-based "privacy-preserving" data mining techniques may need a careful scrutiny in order to prevent privacy breaches through linear transformations. The paper also presents extensive experimental results in order to support this claim.