Distributed and Peer-to-Peer Data Mining for Scalable Analysis of Data from Virtual Observatories

Funding Agency: NASA
Duration (2007-2010)

 

Project Summary
Design, implementation, and archiving of very large sky surveys play a critical role in today's Astronomy research. However, astronomers will be unable to tap the riches of this collection of gigabyte, terabyte, and (eventually) petabyte catalogs without a computational backbone that includes support for queries and data mining across distributed virtual tables of decentralized, joined, and integrated sky survey catalogs. Moreover, use of local data management systems such as MyDB, MySpace in AstroGrid, and Grid Bricks for storing and managing user's local data is becoming increasingly popular. This is opening up the possibility of constructing a Peer-to-Peer (P2P) network for data sharing and mining.


This document proposes research and development for a new generation of scalable data analytic services for the NVO based on advanced distributed and P2P data mining capabilities across multiple data repositories. This research will develop technology for supporting web services within the NVO that will allow astronomy researchers to analyze data from multiple surveys using fundamentally distributed algorithms. It will also develop several distributed data mining algorithms for analysis of distributed Astronomy catalogs without requiring the data to be downloaded and centralized. Specific objectives include the following items:

(1) The project will design and implement distributed algorithms for computing statistical primitives, principal component analysis, and outlier detection from distributed Astronomy catalogs and their partial images stored in users' local data management systems. These algorithms will be able to analyze data without requiring source catalogs to be downloaded and centralized.

(2) The project will develop a prototype system which will offer a rich collection of web-services based on various DDM algorithms. This service offers a novel augmentation to the existing NVO environment and it will support a rich variety of data mining tasks that will work in a distributed fashion.

(3) The developed system will be tested using specific astronomical research problems. In particular, we will explore the multi-dimensional multi-wavelength parameter space of astrophysical properties of starbursting galaxies. We will search for unusual correlations, outliers, sub-clusters, and fundamental planes within the multi-dimensional parameter space presented by several large surveys.

This research will enable increased productivity of NASA's Science Mission Directorate (SMD) research endeavors through rapid multi-mission correlative analysis, such as distributed and P2P mining of the large survey catalogs from GALEX, Spitzer, 2MASS, and eventually WISE.
 

Publications and Products

Please visit out publications page....


Other Resources

Distributed Data Mining Bibliography (www.cs.umbc.edu/~hillol/DDMBIB)