Distributed and Peer-to-Peer Data Mining for
Scalable Analysis of Data from Virtual Observatories
Funding Agency: NASA
Duration (2007-2010)
Project Summary
Design, implementation,
and archiving of very large sky surveys play a critical role in today's
Astronomy research. However, astronomers will be unable to tap the riches of
this collection of gigabyte, terabyte, and (eventually) petabyte catalogs
without a computational backbone that includes support for queries and data
mining across distributed virtual tables of decentralized, joined, and
integrated sky survey catalogs. Moreover, use of local data management systems
such as MyDB, MySpace in AstroGrid, and Grid Bricks for storing and managing
user's local data is becoming increasingly popular. This is opening up the
possibility of constructing a Peer-to-Peer (P2P) network for data sharing and
mining.
This document proposes research and development for a new generation of scalable
data analytic services for the NVO based on advanced distributed and P2P data
mining capabilities across multiple data repositories. This research will
develop technology for supporting web services within the NVO that will allow
astronomy researchers to analyze data from multiple surveys using fundamentally
distributed algorithms. It will also develop several distributed data mining
algorithms for analysis of distributed Astronomy catalogs without requiring the
data to be downloaded and centralized. Specific objectives include the following
items:
(1) The project will design and implement distributed algorithms for computing
statistical primitives, principal component analysis, and outlier detection from
distributed Astronomy catalogs and their partial images stored in users' local
data management systems. These algorithms will be able to analyze data without
requiring source catalogs to be downloaded and centralized.
(2) The project will develop a prototype system which will offer a rich
collection of web-services based on various DDM algorithms. This service offers
a novel augmentation to the existing NVO environment and it will support a rich
variety of data mining tasks that will work in a distributed fashion.
(3) The developed system will be tested using specific astronomical research
problems. In particular, we will explore the multi-dimensional multi-wavelength
parameter space of astrophysical properties of starbursting galaxies. We will
search for unusual correlations, outliers, sub-clusters, and fundamental planes
within the multi-dimensional parameter space presented by several large surveys.
This research will enable increased productivity of NASA's Science Mission
Directorate (SMD) research endeavors through rapid multi-mission correlative
analysis, such as distributed and P2P mining of the large survey catalogs from
GALEX, Spitzer, 2MASS, and eventually WISE.
Publications and
Products
Please visit out
publications page....