Gene Expression: The Missing Link of Evolutionary Computation, from Theory to Applications in Distributed Data Mining

Hillol Kargupta
Assistant Professor, Department of Computer Science and Electrical Engineering
University of Maryland Baltimore County, Baltimore,  MD 21250
Voice: (410) 455-3972, Fax : (410) 455-3969
Email: hillol@cs.umbc.edu
PI URL: http://www.cs.umbc.edu/~hillol
Project URL: http://www.cs.umbc.edu/~hillol/GEXP/gene_expression.html
Link to a related project on Genetic Codes (http://www.cs.umbc.edu/~hillol/GEXP/genetic_code_project.html).
Project Award Information
Award Number:  National Science Foundation Grant IIS-0196401
Duration:  September 15, 1998-2003
Title: Gene Expression: The missing Link of Evolutionary Computation, from Theory to Applications in Distributed Data Mining.
Keywords: Gene expression, evolutionary computation, function induction, distributed data mining
Project Summary
Construction of the phenotype from the information coded in the genome is often called the gene expression process in biology. Transformation of the DNA to the protein structure plays an important role in this process. Since the structure of the protein determines the efficacy of the DNA, gene expression can be viewed as the process of computing the mapping of the genotype to the phenotype. For some reason, it does not directly compute this function using the DNA itself. It first transforms the DNA representation to mRNA and subsequently to protein before evaluating its phenotype. Representation transformations play an important role in efficient problem solving in many fields such as physics, mathematics, engineering, machine learning, and optimization. Therefore representation transformations in gene expression allude intriguing possibilities. However, very little is understood about the role of gene expression in evolutionary search. The primary objective of this project is to  understand the representation transformations in gene expression (DNA--> mRNA-->Protein) from the perspective of search and function induction from observed data. The project will also foundations for developing gene expression-based genetic algorithms for scalable optimization, learning, and distributed data mining. This research has produced a new randomized algorithm to detect the decomposed representation of an objective function with n variables. This algorithm can be effectively used for detecting the search space structure in large scale optimization problems. Fourier analysis of genetic code-like transformations provided an intriguing result. It showed that such transformations can construct a Fourier representation of functions with only a polynomial number of terms that are  exponentially more significant than the rest. This is a very important property needed for efficient induction of function representation from data which is a fundamental problem in inductive learning, search, and optimization. Exploration of randomized genetic code-like transformation has produced deep insights about the effect of such transformations on efficient and approximate solving of non-linear search and optimization problems using linear techniques.
Publications and Products
Journals

  1. H. Kargupta, R. Ayyagari, and S. Ghosh. (2003). Learning Functions Using Randomized Expansions: Probabilistic Properties and Experimentations. IEEE Transactions on Knowledge and Data Engineering, Volume 16, Number 8, pages 894-908. (pdf) (A paper on probablistic properties of randomized genetic code-like transformations)
  2. Kargupta, H. and Park, B. (2001).Gene Expression and Fast Construction of Distributed Evolutionary Representation. Journal of Evolutionary Computation. Vol. 9, no.1, pages 1--32.
  3. Kargupta, H. (2000). A Striking Property of Genetic Code-like Transformations. Complex System Journal. Vol. 13, no. 1, pages 1--32. (pdf format)
  4. Kargupta, H. (1999). SEARCH, Computational Processes in Evolution,and Preliminary Development of the Gene Expression Messy Genetic Algorithm. Journal of Complex Systems. Volume 11, issue 4, pp 233-287.
  5. Kargupta, H. (1999). Scalable Evolutionary Computation. Journal of Evolutionary Computation. vol 7. no 4. iii--iv. MIT Press.
  6. Kargupta, H. and Bandyopadhyay, S. (1998). A Perspective On The Foundation And Evolution Of The Linkage Learning Genetic Algorithms. Accepted in the Special Issue on Genetic Algorithms, the Journal of Computer Methods in Applied Mechanics and Engineering (CMAME). Guest Eds: David E. Goldberg and Kalyanmoy Deb.
Conferences
  1. Kargupta, H. (2001). Learning through Genetic Codes. Accepted for publication in theGenetic and Evolutionary Computation Conference Workshop on the Computation in Gene Expression.
  2. Kargupta, H. (2000). Computation in Genetic Code-Like Transformations. Proceedings of the Genetic and Evolutionary Computation Conference. AAAI Press.
  3. Kargupta, H. and Park, B. (1999). Fast Construction of Distributed and Decomposed Evolutionary Representation. Late Breaking Proceedings of the    Genetic and Evolutionary Computation Conference. 139--148. AAAI Press.
  4. Kargupta, H. and Sarkar, K. (1999). Function Induction, Gene Expression, and Evolutionary Representation Construction.  Proceedings of the Genetic and Evolutionary Computation Conference. vol 1. 313--320. AAAI Press.
  5. Kargupta, H. & Stafford, B. (1997). From DNA to Protein: Transformations and Their Possible Role in Linkage Learning. Proceedings Of The International Conference On Genetic Algorithms, Michigan. Morgan Kaufmann Publishers. 409--416.
  6. Kargupta, H. (1996). Computational Processes of Evolution: The SEARCH Perspective. Presented in SIAM Annual Meeting, 1996 as the winner of the 1996 SIAM Annual Best Student Paper Prize.
Book Chapters
  1. Kargupta, H. (2000). Representation Constructions in Gene Expression. Submitted to the Foundation of Genetic Algorithms. Morgan Kaufmann Pubs.
  2. Kargupta, H. (2000). Gene Expression and Scalable Genetic Search. To be published in Theory and application of evolutionary computation: recent trends. Eds:Ashish Ghosh and Shigeyeoshi Tsutsui, Springer-Verlag, UK.
  3. Kargupta, H. and Bandyopadhyay, S. (1998). Further Experimentations on the Scalability of the GEMGA. Lecture Notes in Computer Science. Vol. 1498. 315-324. Springer Verlag.
  4. Kargupta, H. (1998). Gene Expression and Large Scale Evolutionary Optimization. Computational Aerosciences in the 21st Century. Kluwer Academic Publishers.
  5. Kargupta, H. (1997). Gene Expression: The Missing Link in Evolutionary Computation. Chapter 4.  Genetic Algorithms in Engineering and Computer Science. Eds: D. Quagliarella, J. Periaux, C. Poloni and G. Winter. John Wiley & Sons Ltd. 59--83.
Workshops and Special Issues Organized by the PI
1. Journal of Genetic Programming and Evolvable Hardware Journal special issue on Computation in Gene Expression.
2. Complex Systems Journal Special Issue on the Computation in Gene Expression
3. GECCO-2000 Workshop, Gene Expression: Missing Link in Evolutionary Computation
4. Special Issue of Journal of Evolutionary Computation on Scalable Evolutionary Computation (vol 7, Number 4, 1999)
Project Impact
1. Human Resources: Four graduate students are alternately supported by this project with complementary support from the department. Two undergraduate students were employed by the PI and one of them is now a graduate student working on this project. One of the graduate students is an woman.Three of them are working toward Ph.D. and one toward M.S.
2. Development of the Genetic Algorithms course (cpts 549, School of EECS, Washington State University).
3. Development of the School of EECS DIADIC laboratory comprised of four Linux PCs, several Sparc machines, and one HP printer. This grant money is also used for supporting the everyday maintenance of these machines by the School of EECS systems maintenance group.
4. Development of the Genetic Algorithms course (CSEE, University of Maryland Baltimore County).
5. Development of a research laboratory at CSEE, University of Maryland Baltimore County.
Goals, Objectives, and Targeted Activities
The goal of this research project is to develop an efficient gene expression-based scalable evolutionary algorithm for search, optimization, and machine learning. This work will also apply the developed algorithms for distributed data mining. The approach consists of,
 1. abstracting the transcription (DNA->mRNA), translation (mRNA->Protein sequence), and folding (sequence to 3D folded structure of protein)   operations in the light of probabilistic and approximate representation construction;
 2. development of search operators using the developed understanding
 3. experimentation with large scale optimization and incorporation of these mechanisms into an experimental distributed data mining system for discovering knowledge from distributed data.
Upon successful completion, this work will revolutionize the field of evolutionary adaptive algorithms. It will offer a scalable approach for solving large optimization, machine learning, and data mining problems. Moreover, it will offer a fundamental understanding of the natural gene expression process from the perspective of evolutionery computation.
Area Background
Little work has been done that explores the role of gene expression from the perspective of scalable learning and optimization. This work will lay the foundation of the research in this area. For a review of the related literature please see (Kargupta, 1999)
  • H. Kargupta, R. Ayyagari, and S. Ghosh. (2003). Learning Functions Using Randomized Expansions: Probabilistic Properties and Experimentations. IEEE Transactions on Knowledge and Data Engineering, Volume 16, Number 8, pages 894-908. (pdf)

  • Area References
    1. Goldberg, D. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley. New York
    2. Holland, J. (1975).  Adaptation in Natural and Artificial Systems. University of Michigan Press. Ann Arbor
    3. Kargupta, H. (1997). Gene Expression: The Missing Link In Evolutionary Computation. Chapter 4.  Genetic Algorithms in Engineering and Computer Science. Eds: D. Quagliarella, J. Periaux, C. Poloni and G. Winter. John Wiley & Sons Ltd. 59--83.
    Potential Related Projects:
    Extension of the analysis of the binary genetic code to higher cardinality representations, techniques for adaptively constructing genetic-code like representation transformations, applications of the developed technqiues for large scale optimization and machine learning.