Using Statistical Properties of Text to Create Metadata authors Grace Crowder and Charles Nicholas Department of Computer Science and Electrical Engineering University of Maryland Baltimore County Baltimore, MD 21228-5398 e-mail address: crowder@cs.umbc.edu phone: (410) 455 - 3963 fax: (410) 455 - 3969 Given a large, dynamic, distributed text corpus, traditional Information Retrieval (IR) techniques do not scale very well. For example, the amount of memory needed to hold an inverted file index and the time required to compute results for a query become unmanageable as the corpus grows into the gigabytes. We propose a solution making use of a mediated architecture using metadata to direct activities to appropriate agents. A mediated agent architecture consists of application agents managing local corpora and brokers using metadata to direct search. Metadata must be automatically generated; effective for query, search, browse, and retrieval of the data; and smaller than the data. In our approach, metadata is the key to scalability. Metadata is used to characterize the data, for several purposes. We require that metadata have the following properties: Effective - if the metadata says the information is [not] relevant, it's probably [not] relevant Concise - smaller than the data it describes Generated automatically - no human intervention Abstractable - can create metadata of metadata Interchangable - the form must be the same for queries, documents, and metadata. We call our mediated architecture CAFE for Cooperating Agents Find Everything. In this system there are application agents and brokers. Application agents communicate with other entities in the CAFE system using KQML (Knowledge Query and Manipulation Language). In addition, each application agent is responsible for its own data space, which means it will provide access to that data by generating and registering metadata as well as performing more traditional IR tasks: handling queries, corpus updates, etc. Brokers are similar to application agents. However, the data they manage is metadata, supplied to them by application agents (and other brokers). An n-gram is a sequence of n characters. Example 5-grams taken from this sentence: examp, xampl, ample. IR systems using n-grams as terms have been shown to be: language independent, garble resistant and effective for corpus querying. A frequency distribution of the n-grams in a document is called its n-gram profile. From prior work we know that a document's n-gram profile serves to characterize its content. The mean of the collected n-gram profiles of a group of documents is the centroid. The whole corpus has a centroid as does any subset of documents. Using n-gram profiles as metadata is appealing because through our work with Telltale, n-grams have proven to be effective, concise, automatically grenerated, abstractable, and interchangeable. We are looking at combining the statistical properties of a text uncovered through n-gram analysis with traditional IR approaches to text processing. This may give us word and phrase weighting in a language independent system. We are also performing additional performance analysis on Telltale to better understand the use of n-grams in text retrieval. In conclusion, CAFE is a large-scale document information retrieval system that is able to handle large, dynamic corpora, using metadata in a mediated agent architecture.