Using Statistical Properties of Text to Create Metadata

authors
     Grace Crowder and Charles Nicholas 
     Department of Computer Science and Electrical Engineering
     University of Maryland Baltimore County 
     Baltimore, MD 21228-5398

e-mail address:
     crowder@cs.umbc.edu 

phone:
     (410) 455 - 3963
fax:
     (410) 455 - 3969


Given a large, dynamic, distributed text corpus, traditional 
Information Retrieval (IR) techniques do not scale very well.  
For example, the amount of memory needed to hold an inverted file index and
the time required to compute results for a query become unmanageable
as the corpus grows into the gigabytes.

We propose a solution making use of a mediated architecture
using metadata to direct activities to appropriate agents.
A mediated agent architecture consists of
application agents managing local corpora
and brokers using metadata to direct search.
Metadata must be
automatically generated;
effective for query, search, browse, and retrieval of the data;
and smaller than the data.

In our approach, metadata is the key to scalability.
Metadata is used to characterize the data, 
for several purposes.
We require that metadata have the following properties:
	Effective - if the metadata says the information is [not] 
                    relevant, it's probably [not] relevant
	Concise - smaller than the data it describes
	Generated automatically - no human intervention 
	Abstractable - can create metadata of metadata
	Interchangable - the form must be the same for queries, 
                    documents, and metadata.

We call our mediated architecture CAFE for Cooperating Agents Find Everything.
In this system there are application agents and brokers.
Application agents 
communicate with other entities in the CAFE system  using KQML  
(Knowledge Query and Manipulation Language).
In addition, each application agent is responsible for its own data space, 
which means it will provide access to that 
data by generating and registering metadata as well as performing more
traditional IR tasks: handling queries, corpus updates, etc.
Brokers are similar to application agents. 
However, the data they manage is metadata, supplied to them by application 
agents (and other brokers).

An n-gram is a sequence of n characters.
Example 5-grams taken from this sentence:  examp, xampl, ample.
IR systems using n-grams as terms have been shown to be:
language independent, garble resistant and effective for corpus querying.

A frequency distribution of the n-grams in a document is 
called its n-gram profile.
From prior work we know that a document's n-gram profile serves to 
characterize its content.
The mean of the collected n-gram profiles of a group of 
documents is the centroid.  
The whole corpus has a centroid as does any subset of documents.

Using n-gram profiles as metadata is  
appealing because through our work with Telltale,
n-grams have proven to be effective, concise, automatically grenerated, 
abstractable, and interchangeable.

We are looking at combining the statistical properties of a text uncovered
through  n-gram analysis with traditional IR approaches to 
text processing.  This may give us word and phrase 
weighting in a language independent system.
We are also performing additional performance analysis on Telltale to
better understand the use of n-grams in text retrieval.

In conclusion, CAFE is a large-scale document information retrieval 
system that is able to handle large, dynamic corpora, using 
metadata in a mediated agent architecture.