Adaptive Access for a Digital Library of Corporate Information


R. Krishnan, The Heinz School, Carnegie Mellon University

David Steier, Price Waterhouse Technology Centre

1. Introduction

Before the National Information Infrastructure (NII) can provide widespread access to large data archives, important storage and retrieval problems need to be addressed. As solutions to these problems are proposed, new standards and technologies will force the structure of the NII and its components to change over time. To cope with such changes, we propose that digital libraries incorporate a capability that we call adaptive access which relies on the acquisition and use of a wide range of meta-data about the archive being accessed. We are developing generic capabilities for adaptive access in the context of our work on building a digital library of the filings stored in the Securities and Exchange Commission's (SEC) Electronic Data Gathering And Retrieval (EDGAR) system. EDGAR grows by roughly ten million pages of information a year, with data rates approaching 1 gigabyte per hour on peak filing dates. We are addressing the challenges posed by such high data volumes by building a system called RELATE (REal-time Learning Access to EDGAR), which draws on techniques from many areas of computer science, from highly reliable hierarchical storage management to advanced human-computer interfaces. This abstract focuses on our strategy for analyzing access methods that can adapt to changing information sources and needs.

2. The problem: Turning EDGAR into a digital library of corporate information

The EDGAR system is currently being phased into use and by May 1996 will contain all the filings (e.g., annual and quarterly reports, registration and proxy statements) made to the SEC by over 15,000 companies. The information in these filings is of general interest to investors, auditors, litigators and regulators. These users have a variety of requirements, but they fall into two broad categories: those who are attempting to find characteristics of a particular company (e.g. "How much did this company spend on research and development?"), and those who are looking for companies with a particular characteristic (e.g. "Which companies had a write-down of assets prior to a spin-off?"). There are several problems with getting this information from the printed filings and current computerized services, such as those provided by Disclosure, Inc. (aside from cost). These include: Some of these problems (e.g., missing or incorrect data) can be remedied by providing direct access to the original filings in EDGAR, but this trades off against the original data overload problem. Our conjecture is that no conventional approach to information retrieval addresses all the problems listed above simultaneously. By "conventional", we mean any fixed scheme that depends on abstracting all data in the same way for all users. For example, companies in specific industries may decide to report certain revenue sources or expense categories as individual line items, e.g. "investment management fees" for financial services and "aircraft fuel" for airlines. If users are only allowed to query by the aggregate terms that are available for all companies, e.g. "total costs and expenses", then substantial information will be lost and cross-sectional analyses of an industry (for benchmarking) that users are most often interested in will not be possible. The research challenge, as we see it, is to build systems that provide tailored access efficiently, adapting to user needs and information sources as they evolve, incorporating new terms for use in future queries.

Our approach: Using adaptive techniques to capture meta-information

We now briefly describe how EDGAR filings will be analyzed by the RELATE system in order to illustrate our vision of adaptive access to information. Only the first components of RELATE have been built so far, but we are optimistic about the overall strategy: storing detailed meta-information about the data and user needs, and using this meta-information dynamically to map user queries onto specific data items in a filing.

EDGAR filings will first come into RELATE over the Internet as a series of ASCII text documents. Bill Kornfeld, a consultant to Price Waterhouse, has implemented a pre-processor to decompose the filings before storing them in order to focus the effort of the subsequent content-based analysis. For example, each 10-K (the formal annual report required to be filed with the SEC within 90 days of the end of each fiscal year) must have certain items and schedules, and the set of financial statements that includes the income statements mentioned above will always be in Item 8. Our analysis indicates that finite state grammars (implemented in Prolog) are both sufficiently powerful and efficient to extract this information. We are also making rapid progress in the use of finite state grammars to parse tables from the raw EDGAR text, identifying row and column headers with high accuracy, despite significant variation and noise in the original data. Weak grammars may also be sufficient to find descriptive information such as listings of products and services, and meta-information about accounting policies such as inventory valuation methods or depreciation schedules.

The next phase of processing the data is more semantically-based, and will be more difficult to implement. RELATE will attempt to identify as much structure as possible in the extracted information and provide additional indexes where appropriate. So for example, the table representing an income statement may be decomposed into a net-income section and a per-share-section, while the net-income section may have subsections for operating income, adjustments, net income, etc. This decomposition is a heuristic process and depends on having detailed models of financial items and their inter-relationships. Our goal is for this stage to identify as much of the explicit structure as possible in the individual filings at storage time. Then, on an as-needed basis at query time, further analysis will derive implicit information (such as performance ratios) and perform comparisons across multiple filings. This requires problem solving to apply knowledge about the repository structure, semantics of the data, and strategies for conflict detection and resolution. We are building such reasoning capabilities using the Soar architecture for building intelligent agents (Laird et al., 1987). Soar responds to difficulties in problem solving by creating a series of subgoals, and caches the results of problem solving for each subgoal using an experience-based learning mechanism called chunking. Each section of income statements referred to above has corresponding operators within Soar, and a simple induction algorithm is used to learn applicability conditions for each operator. In RELATE, this will lead to more efficient and accurate information access over time.

A final issue we are addressing is how a user will interact with RELATE to offer guidance about how to detect and resolve inconsistencies during data access and integration. In the case of an income statement for an airline, a line item for "aircraft fuel" row may be listed between the "Operating expenses" header and the "Total operating expenses" total. RELATE could notice that all the line items between the two add up to the total, which would lead it to posit that "aircraft fuel" is a type of operating expense. RELATE could learn to look for such items in the context of income statements from airlines, so that future queries could then quickly compare fuel expenditures across airlines over time. We plan to develop techniques, based on previous research on learning from instruction in Soar, for the user to monitor RELATE's progress and direct its attention to relevant features of the situation.

The RELATE project is clearly a long-term effort, and many challenges remain. Yet we have been pleasantly surprised so far by the progress we have been able to make with relatively simple techniques (e.g. finite state grammars for text scanning). Our hope is that our luck will hold out, and relatively simple learning techniques such as Soar's chunking will afford us the same leverage in developing capabilities for adaptive access.

References

J. Laird, P. Rosenbloom, A. Newell (1987), Soar: An Architecture for General Intelligence, Artificial Intelligence, Vol. 33, 1987, pp. 1-64.