Adaptive Access for a Digital Library of Corporate Information
R. Krishnan, The Heinz School, Carnegie Mellon University
David Steier, Price Waterhouse Technology Centre
1. Introduction
Before the National Information Infrastructure (NII) can provide
widespread access to large data archives, important storage and
retrieval problems need to be addressed. As solutions to these problems
are proposed, new standards and technologies will force the structure of
the NII and its components to change over time. To cope with such
changes, we propose that digital libraries incorporate a capability that
we call adaptive access which relies on the acquisition and use of a
wide range of meta-data about the archive being accessed. We are
developing generic capabilities for adaptive access in the context of
our work on building a digital library of the filings stored in the
Securities and Exchange Commission's (SEC) Electronic Data Gathering And
Retrieval (EDGAR) system. EDGAR grows by roughly ten million pages of
information a year, with data rates approaching 1 gigabyte per hour on
peak filing dates. We are addressing the challenges posed by such high
data volumes by building a system called RELATE (REal-time Learning
Access to EDGAR), which draws on techniques from many areas of computer
science, from highly reliable hierarchical storage management to
advanced human-computer interfaces. This abstract focuses on our
strategy for analyzing access methods that can adapt to changing
information sources and needs.
2. The problem: Turning EDGAR into a digital library of corporate
information
The EDGAR system is currently being phased into use and by May 1996 will
contain all the filings (e.g., annual and quarterly reports,
registration and proxy statements) made to the SEC by over 15,000
companies. The information in these filings is of general interest to
investors, auditors, litigators and regulators. These users have a
variety of requirements, but they fall into two broad categories: those
who are attempting to find characteristics of a particular company
(e.g. "How much did this company spend on research and development?"),
and those who are looking for companies with a particular characteristic
(e.g. "Which companies had a write-down of assets prior to a
spin-off?"). There are several problems with getting this information
from the printed filings and current computerized services, such as
those provided by Disclosure, Inc. (aside from cost). These include:
- Data overload: Each company annually files anywhere from a dozen to
several hundred documents, with some documents almost a thousand pages
long. Only a tiny fraction of the information in these filings will be
relevant to any one query. The software engineering challenges of
providing rapid, highly reliable access to this volume of data are
compounded by the difficulty of finding the right indices. The standard
indices for corporate data, such as the four-digit Standard Industrial
Classification (SIC) codes, are notorious for errors caused by false
hits and missed data.
- Incomparability of data: Numbers such as earnings reports must be
evaluated within the context of other information, such as
industry-specific conventions and variations in accounting policies
terminology and reporting period, etc. This information is often
reported in the management discussions and the footnotes that accompany
financial statements, but are only indexed on specialized databases such
as NAARS; complex concepts such as "securitization of accounts
receivable" are usually not indexed at all.
- Missing or incorrect data: Computerized databases often contain
missing fields, old data, and omit information from filings such as
registration statements or 8-Ks. Coverage is also likely to be
incomplete for small firms or those not actively traded. Often data
entry errors creep into the records, entered as text without integrity
constraint checking.
Some of these problems (e.g., missing or incorrect data) can be remedied
by providing direct access to the original filings in EDGAR, but this
trades off against the original data overload problem. Our conjecture is
that no conventional approach to information retrieval addresses all the
problems listed above simultaneously. By "conventional", we mean any
fixed scheme that depends on abstracting all data in the same way for
all users. For example, companies in specific industries may decide to
report certain revenue sources or expense categories as individual line
items, e.g. "investment management fees" for financial services and
"aircraft fuel" for airlines. If users are only allowed to query by the
aggregate terms that are available for all companies, e.g. "total costs
and expenses", then substantial information will be lost and
cross-sectional analyses of an industry (for benchmarking) that users
are most often interested in will not be possible. The research
challenge, as we see it, is to build systems that provide tailored
access efficiently, adapting to user needs and information sources as
they evolve, incorporating new terms for use in future queries.
Our approach: Using adaptive techniques to capture meta-information
We now briefly describe how EDGAR filings will be analyzed by the RELATE
system in order to illustrate our vision of adaptive access to
information. Only the first components of RELATE have been built so
far, but we are optimistic about the overall strategy: storing detailed
meta-information about the data and user needs, and using this
meta-information dynamically to map user queries onto specific data
items in a filing.
EDGAR filings will first come into RELATE over the Internet as a series
of ASCII text documents. Bill Kornfeld, a consultant to Price Waterhouse,
has implemented a pre-processor to
decompose the filings before storing them in order to focus the effort
of the subsequent content-based analysis. For example, each 10-K (the
formal annual report required to be filed with the SEC within 90 days of
the end of each fiscal year) must have certain items and schedules, and
the set of financial statements that includes the income statements
mentioned above will always be in Item 8. Our
analysis indicates that finite state grammars (implemented in Prolog)
are both sufficiently powerful and efficient to extract this
information. We are also making rapid progress in the use of finite
state grammars to parse tables from the raw EDGAR text, identifying row
and column headers with high accuracy, despite significant variation and
noise in the original data. Weak grammars may also be sufficient to find
descriptive information
such as listings of products and services, and meta-information about
accounting policies
such as inventory valuation methods or depreciation schedules.
The next phase of processing the data is more semantically-based, and
will be more difficult to implement. RELATE will attempt to identify as much
structure as
possible in the extracted information and provide additional indexes
where appropriate. So for example, the table representing an income
statement may be decomposed into a net-income section and a per-share-section,
while the net-income section may have subsections for operating income,
adjustments, net income, etc.
This decomposition is a heuristic process and depends on having
detailed models of financial items and their inter-relationships. Our
goal is for this stage to identify as much of the explicit structure as
possible in the individual filings at storage time. Then, on an
as-needed basis at query time, further analysis will derive implicit
information (such as performance ratios) and perform comparisons across
multiple filings. This requires problem solving to apply knowledge
about the repository structure, semantics of the data, and strategies
for conflict detection and resolution. We are building such reasoning
capabilities using the
Soar
architecture for building intelligent agents
(Laird et al., 1987). Soar responds to difficulties in problem solving
by creating a series of subgoals, and caches the results of problem
solving for each subgoal using an experience-based learning mechanism
called chunking. Each section of income statements referred to above
has corresponding operators within Soar, and a simple induction algorithm
is used to learn applicability conditions for each operator.
In RELATE, this will lead to more efficient and
accurate information access over time.
A final issue we are addressing is how a user will interact with RELATE
to offer guidance about how to detect and resolve inconsistencies during
data access and integration. In the case of an income statement for an
airline, a line item for "aircraft fuel" row may be listed between the
"Operating expenses" header and the "Total operating expenses" total.
RELATE could notice that all the line items between the two add up to
the total, which would lead it to posit that "aircraft fuel" is a type
of operating expense. RELATE could learn to look for such items in the
context of income statements from airlines, so that future queries
could then quickly compare fuel expenditures across airlines over time.
We plan to develop techniques, based on previous research on learning
from instruction in Soar, for the user to monitor RELATE's progress and
direct its attention to relevant features of the situation.
The RELATE project is clearly a long-term effort, and many challenges
remain. Yet we have been pleasantly surprised so far by the progress we
have been able to make with relatively simple techniques (e.g. finite
state grammars for text scanning). Our hope is that our luck will hold
out, and relatively simple learning techniques such as Soar's chunking
will afford us the same leverage in developing capabilities for adaptive
access.
References
J. Laird, P. Rosenbloom, A. Newell (1987), Soar: An Architecture for
General Intelligence, Artificial Intelligence, Vol. 33, 1987, pp. 1-64.