basicConcepts

UMBC CMSC 461 Spring '99

Basic Concepts

An index for a file is like a catalog for a book in the library. Cards in the catalog are stored in order with portions of the catalog order by author's name, book title, or subject. Items in the database are catalogued with indices based on keys. When a table is defined, it has a primary key; however, it can have additional keys defined.

Typical databases are too large to search sequentially looking for specific records and more sophisticated indexing techniques are employed. The two basic kinds of indices are:

Ordered indices
Hash indices

Different techniques are evaluated on the basis of several opposing factors:

Access types: The access could be for a specific value or a specified range.

Access time: The amount of time it takes to find a particular data item or set of items.

Insertion time: The time it takes to insert a new data item. This value includes the time it takes to find the correct place to insert the new data items, as well as update the index structure. Obviously, the more indices that must be updated, the longer this will take. (Note: When optimizing performance of a DBMS, it is useful to review all indices to make sure that the primary key is the key used most often and then any other indices are being used often enough to justify the overhead of maintaining the index.)

Deletion time: The time it takes to delete a data item. This includes the time to locate the item and then delete it, as well as updating the index structure. (Note: When optimizing performance of a DBMS, it is useful to review how deletions can cascade and cause deletions in other tables. This is part of the overhead of maintaining the index structure.)

Space overhead: The addition space occupied by an index structure. Provided that the amount of additional space is moderate, it is usually worthwhile to sacrifice the space to achieve improved performance. (With the rapidly dropping prices and increased capacity, the developer must still be limited by the constraint that there is only so much space available and that it still requires more time to maintain the structure. This overhead can still be significant!)

With indexing, there is now a new meaning for the over-used word "key":

Primary key
Candidate key
Superkey
Search key

Records in an indexed file are stored in some type of order (including unordered.). If the file containing the records is order sequentially, the index whose search key specifies the sequential order of the file is the primary key. Primary indices are also called clustering indices. Indices whose search key specifies an order different from the sequential order are called secondary indices, or nonclustering indices.

Primary Index

If we assume that all files are ordered sequentially on some search key (primary index), then such files are called index-sequential files. (The Index-Sequential Access Method or ISAM is one of the oldest index schemes used in database systems.) They are used for applications that require both sequential processing of the entire file and random access to the individual records. An example for a back account would be:

RECORD NR (not in table)	LOCATION (primary key)	ACCOUNT NR	BALANCE	NEXT RECORD
1	Brighton	A-217	750	2
2	Downtown	A-101	500	3
3	Downtown	A-110	600	4
4	Mianus	A-215	700	5
5	Perryridge	A-102	900	6
6	Perryridge	A-201	700	7
7	Perryridge	A-218	700	8
8	Redwood	A-222	700	9
9	Round Hill	A-305	350	NULL

It could be even more complex if the primary key was the combination of the location and the account number. The advantage of this method is that when processing is being done sequentially, this is the fastest access method. However it is expensive in terms of maintaining. If a record is to be added to the table, the appropriate slot must be located, a new slot must be added to the table, and all records must be moved down one slot. Then the new slot is inserted. Then all of the index columns must be updated. The file stored on the disk can be stored as contiguous or non-contiguous data. In terms of the DBMS performance, being able to maintain the file contiguously offers better DBMS performance, however the is probably in conflict with the operating system's allocation method. (Some DBMS systems will allocate the maximum space that a file can have when the file is created to reduce the overhead caused when records are added to the file.) Some systems require space that the operating system can not manage, so that some of these problems are reduced.

Deletions are preformed in the reverse sequence. The appropriate slot must be located and all records below that slot must be moved up. Again, all of the index columns must be update.

Updates on these kind of files are normally batched together to minimize the system overhead, because the cost of updating one record is almost exactly the same as updating one hundred records. This is not particularly suited to interactive updates.

Dense and Sparse Indices

This can be improved by adding another data structure (especially for random access).

Dense Index. An index record (or index entry) appears for every search-key value, containing the value and the location for the first data record with that value. The dense index for the previous table would look like:

Value	Start
Brighton	1
Downtown	2
Mianus	4
Perryridge	5
Redwood	8
Round Hill	9

Sparse Index. An index record is created only for some of the values, which is the only difference between the two versions in terms of the data and its structure. With the sparse index, the system has to locate the largest value in the index that does not exceed the search key. From that point the records must be checked sequentially until a match is found.

Obviously, it is generally faster to locate a record if we have a dense index rather than a spare index, but the maintenance overhead is greater for a dense index in addition to requiring more disk space. This is the tradeoff that the system designer must make. The major advantage to the sparse index is that it will required bring in less blocks, which can be expensive in terms of time, because searching the block once is memory is SO much faster! The obvious answer might not be the best answer.

Multilevel Indices

The above system appears to work well, but it does not scale up well. If the index is small enough to keep in memory this is a really good method, but when the index becomes too large, the problems of searching the index become just like searching the file: It takes too long!

Suppose we have a file with 100,000 records with 10 records stored per block. If we have one index record for each block, the index has 10,000 entries. Since the index entries are smaller, there might be 100 entries per block in 100 blocks that are stored sequentially on the disk. Doing a sequential search would involve up to 100 reads (average would be n/2, or 50 reads).

Binary searches could be used. The cost for b blocks would be log₂(b) blocks to be read. In this example that would be seven reads. If each read is 30 milliseconds access time, we have 210 milliseconds, which by today's standards is long! (That example ignores a number of other time-consuming activities.) A binary search is not possible if the system was built using overflow blocks.

What if we use a sparse index on the index file, which is small enough to keep in memory! That new index is an outer index and the original index is an inner index. The system performs a binary search on the outer index and now only has to make one read for the correct block of the inner index. There can be any number of inner levels, creating a multilevel index system.

Multilevel indices are closely related to tree structures, such as the binary trees used for in-memory indexing.

Index Update

Algorithms for updating single-level indices:

Deletion. To delete a record, look up the record to delete. If the deleted record was the only record for that particular search-key, then the search-key value is also deleted from a dense index. If the deleted record was the first record for a particular search-key value, then the index was be updated to point the next record. Likewise if a sparse index was used, it too may have to be similarly updated.

Insertion. First we perform a lookup using the search-key value of the record to be inserted. The index is dense and the search-key does not appear in the index, the search-key value must be inserted into the index. If the index is sparse, it may not have to be modified.

Insertion and deletion algorithms for multilevel indices are a simple extension of this scheme.

Secondary Indices

A secondary index on a candidate key looks just like a dense primary index, except that the record pointers are not stored in the record itself and must be stored somewhere else. Additionally, the secondary index might not be based on a candidate key. Therefore, it is not sufficient to point to the first occurrence. There must be pointers to every record. The "dense" index points to a structure that has an entry for each occurrence of that value. Obviously, the maintenance of secondary indices is more costly.