CMSC-341 Spring 2006

Project 5

Assigned Wednesday May 3, 2006
Due Tuesday, May 16th, 11:59pm
Updates Monday, May 8th
So that we all get the same results, the use of Horner's rule for hashing strings (as found in the text and in the class notes) is required.


Background

Despite the fact that Red-Black trees offer guaranteed O(lgN) performance per operation, there are many applications where the O(1) performance of hash tables is essential. One such application is query processing by web search engines like Google and Yahoo. In this project you will use hash tables to implement a simple, yet efficient, query engine.


Description

Web search engines have response times of approximately one second for arbitrary queries because they index the entire web. They do this by maintaining a list of the web pages for every word. Then, to figure out what pages contain all of the words in a multiword query you intersect the lists. For example, if the query "project four" is issued, the index is consulted to find out that "project" occurs in, say, pages P1, P4, P5, and P9; the index is consulted again to find out that "four" occurs in, say, pages P4, P7, P9, and P12; and finally the intersection of these two lists of pages is computed yielding P4 and P9 as the only ones that contain both "project" and "four".

You will implement a system like the one described above using a hash table with quadratic probing to store the index.

Here are your tasks:

  1. Write Proj5.cpp. It will validate command line arguments, read commands from a command file, print the results of executing commands, and print statistics about your hash table performance. Your executable must be named Proj5.
  2. Copy QuadProbing.h and HashAux.cpp from Mr. Frey's public directory. Make any necessary changes to these files for this project. These classes must remain generic templates.
  3. Design and implement any necessary classes required, using good Object-Oriented Design/Programming principles.
  4. Create and submit a makefile for this project. The command make Proj5 should compile and link all necessary files.
  5. Answer the questions posed in 341-Spring06-proj5_questions.txt found in Mr. Frey's public directory for this project.

The Command Line

Project 5 will be invoked with a command line that consists of two arguments in this order
  1. The initial hash table size (an integer). If this integer is not prime, your program should round the table size up to the nearest prime.
  2. The name of a command file
Your program should validate the command line, ensure that the specified file can be opened, then read and process each of the commands in the file.

The Command File

The command file format follows. Note that there can be multiple LOAD and QUERY commands in a single command file in any order.

Project Notes, Restrictions, and Miscellaneous Requirements

  1. Mr. Frey's public directory for this project is
    (/afs/umbc.edu/users/d/e/dennis/pub/CMSC341/Proj5).

  2. As stated above, your project must use quadratic probing. In addition, when an attempted insertion of a word fails and when your project terminates
    1. Print the following statistics
      • Hash Table Size
      • Number of unique words inserted
      • The load factor, lambda (2 decimal places)
      • Number of clusters
      • Max cluster size
      • Average cluster size (2 decimal places)
      • Max probes required to successfully insert any word (output the word and file in which it was contained; not counting the word that failed).
      • Average probes per successful insert (not counting the word that failed; 2 decimal places)
    2. resize the hash table (twice as big rounded up to the next prime), rehash all previous words, and then reinsert the word which failed.
      Note that probe and cluster data should be cleared after printing, before rehashing, and be re-accumulated during rehashing.

  3. For this project, we define an insertion "failure" as a second attempt to insert a word into the initial probe location. By definition, the initial probe is always attempted at index h(K, 0). If a later probe ( h(K, i) with i > 0 ) attempts to insert at the initial index, then the insertion has failed.

  4. A cluster is defined as 2 or more consecutive cells which contain a word. If a cluster "wraps around" the bottom of the table to the top of the table, count these as two clusters.

  5. No STL containers may be used for this project. Only those provided by the author may be used.

Project Output

As usual, this project output is provide for formatting purposes only. It is not the result of running any executable and so may be inconsistent with itself.

Your project should echo each command from the command file and print the result as shown here.

LOAD bob.txt
LOADED bob.txt

LOAD myfile.dat
Error - unable to open myfile.dat

QUERY 3 project five mary
project, five, and mary all appear in files
     bob.txt
     readme.txt

QUERY 3 bob mary sue
bob, mary, and sue all appear in files
     NONE

When an insert fails, your project should print a message followed by the required statistics.
Inserting "napkin" from file "manners.dat" failed.

Hash Table Stats
----------------
HashTable Size: 12345
Unique Words Inserted: 4567
Load Factor: 6.66
Max Cluster Size: 42
Number of Clusters: 42
Average Cluster Size: 42.42
Max Probes: 22 for "lawn" in file "YardWorkMadeEasy.doc"
Average Probes: 3.66

Rehashing.....
When your project terminates, print the same "Hash Table Stats" as above.


Files To Be Submitted

You should submit all files needed to build your program and the file containing your answers to the questions.
Submit Tools
There are a number of tools available to you to check on your submittal. It is your responsibility to ensure that the submittal is correct and will result in a successful
compilation of your project. Do not wait till the last minute to submit your files. Give yourself enough time to check that your submittal is correct.

If you don't submit a project correctly, you will not get credit for it. Why throw away all that hard work you did to write the project? Check your
submittals. Make sure they work. Do this before the due date.

Documentation for the submit program is on the web at http://www.gl.umbc.edu/submit/. One of the tools provided by the submit program is
submitls. It lists the names of the files you have submitted.

Additionally, there are two programs for use only by CMSC-341 students (not part of the UCS submit program). They are in the directory
/afs/umbc.edu/users/o/a/oates/pub/CMSC341/ and are named submitmake and submitrun. You can use these programs to
make or run your submitted projects.

The syntax is similar to that for submit:

submitmake <class> <project>

Example:  submitmake cs341 Proj5

This makes the project, and shows you the report from the make utility. It cleans up the directory after making the project (removes .o and ii_files), but leaves the
executable in place.

submitrun <class> <project> [command-line args]

Example:  submitrun cs341 Proj5 checkers checkfile.dat

This runs the project, assuming there is an executable (i.e. submitmake was run successfully).


Grading and Academic Integrity

Your project will be tested using a variety of command lines, some of which will be invalid.
Your project will also be tested using a variety of command files which will test various conditions which your code should handle.

Project grading is described in the Project Policy handout.

Your answers to 341-Spring06-proj5_questions.txt are worth 10% of your project grade.

Cheating in any form will not be tolerated. Please re-read the Project Policy handout for further details on honesty in doing projects for this course.

Remember, the due date is firm. Submittals made after midnight of the due date will not be accepted. Do not submit any files after that time.