CMSC 476/676
Homework 3 - work individually
Due: At the beginning of class, Thursday, March 12

The objective of this assignment is to build an inverted file (index) for the documents in your collection. To execute your program, it should be sufficient to type:

index input-directory output-directory where the input-directory contains the original files (unprocessed) and the output-directory contains all generated output files (temporary files created along the way should be deleted to save space). The index command may be written in the language(s) of your choice.

Terms and Term Weights You should use the terms, and term weights, generated by Homework 2. If you are not happy with your Homework 2, you may use any classmate's Homework 2 as your starting point. Mention this in your report.

Output Files The goal is to build a pair of files of fixed length records. These files should be in ASCII for easier debugging. The dictionary records may be in either hash_order or alphabetic order. The dictionary should contain three fields:

the word
the number of documents that contain that word
the location of the first record in the postings file
The second, the postings file, should contain the document id
the normalized weight of the word in the document

Algorithm You may implement either a memory based or sort based or multi-way merge algorithm. Measure the time the index function takes on a varying number of documents (10, 20, 40, 80, etc.) and plot a graph of the indexing time as a function of the number of documents in your report. The timings should be done on the FULL indexing process (starting with the raw documents).

Program Testing
Use the same set of files as in Homeworks 1 and 2.

Program Documentation.
After your internal documentation (comments) are complete, write a report which provides an approximately 4 page summary of your program. In particular, discuss how you extended/improved the preprocessing. Show two examples of input documents and the resulting (surviving) tokens. Discuss how many tokens there were in the unprocessed files and how many there were (unique and total) in the final inverted file. Discuss the efficiency of your approach in terms of your timings on a linux machine (use the time command) as a function of number of documents indexed. The entire document should be 4 pages maximum. We will primarily be grading from the report, so make sure it clearly describes what you did and what your program's output and efficiency.

Hand In
Hardcopy of your code (including shell scripts), the report, and the first and last 200 lines of the dictionary and postings files. I also encourage electronic submission, with the pdf file for your report, as well as other files attached as needed. Send the email to Dr. Nicholas <nicholas@umbc.edu> AND Mr. Dimitroff <dondim1@umbc.edu>.

Late Policy
10% deduction per 24 hours. Assignments turned in during or after the class will result in a 10% deduction.