The objective of this assignment is to build an inverted file (index) for the documents in your collection. To execute your program, it should be sufficient to type:
where the input-directory contains the original files (unprocessed) and the output-directory contains all generated output files (temporary files created along the way should be deleted to save space). The index command may be written in the language(s) of your choice.Terms and Term Weights You should use the terms, and term weights, generated by Homework 2. If you are not happy with your Homework 2, you may use any classmate's Homework 2 as your starting point. Mention this in your report.
Output Files The goal is to build a pair of files of fixed length records. These files should be in ASCII for easier debugging. The dictionary records may be in either hash_order or alphabetic order. The dictionary should contain three fields:
The second, the postings file, should containAlgorithm You may implement either a memory based or sort based or multi-way merge algorithm. Measure the time the index function takes on a varying number of documents (10, 20, 40, 80, etc.) and plot a graph of the indexing time as a function of the number of documents in your report. The timings should be done on the FULL indexing process (starting with the raw documents).
Program Testing
Use the same set of files as in Homeworks 1 and 2.
Program Documentation.
After your internal documentation
(comments) are complete, write a report which provides an approximately 4 page
summary of your program. In particular, discuss how you
extended/improved the preprocessing.
Show two examples of input documents and
the resulting (surviving) tokens. Discuss how many tokens there were in the
unprocessed files
and how many there were (unique and total) in the final inverted file. Discuss
the
efficiency of your approach in terms of your timings
on a linux machine (use the time command) as a function of
number of documents indexed. The entire document should be
4 pages maximum.
We will primarily be grading
from the report, so make sure it clearly describes what you did and what your
program's output and efficiency.
Hand In
Hardcopy of your code (including shell scripts), the
report, and the first and last 200 lines of the dictionary and postings files. I also encourage electronic submission, with the pdf file for your report, as well as other files attached as needed. Send the email to Dr. Nicholas <nicholas@umbc.edu> AND Mr. Dimitroff <dondim1@umbc.edu>.
Late Policy
10% deduction per 24 hours. Assignments turned in
during or after the class will result in a 10% deduction.