This output is provided to help you debug your result sorting. If your word count for each file does not match the expected output, then your result sorting will also mismatch.
Project 2 Partial Search
For this project, you will extend your previous project to support exact search and partial search. In addition to meeting the previous project requirements, your code must be able to track the total number of words found in each text file, parse and stem a query file, generate a sorted list of search results from the inverted index, and support writing those results to a JSON file.
Functionality
The core functionality of your project must satisfy the following requirements:
-
Maintain the functionality of the previous project.
-
Process additional command-line parameters to determine whether to search the inverted index and whether to produce additional JSON ouput. See the Input section for specifics.
-
Parse a query file line-by-line into a normalized and optimized multiple word query matching the processing used to build the inverted index. See the Query Parsing section for specifics.
-
Efficiently return exact search results from your inverted index, such that any word stem in your inverted index that exactly matches a query word is returned.
-
Efficiently return partial search results from your inverted index, such that any word stem in your inverted index that starts with a query word is returned.
-
Sort the search results using a simple term frequency metric to determine the most “useful” search results to output first. See the Result Sorting section for specifics.
-
Produce additional pretty JSON output for the word count of each location and the sorted search results. See the Output section for specifics.
The functionality of your project will be evaluated with the Project2Test.java
group of JUnit tests.
Query Parsing
Search queries will be provided in a text file with one multi-word search query per line. (Eventually, these queries will come from a search box on a webpage instead.) When processing this file, your query parsing code must normalize and optimize the queries as follows:
-
Clean and parse each query line. Perform the same transformations to each line of query words as used when populating your inverted index. This includes cleaning the line of any non-alphabetic characters, converting the remaining characters to lowercase, splitting the cleaned line into words by whitespace, and stemming each word. For example, the query line:
Observers observing 99 HIDDEN capybaras!
…should be processed into the cleaned, parsed, and stemmed words
[observ, observ, hidden, capybara]
after this step. -
Remove duplicate query word stems. There is no need to search for the same stem twice! For example, if the original query included both
observers
andobserving
, the stemmed version of both wordsobserv
should be included in the parsed query only once. Specifically, for the stemmed words:[observ, observ, hidden, capybara]
…the duplicate
observ
should be removed resulting in[observ, hidden, capybara]
after this step. -
Sort the unique words for each query alphabetically. This is less necessary for the search operation, but required for the search result output. For example, if the stemmed words without duplicates are:
[observ, hidden, capybara]
…then the final parsed query words should be
[capybara, hidden, observ]
after this step.
At this stage, the normalized words from the original query line can be used by your inverted index data structure to generate search results and to produce the expected JSON file output.
Result Sorting
Search engines rank their search results such that the most useful results are provided first. For example, PageRank is the well-known algorithm initially used by Google to rank results by estimating the importance of a website by its incoming links.
When returning search results, your code must also return the results in sorted order using a simple version of term frequency and raw count to determine the usefulness of a result. To do this, your code will need to be able to calculate the following:
-
Total Words: The total number of word stems in each text file. You could add this information as a separate data structure in your inverted index that gets updated every time a new word is added. See Location Output for the exact word counts for each text file used in the project tests. For example, there are 6 word stems in the
hello.txt
text file. -
Total Matches: The total number of times any of the matching query words appear in the text file. For exact search, this is the number of times the exact word stems appear. For example, the exact word stem
perfor
appears in the filewords.tExT
a total of 2 times. For partial search, this is the number of times a word stem starts with one of the query words. For example, a word stem starts withperfor
in the filewords.tExT
a total of 11 times.For a query with multiple words, your code should consider all query words that appear at a location. For example, for an exact search of the multiple word query
[narwhal, tarsier]
, the word stemnarwhal
appears 1 time and the wordtarsier
appears 1 time in the fileanimals.text
. Therefore, the number of times any of the matching query words appear in the text file is 1 + 1 = 2 times total.
With this information you can calculate the score of the search result as the percent of words in the file that match the query. This is calculated by dividing the total matches by the total words:
score = total matches / total words
Then, when sorting the search results, compare the results as follows:
-
Score: First, compare search results by their score so results with higher scores appear first (i.e. sort in descending order by total matches / total words).
-
Count: If two search results have the same score, then compare by the raw count so results with higher count appear first (i.e. sort in descending order by total matches) instead.
-
Location: If two search results have the same score and count, then compare by the location instead (i.e. sort alphabetically by path).
You can use Double.compare(…), Integer.compare(…), and String.compareTo(…) for these comparisons and the built-in sort methods in Java.
See the partial search results for antelop observ
for an example of results sorted by score, the results for aar alpaca eleph
for an example of results with the same score and thus sorted by count, and finally the results for capybara hidden
for an example of results with the same score and same count and thus sorted alphabetically by location.
Input
Your main method must be placed in a class named Driver
. The Driver
class should accept the following additional command-line arguments:
-
-search filepath
where-search
indicates the next argument is a path to a text file of queries to be used for search. If this flag is not provided, then no search should be performed. In this case, your code should behave exactly the same as the previous project. -
-exact
where-exact
is an optional flag that indicates all search operations performed should be exact search. If the flag is NOT present, any search operations should use partial search instead. -
-results filepath
where-results
is an optional flag that indicates the next argument is the path to use for the search results output file. This may include partial or exact search results! If thefilepath
argument is not provided, useresults.json
as the default output filename. If the-results
flag is not provided, do not produce an output file of search results but still perform the search operation. -
-locations filepath
where-locations
is an optional flag that indicates the next argument is the path to use to output all of the locations and their word count. If thefilepath
argument is not provided, uselocations.json
as the default output filename. If the-locations
flag is not provided, do not produce an output file of locations.
The command-line flag/value pairs may be provided in any order, and the order provided is not the same as the order you should perform the operations (i.e. always build the index before performing search, even if the flags are provided in the other order).
Your code should support all of the command-line arguments from the previous project as well.
Output
The output of your inverted index should be the same from the previous project. The output of your search results and locations should also be in pretty JSON format (using tabs and newlines) as before. See each subsection for specifics.
Search Result Output
The search results should be output as a JSON array of JSON objects, where each object represents a single line from the original query file (e.g. one search). The query objects should have two keys, queries
and results
such that:
-
The key
queries
should have a quoted text value that is the parsed and sorted query words joined together by a space. For example, if the parsed query words are[capybara, hidden, observ]
then the text output should be"capybara hidden observ"
. -
The key
results
should be an array of nested JSON objects, one per search result. Each individual search result should have, in order, these keys:-
The key
where
with a quoted text value that is the relative path to the text file matching one or more of the query words. -
The key
count
with an integer value that is the raw count (i.e. total matches) within the text file. -
The key
score
with a floating point value with 6 digits after the decimal point of the search result score. You can use aDecimalFormat
object to achieve this output:DecimalFormat FORMATTER = new DecimalFormat("0.000000"); System.out.println(FORMATTER.format(Math.PI));
-
The use of tabs, newline characters, and spaces are the same as before for “pretty” JSON output. Here is an example output snippet of a single search query with a single search result:
[
{
"queries": "lori",
"results": [
{
"where": "text/simple/animals.text",
"count": 3,
"score": 0.272727
}
]
},
You can also find this output in the expected/results-text-exact/results-text-animals-animals.json
file in the project-tests
repository. See the other expected output files for more examples.
Location Output
The locations and their word count should be output as a JSON object with the relative path of the file as the key and the word count (after cleaning, parsing, and stemming) as the value. Here is the expected output for the word count of all the text files associated with this project:
{
"text/gutenberg/1400-0.txt": 187368,
"text/gutenberg/pg1228.txt": 157344,
"text/gutenberg/pg1322.txt": 124370,
"text/gutenberg/pg1661.txt": 107396,
"text/gutenberg/pg37134.txt": 16696,
"text/rfcs/rfc475.txt": 3228,
"text/rfcs/rfc5646.txt": 27075,
"text/rfcs/rfc6797.txt": 12925,
"text/rfcs/rfc6805.txt": 9785,
"text/rfcs/rfc6838.txt": 9367,
"text/simple/.txt/hidden.txt": 1,
"text/simple/a/b/c/d/subdir.txt": 1,
"text/simple/animals.text": 11,
"text/simple/animals_copy.text": 11,
"text/simple/animals_double.text": 22,
"text/simple/capital_extension.TXT": 1,
"text/simple/capitals.txt": 4,
"text/simple/digits.txt": 2,
"text/simple/hello.txt": 6,
"text/simple/position.teXt": 20,
"text/simple/symbols.txt": 10,
"text/simple/words.tExT": 24
}
You can also find this output in the project-tests/expected/index-text-locations.json
file in the project-tests
repository.
Examples
The following are a few examples (non-comprehensive) to illustrate the usage of the command-line arguments. Consider the following example:
java Driver -path ../project-tests/text -locations index-text-locations.json
Also consider the similar example:
java Driver -locations index-text-locations.json -path ../project-tests/text
Both of these are functionally equivalent to the first of the SearchExactTest.testSimple() tests in the project-tests
repository. It will create the inverted index from the text
input directory, and output the locations and word counts to the index-text-locations.json
location file. This is useful for debugging your search result sorting.
Also consider the following:
java Driver -path ../project-tests/text/simple/words.text -search ../project-tests/query/words.txt -results results-text-exact-words-words.json -exact
This is functionally equivalent to LocationsTest.test() in the project-tests
repository. It will create the inverted index from the text/simple/words.text
input text file, perform an exact search of the queries in the query/words.txt
query input file, and output the search results to results-text-exact-words-words.json
.
Hints
It is important to develop the project iteratively. One possible breakdown of tasks are:
-
Add the ability to track locations and their word counts to your inverted index data structure. You can update this information every time you add data to your inverted index.
-
Add the ability to output the locations and their word counts to a JSON file. Make sure that your results match the expected location output before moving on.
-
Add the ability to parse query files. Compare your parsed queries to those in the expected output files. For example, the line
performer performers
should becomeperform
after being parsed, cleaned, stemmed, sorted, and discarding duplicates. -
Create a class that stores a single search result and implements the Comparable interface. You will need data members for each of the sort criteria, including the location, total word count of the location, and number of times the query occurs at that location.
-
Add an exact search method to your inverted index data structure that takes already parsed words from a single line of the query file, and returns a sorted list of search results. Output the results to the console for debugging.
-
Add the ability to output the search results to JSON file. Make sure the exact search results match the expected output.
-
Add a partial search method to your inverted index data structure that takes already parsed words from a single line of the query file, and returns a sorted list of search results.
-
Do not worry about efficient partial search until after you are getting correct results.
The important part will be to test your code as you go. The JUnit tests provided only test the entire project as a whole, not the individual parts. You are responsible for testing the individual parts themselves.