Format: The first element is a term. The second element is a JSON-parsable list of urls and term frequencies
Evaluation: Each person (even if you are in a group) will take a Eureka quiz (TBD) which will ask 10 questions of the form:
What is the {first...fifth}-most relevant document (by id number) for the query X and what is it's relevance score?
group members may assist each other if they want, but are not required to.
Guides: V1
Here is how I went about solving this:
First set up a project that includes BerkeleyDB and a JSON parser library
Write code to calculate the number of documents in the input file. (This only has to be done once per input file if you store the number somewhere)
Create a BerkeleyDB table to store (url, double) pairs: your document accumulators
Write code to calculate the document vector lengths V_d * V_d (this only has to be done once per input file if you store the results in your BerkeleyDB)
Write code to ask for a query
Use the efficient cosine similarity accumulator algorithm to calculate V_q * V_d
Normalize the similarity scores by the document vector length
Sort the the most relevant documents for a given input query. Highest score is the most relevant because it's the cosine of the angle between them (or proportional in my case)
Ask for a new query
Here is a screen shot of my output. The movie is to show speed
To get these numbers I used log10. Getting the exact numbers correct is important.
Remember that the query counts as a document and will bump up your corpus size and document frequency by one when finding the elements in V_q