Mastering Text Processing

  • Goals:
    1. Prepare for the rest of the class:
      • To set up a programming environment
      • To be able to run a program in an environment with enough resources for building a search engine (disk space and cpu cycles)
    2. To write feature detectors for things you might want to detect on the web and make available for users of a search engine.
    3. To encourage you to use a modular architecture by forcing development of components apart from the future infrastructure.
  • Groups: This assignment may be done in groups of 1, 2 or 3.
  • Reusing code: You cannot use code written by your classmates. Use code found over the Internet at your own peril -- it may not do exactly what the assignment requests. If you do end up using code you find on the Internet, you must disclose the origin of the code. Concealing the origin of a piece of code is plagiarism.
  • Discussion: Use Piazza for general questions whose answers can benefit you and everyone.
  • Project Skeleton: Skeleton
  • Specifications:
    • Write a Java Program which runs on an Prof. Iba's cluster in a unix environment.
    • Fill in each method in the skeleton according to its Javadoc specification.
    • Feel free to create additional methods / classes where necessary
    • Be very precise with instructions for how to run your program – what programs are needed, versions, etc. If the Prof. can’t run your program, your grade will reflect that.
    • We will test your program with our own text files.
    • At points, the assignment may be underspecified. In those cases, make your own assumptions and document them and/or use Piazza for clarification.
    • Given a text file your program should:
      1. Tokenize (20 pts)
        • Write a method that reads in a text file and returns a list of the tokens in that file.
        • Write a method to print out frequency results.
        • Package: textProcessing.crawler.a
        • File:
        • Method: tokenizeFile
        • Method: printFrequencies
      2. Word Frequencies (20 pts)
        • Count the total number of words and their frequencies in a token list.
        • Package: textProcessing.crawler.b
        • File:
        • Method: computeWordFrequencies)
      3. 2-grams (30 pts)
        • A 2-gram is two words that occur consecutively in a file. For example, "two words", "words that", "that occur" are all 2-grams from the previous sentence. Count the total number of 2-grams and their frequencies in a token list
        • Package: textProcessing.crawler.c
        • File:
        • Method: computeTwoGramFrequencies
      4. Palindromes (30 pts)
        • A palindrome is a word or phrase that reads the same in both directions. For example: "eye" is a palindrome and so is "Do geese see god". Count the total number of palindromes that are longer than 5 non-white space/ non punctuation characters. Case should be standardized to lower case.
        • Overlapping palindromes should be output separately. Palindromes don't have to end on word boundaries. Extra credit if you do specify which end on word boundaries.
        • I would like all palindromes that are over 5 characters that have different center points. If two or more palindromes have the same center points, then we want the longest one. A center point can be a character or a space between characters and those are considered different center points.
        • Package: textProcessing.crawler.c
        • File:
        • Method: computePalindromeFrequencies
  • Submitting your assignment
    1. We are going to use Eureka to submit this assignment.
    2. Make the file name <StudentID>-<StudentID>-<StudentID>
    3. Your submission should be a single zip file submitted. This zip file should match the skeleton zip file provided with the assignment, with the addition of your implementations of the four sections and the package names should not include "skeleton" but instead "textProcessing". If there is anything you wish to communicate to the Prof., such as implementation assumptions made, or how to run your program, this should be placed into the README.txt file provided in the skeleton zip file.
    4. Sample inputs and outputs and JUnit tests are included in the skeleton file.
  • Evaluation:
    1. Can the Prof. run the version that you turned in?
      1. Is it runnable?
      2. Did you follow instructions?
    2. Correctness:
      • How well does the behavior of the program match the specification?
      • How does your program handle bad input?
    3. Efficiency
      • How quickly does the program work on large inputs?
    4. Aesthetics
      • Is the program clearly documented and well written?
    5. Due date: 09/18
    6. This is an assigment grade
    7. Here is an example grading rubric