Crawling the Bible

  • Goals:
    1. To learn how to use software and infrastructure for crawling webpages.
  • Groups: This assignment may be done in groups of 1, 2.
  • Reusing code: You can use text processing code written by you or any other classmate for the previous assignment. You cannot use crawler code written by non-group members. Use code found over the Internet at your own peril -- it may not do exactly what the assignment requests. If you do end up using code you find on the Internet, you must disclose the origin of the code. Concealing the origin of a piece of code is plagiarism.
  • Discussion: Use Piazza for general questions whose answers can benefit you and everyone.
  • Write a program to crawl the NIV Bible:
    • You need to run your crawler from Prof. Iba's cluster machine
    • Use crawler4j as your crawler engine.
    • Follow the instructions at https://github.com/yasserg/crawler4j and create your MyCrawler and Controller classes.
    • Remember to set the maximum heap of java to an acceptable value. For example, if your machine has 2GB RAM. You may want to assign 1GB RAM to your crawlers. You can add -Xmx1024M to your java command line parameters.
    • Make sure to use the nohup or screen command. Otherwise your crawling, which may take a long time , will be stopped if your connection is disconnected for even a second. Search the web for how to use this command.
    • Input: Start your crawl at http://djp3.westmont.edu/classes/2015_09_CS150/tasks/crawl_bible/Bible/bible.html
    • Specifications:
      • VERY IMPORTANT: Set the name of your crawler’s User Agent to “Westmont IR Firstname Lastname: Team <something>” . We will be parsing your user agent to verify you did this right so get this correct. Including capitals.
      • VERY IMPORTANT: wait 100ms between sending page requests. Violating this policy may get your crawler banned for 60 seconds.
      • You should only crawl pages on the http://djp3.westmont.edu/ domain
      • We will verify the execution of your crawler in the web servers’ logs. If we don’t find log entries for your team, that means your crawler didn’t perform as it should or you didn’t set its name correctly; in the latter case we can’t verify whether it ran successfully or not, so we’ll assume it didn’t.
    • Guides
    • Output: Submit a document with the following information (all of this is in reference to URLs below: .../Bible/...
      1. How much time did it take to crawl the entire domain?
      2. How many unique pages did you find ? (Uniqueness is established by the URL)
      3. How many links did you find in the content of the pages that you crawled?
      4. What is the longest page in terms of number of words? (HTML markup doesn’t count as words)
      5. What are the 25 most common words in this domain? (Ignore these English stop words Submit the list of common words ordered by frequency.
      6. What are the 25 most common 2-grams? (again ignore English stop words) A 2-gram, in this case, is a sequence of 2 words in which neither are stop words. Submit the list of 25 2-grams ordered by frequency.
      7. What are the 10 longest palindromes that: 1) don't contain 3 or more of the same letter in a row? 2) are contained in one book of the bible. What pages do they occur on? Submit your list.
      8. Extra credit: On which pages does the 2-gram "dolcipator venexus" show up?
  • Submitting your assignment
    1. We are going to use Eureka to submit this assignment.
    2. Your submission should be a single pdf file submitted. Include your group member's names, answers to the questions above and any additional information that you deem pertinent.
  • Evaluation:
      Correctness:
    1. Did you crawl the domain correctly? Verified in server logs.
    2. Are your answers reasonable?
      1. Correct answers without evidence of correct crawling are not valid
    3. Due date: 10/02 11:59pm
    4. This is an assigment grade