Crawling Gutenberg

  • Goals:
    1. To understand the challenges that come with scale.
    2. To crawl a larger set of documents.
  • Groups: This assignment may be done in groups of 1, 2.
  • Reusing code: You can use text processing code, or crawling code written by you or any other classmate for the previous assignment. Use code found over the Internet at your own peril -- it may not do exactly what the assignment requests. If you do end up using code you find on the Internet, you must disclose the origin of the code. Concealing the origin of a piece of code is plagiarism.
  • Discussion: Use Piazza for general questions whose answers can benefit you and everyone.
  • Write a program to crawl a mirror of Project Gutenberg:
    • You need to run your crawler from Prof. Iba's cluster machine: wcpkneel
    • Use crawler4j as your crawler engine.
    • I recommend you use Berkely DB to keep track of your statistics
    • Follow the instructions at https://github.com/yasserg/crawler4j and create your MyCrawler and Controller classes.
    • Remember to set the maximum heap of java to an acceptable value. For example, if your machine has 2GB RAM. You may want to assign 1GB RAM to your crawlers. You can add -Xmx1024M to your java command line parameters.
    • Make sure to use the nohup or screen command. Otherwise your crawling, which may take a long time , will be stopped if your connection is disconnected for even a second. Search the web for how to use this command.
    • Input: Start your crawl at this seed page
    • Specifications:
      • VERY IMPORTANT: Set the name of your crawler’s User Agent to “Westmont IR Firstname Lastname: Team <something>” . We will be parsing your user agent to verify you did this right so get this correct. Including capitals. If you set it right then you will show up on the seed page.
      • VERY IMPORTANT: wait 100ms between sending page requests. Violating this policy may get your crawler banned for 60 seconds.
      • You should only crawl pages on the http://djp3.westmont.edu/ domain
      • We will verify the execution of your crawler in the web servers’ logs. If we don’t find log entries for your team, that means your crawler didn’t perform as it should or you didn’t set its name correctly; in the latter case we can’t verify whether it ran successfully or not, so we’ll assume it didn’t.
    • Guides
    • Output:
      1. Store all (term,doc,count) tuples
      2. Submit a document with the following information in reference to URLs below: http://djp3.westmont.edu/gutenberg/gutenberg/ excluding filenames that end in the following (regardless of case):css,js,bmp,gif,jpg,jpeg,png,tif,tiff,mid,mp2,mp3,mp4,wav,avi,mov,mpeg,ram,m4v,pdf,rm,smil,wmv,swf,wma,zip,rar,gz
        1. How much space do your (term,doc) pairs take up on disk?
        2. How much time did it take to crawl the entire domain?
        3. How many unique pages did you find ? (Uniqueness is established by the URL)
        4. How many unique words did you find?(simple tokenization of text)
        5. On which pages do the following terms appear on (regardless of case)?
          1. foolishness
          2. deity
          3. assassination
        6. Extra credit: On which pages does the 2-gram "dolcipator venexus" show up?
  • Submitting your assignment
    1. We are going to use Eureka to submit this assignment.
    2. Please submit a zip of your code and a pdf
    3. Your pdf should include your group member's names, answers to the questions above and any additional information that you deem pertinent.
  • Evaluation:
      Correctness:
    1. Did you crawl the domain correctly? Verified in server logs.
    2. Are your answers reasonable?
      1. Correct answers without evidence of correct crawling are not valid
    3. Due date: 10/23 11:59pm
    4. This is an assigment grade