Crawling the Bible
- Goals:
- To learn how to use software and
infrastructure for crawling webpages.
- Groups: This assignment may be done in groups of 1, 2.
- Reusing code: You can use text
processing code written by you or any other
classmate for the previous assignment. You cannot
use crawler code written by non-group members. Use code found over the Internet at your own peril -- it may not do exactly what the assignment requests. If you do end up using code you find on the Internet, you must disclose the origin of the code. Concealing the origin of a piece of code is plagiarism.
- Discussion: Use Piazza for general questions whose answers can benefit you and everyone.
- Write a program to crawl the NIV Bible:
- You need to run your crawler from Prof.
Iba's cluster machine
- Use crawler4j
as your crawler engine.
- Follow the instructions at
https://github.com/yasserg/crawler4j
and
create your MyCrawler and Controller
classes.
- Remember to set the maximum heap of
java to an acceptable value. For example, if
your machine has 2GB RAM. You may want to
assign 1GB RAM to your crawlers. You can add
-Xmx1024M to your java command line
parameters.
- Make sure to use the nohup or screen command. Otherwise your crawling, which may take a long time , will be stopped if your connection is disconnected for even a second. Search the web for how to use this command.
- Input: Start your crawl at http://djp3.westmont.edu/classes/2015_09_CS150/tasks/crawl_bible/Bible/bible.html
- Specifications:
- VERY IMPORTANT: Set the name of
your crawler’s User Agent to
“Westmont IR Firstname Lastname: Team <something>”
. We will be parsing your user
agent to verify you did this right
so get this correct. Including
capitals.
- VERY IMPORTANT: wait 100ms
between sending page requests. Violating this
policy may get your crawler
banned for 60 seconds.
- You should only crawl pages on
the http://djp3.westmont.edu/ domain
- We will verify the execution of
your crawler in the web servers’
logs. If we don’t find log entries
for your team, that means your
crawler didn’t perform as it
should or you didn’t set its
name correctly; in the
latter case we can’t verify
whether it ran successfully
or not, so we’ll assume it
didn’t.
- Guides
- Output: Submit a
document with the following information (all
of this is in reference to URLs below:
.../Bible/...
- How much time did it take to crawl the entire domain?
- How many unique pages did you
find ? (Uniqueness is established by the URL)
- How many links did you find in the content of the pages that you crawled?
- What is the longest page in terms of number of words? (HTML markup doesn’t count as words)
- What are the 25 most common words in this domain? (Ignore these English stop words Submit the list of common words ordered by frequency.
- What are the 25 most common 2-grams? (again ignore English stop
words) A 2-gram, in this case, is a
sequence of 2 words in which neither
are stop words. Submit the list of
25 2-grams ordered by frequency.
- What are the 10 longest
palindromes that: 1) don't contain
3 or more of the same letter in a
row? 2) are contained in one book of
the bible. What pages do they
occur on? Submit your list.
- Extra credit: On which pages
does the 2-gram "dolcipator venexus" show up?
- Submitting your assignment
- We are going to use Eureka to submit this assignment.
- Your submission should be a single pdf
file submitted. Include your group member's
names, answers to the questions above and
any additional information that you deem
pertinent.
- Evaluation:
Correctness:
- Did you crawl the domain correctly? Verified in server logs.
- Are your answers reasonable?
- Correct answers without evidence of correct crawling are not valid
- Due date: 10/02 11:59pm
- This is an assigment grade