Using a MapReduce Architecture to make a posting list
- To teach you how to make a posting list
with a distributed MapReduce architecture on
a Hadoop cluster.
- Groups: This can be done in groups of 1 or 2.
- Discussion: Use Piazza for general questions whose answers can benefit you and everyone.
- Write a program to be executed by Hadoop:
- on the distributed file system
there will be files in a directory
called /user/common/big_output.txt. This is
the result of the full crawl of
Prof. Patterson did.
- The format of the file will be
a JSON array of
One entry per line. Multiple
entries with the same (url,term)
pair may be present. They should be
summed in your deliverable.
- Your work:
- Make a posting list with an alphabetized list of
words found in the corpus, the
documents in which they occur and
their frequency in each document.
Alphabetize according to native Java
- Step 1:Get an example to build
- The first step is to get a bit of example code running
- Create a new, empty Eclipse Workspace (File -> Switch Workspace)
- Download the source code for hadoop 2.7.1 from the Apache Hadoop Releases webpage
- Uncompress the hadoop source into a folder
- In Eclipse, "Import" a project -> a "Maven" project -> which is an "Existing Maven Project"
- Select the hadoop folder that you decompressed. Scanning the folder will take a few minutes.
- Also select the
and all it's sub pom.xml files
- 11 in total
- Eclipse will ask you to "Setup Maven
Plugin connectors", the defaults are
fine (Including some scary red x's)
- Wait for Eclipse to finish working
(status on bottom
- Right-Click on the hadoop-common
project and select -> "properties"
-> "Order and Export" and check the
box next to "Maven Dependencies". This
will let your project use the libraries
that hadoop-common uses via Maven.
- Create a new Java project for your
code. It should be a sibling to 10
hadoop projects. Make sure it is set to
build in Java SE 7 (or 1.7)
- Under the
"hadoop-mapreduce-examples" project find
the WordCount.java project and copy it
to the source directory of your new
project. This will be the code that we
will get running as a test.
- For now ignore the build errors in
the hadoop projects... those are due to
Maven configuration problems. We do
need to worry about the build errors in
our project though
- Right-Click on your project
select -> "properties"
-> "Projects" and add two projects:
This will let you use the code from
- Select WordCount.java and then under
the menubar "Source" -> select
"Organize Imports". This will map the
references in your java file to the
hadoop project if it wasn't already
- Now your project should be compiling
- Try and run your project as a Java
Application. It will indicate an
Usage: wordcount <in> [<in>...]<out>
and then it will stop,
but this will create a run
configuration, that we need
- Step 2:Run it on the cluster
- Export the project as a runnable
jar using the run configuration that
you just implicitly created, but for
library handling pick "Copy required
libraries...". We are going to use
the libraries that are on the
cluster to avoid conflicts. Keep
track of where the jar goes.
- On your local machine, download the
2.7.1 binary package of Hadoop from the
Apache Hadoop Releases page
- Transfer nine files to wcpkneel
This may take a long time if you are
doing it from a slow spot on the
- The jar you exported
- The binary hadoop package
that ends in ".tar.gz"
- An input file for you to test your word counting on,
perhaps this one
- on the remote machine uncompress the hadoop binary file. If the binary ends in ".tar.gz" you can use the command "tar xvofz <file_name>" to do that
- This will create a directory called hadoop-2.7.1
- cd into it and make a directory
(mkdir) called "conf" (for
- put the last 6 configuration files above into that directory
- Now, from the hadoop-2.7.1 directory
you should be able to run
bin/hdfs dfs -ls /
To list what is on the distributed
file system: the one that is split
across all the nodes in the cluster.
- Make your own directory in the
distributed file system with the
bin/hdfs dfs -mkdir /user/<user_name>
- Look to see what is in your
distributed file system directory
with the command:
bin/hdfs dfs -ls /user/<user_name>
- You can also see what's in the
distributed file system by looking
at the filesystem in a web browser
- Make your own input directory in the
distributed file system with the
bin/hdfs dfs -mkdir /user/<user_name>/input
- Move your input file from
wcpkneel's filesystem into the
distributed file system using this
bin/hdfs dfs -copyFromLocal input.txt /user/<user_name>/input
- Check to make sure the file
arrived there by using the -ls
command or the filesystem
- If you need to delete a file use
the command, -rm, in place
of, -mkdir, or -ls
- The files on the dfs may be deleted
at any time if the cluster crashes. So don't keep anything
If the warning about native
libraries bothers you, you can
replace the files in lib/native with
these that I compiled locally on
wcpkneel. It bugged me so I
- Now run your program on the cluster
bin/hadoop jar <yourJar>.jar /user/<user_name>/input /user/<user_name>/output
- If everything went okay, then your answer should be in the output directory. View it in the filebrowser
- If you use the sample input file from above, then this should be your output: sample-output.txt
- You can see you program execute here
on the cluster
- Step 3:Change the code
- We are now going to change the code
to process as
input the file on the dfs
line of the file is a JSON Array
with the first element a url, the
second element the term and the
third element the count.
- This file is the output of Prof. Patterson's crawl of project gutenberg
- You should not use a Combiner Class.
Comment that out.
- For the key and value classes I used
the "Text" Class and represented data in
serialized JSON format rather than trying to get
Hadoop to pass JSON Objects around
- This is the point where you must do the work
to understand what MapReduce does and in
the WordCount.java class and make it do
what you want it to do
- The output will be on the hdfs in
the output folder that you indicate
- If you use external libraries to
help you, then when you run your hadoop
job you will need to put them on
wcpkneel also and run the job like this:
bin/hadoop jar <yourJar>.jar -libjars jar_01.jar,jar_02.jar /user/<user_name>/input /user/<user_name>/output
- If you manually set multiple reducer
jobs in your JobConfig object then you
will need to merge all the output files that start with
"part*" into one large file.
LC_ALL=C sort -m part* > big_file.txt
The "-m" tells
sort that the files are already individually
sorted. The "LC_ALL=C" tells the os to sort
according to byte order
- Step 4:Calculate the answers
- The last thing that you need
is to be prepared to answer the quiz
questions referenced below. If you want
to write a separate Java program that
reads in the "big file" and builds a data
structure from which you can answer the
quiz questions referenced below that is
a viable strategy. Otherwise if you
think you can move around that file fast
enough you can also try answer the
questions manually on the fly.
- Submitting your assignment
- Each individual, even if you are working
in a group, will take a quiz which asks you
to answer 15 questions about the posting
list. Your group-mates should not help you. Each person
should be able to run the required programs
and/or work with the required data
- The quiz will be administered via the
Eureka quiz mechanism.
- The questions will be of the form:
- What term comes alphabetically before term X?
- What term comes alphabetically after term X?
- What is the document frequency of word X?
- (Corrected) What is term frequency of word X in the corpus?
- Correctness: Did you get the right answer?
- Due date: 11/20 11:59pm
- This is an assigment grade