Using a MapReduce Architecture to make a posting list

  • Goals:
    1. To teach you how to make a posting list with a distributed MapReduce architecture on a Hadoop cluster.
  • Groups: This can be done in groups of 1 or 2.
  • Discussion: Use Piazza for general questions whose answers can benefit you and everyone.
  • Write a program to be executed by Hadoop:
    • Input:
      • on the distributed file system there will be files in a directory called /user/common/big_output.txt. This is the result of the full crawl of gutenberg that Prof. Patterson did.
      • The format of the file will be a JSON array of [<url>,<term>,<trem_count>] One entry per line. Multiple entries with the same (url,term) pair may be present. They should be summed in your deliverable.
    • Your work:
      • Make a posting list with an alphabetized list of words found in the corpus, the documents in which they occur and their frequency in each document. Alphabetize according to native Java sorting functions.
  • Guides
    • Step 1:Get an example to build
      1. The first step is to get a bit of example code running
      2. Create a new, empty Eclipse Workspace (File -> Switch Workspace)
      3. Download the source code for hadoop 2.7.1 from the Apache Hadoop Releases webpage
      4. Uncompress the hadoop source into a folder
      5. In Eclipse, "Import" a project -> a "Maven" project -> which is an "Existing Maven Project"
      6. Select the hadoop folder that you decompressed. Scanning the folder will take a few minutes.
        • Select "hadoop-common-project" -> "hadoop-common/pom.xml"
        • Also select the "hadoop-mapreduce-project" pom.xml and all it's sub pom.xml files
        • 11 in total
      7. Eclipse will ask you to "Setup Maven Plugin connectors", the defaults are fine (Including some scary red x's)
      8. Wait for Eclipse to finish working (status on bottom right)
      9. Right-Click on the hadoop-common project and select -> "properties" -> "Order and Export" and check the box next to "Maven Dependencies". This will let your project use the libraries that hadoop-common uses via Maven.
      10. Create a new Java project for your code. It should be a sibling to 10 hadoop projects. Make sure it is set to build in Java SE 7 (or 1.7)
      11. Under the "hadoop-mapreduce-examples" project find the WordCount.java project and copy it to the source directory of your new project. This will be the code that we will get running as a test.
      12. For now ignore the build errors in the hadoop projects... those are due to Maven configuration problems. We do need to worry about the build errors in our project though
      13. Right-Click on your project select -> "properties" -> "Projects" and add two projects: "hadoop-common" and "hadoop-mapreduce-client-core" This will let you use the code from those hadoop projects
      14. Select WordCount.java and then under the menubar "Source" -> select "Organize Imports". This will map the references in your java file to the hadoop project if it wasn't already done.
      15. Now your project should be compiling without error
      16. Try and run your project as a Java Application. It will indicate an error message:
        Usage: wordcount <in> [<in>...]<out>
        and then it will stop, but this will create a run configuration, that we need
    • Step 2:Run it on the cluster
      1. Export the project as a runnable jar using the run configuration that you just implicitly created, but for library handling pick "Copy required libraries...". We are going to use the libraries that are on the cluster to avoid conflicts. Keep track of where the jar goes.
      2. On your local machine, download the 2.7.1 binary package of Hadoop from the Apache Hadoop Releases page
      3. Transfer nine files to wcpkneel
        1. The jar you exported
        2. The binary hadoop package that ends in ".tar.gz"
        3. An input file for you to test your word counting on, perhaps this one
        4. capacity-scheduler.xml
        5. core-site.xml
        6. hadoop-env.sh
        7. mapred-site.xml
        8. yarn-env.sh
        9. yarn-site.xml
        This may take a long time if you are doing it from a slow spot on the net.
      4. on the remote machine uncompress the hadoop binary file. If the binary ends in ".tar.gz" you can use the command "tar xvofz <file_name>" to do that
      5. This will create a directory called hadoop-2.7.1
      6. cd into it and make a directory (mkdir) called "conf" (for configuration)
      7. put the last 6 configuration files above into that directory
      8. Now, from the hadoop-2.7.1 directory you should be able to run
        bin/hdfs dfs -ls /
        To list what is on the distributed file system: the one that is split across all the nodes in the cluster.
      9. Make your own directory in the distributed file system with the command:
        bin/hdfs dfs -mkdir /user/<user_name>
      10. Look to see what is in your distributed file system directory with the command:
        bin/hdfs dfs -ls /user/<user_name>
      11. You can also see what's in the distributed file system by looking at the filesystem in a web browser
      12. Make your own input directory in the distributed file system with the command:
        bin/hdfs dfs -mkdir /user/<user_name>/input
      13. Move your input file from wcpkneel's filesystem into the distributed file system using this command:
        bin/hdfs dfs -copyFromLocal input.txt /user/<user_name>/input
      14. Check to make sure the file arrived there by using the -ls command or the filesystem browser
      15. If you need to delete a file use the command, -rm, in place of, -mkdir, or -ls above
      16. The files on the dfs may be deleted at any time if the cluster crashes. So don't keep anything important there
      17. If the warning about native libraries bothers you, you can replace the files in lib/native with these that I compiled locally on wcpkneel. It bugged me so I replaced them.
      18. Now run your program on the cluster
        bin/hadoop jar <yourJar>.jar /user/<user_name>/input /user/<user_name>/output
      19. If everything went okay, then your answer should be in the output directory. View it in the filebrowser
      20. If you use the sample input file from above, then this should be your output: sample-output.txt
      21. You can see you program execute here on the cluster
    • Step 3:Change the code
      1. We are now going to change the code to process as input the file on the dfs /user/common/big_output.txt. Each line of the file is a JSON Array with the first element a url, the second element the term and the third element the count.
      2. This file is the output of Prof. Patterson's crawl of project gutenberg
      3. You should not use a Combiner Class. Comment that out.
      4. For the key and value classes I used the "Text" Class and represented data in serialized JSON format rather than trying to get Hadoop to pass JSON Objects around
      5. This is the point where you must do the work to understand what MapReduce does and in the WordCount.java class and make it do what you want it to do
      6. The output will be on the hdfs in the output folder that you indicate
      7. If you use external libraries to help you, then when you run your hadoop job you will need to put them on wcpkneel also and run the job like this:
        bin/hadoop jar <yourJar>.jar -libjars jar_01.jar,jar_02.jar /user/<user_name>/input /user/<user_name>/output
      8. If you manually set multiple reducer jobs in your JobConfig object then you will need to merge all the output files that start with "part*" into one large file.
        LC_ALL=C sort -m part* > big_file.txt
        The "-m" tells sort that the files are already individually sorted. The "LC_ALL=C" tells the os to sort according to byte order
    • Step 4:Calculate the answers
      1. The last thing that you need is to be prepared to answer the quiz questions referenced below. If you want to write a separate Java program that reads in the "big file" and builds a data structure from which you can answer the quiz questions referenced below that is a viable strategy. Otherwise if you think you can move around that file fast enough you can also try answer the questions manually on the fly.
  • Submitting your assignment
    1. Each individual, even if you are working in a group, will take a quiz which asks you to answer 15 questions about the posting list. Your group-mates should not help you. Each person should be able to run the required programs and/or work with the required data individually.
    2. The quiz will be administered via the Eureka quiz mechanism.
    3. The questions will be of the form:
      • What term comes alphabetically before term X?
      • What term comes alphabetically after term X?
      • What is the document frequency of word X?
      • (Corrected) What is term frequency of word X in the corpus?
  • Evaluation:
    1. Correctness: Did you get the right answer?
    2. Due date: 11/20 11:59pm
    3. This is an assigment grade