First Steps with Pig and Hadoop

By Markus Klems

Today I wanted to do some practical work and try out Yahoo’s “Pig Latin: A Not-So-Foreign Language for Data Processing” (PDF). Pig is a dataflow programming environment for processing large files based on MapReduce / Hadoop. I will describe what I did step-by-step, so you can replicate my results if you feel like it. The basic steps to get started are well-documented on the Apache website.

1. First I started one of Eric Hammond’s fantastic Ubuntu EC2 AMIs: ami-0757b26e

2. I logged in via ssh and configured the work environment:

  • apt-get install sun-java6-jdk ant ant-optional subversion
  • svn co http://svn.apache.org/repos/asf/incubator/pig/trunk
  • mv trunk pig
  • export JAVA_HOME=/usr/lib/jvm/java-6-sun
  • export PIGDIR=~/pig
  • cd pig
  • ant
  • cd tutorial
  • ant
  • tar -xzf pigtutorial.tar.gz

Now the build should have been successful and we can try out some basic Pig commands. The PigTutorial is a good starting point.

First: Tutorial Pig Script in Local Mode

The Query Phrase Popularity script [...] processes a search query log file from the Excite search engine and finds search phrases that occur with particular high frequency during certain times of the day.

Move to $PIGDIR/pig/tutorial/pigtmp and execute

  • java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local script1-local.pig

After running the script, the file script1-local-results.txt was successfully created and shows some results. Let’s move on and do something on our own.

Second: Analyze Wikipedia Logs (again local mode)

I created the directory $PIGDIR/wikianalysis and copied some user log data from Wikipedia into the file ’stats’. Make sure that the ’stats’ file does not contain any empty lines, as this will result in empty tuples when loading the data into an alias and you will end up with error messages like

ERROR org.apache.pig.tools.grunt.GruntParser – java.io.IOException: Unable to open iterator for alias [...] Caused by: java.lang.IndexOutOfBoundsException: Requested index 1 from tuple () [...]

Then I started the Pig command line via

  • java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local

Now I can see the command line prompt “grunt>”. With the first command I load the ’stats’ file and assign the names ‘week’ and ‘creations’ to the first two fields of each tuple. The Wikipedia data is tab-delimited, so PigStorage(‘\t’) should be the way to go. Btw: tabs are the default delimiters, so you could leave out the PigStorage(‘\t’) part. Also, you can access the fields via $0, $1, etc. instead of named variables.

  • table = LOAD ’stats’ USING PigStorage(‘\t’) AS (week, creations);

Next, I want to apply one of the basic Pig Latin commands: FILTER. Let’s find the weeks where more than 1000 page creations took place (remark: the user log data in this example is only a small fraction of the total Wikipedia user data).

  • heavy_week = FILTER table BY creations > ‘1000′;

Lets have a look at the results:

  • dump heavy_week

(10/16/2002, 1102)
(6/28/2006, 1005)
(8/23/2006, 1037)
(7/18/2007, 1164)
(7/25/2007, 1120)
(8/22/2007, 1053)

Future plans

I had my little experiment running in Pig local mode which is not very interesting. Next time I will try to run it on Hadoop clusters.

Hint

In order to use the Wikipedia user log data provided by “Dragon flight” I clicked on “edit this page” and copied the text table. Then I inserted it into an Excel spreadsheet and copied the first two columns (week and #creations) into my vi editor on the AMI instance. Perhaps I did it overly complicated but on the first run I had some issues with the formatting of my file and wanted to be sure when I tried the second time.

4 Responses to “First Steps with Pig and Hadoop”

  1. The Burgeoning Openly Owned Web » links for 2008-07-03 Says:

    [...] First Steps with Pig and Hadoop « Cloudy Times i’ve got to get to play with this once we provision a nice fat server in the DMZ! (tags: cloud hadoop ec2 piglatin yahoo sla@soi) [...]

  2. andy.edmonds.be › links for 2008-07-03 Says:

    [...] First Steps with Pig and Hadoop « Cloudy Times i’ve got to get to play with this once we provision a nice fat server in the DMZ! (tags: cloud hadoop ec2 piglatin yahoo sla@soi) [...]

  3. automotive floor jacks Says:

    It’s the first time I commented here and I must say you share genuine, and quality information for other bloggers! Great job.
    p.s. You have an awesome template for your blog. Where did you find it?

    • Markus Klems Says:

      Ha, thanks for the nice comment :D

      Well, the layout is a standard WordPress template “Kubrick by Michael Heilemann” and I exchanged the header.

      Cheers

      Markus

Leave a Reply