Saturday 21 June 2014

Using Apache Pig to perform Hadoop MapReduce for WordCount Problem

1. Download Apache Hadoop, Apache Pig packages & JDK1.7 and install them at some specific locations and give them a path using >sudo gedit /etc/profile/. For Pig installations refer  http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.1/CDH4-Installation-Guide/cdh4ig_topic_16_2.html.
I have kept my files of wordcount code inside pig-wordcount directory as shown above

 2. There are two files inside directory. One is the input file & another Pig Latin code file for wordcount

3. As seen above, input to the Pig file which consists of different sentences or precisely words

4. This is the heart of Pig Latin i.e. the wordcount code which is much simple and short as compared to the java code. This code counts the occurrences of each word

5. Now open terminal and go to the location where the files are stored

6. Check whether Pig is properly working or not by providing 'pig' command

7. Now run the pig script by following command
    > pig wordcount.pig
where wordcount is the name of the pig script file & with .pig extension

8. This is how the MapReduce function starts running to perform operation to get the desired result

9. After the completion you will get the above message & you will also observe that a wordcount directory has been created which contains the output as seen above

10. So the final output is the count of words from the input file

So what's the big deal...? We can achieve this by using any normal programming language..!!!
Firstly this is not a common programming language like java, pHp, asp or any other. Its 'PIG LATIN' language used along with Apache Pig for processing & analyzing very large datasets containing billion's of record. This continuously processes data by using Apache Hadoop's MapReduce functionality to get the output.
For now we have considered very small dataset which you might not observe its beauty. It works great for processing very large datasets (in terabytes or petabytes). Many companies like Facebook, Twitter or Google uses this architecture for processing their data.
Even in my last blog where I have shown analyzing data using Microsoft Power BI took a huge time for processing & getting the desired result. So that's the benefit of using Hadoop.
Thank you..!