Pig Latin Operators

Pig Latin provides a number of operators which filter, join, or otherwise organize data.

FOREACH: The FOREACH command operates on each element of a data bag. This is useful, for instance, for processing each input record in a bag returned by a LOAD statement.

FOREACH bagname GENERATE expression, expression...

This statement iterates over the contents of a bag. It applies the expressions on the right of the GENERATE keyword to the data provided by the current record emitted from the bag. The expressions may be, for example, the names of fields. So to extract the names of all users who accessed the site (based on the query_log.txt example shown above), we could write a query like:

FOREACH queries GENERATE userId;

In the FOREACH statement, each element of the bag is considered independently. There are no expressions which reference multiple elements being extracted from the bag's iterator at a time; this allows the statement to be processed in parallel using Hadoop MapReduce.

Expressions emitted by the GENERATE element are not limited to the names of fields; they can be fields (by name like userId or by position like $0), constants, algebraic operations, map lookups, conditional expressions, or FLATTEN expressions, described below.

Finally, these expressions may also call user-provided functions that are written in Java. These user-provided functions have access to the entire current record through a Pig library; in this way, Pig can be used as the heavy-lifting component to automate record-by-record mapping using an application-specific Java function to perform tricky parsing or evaluation logic. Pig also provides several of the most commonly-needed functions, such as COUNT, AVG, MIN, MAX, and SUM.

FLATTEN is an expression which will eliminate a level of nesting. Given a tuple which contains a bag, FLATTEN will emit several tuples each of which contains one record from the bag. For example, if we had a bag of records containing a person's name and a list of types of pets they own:

(Alice, { turtle, goldfish, cat })
(Bob, { dog, cat })

A FLATTEN command would eliminate the inner bags like so:

(Alice, turtle)
(Alice, goldfish)
(Alice, cat)
(Bob, dog)
(Bob, cat)

FILTER statements iterate over a bag and return a new bag containing all elements which pass a conditional expression, e.g.:

adults = FILTER people BY age > 21;

The COGROUP and JOIN operations perform similar functions: they unite related data elements from multiple data sets. The difference is that JOIN acts like the SQL JOIN statement, creating a flat set of output records containing the joined cross-product of the input records. The COGROUP operator, on the other hand, groups the elements by their common field and returns a set of records each containing two separate bags. The first bag is the records of the first data set with the common field, and the second bag is the records of the second data set containing the common field.

To illustrate the difference, suppose we had the flattened data set mapping people to their pets, and another flattened data set mapping people to their friends. We could create a "pets of friends" data set out of these like the following. Here are the input data sets:

pets: (owner, pet)
----------------------
(Alice, turtle)
(Alice, goldfish)
(Alice, cat)
(Bob, dog)
(Bob, cat)

friends: (friend1, friend2)
----------------------
(Cindy, Alice)
(Mark, Alice)
(Paul, Bob)

Here is what is returned by COGROUP:

COGROUP pets BY owner, friends BY friend2; returns:

( Alice, {(Alice, turtle), (Alice, goldfish), (Alice, cat)},
{(Cindy, Alice), (Mark, Alice)} )
( Bob, {(Bob, dog), (Bob, cat)}, {(Paul, Bob)} )

Contrasted with the more familiar, non-hierarchical JOIN operator:

JOIN pets BY owner, friends BY friend2; returns:

(Alice, turtle, Cindy)
(Alice, turtle, Mark)
(Alice, goldfish, Cindy)
(Alice, goldfish, Mark)
(Alice, cat, Cindy)
(Alice, cat, Mark)
(Bob, dog, Paul)
(Bob, cat, Paul)

In general, COGROUP command supports grouping on as many data sets as are desired. Three or more data sets can be joined in this fashion. It is also possible to group up elements of only a single data set; this is supported through an alternate keyword, GROUP.

A GROUP ... BY statement will organize a bag of records into bags of related items based on the field identified as their common key field. e.g., the pets bag from the previous example could be grouped up with:

GROUP pets BY owner; returns:

( Alice, {(Alice, turtle), (Alice, goldfish), (Alice, cat)} )
( Bob, {(Bob, dog), (Bob, cat)} )

In this way, GROUP and FLATTEN are effectively inverses of one another.

More complicated statements can be realized as well: operations which expect a data set as input do not need to use an explicitly-named data set; they can use one generated "inline" with another FILTER, GROUP or other statement.

When the final data set has been created by a Pig Latin script, the output can be saved to a file with the STORE command, which follows the form:

STORE data set INTO 'filename' USING function()

The provided function specifies how to serialize the data to the file; if it is omitted, then a default serializer will write plain-text tab-delimited files.

A number of additional operators exist for the purposes of removing duplicate records, sorting records, etc. This paper explains the additional operators and expression syntaxes in greater detail.

82 comments:

  1. Thanks for such an article. You can find word count program in pig script at:

    word count program in pig script

    ReplyDelete
  2. Nice Tutorial. http://pigtutorial.blogspot.in/2014/01/setting-up-eclipse-for-apache-pig-and.html will get you started with pig setup in eclipse

    ReplyDelete
  3. Hadoop is creating more opportunities to every one. And thanks for sharing best information about hadoop in this blog Hadoop Tutorial
    Hadoop Tutorial

    ReplyDelete
  4. Thanku soo much for sharing this valuable information.Really hadoop will makes you to pay your way to good growth.Recently I visited www.hadooponlinetutor.com,they are offering the videos at $20 only.

    ReplyDelete
  5. Thank you so much for sharing this worthwhile to spent time on. You are running a really awesome blog. Keep up this good work Big Data Training

    ReplyDelete
  6. Learning new technology would give oneself a true confidence in the current emerging Information Technology domain. With the knowledge of big data the most magnificent cloud computing technology one can go the peek of data processing. As there is a drastic improvement in this field everyone are showing much interest in pursuing this technology. Your content tells the same about evolving technology. Thanks for sharing this.

    Hadoop Training in Chennai | Big Data Training in Chennai | Big Data Training Chennai | Big Data Training

    ReplyDelete
  7. I have finally found a Worth able content to read. The way you have presented information here is quite impressive. I have bookmarked this page for future use. Thanks for sharing content like this once again. Keep sharing content like this.

    Software testing training in chennai | Software testing course | Manual testing training in Chennai

    ReplyDelete
  8. Salesforce.com is an american company which offfers CRM based cloud services and it is loved globally for it quality services
    salesforce training in chennai|salesforce training institute in chennai | salesforce course in chennai

    ReplyDelete
  9. SAS stands for statistical analysis system which is a analysis tool developed by SAS institute and with the help of this tool data driven decisions can be taken which is helpful for the bsuiness.
    SAS training in Chennai | SAS course in Chennai | SAS training institute in Chennai

    ReplyDelete
  10. Thanks a lot for letting me a chance to visit your any pointers. Your article about web design is really impressed me very much.ios applications development

    ReplyDelete
  11. Great Tutorial with important information about Pig! Pig is a high-level platform for creating MapReduce programs used with Hadoop. I am Hadoop Developer. I will share you a link https://goo.gl/rrChA2 just have looks. I hope it will help who are looking for Hadoop.

    Thank you

    ReplyDelete
  12. This comment has been removed by the author.

    ReplyDelete
  13. Amazing content.If you are interested instudying nodejs visit this website. Nodejs is an open source, server side web application that enables you to build fast and scalable web application that is capable of running large number of simultaneous connections that has high throughput.
    Node js Training in Chennai | Node JS training institute in chennai

    ReplyDelete
  14. This is a great inspiring article.I am pretty much pleased with your good work.You put really very helpful information..

    Chennai Bigdata Training

    ReplyDelete
  15. Thanks for sharing the information very useful info about Hadoop and keep updating us, Please........

    ReplyDelete
  16. Use schemas to assign types to fields. If you don't assign types, fields default to type byte array and implicit conversions are applied to the data depending on the context in which that data is used.If want to do learning from Selenium automation testing to reach us Besant technologies.They Provide at real-time Selenium Automation Testing.
    Selenium Training in Chennai
    Selenium Training Institute in Chennai

    ReplyDelete
  17. This comment has been removed by the author.

    ReplyDelete
  18. I appreciate your work on Hadoop. It's such a wonderful read on Hadoop tutorial. Keep sharing stuffs like this. I am also educating people on similar Hadoop so if you are interested to know more you can watch this Hadoop tutorial:-https://www.youtube.com/watch?v=1jMR4cHBwZE

    ReplyDelete

  19. Top 10 hot technologies of 2019 to make a good career in the upcoming year: https://www.youtube.com/watch?v=-y5Z2fmnp-o

    ReplyDelete
  20. This is most informative and also this post most user friendly and super navigation to all posts... Thank you so much for giving this information to me.. 
    Devops training in OMR

    Deops training in annanagar

    Devops training in chennai

    Devops training in marathahalli

    Devops training in rajajinagar

    Devops training in BTM Layout

    ReplyDelete
  21. Nice tips. Very innovative... Your post shows all your effort and great experience towards your work Your Information is Great if mastered very well.


    java training in chennai | java training in bangalore

    java online training | java training in pune

    selenium training in chennai

    selenium training in bangalore

    ReplyDelete
  22. Very Impressive Big Data Hadoop tutorial. The content seems to be pretty exhaustive and excellent and will definitely help in learning Big Data Hadoop course. I'm also a learner taken up Big Data Hadoop Tutorial and I think your content has cleared some concepts of mine. While browsing for Hadoop tutorials on YouTube i found this fantastic video on Big Data Hadoop Tutorial.Do check it out if you are interested to know more.https://www.youtube.com/watch?v=nuPp-TiEeeQ&

    ReplyDelete
  23. Great work. Quite a useful post, I learned some new points here.I wish you luck as you continue to follow that passion.

    CSS Training in Chennai
    CSS Training

    ReplyDelete
  24. This comment has been removed by the author.

    ReplyDelete
  25. Your story is truly inspirational and I have learned a lot from your blog. Much appreciated.
    python training in pune
    python training institute in chennai
    python training in Bangalore

    ReplyDelete
  26. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    Selenium Training in Chennai | Selenium Training in Bangalore | Selenium Training in Pune | Selenium online Training

    ReplyDelete
  27. Thanks you for sharing this unique useful information content with us. Really awesome work. keep on blogging

    Devops Training in pune
    DevOps online Training

    ReplyDelete
  28. This idea is mind blowing. I think everyone should know such information like you have described on this post. Thank you for sharing this explanation.Your final conclusion was good.
    Selenium Training in Chennai
    Selenium Training Institute in Chennai
    Java Courses in Chennai
    core Java training in chennai
    iOS Training Chennai
    best ios training in chennai

    ReplyDelete
  29. Awwsome informative blog ,Very good information thanks for sharing such wonderful blog with us ,after long time came across such knowlegeble blog. keep sharing such informative blog with us.
    Aviation Academy in Chennai | Aviation Courses in Chennai | Best Aviation Academy in Chennai | Aviation Institute in Chennai | Aviation Training in Chennai

    ReplyDelete
  30. After seeing your article I want to say that the presentation is very good and also a well-written article with some very good information which is very useful for the readers....thanks for sharing it and do share more posts like this.
    angularjs Training in bangalore

    angularjs Training in bangalore

    angularjs Training in btm

    angularjs Training in electronic-city

    angularjs online Training

    angularjs Training in marathahalli

    ReplyDelete
  31. Thanks For Your valuable posting, it was very informative

    Guest posting sites
    Education

    ReplyDelete
  32. Thanks for your interesting ideas.the information's in this blog is very much useful for me to improve my knowledge.
    android developer course in bangalore
    Android Training in chennai
    Android Training courses near me
    Android Training in chennai

    ReplyDelete

  33. Worthful Hadoop tutorial. Appreciate a lot for taking up the pain to write such a quality content on Hadoop tutorial. Just now I watched this similar Hadoop tutorial and I think this will enhance the knowledge of other visitors for sureHadoop Online Training

    ReplyDelete
  34. Thanks For Sharing The Information The Information Shared Is Very Valuable Please Keep Updating

    Us Time Just Went On Reading The article Hadoop Online Course

    ReplyDelete
  35. wow... what a great blog, this writter who wrote this article it's realy a great blogger, this article so inspiring me to be a better person
    data science course malaysia
    big data course malaysia
    data analytics course malaysia
    AI learning course malaysia
    machinelearning course malaysia
    pmp certification malaysia

    ReplyDelete
  36. Great article, valuable and excellent article, lots of great information, thanks for sharing with peoples.


    ExcelR Data Science Bangalore

    ReplyDelete
  37. Thank you for your post, I look for such article along time, today i find it finally. this post give me lots of advise it is very useful for me.
    date analytics certification training courses
    data science courses training

    ReplyDelete
  38. Attend The Python training in bangalore From ExcelR. Practical Python training in bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Python training in bangalore.
    python training in bangalore

    ReplyDelete
  39. This comment has been removed by the author.

    ReplyDelete
  40. This comment has been removed by the author.

    ReplyDelete
  41. Hi,
    Good job & thank you very much for the new information, i learned something new. Very well written. It was sooo good to read and usefull to improve knowledge. Who want to learn this information most helpful. One who wanted to learn this technology IT employees will always suggest you take big data hadoop training in bangalore. Because big data course in Bangalore is one of the best that one can do while choosing the course.

    ReplyDelete
  42. Thanks For sharing a nice post about datascience with python Training Course.It is very helpful and datascience with python useful for us.datascience with python training in bangalore

    ReplyDelete
  43. This is an awesome blog. Really very informative and creative contents. This concept is a good way to enhance the knowledge. Thanks for sharing.
    ExcelR business analytics course

    ReplyDelete
  44. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.DataScience with Python Training in Bangalore


    ReplyDelete
  45. I am happy for sharing on this blog its awesome blog I really impressed. thanks for sharing. Great efforts.

    Softgen Infotech is the Best SAP HANA Admin Training in Bangalore located in BTM Layout, Bangalore providing quality training with Realtime Trainers and 100% Job Assistance.

    ReplyDelete
  46. Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and i got more information from your blog. Online Data Science Training in Pune, Mumbai, Delhi NCR

    ReplyDelete

  47. That is very interesting; you are a very skilled blogger. I have shared your website in my social networks! A very nice guide. I will definitely follow these tips. Thank you for sharing such detailed article.thanks a lot

    Java training in Chennai

    Java training in Bangalore

    Java training in Hyderabad

    Java Training in Coimbatore

    Java Online Training

    ReplyDelete
  48. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly..Wonderful blog.. Thanks for sharing informative blog.. its very useful to me..

    Data Science Training In Chennai

    Data Science Online Training In Chennai

    Data Science Training In Bangalore

    Data Science Training In Hyderabad

    Data Science Training In Coimbatore

    Data Science Training

    Data Science Online Training

    ReplyDelete
  49. Great thoughts you got there, believe I may possibly try just some of it throughout my daily life.

    DevOps Training in Hyderabad

    ReplyDelete
  50. Attend The Business Analytics Courses From ExcelR. Practical Business Analytics Courses Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Business Analytics Courses.
    Business Analytics Courses

    ReplyDelete
  51. Awesome article, it was exceptionally helpful! I simply began in this and I'm becoming more acquainted with it better! Cheers, keep doing awesome!

    Mystrikingly Bloglovin

    ReplyDelete
  52. Reach to the best Python Training institute in Chennai for skyrocketing your career, Infycle Technologies. It is the best Software Training & Placement institute in and around Chennai, that also gives the best placement training for personality tests, interview preparation, and mock interviews for leveling up the candidate's grades to a professional level.

    ReplyDelete
  53. I see the greatest contents on your blog and I extremely love reading them.
    full stack web development course

    ReplyDelete
  54. I have read your excellent post. This is a great job. I enjoyed reading your post for the first time. I want to say thanks for this post. Thank you...
    data science training in hyderabad

    ReplyDelete