Pig Tutorial: Introduction

Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Users can create their own functions to do special-purpose processing.

Pig Latin programs execute in a distributed fashion on a cluster. Our current implementation compiles Pig Latin programs into Map/Reduce jobs, and executes them using Hadoop on Kryptonite.

Thur purpose of Pig is to answer queries over semi-structured data such as log files. Large volumes of data are in mostly-organized formats such as log files, which define a set of standard fields for each entry. While the MapReduce programming model on top of Hadoop provides a convenient mechanism for delivering a large volume of log-structured information to an analysis program, writing analyses of mostly-structured information involves writing a large amount of tedious processing code.

Pig is a high-level language for writing queries over this sort of data. A query planner then compiles queries written in this language (called "Pig Latin") into maps and reduces which are then executed on a Hadoop cluster.

Anything which could be written in Pig can also be implemented as straight Java-based Hadoop MapReduce. But while individual programmers could develop their own suite of data analysis functions such as join, order by, etc., this requires individual programmers to develop their own (non-standard) libraries, and test them. Pig provides a tested and supported suite of the most common data-aggregation functions. It also allows programmers to provide their own application-specific code for purposes of loading and saving data, as well as for performing more complicated record-by-record evaluations.

Introduction

1 comment: