Loading Data Into Pig

The first step in using Pig is to load data into a program. Pig provides a LOAD statement for this purpose. Its format is: result = LOAD 'filename' USING fn() AS (field1, field2, ...).

This statement returns a bag of values of all the data contained in the named file. Each record in the bag is a tuple, with the fields named by field1, field2, etc. The fn() is a user-provided function that reads in the data. Pig supports user-provided Java code throughout to handle the application-specific bits of parsing. Pig Latin itself is the "glue" that then holds these application-specific functions together, routing records and other data between them.

An example data loading command (taken from this paper on Pig) is:

queries = LOAD 'query_log.txt'
USING myLoad()
AS (userId, queryString, timestamp)

The user-defined functions to load data (e.g., myLoad()) do not need to be provided. A default function for loading data exists, which will parse tab-delimited records. If the programmer did not specify field names in the AS clause, they would be addressed by positional parameters: $0, $1, and so forth.

The default loader is called PigStorage(). This loader can read files containing character-delimited tuple records. These tuples must contain only atomic values; e.g., cat, turtle, fish. Other loaders are listed in the PigBuiltins page of the Pig wiki. PigStorage() takes as an argument the character to use to delimit fields. For example, to load a table of three tab-delimited fields, the following statement can be used:

data = LOAD 'tab_delim_data.txt' USING PigStorage('\t') AS (user, time, query)

A different argument could be passed to PigStorage() to read comma- or space-delimited fields.

Pig Latin Data Types

Values in Pig Latin can be expressed by four basic data types:

* An atom is any atomic value (e.g., "fish")
* A tuple is a record of multiple values with fixed arity. e.g., ("dog", "sparky").
* A data bag is a collection of an arbitrary number of values. e.g., {("dog", "sparky"), ("fish", "goldie")}. Data bags support a scan operation for iterating through their contents.
* A data map is a collection with a lookup function translating keys to values. e.g., ["age" : 25]

All data types are fully nestable; bags may contain tuples, and maps may contain bags or other maps, etc. This differs from a traditional database model, where data must be normalized into lists of atoms. By allowing data types to be composed in this manner, Pig queries line up better to the conceptual model of the data held by the programmer. Data types may also be heterogeneous. For example, the fields of a tuple may each have different types; some may be atoms, others may be more tuples, etc. The values in a bag may hold different types, as may the values in data maps. These can vary from one record to the next in the bag. Data map keys must be atoms, for efficiency reasons.