The first step in using Pig is to load data into a program. Pig provides a LOAD statement for this purpose. Its format is: result = LOAD 'filename' USING fn() AS (field1, field2, ...).
This statement returns a bag of values of all the data contained in the named file. Each record in the bag is a tuple, with the fields named by field1, field2, etc. The fn() is a user-provided function that reads in the data. Pig supports user-provided Java code throughout to handle the application-specific bits of parsing. Pig Latin itself is the "glue" that then holds these application-specific functions together, routing records and other data between them.
An example data loading command (taken from this paper on Pig) is:
queries = LOAD 'query_log.txt'
USING myLoad()
AS (userId, queryString, timestamp)
The user-defined functions to load data (e.g., myLoad()) do not need to be provided. A default function for loading data exists, which will parse tab-delimited records. If the programmer did not specify field names in the AS clause, they would be addressed by positional parameters: $0, $1, and so forth.
The default loader is called PigStorage(). This loader can read files containing character-delimited tuple records. These tuples must contain only atomic values; e.g., cat, turtle, fish. Other loaders are listed in the PigBuiltins page of the Pig wiki. PigStorage() takes as an argument the character to use to delimit fields. For example, to load a table of three tab-delimited fields, the following statement can be used:
data = LOAD 'tab_delim_data.txt' USING PigStorage('\t') AS (user, time, query)
A different argument could be passed to PigStorage() to read comma- or space-delimited fields.
Great. But the interesting thing is to write a class to load data, not to just use PigStorage, as every tutorial seems to do...
ReplyDelete
ReplyDeletethanks for this idea
insta dp viewer