org.kiji.mapreduce.produce
Class KijiProducer

java.lang.Object
  extended by org.kiji.mapreduce.produce.KijiProducer
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, KeyValueStoreClient

@ApiAudience.Public
@ApiStability.Stable
@Inheritance.Extensible
public abstract class KijiProducer
extends Object
implements org.apache.hadoop.conf.Configurable, KeyValueStoreClient

Base class for all Kiji Producers used to generate per-row derived entity data.

Lifecycle:

Instances are created using ReflectionUtils, so the Configuration is automagically set immediately after instantiation with a call to setConf(). In order to initialize internal state before any other methods are called, override the setConf() method.

As a KeyValueStoreClient, KijiProducers will have access to all stores defined by KeyValueStoreClient.getRequiredStores(). Readers for these stores are surfaced in the setup(), produce(), and cleanup() methods via the Context provided to each by calling KijiContext.getStore(String).

Once the internal state is set, functions may be called in any order, except for restrictions on setup(), produce(), and cleanup().

setup() will get called once at the beginning of the map phase, followed by a call to produce() for each input row. Once all of these produce() calls have completed, cleanup() will be called exactly once. It is possible that this setup-produce-cleanup cycle may repeat any number of times.

A final guarantee is that setup(), produce(), and cleanup() will be called after getDataRequest() and getOutputColumn() have each been called at least once.

Skeleton:

Any concrete implementation of a KijiProducer must implement the getDataRequest(), produce(org.kiji.schema.KijiRowData, org.kiji.mapreduce.produce.ProducerContext), and getOutputColumn() methods. An example of a produce method that extracts the domains from the email field of each row:


   public void produce(KijiRowData input, ProducerContext context)
       throws IOException {
     if (!input.containsColumn("info", "email")) {
       return;
     }
     String email = input.getMostRecentValue("info", "email").toString();
     int atSymbol = email.indexOf("@");

     String domain = email.substring(atSymbol + 1);
     context.put(domain);
   }
 
For the entire code for this producer, check out EmailDomainProducer in KijiMR Lib.


Constructor Summary
KijiProducer()
          Your subclass of KijiProducer must have a default constructor if it is to be used in a KijiProduceJob.
 
Method Summary
 void cleanup(KijiContext context)
          Called once to clean up this producer after all produce(KijiRowData, ProducerContext) calls are made.
 org.apache.hadoop.conf.Configuration getConf()
          
abstract  org.kiji.schema.KijiDataRequest getDataRequest()
          Returns a KijiDataRequest that describes which input columns need to be available to the producer.
abstract  String getOutputColumn()
          Return the name of the output column.
 Map<String,KeyValueStore<?,?>> getRequiredStores()
          Returns a mapping that specifies the names of all key-value stores that must be loaded to execute this component, and default KeyValueStore definitions that can be used if the user does not specify alternate locations/implementations.
abstract  void produce(org.kiji.schema.KijiRowData input, ProducerContext context)
          Called to compute derived data for a single entity.
 void setConf(org.apache.hadoop.conf.Configuration conf)
          Sets the Configuration for this KijiProducer to use.
 void setup(KijiContext context)
          Called once to initialize this producer before any calls to produce(KijiRowData, ProducerContext).
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

KijiProducer

public KijiProducer()
Your subclass of KijiProducer must have a default constructor if it is to be used in a KijiProduceJob. The constructors should be lightweight, since the framework is free to create KijiProducers at any time.

Method Detail

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
Sets the Configuration for this KijiProducer to use. This function is guaranteed to be called immediately after instantiation. Override this method to initialize internal state from a configuration.

If you override this method for your producer, you must call super.setConf(); or the configuration will not be saved properly.

Specified by:
setConf in interface org.apache.hadoop.conf.Configurable
Parameters:
conf - The Configuration to read.

getConf

public org.apache.hadoop.conf.Configuration getConf()

Overriding this method without returning super.getConf() may cause undesired behavior.

Specified by:
getConf in interface org.apache.hadoop.conf.Configurable

getDataRequest

public abstract org.kiji.schema.KijiDataRequest getDataRequest()
Returns a KijiDataRequest that describes which input columns need to be available to the producer. This method may be called multiple times, perhaps before setup(KijiContext).

Returns:
a kiji data request.

getRequiredStores

public Map<String,KeyValueStore<?,?>> getRequiredStores()

Returns a mapping that specifies the names of all key-value stores that must be loaded to execute this component, and default KeyValueStore definitions that can be used if the user does not specify alternate locations/implementations. It is an error for any of these default implementations to be null. If you want to defer KeyValueStore definition to runtime, bind a name to the UnconfiguredKeyValueStore instead.

Note that this method returns default mappings from store names to concrete implementations. Users may override these mappings, e.g. in MapReduce job configuration. You should not open a store returned by getRequiredStores() directly; you should look to a Context object or similar mechanism exposed by the Kiji framework to determine the actual KeyValueStoreReader instance to use.

Specified by:
getRequiredStores in interface KeyValueStoreClient
Returns:
a map from store names to default KeyValueStore implementations.

getOutputColumn

public abstract String getOutputColumn()
Return the name of the output column. An output column is of the form "family" or "family:qualifier". Family columns can store key/value pairs. A qualifier column may only contain a single piece of data.

Returns:
the output column name.

setup

public void setup(KijiContext context)
           throws IOException
Called once to initialize this producer before any calls to produce(KijiRowData, ProducerContext).

Parameters:
context - The KijiContext providing access to KeyValueStores, Counters, etc.
Throws:
IOException - on I/O error.

produce

public abstract void produce(org.kiji.schema.KijiRowData input,
                             ProducerContext context)
                      throws IOException
Called to compute derived data for a single entity. The input that is included is controlled by the KijiDataRequest returned in getDataRequest().

Parameters:
input - The requested input data for the entity.
context - The producer context, used to output derived data.
Throws:
IOException - on I/O error.

cleanup

public void cleanup(KijiContext context)
             throws IOException
Called once to clean up this producer after all produce(KijiRowData, ProducerContext) calls are made.

Parameters:
context - The KijiContext providing access to KeyValueStores, Counters, etc.
Throws:
IOException - on I/O error.


Copyright © 2012-2014 WibiData, Inc.. All Rights Reserved.