Parsing command line arguments and passing them around in your Flink application
Almost all Flink applications, both batch and streaming rely on external configuration parameters.
For example for specifying input and output sources (like paths or addresses), also system parameters (parallelism, runtime configuration) and application specific parameters (often used within the user functions).
Since version 0.9 we are providing a simple utility called ParameterTool to provide at least some basic tooling for solving these problems.
Please note that you don’t have to use the ParameterTool explained here. Other frameworks such as Commons CLI,
argparse4j and others work well with Flink as well.
Getting your configuration values into the ParameterTool
The ParameterTool provides a set of predefined static methods for reading the configuration. The tool is internally expecting a Map<String, String>, so its very easy to integrate it with your own configuration style.
From .properties files
The following method will read a Properties file and provide the key/value pairs:
From the command line arguments
This allows getting arguments like --input hdfs:///mydata --elements 42 from the command line.
From system properties
When starting a JVM, you can pass system properties to it: -Dinput=hdfs:///mydata. You can also initialize the ParameterTool from these system properties:
Using the parameters in your Flink program
Now that we’ve got the parameters from somewhere (see above) we can use them in various ways.
Directly from the ParameterTool
The ParameterTool itself has methods for accessing the values.
You can use the return values of these methods directly in the main() method (=the client submitting the application).
For example you could set the parallelism of a operator like this:
Since the ParameterTool is serializable, you can pass it to the functions itself:
and then use them inside the function for getting values from the command line.
Passing it as a Configuration object to single functions
The example below shows how to pass the parameters as a Configuration object to a user defined function.
In the Tokenizer, the object is now accessible in the open(Configuration conf) method:
Register the parameters globally
Parameters registered as a global job parameter at the ExecutionConfig allow you to access the configuration values from the JobManager web interface and all functions defined by the user.
Register the parameters globally
Access them in any rich user function:
Naming large TupleX types
It is recommended to use POJOs (Plain old Java objects) instead of TupleX for data types with many fields.
Also, POJOs can be used to give large Tuple-types a name.
Instead of using:
It is much easier to create a custom type extending from the large Tuple type.
Register a custom serializer for your Flink program
If you use a custom type in your Flink program which cannot be serialized by the
Flink type serializer, Flink falls back to using the generic Kryo
serializer. You may register your own serializer or a serialization system like
Google Protobuf or Apache Thrift with Kryo. To do that, simply register the type
class and the serializer in the ExecutionConfig of your Flink program.
Note that your custom serializer has to extend Kryo’s Serializer class. In the
case of Google Protobuf or Apache Thrift, this has already been done for
For the above example to work, you need to include the necessary dependencies in
your Maven project file (pom.xml). In the dependency section, add the following
for Apache Thrift:
For Google Protobuf you need the following Maven dependency:
Please adjust the versions of both libraries as needed.
Using Logback instead of Log4j
Note: This tutorial is applicable starting from Flink 0.10
Apache Flink is using slf4j as the logging abstraction in the code. Users are advised to use sfl4j as well in their user functions.
Sfl4j is a compile-time logging interface that can use different logging implementations at runtime, such as log4j or Logback.
Flink is depending on Log4j by default. This page describes how to use Flink with Logback. Users reported that they were also able to set up centralized logging with Graylog using this tutorial.
To get a logger instance in the code, use the following code:
Use Logback when running Flink out of the IDE / from a Java application
In all cases were classes are executed with a classpath created by a dependency manager such as Maven, Flink will pull log4j into the classpath.
Therefore, you will need to exclude log4j from Flink’s dependencies. The following description will assume a Maven project created from a Flink quickstart.
Change your projects pom.xml file like this:
The following changes were done in the <dependencies> section:
Exclude all log4j dependencies from all Flink dependencies: This causes Maven to ignore Flink’s transitive dependencies to log4j.
Exclude the slf4j-log4j12 artifact from Flink’s dependencies: Since we are going to use the slf4j to logback binding, we have to remove the slf4j to log4j binding.
Add the Logback dependencies: logback-core and logback-classic
Add dependencies for log4j-over-slf4j. log4j-over-slf4j is a tool which allows legacy applications which are directly using the Log4j APIs to use the Slf4j interface. Flink depends on Hadoop which is directly using Log4j for logging. Therefore, we need to redirect all logger calls from Log4j to Slf4j which is in turn logging to Logback.
Please note that you need to manually add the exclusions to all new Flink dependencies you are adding to the pom file.
You may also need to check if other dependencies (non Flink) are pulling in log4j bindings. You can analyze the dependencies of your project with mvn dependency:tree.
Use Logback when running Flink on a cluster
This tutorial is applicable when running Flink on YARN or as a standalone cluster.
In order to use Logback instead of Log4j with Flink, you need to remove the log4j-1.2.xx.jar and sfl4j-log4j12-xxx.jar from the lib/ directory.
Next, you need to put the following jar files into the lib/ folder:
log4j-over-slf4j.jar: This bridge needs to be present in the classpath for redirecting logging calls from Hadoop (which is using Log4j) to Slf4j.
Note that you need to explicitly set the lib/ directory when using a per job YARN cluster.
The command to submit Flink on YARN with a custom logger is: ./bin/flink run -yt $FLINK_HOME/lib <... remaining arguments ...>