Important
These are just some commonly-used options, but any option that Beam & Dataflow support are also supported here. More information can be found in the official Beam docs.
pipeline_options.autoscaling_algorithm
STR
Algorithm in which to scale your job. See Dataflow’s autoscaling documentation for more info.
NONE
THROUGHPUT_BASED
pipeline_options.disk_size_gb
INT
Configure the amount of available storage for a worker running on Dataflow.
32
pipeline_options.experiments
[] LIST(STR)
Flags that enable Beam to run experimental features.
Set this value to an empty list if the default values are not needed.
enable_stackdriver_agent_metrics
beam_fn_api
Note
Experiments are correlated to specific runners. Valid experiment values for the Dataflow runner can be found in their documentation (not limited to that linked page).
pipeline_options.max_num_workers
Configure the maximum number of workers that will try to run your job at any given time on Dataflow.
2
Only relevant when pipeline_options.autoscaling_algorithm is not NONE.
pipeline_options.num_workers
The number of workers that your job will run on Dataflow.
pipeline_options.project
GCP project within which this job will be run.
Runner: Dataflow (Required)
pipeline_options.region
GCP region where this job will be run on Dataflow (supported regions).
europe-west1
pipeline_options.runner
Specify which runner should be used to execute the job.
DirectRunner
DataflowRunner
pipeline_options.setup_file
Path to setup.py file relative to a Klio job’s run.py file.
setup.py
run.py
This configuration attribute is mutually exclusive with the beam_fn_api experiment.
pipeline_options.staging_location
A Cloud Storage path for Cloud Dataflow to stage code packages needed by workers executing the job. Must be a valid Cloud Storage URL, beginning with gs://.
gs://
If not set, defaults to a staging directory within temp_location. At least one of temp_location or staging_location must be specified.
temp_location
staging_location
Runner: Dataflow
The commands klio job create and klio job verify --create-resources will create this bucket for you.
klio job create
klio job verify --create-resources
pipeline_options.streaming
BOOL
If True, the pipeline reads from an unbounded source (a.k.a. Pub/Sub) and will always be “up”; a streaming job will process data from a source and will continue working until it is shut down. If False, it designates the job as a batch job.
True
False
pipeline_options.temp_location
A Cloud Storage path for Cloud Dataflow to stage temporary job files created during the execution of the pipeline. Must be a valid Cloud Storage URL, beginning with gs://.
If not set, defaults to a staging directory within staging_location. At least one of temp_location or staging_location must be specified.
pipeline_options.worker_harness_container_image
Regardless of the configured runner, Klio will build an image and tag it with the configured value set here.
For all runners, Klio uses this image as the “driver” of a pipeline (i.e. to start/launch the pipeline).
When running a pipeline with DirectRunner, the entire execution model (the “runner” and the “worker”) runs within this image as well. Essentially the driver, runner, and worker all run on one container.
With Dataflow, the workers (synonymous with “instance” or “host”) will download the image from Google Container Registry (GCR) to then use as the runtime environment for the pipeline’s transforms. Each worker will then run as many containers as configured (default is equal to the number of the workers’s CPUs).
When using Dataflow, the name of the image must be a URI for Google Container Registry.
For example: gcr.io/my-project/my-klio-job-image.
gcr.io/my-project/my-klio-job-image
Image version tags may be included here (e.g. my-klio-job-image:v1) but by default, Klio takes care of these image tags via the klio-cli when uploading and deploying.
my-klio-job-image:v1
klio-cli
pipeline_options.worker_disk_type
Configure the disk type for a worker running on Dataflow.
pd-standard
pd-ssd
local-ssd
pipeline_options.worker_machine_type
Configure the worker type for a worker running on Dataflow. See available machine types.
n1-standard-2