Converting a Streaming Pipeline to Batch

Batch Klio pipelines can be useful when performing backfills or when work can be done on a cadence. Batch jobs can also simplify local testing as resources to handle Pub/Sub messages are not required to be set up to kick off a job.

Config Changes

In the klio-job.yaml there are two config values that need to change in order to convert a streaming job to a batch job: the streaming field and the job_config.event input and output configurations.


Set streaming to False:

 name: my-stream-job-that-i-want-to-be-batch
   streaming: False
   <-- snip -->

Event I/O

Currently the only supported event inputs and outputs for streaming jobs are Google Cloud Pub/Sub. However there are multiple supported event configurations in batch mode, the simplest of which is a text file located locally or in GCS. Similarly, writing event outputs to a GCS file is available in batch mode by setting job_config.event.outputs.

An example of of the changes for reading and writing to a GCS file are seen below:

 name: my-stream-job-that-i-want-to-be-batch
   streaming: False
   <-- snip -->
       - type: gcs
         location: gs://my-event-input/my-input-elements.txt
       - type: gcs
         location: gs://my-event-output/


A batch job can also be converted into a streaming job in a similar matter. However, missing resources such as Pub/Sub topics and subscriptions will need to be created with the command klio job verify --create-resources.