A typical streaming Klio job is a Python Dataflow job running Dockerized workers that:
reads messages from a Pub/Sub input subscription (potentially subscribed to the output topic of another job) where each message has a reference to binary data;
downloads that binary data from a GCS bucket;
processes the downloaded data by applying transforms (which may include running Python libraries with C extensions, Fortran, etc);
uploads the derived data to another GCS bucket;
writes a message to a single Pub/Sub output topic so downstream Klio jobs may process the derived data.
Here’s an overview diagram of how this works:
The above architecture overview mentions a few resources in Google Cloud that a typical streaming Klio job needs. While Dataflow handles the execution of a job, Klio makes use of Pub/Sub and GCS buckets to create a DAG to string job dependencies together, allowing for top-down and bottom-up execution.
A typical batch Klio job is the batch analogue of a Klio streaming job.
It is a Python Dataflow job running Dockerized workers that:
Reads an item of input from a batch source (such as a text file, a GCS file, or a BigQuery table), where each input is a reference to binary data;
writes a message to a sink (such as a text file, a GCS file, a BigQuery table, or even a Pub/Sub output topic).
Note that top-down and bottom-up execution have been built to support streaming jobs. We recommend using a separate orchestration framework, such as Luigi, for creating and coordinating a DAG of batch Klio jobs.