The Klio Pipeline

A Beam Pipeline encapsulates the various steps of the Klio job from reading input data, transforming the data, and writing output data. Klio pipelines offer a Pythonic interface to build upon beam pipelines and allow large-scale data processing on Docker and Google Dataflow.

A typical Klio job is a Python Dataflow job that read elements in the form of Klio messages from and input source. These elements are a references to binary data, usually audio files. The corresponding audio file will be downloaded, and the binary data can then be processed by applying transforms on the data. The output of the job can be uploaded to another GCS location. Klio aims to simplify the code involved in these data pipelines.

Klio pipelines are organized into three structures - the Klio Message which is a unit of work passed around as a trigger for the pipeline, the run function which is the entry point for the pipeline and encapsulates the various DAG-like steps of the pipeline, and the transforms or the individual steps of the overall Klio pipeline.

Top Down and Bottom Up Execution

Streaming Klio jobs are structured as directed acyclic graphs (DAGs) where parent jobs can trigger dependent child jobs. Klio support two modes of execution - top-down and bottom-up. Top-down execution is used when every step of the DAG should run for ever received klio message. Bottom-up execution is used to run a single job for a file and missing upstream dependencies will be recursively created.

Further Documentation