setup.py
Using setup.py file to specify dependencies to install on workers was the original means by which a Python Beam job developer could customize the environment in which their job ran. You can read more about it here.
To configure a Klio job to run using setup.py packaging, the user needs to provide the required files (see below) and configure the job to use them by filling in the pipeline_options.setup_file field in their klio-job.yaml file:
pipeline_options.setup_file
klio-job.yaml
job_name: my-job pipeline_options: setup_file: setup.py # <-- snip -->
Note that this method of setup is mutually exclusive with using the FnAPI for packaging.
A setup.py file is needed in the root of your job’s directory. It partly substitutes the need for a worker image by installing any non-Python dependencies via a child process, and by explicitly including non-Python files needed for a job (i.e. a model, a JSON schema, etc).
Tip
The setup.py must contain the required system-level dependencies, Python dependencies, and required non-Python files (i.e. ML models, JSON schemas, etc) that your job requires to run.
Minimal Example setup.py
The following is an example with a third party Python dependency available on PyPI (but no non-public Python package dependencies), a non-Python file (an ML model, my-model.h5), and no OS-level dependencies.
my-model.h5
import setuptools setuptools.setup( name="my-example-job", # required version="0.0.1", # required author="klio-devs", # optional author_email="hello@example.com", # optional description="My example job using setup.py", # optional install_requires=["tensorflow"], # optional data_files=[ # required (".", ["klio-job-run-effective.yaml", "my-model.h5"]), ], include_package_data=True, # required py_modules=["run", "transforms"], # required )
Example setup.py with internal & OS-level dependencies
The following is a minimal example that shows a method for including internal & OS-level dependencies, in addition to .
import subprocess import setuptools from distutils.command.build import build as _build # NOTE: This class is required when using custom commands (i.e. # installing OS-level deps and/or deps from internal PyPI) class build(_build): """A build command class that will be invoked during package install. The package built using the current setup.py will be staged and later installed in the worker using `pip install package'. This class will be instantiated during install for this specific scenario and will trigger running the custom commands specified. """ sub_commands = _build.sub_commands + [('CustomCommands', None)] # `APT_COMMANDS` and `REQUIREMENTS_COMMANDS` are custom commands that # will run during setup that are required for this Klio job. Each # command will spawn a child process. APT_COMMANDS = [ ["apt-get", "update"], # Debian-packaged dependencies (or otherwise OS-level requirements) # `--assume-yes` to avoid interactive confirmation ["apt-get", "install", "--assume-yes", "libsndfile1"], ] REQUIREMENTS_COMMANDS = [ [ "pip3", "install", "--default-timeout=120", # --index-url will not work on Dataflow, but `--extra-index-url` will "--extra-index-url", # point to your internal PyPI "pypi.internal.net", "--requirement", "job-requirements.txt", # Must also be included in MANIFEST.in ] ] # NOTE: This class is required when using custom commands (i.e. # installing OS-level deps and/or deps from internal PyPI) class CustomCommands(setuptools.Command): """A setuptools Command class able to run arbitrary commands.""" def initialize_options(self): # Method must be defined when implementing custom Commands pass def finalize_options(self): # Method must be defined when implementing custom Commands pass def RunCustomCommand(self, command_list): # NOTE: Output from the custom commands are missing from the # logs. The output of custom commands (including failures) will # be logged in the worker-startup log. (BEAM-3237) print("Running command: %s" % " ".join(command_list)) p = subprocess.Popen( command_list, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, ) stdout_data, _ = p.communicate() print("Command output: %s" % stdout_data) if p.returncode != 0: raise RuntimeError( "Command {} failed: exit code: {}".format( command_list, p.returncode ) ) def run(self): for command in APT_COMMANDS: self.RunCustomCommand(command) for command in REQUIREMENTS_COMMANDS: self.RunCustomCommand(command) # NOTE: `version` does not particularly mean anything here since we're # not publishing to PyPI; it's just a required field setuptools.setup( name="my-example-job", # required version="0.0.1", # required author="klio-devs", # optional author_email="hello@example.com", # optional description="My example job using setup.py", # optional install_requires=[], # optional if using the above REQUIREMENTS_COMMANDS data_files=[ # required # tuple( # str(dir where to install files, relative to Python modules), # list(str(non-Python filenames)) # ) (".", ["klio-job-run-effective.yaml", "my-model.h5"]), ], include_package_data=True, # required py_modules=["run", "transforms"], # required # NOTE: required when using custom commands (i.e. installing OS-level # deps and/or deps from internal PyPI) cmdclass={ # optional # Command class instantiated and run during pip install scenarios. "build": build, "CustomCommands": CustomCommands, }, )
MANIFEST.in
A file called MANIFEST.in is needed in the root of your job’s directory with the line include job-requirements.txt:
include job-requirements.txt
# cat MANIFEST.in include job-requirements.txt
Why is this needed?
The MANIFEST.in file must include any file required to install your job as a Python package (but not needed to run your job; those files are declared under data_files in setup.py as referred above).
data_files
When Klio launches the job for Dataflow, Dataflow will locally create a source distribution of your job by running python setup.py sdist. When running this, Python will tar together the files declared in setup.py as well as any non-Python files defined in MANIFEST.in into a file called workflow.tar.gz (as named by Dataflow to then be uploaded).
python setup.py sdist
workflow.tar.gz
Then, on the worker, Dataflow will run pip install workflow.tar.gz. pip will actually build a wheel, installing packages defined in job-requirements.txt (and running any other custom commands defined in setup.py). After the installation of the package via pip install workflow.tar.gz, job-requirements.txt will effectively be gone and inaccessible to the job’s code. Building a wheel ignores MANIFEST.in, but includes all the files declared in setup.py, the ones actually needed for running the Klio job.
pip install workflow.tar.gz
pip
job-requirements.txt
pipeline_options.requirements_file configuration for pipeline dependencies will not work for Klio jobs. While klio will honor that configuration value for Dataflow to pick up, declaring requirements in setup.py is needed because a Klio job inherently has multiple Python files.
pipeline_options.requirements_file
While Klio will still upload the worker image to Google Container Registry when running/deploying a job, Dataflow will not use the image. It is good practice to upload the worker image to ensure repeatable builds, but in the future, an option will be added to skip the upload.