Shell-tasks

Command-line templates

Shell task specs can be defined using string templates that resemble the command-line usage examples typically used in in-line help. Therefore, they can be quick and intuitive way to specify a shell task. For example, a simple spec for the copy command cp that omits optional flags,

[1]:
from pydra.compose import shell

Cp = shell.define("cp <in_file> <out|destination>")

Input and output fields are both specified by placing the name of the field within enclosing < and >. Outputs are differentiated by the out| prefix.

This shell task can then be run just as a Python task would be run, first parameterising it, then executing

[2]:
from pathlib import Path
from tempfile import mkdtemp

# Make a test file to copy
test_dir = Path(mkdtemp())
test_file = test_dir / "in.txt"
with open(test_file, "w") as f:
    f.write("Contents to be copied")

# Parameterise the task
cp = Cp(in_file=test_file, destination=test_dir / "out.txt")

# Print the cmdline to be run to double check
print(f"Command-line to be run: {cp.cmdline}")

# Run the shell-comand task
outputs = cp()

print(
    f"Contents of copied file ('{outputs.destination}'): "
    f"'{Path(outputs.destination).read_text()}'"
)
Command-line to be run: cp /tmp/tmpb4iwy0x2/in.txt /tmp/tmpb4iwy0x2/out.txt
Contents of copied file ('/tmp/tmpb4iwy0x2/out.txt'): 'Contents to be copied'

If paths to output files are not provided in the parameterisation, it will default to the name of the field

[3]:
cp = Cp(in_file=test_file)
print(cp.cmdline)
cp /tmp/tmpb4iwy0x2/in.txt /home/runner/work/pydra/pydra/docs/source/tutorial/destination

Defining input/output types

By default, shell-command fields are considered to be of fileformats.generic.FsObject type. However, more specific file formats or built-in Python types can be specified by appending the type to the field name after a :.

File formats are specified by their MIME type or "MIME-like" strings (see the FileFormats docs for details)

[4]:
from fileformats.image import Png

TrimPng = shell.define("trim-png <in_image:image/png> <out|out_image:image/png>")

trim_png = TrimPng(in_image=Png.mock(), out_image="/path/to/output.png")

print(trim_png.cmdline)
trim-png /mock/png.png /path/to/output.png

Flags and options

Command line flags can also be added to the shell template, either the single or double hyphen form. The field template name immediately following the flag will be associate with that flag. If there is no space between the flag and the field template, then the field is assumed to be a boolean, otherwise it is assumed to be of type string unless otherwise specified.

If a field is optional, the field template should end with a ?. Tuple fields are specified by comma separated types. The ellipsis (...) can signify tuple types with variable number of items. Arguments and options that can be repeated are specified by appending a + (at least one must be provided) or * (defaults to empty list). Note that for options, this signifies that the flag itself is printed multiple times. e.g. my-command --multi-opt 1 2 --multi-opt 1 5.

[5]:
from pydra.utils import print_help

Cp = shell.define(
    "cp <in_fs_objects:fs-object+> <out|out_dir:directory> "
    "-R<recursive> "
    "--text-arg <text_arg?> "
    "--int-arg <int_arg:int?> "
    "--tuple-arg <tuple_arg:int,str*> "
)

print_help(Cp)
------------------------
Help for Shell task 'cp'
------------------------

Inputs:
- executable: str | Sequence[str]; default = 'cp'
    the first part of the command, can be a string, e.g. 'ls', or a list, e.g.
    ['ls', '-l', 'dirname']
- in_fs_objects: MultiInputObj[generic/fs-object]
- out_dir: Path | bool; default = True
    The path specified for the output file, if True, the default 'path
    template' will be used.
- recursive: bool; default = False ('-R')
- text_arg: str | None; default = None ('--text-arg')
- int_arg: int | None; default = None ('--int-arg')
- tuple_arg: MultiInputObj[tuple[int, str]]; default-factory = list() ('--tuple-arg')
- append_args: list[str | generic/file]; default-factory = list()
    Additional free-form arguments to append to the end of the command.

Outputs:
- out_dir: generic/directory
- return_code: int
    The process' exit code.
- stdout: str
    The standard output stream produced by the command.
- stderr: str
    The standard error stream produced by the command.

Defaults

Defaults can be specified by appending them to the field template after =

[6]:
from pydra.utils import task_fields

Cp = shell.define(
    "cp <in_fs_objects:fs-object+> <out|out_dir:directory> "
    "-R<recursive=True> "
    "--text-arg <text_arg='foo'> "
    "--int-arg <int_arg:int=99> "
    "--tuple-arg <tuple_arg:int,str=(1,'bar')> "
)

print(f"'--int-arg' default: {task_fields(Cp).int_arg.default}")
'--int-arg' default: 99

Path templates for output files

By default, when an output file argument is defined, a path_template attribute will be assigned to the field based on its name and extension (if applicable). For example, the zipped output field in the following Gzip command will be assigned a path_template of out_file.gz

[7]:
from pydra.compose import shell
from fileformats.generic import File

Gzip = shell.define("gzip <out|out_file:application/gzip> <in_files+>")
gzip = Gzip(in_files=File.mock("/a/file.txt"))
print(gzip.cmdline)
gzip /home/runner/work/pydra/pydra/docs/source/tutorial/out_file.gz /a/file.txt

However, if this needs to be specified it can be by using the $ operator, e.g.

[8]:
Gzip = shell.define("gzip <out|out_file:application/gzip$zipped.gz> <in_files+>")
gzip = Gzip(in_files=File.mock("/a/file.txt"))
print(gzip.cmdline)
gzip /home/runner/work/pydra/pydra/docs/source/tutorial/zipped.gz /a/file.txt

To give the field a path_template of archive.gz when it is written on the command line. Note that this value can always be overridden when the task is initialised, e.g.

[9]:
gzip = Gzip(in_files=File.mock("/a/file.txt"), out_file="/path/to/archive.gz")
print(gzip.cmdline)
gzip /path/to/archive.gz /a/file.txt

Additional field attributes

Additional attributes of the fields in the template can be specified by providing shell.arg or shell.outarg fields to the inputs and outputs keyword arguments to the define

[10]:
Cp = shell.define(
    (
        "cp <in_fs_objects:fs-object,...> <out|out_dir:directory> <out|out_file:file?> "
        "-R<recursive> "
        "--text-arg <text_arg> "
        "--int-arg <int_arg:int?> "
        "--tuple-arg <tuple_arg:int,str> "
    ),
    inputs={
        "recursive": shell.arg(
            help=(
                "If source_file designates a directory, cp copies the directory and "
                "the entire subtree connected at that point."
            )
        )
    },
    outputs={
        "out_dir": shell.outarg(position=-2),
        "out_file": shell.outarg(position=-1),
    },
)


print_help(Cp)
------------------------
Help for Shell task 'cp'
------------------------

Inputs:
- executable: str | Sequence[str]; default = 'cp'
    the first part of the command, can be a string, e.g. 'ls', or a list, e.g.
    ['ls', '-l', 'dirname']
- in_fs_objects: tuple[generic/fs-object, Ellipsis]
- recursive: bool; default = False ('-R')
    If source_file designates a directory, cp copies the directory and the
    entire subtree connected at that point.
- text_arg: str ('--text-arg')
- int_arg: int | None; default = None ('--int-arg')
- tuple_arg: tuple[int, str] ('--tuple-arg')
- out_file: Path | bool | None; default = None
    The path specified for the output file, if True, the default 'path
    template' will be used.If False or None, the output file will not be
    saved.
- out_dir: Path | bool; default = True
    The path specified for the output file, if True, the default 'path
    template' will be used.
- append_args: list[str | generic/file]; default-factory = list()
    Additional free-form arguments to append to the end of the command.

Outputs:
- out_dir: generic/directory
- out_file: generic/file | None; default = None
- return_code: int
    The process' exit code.
- stdout: str
    The standard output stream produced by the command.
- stderr: str
    The standard error stream produced by the command.

Callable outptus

In addition to outputs that are specified to the tool on the command line, outputs can be derived from the outputs of the tool by providing a Python function that can take the output directory and inputs as arguments and return the output value. Callables can be either specified in the callable attribute of the shell.out field, or in a dictionary mapping the output name to the callable

[11]:
import os
from pydra.compose import shell
from pathlib import Path
from fileformats.generic import File


# Arguments to the callable function can be one of
def get_file_size(out_file: Path) -> int:
    """Calculate the file size"""
    result = os.stat(out_file)
    return result.st_size


CpWithSize = shell.define(
    "cp <in_file:file> <out|out_file:file>",
    outputs={"out_file_size": get_file_size},
)

# Parameterise the task
cp_with_size = CpWithSize(in_file=File.sample())

# Run the command
outputs = cp_with_size()


print(f"Size of the output file is: {outputs.out_file_size}")
Size of the output file is: 256

The callable can take any combination of the following arguments, which will be passed to it when it is called

  • field: the Field object to be provided a value, useful when writing generic callables

  • cache_dir: a Path object referencing the working directory the command was run within

  • inputs: a dictionary containing all the resolved inputs to the task

  • stdout: the standard output stream produced by the command

  • stderr: the standard error stream produced by the command

  • name of an input: the name of any of the input arguments to the task, including output args that are part of the command line (i.e. output files)

To make workflows that use the interface type-checkable, the canonical form of a shell task dataclass should inherit from shell.Def parameterized by its nested Outputs class, and the Outputs nested class should inherit from shell.Outputs. Arguments that are provided None values are not included in the command line, so optional arguments should be typed as one of these equivalent forms ty.Union[T, None], ty.Optional[T] or T | None and have a default of None.

[12]:
from pydra.utils.typing import MultiInputObj
from fileformats.generic import FsObject, Directory


@shell.define
class Cp(shell.Task["Cp.Outputs"]):

    executable = "cp"

    in_fs_objects: MultiInputObj[FsObject]
    recursive: bool = shell.arg(argstr="-R", default=False)
    text_arg: str = shell.arg(argstr="--text-arg")
    int_arg: int | None = shell.arg(argstr="--int-arg", default=None)
    tuple_arg: tuple[int, str] | None = shell.arg(argstr="--tuple-arg", default=None)

    class Outputs(shell.Outputs):
        out_dir: Directory = shell.outarg(path_template="{out_dir}")

Dynamic definitions

In some cases, it is required to generate the definition for a task dynamically, which can be done by just providing the executable to shell.define and specifying all inputs and outputs explicitly

[13]:
from fileformats.generic import File
from pydra.utils import print_help

ACommand = shell.define(
    "a-command",
    inputs={
        "in_file": shell.arg(type=File, help="output file", argstr="", position=-2)
    },
    outputs={
        "out_file": shell.outarg(type=File, help="output file", argstr="", position=-1),
        "out_file_size": {
            "type": int,
            "help": "size of the output directory",
            "callable": get_file_size,
        },
    },
)

print_help(ACommand)
-------------------------------
Help for Shell task 'a_command'
-------------------------------

Inputs:
- executable: str | Sequence[str]; default = 'a-command'
    the first part of the command, can be a string, e.g. 'ls', or a list, e.g.
    ['ls', '-l', 'dirname']
- out_file: Path | bool; default = True
    The path specified for the output file, if True, the default 'path
    template' will be used.
- in_file: generic/file
    output file
- append_args: list[str | generic/file]; default-factory = list()
    Additional free-form arguments to append to the end of the command.

Outputs:
- out_file: generic/file
    output file
- out_file_size: int
    size of the output directory
- return_code: int
    The process' exit code.
- stdout: str
    The standard output stream produced by the command.
- stderr: str
    The standard error stream produced by the command.