Index¶

Classes¶

AprunLauncher¶

class AprunLauncher(config=None)[source]

Bases: MultipleLauncher

Launches a job using Cobalt’s aprun.

Parameters: config (Optional[JobExecutorConfig]) – An optional configuration.

BatchSchedulerExecutor¶

class BatchSchedulerExecutor(url=None, config=None)[source]

Bases: JobExecutor

A base class for batch scheduler executors.

This class implements a generic JobExecutor that interacts with batch schedulers. There are two main components to the executor: job submission and queue polling. Submission is implemented by generating a submit script which is then fed to the queuing system submit command.

The submit script is generated using a generate_submit_script(). An implementation of this functionality based on Mustache/Pystache (see https://mustache.github.io/ and https://pypi.org/project/pystache/) exists in TemplatedScriptGenerator. This class can be instantiated by concrete implementations of a batch scheduler executor and the submit script generation can be delegated to that instance, which has a method whose signature matches that of generate_submit_script(). Besides an opened file which points to where the contents of the submit script are to be written, the parameters to generate_submit_script() are the Job that is being submitted and a context, which is a dictionary with the following structure:

{
    'job': <the job being submitted>
    'psij': {
        'lib': <dict; function library>,
        'launch_command': <str; launch command>,
        'script_dir': <str; directory where the submit script is generated>
    }
}

The script directory is a directory (typically ~/.psij/work) where submit scripts are written; it is also used for auxiliary files, such as the exit code file (see below) or the script output file.

The launch command is a list of strings which the script generator should render as the command to execute. It wraps the job executable in the proper Launcher.

The function library is a dictionary mapping function names to functions for all public functions in the template_function_library module.

The submit script must perform two essential actions:

1. redirect the output of the executable part of the script to the script output file, which is a file in <script_dir> named <native_id>.out, where <native_id> is the id given to the job by the queuing system.

2. store the exit code of the launch command in the exit code file named <native_id>.ec, also inside <script_dir>.

Additionally, where appropriate, the submit script should set the environment variable named PSIJ_NODEFILE to point to a file containing a list of nodes that are allocated for the job, one per line, with a total number of lines matching the process count of the job.

Once the submit script is generated, the executor renders the submit command using get_submit_command() and executes it. Its output is then parsed using job_id_from_submit_output() to retrieve the native_id of the job. Subsequently, the job is registered with the queue polling thread.

The queue polling thread regularly polls the batch scheduler queue for updates to job states. It builds the command for polling the queue using get_status_command(), which takes a list of native_id strings corresponding to all registered jobs. Implementations are strongly encouraged to restrict the query of job states to the specified jobs in order to reduce the load on the queuing system. The output of the status command is then parsed using parse_status_output() and the status of each job is updated accordingly. If the status of a registered job is not found in the output of the queue status command, it is assumed completed (or failed, depending on its exit code), since most queuing systems automatically purge completed jobs from their databases after a short period of time. The exit code is read from the exit code file, as described above. If the exit code value is not zero, the job is assumed failed and an attempt is made to read an error message from the script output file.

Parameters

url (Optional[str]) – An optional URL pointing to a specific backend
config (Optional[BatchSchedulerExecutorConfig]) – An configuration for this executor instance; if none is specified, a default configuration is used.

BatchSchedulerExecutorConfig¶

class BatchSchedulerExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: JobExecutorConfig

A base configuration class for BatchSchedulerExecutor implementations.

When subclassing BatchSchedulerExecutor, specific configuration classes inheriting from this class should be defined, even if empty.

Parameters

launcher_log_file (Optional[Path]) – See JobExecutorConfig.
work_directory (Optional[Path]) – See JobExecutorConfig.
queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.
initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.
queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.
keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

CobaltExecutorConfig¶

class CobaltExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: BatchSchedulerExecutorConfig

A configuration class for the Cobalt executor.

Parameters

launcher_log_file (Optional[Path]) – See JobExecutorConfig.
work_directory (Optional[Path]) – See JobExecutorConfig.
queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.
initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.
queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.
keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

CobaltJobExecutor¶

class CobaltJobExecutor(url=None, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the Cobalt Workload Manager.

The Cobalt HPC Job Scheduler, is used by Argonne’s ALCF systems.

Uses the qsub, qstat, and qdel commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #COBALT directives when submitting a job.

Custom attributes prefixed with cobalt. are rendered as long-form directives in the script. For example, setting custom_attributes[‘cobalt.m’] = ‘co’ results in the #COBALT –m=co directive being placed in the submit script.

Parameters

url (Optional[str]) – This parameter is not used and is only provided for compatibility reasons.
config (Optional[CobaltExecutorConfig]) – An optional configuration for this executor.

Return type

None

Descriptor¶

class Descriptor(name, version, cls, aliases=None, nice_name=None)[source]

Bases: object

This class is used to enable PSI/J to discover and register executors and/or launchers.

Executors wanting to register with PSI/J must place an instance of this class in a global module list named __PSI_J_EXECUTORS__ or __PSI_J_LAUNCHERS__ in a module placed in the psij-descriptors namespace package. In other words, in order to automatically register an executor or launcher, a python file should be created inside a psij-descriptors package, such as:

<project_root>/
    src/
        psij-descriptors/
            descriptors_for_project.py

It is essential that the psij-descriptors package not contain an __init__.py file in order for Python to treat the package as a namespace package. This allows Python to combine multiple psij-descriptors directories into one, which, in turn, allows PSI/J to detect and load all descriptors that can be found in Python’s library search path.

The contents of descriptors_for_project.py could then be as follows:

from packaging.version import Version
from psij.descriptor import Descriptor

__PSI_J_EXECUTORS__ = [
    Descriptor(name=<name>, version=Version(<version_str>),
               cls=<fqn_str>),
    ...
]

__PSI_J_LAUNCHERS__ = [
    Descriptor(name=<name>, version=Version(<version_str>),
               cls=<fqn_str>),
    ...
]

where <name> stands for the name used to instantiate the executor or launcher, <version_str> is a version string such as 1.0.2, and <fqn_str> is the fully qualified class name that implements the executor or launcher such as psij.executors.local.LocalJobExecutor.

Parameters

name (str) – The name of the executor or launcher. The automatic registration system will register the executor or launcher using this name. That is, the executor or launcher represented by this descriptor will be available for instantiation using either get_instance() or get_instance()
version (Version) – The version of the executor/launcher. Multiple versions can be registered under a single name.
cls (str) – A fully qualified name pointing to the class implementing an executor or launcher.
aliases (Optional[List[str]]) – An optional set of alternative names to make the executor available under as if its name was the alias.
nice_name (Optional[str]) – An optional string to use whenever a user-friendly name needs to be displayed to a user. For example, a nice name for pbs would be PBS or Portable Batch System. If not specified, the nice_name defaults to the value of the name parameter.

Return type

None

FunctionJobStatusCallback¶

class FunctionJobStatusCallback(fn)[source]¶

Bases: JobStatusCallback

A JobStatusCallback that wraps a function.

Initializes a _FunctionJobStatusCallback.

Parameters: fn (Callable[[Job, JobStatus], None]) –

GenericPBSJobExecutor¶

class GenericPBSJobExecutor(generator, url=None, config=None)[source]¶

Bases: BatchSchedulerExecutor

A generic JobExecutor for PBS-type schedulers.

PBS, originally developed by NASA, is one of the oldest resource managers still in use. A number of variations are available: PBS Pro, OpenPBS, and TORQUE.

Uses the qsub, qstat, and qdel commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #PBS directives when submitting a job.

Custom attributes prefixed with pbs. are rendered as directives in the script. For example, setting custom_attributes[‘pbs.c’] = ‘n’ results in the #PBS -c n directive being placed in the submit script, which disables checkpointing.

Parameters

url (Optional[str]) – Not used, but required by the spec for automatic initialization.
config (Optional[PBSExecutorConfig]) – An optional configuration for this executor.
generator (TemplatedScriptGenerator) –

Return type

None

InvalidJobException¶

class InvalidJobException(message, exception=None)[source]

Bases: Exception

An exception describing a problem with a job specification.

Parameters

message (str) – see the message property
exception (Optional[Exception]) – see the exception property

Return type

None

InvalidJobStateError¶

class InvalidJobStateError[source]

Bases: Exception

An exception that signals that a job cannot be cancelled due to it being already done.

Job¶

class Job(spec=None)[source]¶

Bases: object

This class represents a PSI/J job.

It encapsulates all of the information needed to run a job as well as the job’s state.

When constructed, a job is in the NEW state.

Parameters: spec (Optional[JobSpec]) – an optional JobSpec that describes the details of the job.
Return type: None

JobAttributes¶

class JobAttributes(duration=datetime.timedelta(seconds=600), queue_name=None, account=None, reservation_id=None, custom_attributes=None, project_name=None)[source]¶

Bases: object

A class containing ancillary job information that describes how a job is to be run.

Parameters

duration (timedelta) – Specifies the duration (walltime) of the job. A job whose execution exceeds its walltime can be terminated forcefully.
queue_name (Optional[str]) – If a backend supports multiple queues, this parameter can be used to instruct the backend to send this job to a particular queue.
account (Optional[str]) – An account to use for billing purposes. Please note that the executor implementation (or batch scheduler) may use a different term for the option used for accounting/billing purposes, such as project. However, scheduler must map this attribute to the accounting/billing option in the underlying execution mechanism.
reservation_id (Optional[str]) – Allows specifying an advanced reservation ID. Advanced reservations enable the pre-allocation of a set of resources/compute nodes for a certain duration such that jobs can be run immediately, without waiting in the queue for resources to become available.
custom_attributes (Optional[Dict[str, object]]) – Specifies a dictionary of custom attributes. Implementations of JobExecutor define and are responsible for interpreting custom attributes. The typical usage scenario for custom attributes is to pass information to the executor or underlying job execution mechanism that cannot otherwise be passed using the classes and properties provided by PSI/J. A specific example is that of the subclasses of BatchSchedulerExecutor, which look for custom attributes prefixed with their name and a dot (e.g., slurm.constraint, pbs.c, lsf.core_isolation) and translate them into the corresponding batch scheduler directives (e.g., #SLURM –constraint=…, #PBS -c …, #BSUB -core_isolation …).
project_name (Optional[str]) – Deprecated. Please use the account attribute.

Return type

None

All constructor parameters are accessible as properties.

JobExecutor¶

class JobExecutor(url=None, config=None)[source]¶

Bases: ABC

An abstract base class for all JobExecutor implementations.

Parameters

url (Optional[str]) – The URL is a string that a JobExecutor implementation can interpret as the location of a backend.
config (Optional[JobExecutorConfig]) – An configuration specific to each JobExecutor implementation. This parameter is marked as optional such that concrete JobExecutor classes can be instantiated with no config parameter. However, concrete JobExecutor classes must pass a default configuration up the inheritance tree and ensure that the config parameter of the ABC constructor is non-null.

JobExecutorConfig¶

class JobExecutorConfig(launcher_log_file=None, work_directory=None)[source]¶

Bases: object

An abstract configuration class for JobExecutor instances.

Parameters

launcher_log_file (Optional[Path]) – If specified, log messages from launcher scripts (including output from pre- and post- launch scripts) will be directed to this file.
work_directory (Optional[Path]) – A directory where submit scripts and auxiliary job files will be generated. In a, cluster this directory needs to point to a directory on a shared filesystem. This is so that the exit code file, likely written on a service node, can be accessed by PSI/J, likely running on a head node.

Return type

None

JobSpec¶

class JobSpec(executable=None, arguments=None, directory=None, name=None, inherit_environment=True, environment=None, stdin_path=None, stdout_path=None, stderr_path=None, resources=None, attributes=None, pre_launch=None, post_launch=None, launcher=None)[source]¶

Bases: object

A class that describes the details of a job.

Parameters

executable (Optional[str]) – An executable, such as “/bin/date”.
arguments (Optional[List[str]]) – The argument list to be passed to the executable. Unlike with execve(), the first element of the list will correspond to argv[1] when accessed by the invoked executable.
directory (Union[str, Path, None]) – The directory, on the compute side, in which the executable is to be run
name (Optional[str]) – A name for the job. The name plays no functional role except that JobExecutor implementations may attempt to use the name to label the job as presented by the underlying implementation.
inherit_environment (bool) – If this flag is set to False, the job starts with an empty environment. The only environment variables that will be accessible to the job are the ones specified by this property. If this flag is set to True, which is the default, the job will also have access to variables inherited from the environment in which the job is run.
environment (Optional[Dict[str, Union[str, int]]]) – A mapping of environment variable names to their respective values.
stdin_path (Union[str, Path, None]) – Path to a file whose contents will be sent to the job’s standard input.
stdout_path (Union[str, Path, None]) – A path to a file in which to place the standard output stream of the job.
stderr_path (Union[str, Path, None]) – A path to a file in which to place the standard error stream of the job.
resources (Optional[ResourceSpec]) – The resource requirements specify the details of how the job is to be run on a cluster, such as the number and type of compute nodes used, etc.
attributes (Optional[JobAttributes]) – Job attributes are details about the job, such as the walltime, that are descriptive of how the job behaves. Attributes are, in principle, non-essential in that the job could run even though no attributes are specified. In practice, specifying a walltime is often necessary to prevent LRMs from prematurely terminating a job.
pre_launch (Union[str, Path, None]) – An optional path to a pre-launch script. The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.
post_launch (Union[str, Path, None]) – An optional path to a post-launch script. The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.
launcher (Optional[str]) – The name of a launcher to use, such as “mpirun”, “srun”, “single”, etc. For a list of available launchers, see Available Launchers.

All constructor parameters are accessible as properties.

Note

A note about paths.

It is strongly recommended that paths to std*_path, directory, etc. be specified as absolute. While paths can be relative, and there are cases when it is desirable to specify them as relative, it is important to understand what the implications are.

Paths in a specification refer to paths that are accessible to the machine where the job is running. In most cases, that will be different from the machine on which the job is launched (i.e., where PSI/J is invoked from). This means that a given path may or may not point to the same file in both the location where the job is running and the location where the job is launched from.

For example, if launching jobs from a login node of a cluster, the path /tmp/foo.txt will likely refer to locally mounted drives on both the login node and the compute node(s) where the job is running. However, since they are local mounts, the file /tmp/foo.txt written by a job running on the compute node will not be visible by opening /tmp/foo.txt on the login node. If an output file written on a compute node needs to be accessed on a login node, that file should be placed on a shared filesystem. However, even by doing so, there is no guarantee that the shared filesystem is mounted under the same mount point on both login and compute nodes. While this is an unlikely scenario, it remains a possibility.

When relative paths are specified, even when they point to files on a shared filesystem as seen from the submission side (i.e., login node), the job working directory may be different from the working directory of the application that is launching the job. For example, an application that uses PSI/J to launch jobs on a cluster may be invoked from (and have its working directory set to) /home/foo, where /home is a mount point for a shared filesystem accessible by compute nodes. The launched job may specify stdout_path=Path(‘bar.txt’), which would resolve to /home/foo/bar.txt. However, the job may start in /tmp on the compute node, and its standard output will be redirected to /tmp/bar.txt.

Relative paths are useful when there is a need to refer to the job directory that the scheduler chooses for the job, which is not generally known until the job is started by the scheduler. In such a case, one must leave the spec.directory attribute empty and refer to files inside the job directory using relative paths.

JobState¶

class JobState(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶

Bases: bytes, Enum

An enumeration holding the possible job states.

The possible states are: NEW, QUEUED, ACTIVE, COMPLETED, FAILED, and CANCELED.

JobStateOrder¶

class JobStateOrder[source]¶

Bases: object

A class that can be used to reconstruct missing states.

JobStatus¶

class JobStatus(state, time=None, message=None, exit_code=None, metadata=None)[source]¶

Bases: object

A class containing details about job transitions to new states.

Parameters

state (JobState) – The JobState of this status.
time (Optional[float]) – The time, as would be returned by time.time(), at which the transition to the new state occurred. If not specified, the time when this JobStatus was instantiated will be used.
message (Optional[str]) – An optional message associated with the transition.
exit_code (Optional[int]) – An optional exit code for the job, if the job has completed.
metadata (Optional[Dict[str, object]]) – Optional metadata provided by the JobExecutor.

Return type

None

All constructor parameters are accessible as properties.

JobStatusCallback¶

class JobStatusCallback[source]¶

Bases: ABC

An interface used to listen to job status change events.

JsrunLauncher¶

class JsrunLauncher(config=None)[source]

Bases: MultipleLauncher

Launches a job using LSF’s jsrun.

Parameters: config (Optional[JobExecutorConfig]) – An optional configuration.

Launcher¶

class Launcher(config=None)[source]¶

Bases: ABC

An abstract base class for all launchers.

Parameters: config (Optional[JobExecutorConfig]) – An optional configuration. If not specified, DEFAULT is used.
Return type: None

LocalJobExecutor¶

class LocalJobExecutor(url=None, config=None)[source]

Bases: JobExecutor

A job executor that runs jobs locally using subprocess.Popen.

This job executor is intended to be used either to run jobs directly on the same machine as the PSI/J library or for testing purposes.

Note

In Linux, attached jobs always appear to complete with a zero exit code regardless of the actual exit code.

Warning

Instantiation of a local executor from both parent process and a fork()-ed process is not guaranteed to work. In general, using fork() and multi-threading in Linux is unsafe, as suggested by the fork() man page. While PSI/J attempts to minimize problems that can arise when fork() is combined with threads (which are used by PSI/J), no guarantees can be made and the chances of unexpected behavior are high. Please do not use PSI/J with fork(). If you do, please be mindful that support for using PSI/J with fork() will be limited.

Parameters

url (Optional[str]) – Not used, but required by the spec for automatic initialization.
config (JobExecutorConfig) – The LocalJobExecutor does not have any configuration options.

Return type

None

LsfExecutorConfig¶

class LsfExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: BatchSchedulerExecutorConfig

A configuration class for the LSF executor.

Parameters

launcher_log_file (Optional[Path]) – See JobExecutorConfig.
work_directory (Optional[Path]) – See JobExecutorConfig.
queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.
initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.
queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.
keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

LsfJobExecutor¶

class LsfJobExecutor(url, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the LSF Workload Manager.

The IBM Spectrum LSF workload manager is the system resource manager on LLNL’s Sierra and Lassen, and ORNL’s Summit.

Uses the ‘bsub’, ‘bjobs’, and ‘bkill’ commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #BSUB directives when submitting a job.

Renders all custom attributes of the form lsf.<name> into the corresponding LSF directive. For example, setting job.spec.attributes.custom_attributes[‘lsf.core_isolation’] = ‘0’ results in a `#BSUB -core_isolation 0 directive being placed in the submit script.

Parameters

url (Optional[str]) – Not used, but required by the spec for automatic initialization.
config (Optional[LsfExecutorConfig]) – An optional configuration for this executor.

MPILauncher¶

class MPILauncher(config=None)[source]

Bases: MultipleLauncher

Launches jobs using mpirun.

mpirun is a tool provided by MPI implementations, such as Open MPI.

Parameters: config (Optional[JobExecutorConfig]) – An optional configuration.

MultipleLauncher¶

class MultipleLauncher(script_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/psi-j-python/checkouts/latest/src/psij/launchers/scripts/multi_launch.sh'), config=None)[source]

Bases: ScriptBasedLauncher

A launcher that launches multiple identical copies of the executable.

The exit code of the job corresponds to the first non-zero exit code encountered in one of the executable copies or zero if all invocations of the executable succeed.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.
script_path (Path) –

PBSClassicJobExecutor¶

class PBSClassicJobExecutor(url=None, config=None)[source]

Bases: GenericPBSJobExecutor

A JobExecutor for classic PBS systems.

This executor uses resource specifications specific to Open PBS. Specifically, this executor uses the -l nodes=n:ppn=m way of specifying nodes, which differs from the scheme used by PBS Pro.

Parameters

url (Optional[str]) – Not used, but required by the spec for automatic initialization.
config (Optional[PBSExecutorConfig]) – An optional configuration for this executor.

PBSExecutorConfig¶

class PBSExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]¶

Bases: BatchSchedulerExecutorConfig

A generic configuration class for PBS-type executors.

Parameters

launcher_log_file (Optional[Path]) – See JobExecutorConfig.
work_directory (Optional[Path]) – See JobExecutorConfig.
queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.
initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.
queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.
keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

PBSJobExecutor¶

class PBSJobExecutor(url=None, config=None)[source]

Bases: GenericPBSJobExecutor

A JobExecutor for PBS Pro and friends.

This executor uses resource specifications specific to PBS Pro

Parameters

url (Optional[str]) – Not used, but required by the spec for automatic initialization.
config (Optional[PBSExecutorConfig]) – An optional configuration for this executor.

RPJobExecutor¶

class RPJobExecutor(url=None, config=None)[source]

Bases: JobExecutor

A job executor that runs jobs via the RADICAL Pilot system.

Parameters

url (Optional[str]) – Not used, but required by the spec for automatic initialization.
config (Optional[JobExecutorConfig]) – The RPJobExecutor does not have any configuration options.

Return type

None

ResourceSpec¶

class ResourceSpec[source]¶

Bases: ABC

A base class for resource specifications.

The ResourceSpec class is an abstract base class for all possible resource specification classes in PSI/J.

ResourceSpecV1¶

class ResourceSpecV1(node_count=None, process_count=None, processes_per_node=None, cpu_cores_per_process=None, gpu_cores_per_process=None, exclusive_node_use=True)[source]¶

Bases: ResourceSpec

This class implements V1 of the PSI/J resource specification.

Some of the properties of this class are constrained. Specifically, process_count = node_count * processes_per_node. Specifying all constrained properties in a way that does not satisfy the constraint will result in an error. Specifying some of the constrained properties will result in the remaining one being inferred based on the constraint. This inference is done by this class. However, executor implementations may chose to delegate this inference to an underlying implementation and ignore the values inferred by this class.

Parameters

node_count (Optional[int]) – If specified, request that the backend allocate this many compute nodes for the job.
process_count (Optional[int]) – If specified, instruct the backend to start this many process instances. This defaults to 1.
processes_per_node (Optional[int]) – Instruct the backend to run this many process instances on each node.
cpu_cores_per_process (Optional[int]) – Request this many CPU cores for each process instance. This property is used by a backend to calculate the number of nodes from the process_count
gpu_cores_per_process (Optional[int]) – Request this many GPU cores for each process instance.
exclusive_node_use (bool) – If this parameter is set to True, the LRM is instructed to allocate to this job only nodes that are not running any other jobs, even if this job is requesting fewer cores than the total number of cores on a node. With this parameter set to False, which is the default, the LRM is free to co-schedule multiple jobs on a given node if the number of cores requested by those jobs total less than the amount available on the node.

Return type

None

All constructor parameters are accessible as properties.

ScriptBasedLauncher¶

class ScriptBasedLauncher(script_path, config=None)[source]

Bases: Launcher

A launcher that uses a script to start the job, possibly by wrapping it in other tools.

This launcher is an abstract base class for launchers that wrap the job in a script. The script must be a bash script and is invoked with the first four parameters as:

the job ID
a launcher log file, which is taken from the launcher_log_file configuration setting and defaults to /dev/null
the pre- and post- launcher scripts, or empty strings if they are not specified

Additional positional arguments to the script can be specified by subclasses by overriding the get_additional_args() method.

The remaining arguments to the script are the job executable and arguments.

A simple script library is provided in scripts/launcher_lib.sh. Its use is optional and it is intended to be included at the beginning of a main launcher script using source $(dirname “$0”)/launcher_lib.sh. It does the following:

sets ‘-e’ mode (exit on error)
sets the variables _PSI_J_JOB_ID, _PSI_J_LOG_FILE, _PSI_J_PRE_LAUNCH, and _PSI_J_POST_LAUNCH from the first arguments, as specified above.
saves the current stdout and stderr in descriptors 3 and 4, respectively
redirects stdout and stderr to the log file, while prepending a timestamp and the job ID to each line
defines the commands “pre_launch” and “post_launch”, which can be invoked by the main script.

When invoking the job executable (either directly or through a launch command), it is recommended that the stdout and stderr of the job process be redirected to descriptors 3 and 4, respectively, such that they can be captured by the entity invoking the launcher rather than ending up in a the launcher log file.

A successful completion of the launcher should be signalled by the launcher by printing the string “_PSI_J_LAUNCHER_DONE” to stdout. The launcher can then exit with the exit code returned by the launched command. This allows executor to distinguish between a non-zero exit code due to application failure or due to a premature launcher failure.

The actual launcher scripts, as well as the library, are deployed at run-time into the work directory, where submit scripts are also generated. This directory is meant to be accessible by both the node submitting the job as well as the node launching the job.

Parameters

script_path (Path) – A path to a script that is invoked as described above.
config (Optional[JobExecutorConfig]) – An optional configuration.

Return type

None

SingleLauncher¶

class SingleLauncher(config=None)[source]

Bases: ScriptBasedLauncher

A launcher that launches a single copy of the executable. This is the default launcher.

Parameters: config (Optional[JobExecutorConfig]) – An optional configuration.

SingletonThread¶

class SingletonThread(name=None, daemon=False)[source]

Bases: Thread

A convenience class to return a thread that is guaranteed to be unique to this process.

This is intended to work with fork() to ensure that each os.getpid() value is associated with at most one thread. This is not safe. The safe thing, as pointed out by the fork() man page, is to not use fork() with threads. However, this is here in an attempt to make it slightly safer for when users really really want to take the risk against all advice.

This class is meant as an abstract class and should be used by subclassing and implementing the run method.

Instantiation of this class or one of its subclasses should be done through the get_instance() method rather than directly.

Parameters

name (Optional[str]) – An optional name for this thread.
daemon (bool) – A daemon thread does not prevent the process from exiting.

Return type

None

SlurmExecutorConfig¶

class SlurmExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: BatchSchedulerExecutorConfig

A configuration class for the Slurm executor.

Parameters

launcher_log_file (Optional[Path]) – See JobExecutorConfig.
work_directory (Optional[Path]) – See JobExecutorConfig.
queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.
initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.
queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.
keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

SlurmJobExecutor¶

class SlurmJobExecutor(url=None, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the Slurm Workload Manager.

The Slurm Workload Manager is a widely used resource manager running on machines such as NERSC’s Perlmutter, as well as a variety of LLNL machines.

Uses the ‘sbatch’, ‘squeue’, and ‘scancel’ commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #SBATCH directives when submitting a job.

Renders all custom attributes set on a job’s attributes with a slurm. prefix into corresponding Slurm directives with long-form parameters. For example, job.spec.attributes.custom_attributes[‘slurm.qos’] = ‘debug’ causes a directive #SBATCH –qos=debug to be placed in the submit script.

Parameters

url (Optional[str]) – Not used, but required by the spec for automatic initialization.
config (Optional[SlurmExecutorConfig]) – An optional configuration for this executor.

SrunLauncher¶

class SrunLauncher(config=None)[source]

Bases: MultipleLauncher

Launches a job using Slurm’s srun.

See the Slurm Workload Manager.

Parameters: config (Optional[JobExecutorConfig]) – An optional configuration.

SubmitException¶

class SubmitException(message, exception=None, transient=False)[source]

Bases: Exception

An exception representing job submission issues.

This exception is thrown when the submit() call fails for a reason that is independent of the job that is being submitted.

Parameters

message (str) – see message
exception (Optional[Exception]) – see exception
transient (bool) – see transient

Return type

None

SubmitScriptGenerator¶

class SubmitScriptGenerator(config)[source]

Bases: ABC

A base class representing a submit script generator.

A submit script generator is used to render a Job (together with all its properties, including JobSpec, ResourceSpec, etc.) into a submit script specific to a certain batch scheduler.

Parameters: config (JobExecutorConfig) – An executor configuration containing configuration properties for the executor that is attempting to use this generator. Submit script generators are meant to work in close cooperation with batch scheduler job executors, hence the sharing of a configuration mechanism.
Return type: None

TemplatedScriptGenerator¶

class TemplatedScriptGenerator(config, template_path, escape=<function bash_escape>)[source]

Bases: SubmitScriptGenerator

A Mustache templates submit script generator.

This script generator uses Pystache (https://pypi.org/project/pystache/), which is a Python implementation of the Mustache templating language (https://mustache.github.io/).

Parameters

config (JobExecutorConfig) – A configuration, which is passed to the base class.
template_path (Path) – The path to a Mustache template.
escape (Callable[[object], str]) – An escape function to use for escaping values. By default, a function that escapes strings for use in bash scripts is used.

Return type

None

Functions¶

bash_escape¶

bash_escape(o)[source]

Escape object to bash string.

Renders and escapes an object to a string such that its value is preserved when substituted in a bash script between double quotes. Numeric values are simply rendered without any escaping. Path objects are converted to absolute path and escaped. All other objects are converted to string and escaped.

Parameters: o (object) – The object to escape.
Returns: An escaped representation of the object that can be substituted in bash scripts.
Return type: str

check_status_exit_code¶

check_status_exit_code(command, exit_code, out)[source]

Check if exit_code is nonzero and, if so, raise a RuntimeError.

This function produces a somewhat user-friendly exception message that combines the command that was run with its output.

Parameters

command (str) – The command that was run. This is only used to format the error message.
exit_code (int) – The exit code returned by running the command.
out (str) – The output produced by command.

Return type

None

walltime_to_minutes¶

walltime_to_minutes(walltime)[source]

Converts a walltime object to a number of minutes.

The walltime can either be a Python timedelta, an integer, in which case it is interpreted directly as a number of minutes, or a string with a format of either HH:MM:SS, HH:MM, or MM.

Parameters: walltime (Union[timedelta, int, str]) – the walltime to convert
Returns: The number of minutes represented by the walltime parameter.
Return type: int