PSI/J Core

Job

class Job(spec=None)[source]

Bases: object

This class represents a PSI/J job.

It encapsulates all of the information needed to run a job as well as the job’s state.

When constructed, a job is in the NEW state.

Parameters

spec (Optional[JobSpec]) – an optional JobSpec that describes the details of the job.

Return type

None

cancel()[source]

Cancels this job.

The job is canceled by calling cancel() on the job executor that was used to submit this job.

Raises

SubmitException – if the job has not yet been submitted.

Return type

None

property id: str

A read-only property containing the PSI/J job ID.

The ID is assigned automatically by the implementation when this Job object is constructed. The ID is guaranteed to be unique on the machine on which the Job object was instantiated. The ID does not have to match the ID of the underlying LRM job, but is used to identify Job instances as seen by a client application.

property native_id: Optional[str]

A read-only property containing the native ID of the job.

The native ID is the ID assigned to the job by the underlying implementation. The native ID may not be available until after the job is submitted to a JobExecutor, in which case the value of this property is None.

set_job_status_callback(cb)[source]

Registers a status callback with this job.

The callback can either be a subclass of JobStatusCallback or a procedure accepting two arguments: a Job and a JobStatus.

The callback is invoked whenever a status change occurs for this job, independent of any callback registered on the job’s JobExecutor. The callback can be removed by setting this property to None.

Parameters

cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of JobStatusCallback or a callable with two parameters, job of type Job, job_status of type JobStatus, and returning nothing.

Return type

None

spec

The job specification of this job.

property status: JobStatus

Contains the current status of the job.

It is guaranteed that the status returned by this method is monotonic in time with respect to the partial ordering of JobStatus types. That is, if job_status_1.state and job_status_2.state are comparable and job_status_1.state < job_status_2.state, then it is impossible for job_status_2 to be returned by a call placed prior to a call that returns job_status_1 if both calls are placed from the same thread or if a proper memory barrier is placed between the calls. Furthermore the job is guaranteed to go through all intermediate states in the state model before reaching a particular state.

Returns

the current state of this job

wait(timeout=None, target_states=None)[source]

Waits for the job to reach certain states.

This method returns either when the job reaches one of the target_states, a state following one of the target_states, a final state, or when an amount of time indicated by the timeout parameter, if specified, passes. Returns the JobStatus object that has one of the desired states or None if the timeout is reached. For example, wait(target_states = [JobState.QUEUED] waits until the job is in any of the QUEUED, ACTIVE, COMPLETED, FAILED, or CANCELED states.

Parameters
  • timeout (Optional[timedelta]) – An optional timeout after which this method returns even if none of the target_states was reached. If not specified, wait indefinitely.

  • target_states (Optional[Union[JobState, Sequence[JobState]]]) – A set of states to wait for. If not specified, wait for any of the final states.

Returns

returns the JobStatus object that caused the caused this call to complete or None if the timeout is specified and reached.

Return type

Optional[JobStatus]

JobSpec

class JobSpec(executable=None, arguments=None, directory=None, name=None, inherit_environment=True, environment=None, stdin_path=None, stdout_path=None, stderr_path=None, resources=None, attributes=None, pre_launch=None, post_launch=None, launcher=None)[source]

Bases: object

A class that describes the details of a job.

Parameters
  • executable (Optional[str]) – An executable, such as “/bin/date”.

  • arguments (Optional[List[str]]) – The argument list to be passed to the executable. Unlike with execve(), the first element of the list will correspond to argv[1] when accessed by the invoked executable.

  • directory (Union[str, Path, None]) – The directory, on the compute side, in which the executable is to be run

  • name (Optional[str]) – A name for the job. The name plays no functional role except that JobExecutor implementations may attempt to use the name to label the job as presented by the underlying implementation.

  • inherit_environment (bool) – If this flag is set to False, the job starts with an empty environment. The only environment variables that will be accessible to the job are the ones specified by this property. If this flag is set to True, which is the default, the job will also have access to variables inherited from the environment in which the job is run.

  • environment (Optional[Dict[str, Union[str, int]]]) – A mapping of environment variable names to their respective values.

  • stdin_path (Union[str, Path, None]) – Path to a file whose contents will be sent to the job’s standard input.

  • stdout_path (Union[str, Path, None]) – A path to a file in which to place the standard output stream of the job.

  • stderr_path (Union[str, Path, None]) – A path to a file in which to place the standard error stream of the job.

  • resources (Optional[ResourceSpec]) – The resource requirements specify the details of how the job is to be run on a cluster, such as the number and type of compute nodes used, etc.

  • attributes (Optional[JobAttributes]) – Job attributes are details about the job, such as the walltime, that are descriptive of how the job behaves. Attributes are, in principle, non-essential in that the job could run even though no attributes are specified. In practice, specifying a walltime is often necessary to prevent LRMs from prematurely terminating a job.

  • pre_launch (Union[str, Path, None]) – An optional path to a pre-launch script. The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.

  • post_launch (Union[str, Path, None]) – An optional path to a post-launch script. The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.

  • launcher (Optional[str]) – The name of a launcher to use, such as “mpirun”, “srun”, “single”, etc. For a list of available launchers, see Available Launchers.

All constructor parameters are accessible as properties.

Note

A note about paths.

It is strongly recommended that paths to std*_path, directory, etc. be specified as absolute. While paths can be relative, and there are cases when it is desirable to specify them as relative, it is important to understand what the implications are.

Paths in a specification refer to paths that are accessible to the machine where the job is running. In most cases, that will be different from the machine on which the job is launched (i.e., where PSI/J is invoked from). This means that a given path may or may not point to the same file in both the location where the job is running and the location where the job is launched from.

For example, if launching jobs from a login node of a cluster, the path /tmp/foo.txt will likely refer to locally mounted drives on both the login node and the compute node(s) where the job is running. However, since they are local mounts, the file /tmp/foo.txt written by a job running on the compute node will not be visible by opening /tmp/foo.txt on the login node. If an output file written on a compute node needs to be accessed on a login node, that file should be placed on a shared filesystem. However, even by doing so, there is no guarantee that the shared filesystem is mounted under the same mount point on both login and compute nodes. While this is an unlikely scenario, it remains a possibility.

When relative paths are specified, even when they point to files on a shared filesystem as seen from the submission side (i.e., login node), the job working directory may be different from the working directory of the application that is launching the job. For example, an application that uses PSI/J to launch jobs on a cluster may be invoked from (and have its working directory set to) /home/foo, where /home is a mount point for a shared filesystem accessible by compute nodes. The launched job may specify stdout_path=Path(‘bar.txt’), which would resolve to /home/foo/bar.txt. However, the job may start in /tmp on the compute node, and its standard output will be redirected to /tmp/bar.txt.

Relative paths are useful when there is a need to refer to the job directory that the scheduler chooses for the job, which is not generally known until the job is started by the scheduler. In such a case, one must leave the spec.directory attribute empty and refer to files inside the job directory using relative paths.

property directory: Optional[Path]

The directory, on the compute side, in which the executable is to be run.

property environment: Optional[Dict[str, str]]

Return the environment dict.

property name: Optional[str]

Returns the name of the job.

property post_launch: Optional[Path]

An optional path to a post-launch script.

The post-launch script is sourced after all the ranks of the job executable complete and is sourced on the same node as the pre-launch script.

property pre_launch: Optional[Path]

An optional path to a pre-launch script.

The pre-launch script is sourced before the launcher is invoked. It, therefore, runs on the service node of the job rather than on all of the compute nodes allocated to the job.

property stderr_path: Optional[Path]

A path to a file in which to place the standard error stream of the job.

property stdin_path: Optional[Path]

A path to a file whose contents will be sent to the job’s standard input.

property stdout_path: Optional[Path]

A path to a file in which to place the standard output stream of the job.

JobAttributes

class JobAttributes(duration=datetime.timedelta(seconds=600), queue_name=None, account=None, reservation_id=None, custom_attributes=None, project_name=None)[source]

Bases: object

A class containing ancillary job information that describes how a job is to be run.

Parameters
  • duration (timedelta) – Specifies the duration (walltime) of the job. A job whose execution exceeds its walltime can be terminated forcefully.

  • queue_name (Optional[str]) – If a backend supports multiple queues, this parameter can be used to instruct the backend to send this job to a particular queue.

  • account (Optional[str]) – An account to use for billing purposes. Please note that the executor implementation (or batch scheduler) may use a different term for the option used for accounting/billing purposes, such as project. However, scheduler must map this attribute to the accounting/billing option in the underlying execution mechanism.

  • reservation_id (Optional[str]) – Allows specifying an advanced reservation ID. Advanced reservations enable the pre-allocation of a set of resources/compute nodes for a certain duration such that jobs can be run immediately, without waiting in the queue for resources to become available.

  • custom_attributes (Optional[Dict[str, object]]) – Specifies a dictionary of custom attributes. Implementations of JobExecutor define and are responsible for interpreting custom attributes. The typical usage scenario for custom attributes is to pass information to the executor or underlying job execution mechanism that cannot otherwise be passed using the classes and properties provided by PSI/J. A specific example is that of the subclasses of BatchSchedulerExecutor, which look for custom attributes prefixed with their name and a dot (e.g., slurm.constraint, pbs.c, lsf.core_isolation) and translate them into the corresponding batch scheduler directives (e.g., #SLURM –constraint=…, #PBS -c …, #BSUB -core_isolation …).

  • project_name (Optional[str]) – Deprecated. Please use the account attribute.

Return type

None

All constructor parameters are accessible as properties.

property custom_attributes: Optional[Dict[str, object]]

Returns a dictionary with the custom attributes.

get_custom_attribute(name)[source]

Retrieves the value of a custom attribute.

Parameters

name (str) –

Return type

Optional[object]

static parse_walltime(walltime)[source]

Parses a walltime string into a timedelta.

The accepted walltime strings formats are: * hh:mm:ss * hh:mm * mm * ns*[y|M|d|h|ms]

Parameters

walltime (str) – A string in one of the above formats representing a time duration

Returns

A timedelta representing the same time duration as the walltime parameter.

Return type

timedelta

property project_name: Optional[str]

Deprecated. Please use the account attribute.

set_custom_attribute(name, value)[source]

Sets a custom attribute.

Parameters
Return type

None

ResourceSpec

class ResourceSpec[source]

Bases: ABC

A base class for resource specifications.

The ResourceSpec class is an abstract base class for all possible resource specification classes in PSI/J.

static get_instance(version)[source]

Creates an instance of a ResourceSpec of the specified version.

Parameters

version (int) – The version of ResourceSpec to instantiate. For example, if version == 1, this method will return a new instance of ResourceSpecV1.

Return type

ResourceSpec

abstract property version: int

Returns the version of this resource specification class.

ResourceSpecV1

class ResourceSpecV1(node_count=None, process_count=None, processes_per_node=None, cpu_cores_per_process=None, gpu_cores_per_process=None, exclusive_node_use=True)[source]

Bases: ResourceSpec

This class implements V1 of the PSI/J resource specification.

Some of the properties of this class are constrained. Specifically, process_count = node_count * processes_per_node. Specifying all constrained properties in a way that does not satisfy the constraint will result in an error. Specifying some of the constrained properties will result in the remaining one being inferred based on the constraint. This inference is done by this class. However, executor implementations may chose to delegate this inference to an underlying implementation and ignore the values inferred by this class.

Parameters
  • node_count (Optional[int]) – If specified, request that the backend allocate this many compute nodes for the job.

  • process_count (Optional[int]) – If specified, instruct the backend to start this many process instances. This defaults to 1.

  • processes_per_node (Optional[int]) – Instruct the backend to run this many process instances on each node.

  • cpu_cores_per_process (Optional[int]) – Request this many CPU cores for each process instance. This property is used by a backend to calculate the number of nodes from the process_count

  • gpu_cores_per_process (Optional[int]) – Request this many GPU cores for each process instance.

  • exclusive_node_use (bool) – If this parameter is set to True, the LRM is instructed to allocate to this job only nodes that are not running any other jobs, even if this job is requesting fewer cores than the total number of cores on a node. With this parameter set to False, which is the default, the LRM is free to co-schedule multiple jobs on a given node if the number of cores requested by those jobs total less than the amount available on the node.

Return type

None

All constructor parameters are accessible as properties.

property computed_node_count: int

Returns or calculates a node count.

If the node_count property is specified, this method returns it. If not, a node count is calculated from process_count and processes_per_node.

Returns

An integer value with the specified or calculated node count.

property computed_process_count: int

Returns or calculates a process count.

If the process_count property is specified, this method returns it, otherwise it returns 1.

Returns

An integer value with either the value of process_count or one if the former is not specified.

property computed_processes_per_node: int

Returns or calculates the number of processes per node.

If the processes_per_node property is specified, this method returns it, otherwise calculates it based on process_count and node_count if possible, or defaults to 1.

Returns

An integer value with either the value of processes_per_node or one if the former cannot be determined.

property version: int

Returns the version of this ResourceSpec, which is 1 for this class.

JobStatus

class JobStatus(state, time=None, message=None, exit_code=None, metadata=None)[source]

Bases: object

A class containing details about job transitions to new states.

Parameters
  • state (JobState) – The JobState of this status.

  • time (Optional[float]) – The time, as would be returned by time.time(), at which the transition to the new state occurred. If not specified, the time when this JobStatus was instantiated will be used.

  • message (Optional[str]) – An optional message associated with the transition.

  • exit_code (Optional[int]) – An optional exit code for the job, if the job has completed.

  • metadata (Optional[Dict[str, object]]) – Optional metadata provided by the JobExecutor.

Return type

None

All constructor parameters are accessible as properties.

property final: bool

Returns the final property of the underlying state.

Returns

True if the state is final and False otherwise.

JobState

class JobState(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: bytes, Enum

An enumeration holding the possible job states.

The possible states are: NEW, QUEUED, ACTIVE, COMPLETED, FAILED, and CANCELED.

ACTIVE = 2

This state represents an actively running job.

CANCELED = 5

Represents a job that was canceled by a call to cancel().

COMPLETED = 3

This state represents a job that has completed successfully (i.e., with a zero exit code). In other words, a job with the executable set to /bin/false cannot enter this state.

FAILED = 4

Represents a job that has either completed unsuccessfully (with a non-zero exit code) or a job whose handling and/or execution by the backend has failed in some way.

NEW = 0

This is the state of a job immediately after the Job object is created and before being submitted to a JobExecutor.

QUEUED = 1

This is the state of the job after being accepted by a backend for execution, but before the execution of the job begins.

property final: bool

Returns True if this state final.

A state is final when no other state transition can occur after that state has been reached.

Returns

True if this is a final state and False otherwise

is_greater_than(other)[source]

Defines a (strict) partial ordering on the states.

Not all states are comparable. State transitions cannot violate this ordering.

Parameters

other (JobState) – the other JobState to compare to

Returns

if this state is comparable with other, this method returns True or False depending on the relative order between this state and other. That is, True is returned if and only if this state can come after other. If this state is not comparable with other, this method returns None.

Return type

Optional[bool]

Serialization

psij.serialize.Export

psij.serialize.Import

Miscellaneous

psij.utils module

class SingletonThread(name=None, daemon=False)[source]

Bases: Thread

A convenience class to return a thread that is guaranteed to be unique to this process.

This is intended to work with fork() to ensure that each os.getpid() value is associated with at most one thread. This is not safe. The safe thing, as pointed out by the fork() man page, is to not use fork() with threads. However, this is here in an attempt to make it slightly safer for when users really really want to take the risk against all advice.

This class is meant as an abstract class and should be used by subclassing and implementing the run method.

Instantiation of this class or one of its subclasses should be done through the get_instance() method rather than directly.

Parameters
  • name (Optional[str]) – An optional name for this thread.

  • daemon (bool) – A daemon thread does not prevent the process from exiting.

Return type

None

classmethod get_instance()[source]

Returns a started instance of this thread.

The instance is guaranteed to be unique for this process. This method also guarantees that a forked process will get a separate instance of this thread from the parent.

Return type

SingletonThread

psij.version module

This module stores the current version of this library.

Descriptor

class Descriptor(name, version, cls, aliases=None, nice_name=None)[source]

Bases: object

This class is used to enable PSI/J to discover and register executors and/or launchers.

Executors wanting to register with PSI/J must place an instance of this class in a global module list named __PSI_J_EXECUTORS__ or __PSI_J_LAUNCHERS__ in a module placed in the psij-descriptors namespace package. In other words, in order to automatically register an executor or launcher, a python file should be created inside a psij-descriptors package, such as:

<project_root>/
    src/
        psij-descriptors/
            descriptors_for_project.py

It is essential that the psij-descriptors package not contain an __init__.py file in order for Python to treat the package as a namespace package. This allows Python to combine multiple psij-descriptors directories into one, which, in turn, allows PSI/J to detect and load all descriptors that can be found in Python’s library search path.

The contents of descriptors_for_project.py could then be as follows:

from packaging.version import Version
from psij.descriptor import Descriptor

__PSI_J_EXECUTORS__ = [
    Descriptor(name=<name>, version=Version(<version_str>),
               cls=<fqn_str>),
    ...
]

__PSI_J_LAUNCHERS__ = [
    Descriptor(name=<name>, version=Version(<version_str>),
               cls=<fqn_str>),
    ...
]

where <name> stands for the name used to instantiate the executor or launcher, <version_str> is a version string such as 1.0.2, and <fqn_str> is the fully qualified class name that implements the executor or launcher such as psij.executors.local.LocalJobExecutor.

Parameters
  • name (str) – The name of the executor or launcher. The automatic registration system will register the executor or launcher using this name. That is, the executor or launcher represented by this descriptor will be available for instantiation using either get_instance() or get_instance()

  • version (Version) – The version of the executor/launcher. Multiple versions can be registered under a single name.

  • cls (str) – A fully qualified name pointing to the class implementing an executor or launcher.

  • aliases (Optional[List[str]]) – An optional set of alternative names to make the executor available under as if its name was the alias.

  • nice_name (Optional[str]) – An optional string to use whenever a user-friendly name needs to be displayed to a user. For example, a nice name for pbs would be PBS or Portable Batch System. If not specified, the nice_name defaults to the value of the name parameter.

Return type

None

Exceptions

psij.exceptions module

A collection of exceptions used by PSI/J.

exception InvalidJobException(message, exception=None)[source]

Bases: Exception

An exception describing a problem with a job specification.

Parameters
Return type

None

exception

Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.

message

Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.

exception SubmitException(message, exception=None, transient=False)[source]

Bases: Exception

An exception representing job submission issues.

This exception is thrown when the submit() call fails for a reason that is independent of the job that is being submitted.

Parameters
Return type

None

exception

Returns an optional underlying exception that can potentially be used for debugging purposes, but which should not, in general, be presented to an end-user.

message

Retrieves the message associated with this exception. This is a descriptive message that is sufficiently clear to be presented to an end-user.

transient

Returns True if the underlying condition that triggered this exception is transient. Jobs that cannot be submitted due to a transient exceptional condition have chance of being successfully re-submitted at a later time, which is a suggestion to client code that it could re-attempt the operation that triggered this exception. However, the exact chances of success depend on many factors and are not guaranteed in any particular case. For example, a DNS resolution failure while attempting to connect to a remote service is a transient error since it can be reasonably assumed that DNS resolution is a persistent feature of an Internet-connected network. By contrast, an authentication failure due to an invalid username/password combination would not be a transient failure. While it may be possible for a temporary defect in a service to cause such a failure, under normal operating conditions such an error would persist across subsequent re-tries until correct credentials are used.

Executors

The concrete executor implementations provided by this version of PSI/J Python are:

Cobalt

class CobaltJobExecutor(url=None, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the Cobalt Workload Manager.

The Cobalt HPC Job Scheduler, is used by Argonne’s ALCF systems.

Uses the qsub, qstat, and qdel commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #COBALT directives when submitting a job.

Custom attributes prefixed with cobalt. are rendered as long-form directives in the script. For example, setting custom_attributes[‘cobalt.m’] = ‘co’ results in the #COBALT –m=co directive being placed in the submit script.

Parameters
Return type

None

CobaltExecutorConfig

class CobaltExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: BatchSchedulerExecutorConfig

A configuration class for the Cobalt executor.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

Flux

class FluxJobExecutor(url=None, config=None)[source]

Bases: JobExecutor

A JobExecutor for the Flux scheduler.

The Flux resource manager framework is deployed and used on a per-user basis at many sites, and is slated to become the system-level resource manager at LLNL.

Uses Flux’s python library/bindings to submit, monitor, and manipulate jobs.

Parameters
  • url (Optional[str]) – Not used, but required by the spec for automatic initialization.

  • config (Optional[JobExecutorConfig]) – The FluxJobExecutor does not have any configuration options.

Return type

None

Local

class LocalJobExecutor(url=None, config=None)[source]

Bases: JobExecutor

A job executor that runs jobs locally using subprocess.Popen.

This job executor is intended to be used either to run jobs directly on the same machine as the PSI/J library or for testing purposes.

Note

In Linux, attached jobs always appear to complete with a zero exit code regardless of the actual exit code.

Warning

Instantiation of a local executor from both parent process and a fork()-ed process is not guaranteed to work. In general, using fork() and multi-threading in Linux is unsafe, as suggested by the fork() man page. While PSI/J attempts to minimize problems that can arise when fork() is combined with threads (which are used by PSI/J), no guarantees can be made and the chances of unexpected behavior are high. Please do not use PSI/J with fork(). If you do, please be mindful that support for using PSI/J with fork() will be limited.

Parameters
  • url (Optional[str]) – Not used, but required by the spec for automatic initialization.

  • config (JobExecutorConfig) – The LocalJobExecutor does not have any configuration options.

Return type

None

LSF

class LsfJobExecutor(url, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the LSF Workload Manager.

The IBM Spectrum LSF workload manager is the system resource manager on LLNL’s Sierra and Lassen, and ORNL’s Summit.

Uses the ‘bsub’, ‘bjobs’, and ‘bkill’ commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #BSUB directives when submitting a job.

Renders all custom attributes of the form lsf.<name> into the corresponding LSF directive. For example, setting job.spec.attributes.custom_attributes[‘lsf.core_isolation’] = ‘0’ results in a `#BSUB -core_isolation 0 directive being placed in the submit script.

Parameters

LsfExecutorConfig

class LsfExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: BatchSchedulerExecutorConfig

A configuration class for the LSF executor.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

PBS Pro

class PBSJobExecutor(url=None, config=None)[source]

Bases: GenericPBSJobExecutor

A JobExecutor for PBS Pro and friends.

This executor uses resource specifications specific to PBS Pro

Parameters

PBS Classic

class PBSClassicJobExecutor(url=None, config=None)[source]

Bases: GenericPBSJobExecutor

A JobExecutor for classic PBS systems.

This executor uses resource specifications specific to Open PBS. Specifically, this executor uses the -l nodes=n:ppn=m way of specifying nodes, which differs from the scheme used by PBS Pro.

Parameters

Radical Pilot

class RPJobExecutor(url=None, config=None)[source]

Bases: JobExecutor

A job executor that runs jobs via the RADICAL Pilot system.

Parameters
  • url (Optional[str]) – Not used, but required by the spec for automatic initialization.

  • config (Optional[JobExecutorConfig]) – The RPJobExecutor does not have any configuration options.

Return type

None

Slurm

class SlurmJobExecutor(url=None, config=None)[source]

Bases: BatchSchedulerExecutor

A JobExecutor for the Slurm Workload Manager.

The Slurm Workload Manager is a widely used resource manager running on machines such as NERSC’s Perlmutter, as well as a variety of LLNL machines.

Uses the ‘sbatch’, ‘squeue’, and ‘scancel’ commands, respectively, to submit, monitor, and cancel jobs.

Creates a batch script with #SBATCH directives when submitting a job.

Renders all custom attributes set on a job’s attributes with a slurm. prefix into corresponding Slurm directives with long-form parameters. For example, job.spec.attributes.custom_attributes[‘slurm.qos’] = ‘debug’ causes a directive #SBATCH –qos=debug to be placed in the submit script.

Parameters

SlurmExecutorConfig

class SlurmExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: BatchSchedulerExecutorConfig

A configuration class for the Slurm executor.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

Executor Infrastructure

JobExecutor

class JobExecutor(url=None, config=None)[source]

Bases: ABC

An abstract base class for all JobExecutor implementations.

Parameters
  • url (Optional[str]) – The URL is a string that a JobExecutor implementation can interpret as the location of a backend.

  • config (Optional[JobExecutorConfig]) – An configuration specific to each JobExecutor implementation. This parameter is marked as optional such that concrete JobExecutor classes can be instantiated with no config parameter. However, concrete JobExecutor classes must pass a default configuration up the inheritance tree and ensure that the config parameter of the ABC constructor is non-null.

abstract attach(job, native_id)[source]

Attaches a job to a native job.

Parameters
  • job (Job) – A job to attach. The job must be in the NEW state.

  • native_id (str) – The native ID to attach to as returned by native_id.

Return type

None

abstract cancel(job)[source]

Cancels a job that has been submitted to underlying executor implementation.

A successful return of this method only indicates that the request for cancellation has been communicated to the underlying implementation. The job will then be canceled at the discretion of the implementation, which may be at some later time. A successful cancellation is reflected in a change of status of the respective job to CANCELED. User code can synchronously wait until the CANCELED state is reached using job.wait(JobState.CANCELED) or even job.wait(), since the latter would wait for all final states, including JobState.CANCELED. In fact, it is recommended that job.wait() be used because it is entirely possible for the job to complete before the cancellation is communicated to the underlying implementation and before the client code receives the completion notification. In such a case, the job will never enter the CANCELED state and job.wait(JobState.CANCELED) would hang indefinitely.

Parameters

job (Job) – The job to be canceled.

Raises

SubmitException – Thrown if the request cannot be sent to the underlying implementation.

Return type

None

static get_executor_names()[source]

Returns a set of registered executor names.

Names returned by this method can be passed to get_instance() as the name parameter.

Returns

A set of executor names corresponding to the known executors.

Return type

Set[str]

static get_instance(name, version_constraint=None, url=None, config=None)[source]

Returns an instance of a JobExecutor.

Parameters
  • name (str) – The name of the executor to return. This must be one of the values returned by get_executor_names(). If the value of the name parameter is not one of the valid values returned by get_executor_names(), ValueError is raised.

  • version_constraint (Optional[str]) – A version constraint for the executor in the form ‘(‘ <op> <version>[, <op> <version[, …]] ‘)’, such as “( > 0.0.2, != 0.0.4)”.

  • url (Optional[str]) – An optional URL to pass to the JobExecutor instance.

  • config (Optional[JobExecutorConfig]) – An optional configuration to pass to the instance.

Returns

A JobExecutor.

Return type

JobExecutor

abstract list()[source]

List native IDs of all jobs known to the backend.

This method is meant to return a list of native IDs for jobs submitted to the backend by any means, not necessarily through this executor or through PSI/J.

Return type

List[str]

property name: str

Returns the name of this executor.

static register_executor(desc, root)[source]

Registers a JobExecutor class through a Descriptor.

The class can then be later instantiated using get_instance().

Parameters
  • desc (Descriptor) – A Descriptor with information about the executor to be registered.

  • root (str) – A filesystem path under which the implementation of the executor is to be loaded from. Executors from other locations, even if under the correct package, will not be registered by this method. If an executor implementation is only available under a different root path, this method will throw an exception.

Return type

None

set_job_status_callback(cb)[source]

Registers a status callback with this executor.

The callback can either be a subclass of JobStatusCallback or a procedure accepting two arguments: a Job and a JobStatus.

The callback will be invoked whenever a status change occurs for any of the jobs submitted to this job executor, whether they were submitted with an individual job status callback or not. To remove the callback, set it to None.

Parameters

cb (Union[JobStatusCallback, Callable[[Job, JobStatus], None]]) – An instance of JobStatusCallback or a callable with two parameters: job of type Job and job_status of type JobStatus.

Return type

None

abstract submit(job)[source]

Submits a Job to the underlying implementation.

Successful return of this method indicates that the job has been sent to the underlying implementation and all changes in the job status, including failures, are reported using notifications. Conversely, if one of the two possible exceptions is thrown, then the job has not been successfully sent to the underlying implementation, the job status remains unchanged, and no status notifications about the job will be fired.

A successful return of this method guarantees that the job’s native_id property is set.

Raises
  • InvalidJobException – Thrown if the job specification cannot be understood. This exception is fatal in that submitting another job with the exact same details will also fail with an InvalidJobException. In principle, the underlying implementation / LRM is the entity ultimately responsible for interpreting a specification and reporting any errors associated with it. However, in many cases, this reporting may come after a significant delay. In the interest of failing fast, library implementations should make an effort of validating specifications early and throwing this exception as soon as possible if that validation fails.

  • SubmitException – Thrown if the request cannot be sent to the underlying implementation. Unlike InvalidJobException, this exception can occur for reasons that are transient.

Parameters

job (Job) –

Return type

None

property version: packaging.version.Version

Returns the version of this executor.

JobExecutorConfig

class JobExecutorConfig(launcher_log_file=None, work_directory=None)[source]

Bases: object

An abstract configuration class for JobExecutor instances.

Parameters
  • launcher_log_file (Optional[Path]) – If specified, log messages from launcher scripts (including output from pre- and post- launch scripts) will be directed to this file.

  • work_directory (Optional[Path]) – A directory where submit scripts and auxiliary job files will be generated. In a, cluster this directory needs to point to a directory on a shared filesystem. This is so that the exit code file, likely written on a service node, can be accessed by PSI/J, likely running on a head node.

Return type

None

DEFAULT: JobExecutorConfig = <psij.job_executor_config.JobExecutorConfig object>

A default JobExecutorConfig used when none is specified.

DEFAULT_WORK_DIRECTORY = PosixPath('/home/docs/.psij/work')

The default work directory when a work directory is not explicitly specified.

property launcher_log_file: Optional[Path]

Configure the executor’s launcher log file.

Parameters

launcher_log_file – If specified, log messages from launcher scripts (including output from pre- and post- launch scripts) will be directed to this file.

property work_directory: Path

Configure the execor’s work directory.

Parameters

work_directory – A directory where submit scripts and auxiliary job files will be generated. In a, cluster this directory needs to point to a directory on a shared filesystem. This is so that the exit code file, likely written on a service node, can be accessed by PSI/J, likely running on a head node.

psij.executors.batch.batch_scheduler_executor module

class BatchSchedulerExecutor(url=None, config=None)[source]

Bases: JobExecutor

A base class for batch scheduler executors.

This class implements a generic JobExecutor that interacts with batch schedulers. There are two main components to the executor: job submission and queue polling. Submission is implemented by generating a submit script which is then fed to the queuing system submit command.

The submit script is generated using a generate_submit_script(). An implementation of this functionality based on Mustache/Pystache (see https://mustache.github.io/ and https://pypi.org/project/pystache/) exists in TemplatedScriptGenerator. This class can be instantiated by concrete implementations of a batch scheduler executor and the submit script generation can be delegated to that instance, which has a method whose signature matches that of generate_submit_script(). Besides an opened file which points to where the contents of the submit script are to be written, the parameters to generate_submit_script() are the Job that is being submitted and a context, which is a dictionary with the following structure:

{
    'job': <the job being submitted>
    'psij': {
        'lib': <dict; function library>,
        'launch_command': <str; launch command>,
        'script_dir': <str; directory where the submit script is generated>
    }
}

The script directory is a directory (typically ~/.psij/work) where submit scripts are written; it is also used for auxiliary files, such as the exit code file (see below) or the script output file.

The launch command is a list of strings which the script generator should render as the command to execute. It wraps the job executable in the proper Launcher.

The function library is a dictionary mapping function names to functions for all public functions in the template_function_library module.

The submit script must perform two essential actions:

1. redirect the output of the executable part of the script to the script output file, which is a file in <script_dir> named <native_id>.out, where <native_id> is the id given to the job by the queuing system.

2. store the exit code of the launch command in the exit code file named <native_id>.ec, also inside <script_dir>.

Additionally, where appropriate, the submit script should set the environment variable named PSIJ_NODEFILE to point to a file containing a list of nodes that are allocated for the job, one per line, with a total number of lines matching the process count of the job.

Once the submit script is generated, the executor renders the submit command using get_submit_command() and executes it. Its output is then parsed using job_id_from_submit_output() to retrieve the native_id of the job. Subsequently, the job is registered with the queue polling thread.

The queue polling thread regularly polls the batch scheduler queue for updates to job states. It builds the command for polling the queue using get_status_command(), which takes a list of native_id strings corresponding to all registered jobs. Implementations are strongly encouraged to restrict the query of job states to the specified jobs in order to reduce the load on the queuing system. The output of the status command is then parsed using parse_status_output() and the status of each job is updated accordingly. If the status of a registered job is not found in the output of the queue status command, it is assumed completed (or failed, depending on its exit code), since most queuing systems automatically purge completed jobs from their databases after a short period of time. The exit code is read from the exit code file, as described above. If the exit code value is not zero, the job is assumed failed and an attempt is made to read an error message from the script output file.

Parameters
attach(job, native_id)[source]

Attaches a job to a native job.

Attempts to connect job to a native job with native_id such that the job correctly reflects updates to the status of the native job. If the native job was previously submitted using this executor (hence having an exit code file and a script output file), the executor will attempt to retrieve the exit code and errors from the job. Otherwise, it may be impossible for the executor to distinguish between a failed and successfully completed job.

Parameters
  • job (Job) – The PSI/J job to attach.

  • native_id (str) – The id of the batch scheduler job to attach to.

Return type

None

cancel(job)[source]

Cancels a job if it has not otherwise completed.

A command is constructed using get_cancel_command() and executed in order to cancel the job. Also see cancel().

Parameters

job (Job) –

Return type

None

abstract generate_submit_script(job, context, submit_file)[source]

Called to generate a submit script for a job.

Concrete implementations of batch scheduler executors must override this method in order to generate a submit script for a job.

Parameters
  • job (Job) – The job to be submitted.

  • context (Dict[str, object]) – A dictionary containing information about the context in which the job is being submitted. For details, see the description of this class.

  • submit_file (IO[str]) – An opened file-like object to which the contents of the submit script should be written.

Return type

None

abstract get_cancel_command(native_id)[source]

Constructs a command to cancel a batch scheduler job.

Concrete implementations of batch scheduler executors must override this method.

Parameters

native_id (str) – The native id of the job being cancelled.

Returns

A list of strings representing the command and arguments to execute in order to cancel the job, such as, e.g., [‘qdel’, native_id].

Return type

List[str]

abstract get_list_command()[source]

Constructs a command to retrieve the list of jobs known to the LRM for the current user.

Concrete implementations of batch scheduler executors must override this method. Upon running the command, the output can be parsed with parse_list_output().

Returns

A list of strings representing the executable and arguments to invoke in order to obtain the list of jobs the LRM knows for the current user.

Return type

List[str]

abstract get_status_command(native_ids)[source]

Constructs a command to retrieve the status of a list of jobs.

Concrete implementations of batch scheduler executors must override this method. In order to prevent overloading the queueing system, concrete implementations are strongly encouraged to return a command that only queries for the status of the indicated jobs. The command returned by this method should produce an output that is understood by parse_status_output().

Parameters
  • jobs – A collection of native ids corresponding to the jobs whose status is sought.

  • native_ids (Collection[str]) –

Returns

A list of strings representing the command and arguments to execute in order to get the status of the jobs.

Return type

List[str]

abstract get_submit_command(job, submit_file_path)[source]

Constructs a command to submit a job to a batch scheduler.

Concrete implementations of batch scheduler executors must override this method.

Parameters
Returns

A list of strings representing the command and arguments to execute in order to submit the job, such as [‘qsub’, str(submit_file_path)].

Return type

List[str]

abstract job_id_from_submit_output(out)[source]

Extracts a native job id from the output of the submit command.

Concrete implementations of batch scheduler executors must override this method. This method is only invoked if the submit command completes with a zero exit code, so implementations of this method do not need to determine whether the output reflects an error from the submit command.

Parameters

out (str) – The output from the submit command.

Returns

A string representing the native id of the newly submitted job.

Return type

str

list()[source]

Returns a list of jobs known to the underlying implementation.

See list(). The returned list is a list of native_id strings representing jobs known to the underlying batch scheduler implementation, whether submitted through this executor or not. Implementations are encouraged to restrict the results to jobs accessible by the current user.

Return type

List[str]

parse_list_output(out)[source]

Parses the output of the command obtained from get_list_command().

The default implementation of this method assumes that the output has no header and consists of native IDs, one per line, possibly surrounded by whitespace. Concrete implementations should override this method if a different format is expected.

Parameters

out (str) – The output from the “list” command as returned by get_list_command().

Returns

A list of strings representing the native IDs of the jobs known to the LRM for the current user.

Return type

List[str]

abstract parse_status_output(exit_code, out)[source]

Parses the output of a job status command.

Concrete implementations of batch scheduler executors must override this method. The output is meant to have been produced by the command generated by get_status_command().

Parameters
Returns

A dictionary mapping native job ids to JobStatus objects. The implementation of this method need not process the exit code file or the script output file since it is done by the base BatchSchedulerExecutor implementation.

Return type

Dict[str, JobStatus]

abstract process_cancel_command_output(exit_code, out)[source]

Handle output from a failed cancel command.

The main purpose of this method is to help distinguish between the cancel command failing due to an invalid job state (such as the job having completed before the cancel command was invoked) and other types of errors. Since job state errors are ignored, there are two options:

1. Instruct the cancel command to not fail on invalid state errors and have this method always raise a SubmitException, since it is only invoked on “other” errors.

2. Have the cancel command fail on both invalid state errors and other errors and interpret the output from the cancel command to distinguish between the two and raise the appropriate exception.

Parameters
  • exit_code (int) – The exit code from the cancel command.

  • out (str) – The output from the cancel command.

Raises
  • InvalidJobStateError – Raised if the job cancellation has failed because the job was in a completed or failed state at the time when the cancellation command was invoked.

  • SubmitException – Raised for all other reasons.

Return type

None

submit(job)[source]

See submit().

Parameters

job (Job) –

Return type

None

class BatchSchedulerExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: JobExecutorConfig

A base configuration class for BatchSchedulerExecutor implementations.

When subclassing BatchSchedulerExecutor, specific configuration classes inheriting from this class should be defined, even if empty.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

exception InvalidJobStateError[source]

Bases: Exception

An exception that signals that a job cannot be cancelled due to it being already done.

check_status_exit_code(command, exit_code, out)[source]

Check if exit_code is nonzero and, if so, raise a RuntimeError.

This function produces a somewhat user-friendly exception message that combines the command that was run with its output.

Parameters
  • command (str) – The command that was run. This is only used to format the error message.

  • exit_code (int) – The exit code returned by running the command.

  • out (str) – The output produced by command.

Return type

None

BatchSchedulerExecutor

class BatchSchedulerExecutor(url=None, config=None)[source]

Bases: JobExecutor

A base class for batch scheduler executors.

This class implements a generic JobExecutor that interacts with batch schedulers. There are two main components to the executor: job submission and queue polling. Submission is implemented by generating a submit script which is then fed to the queuing system submit command.

The submit script is generated using a generate_submit_script(). An implementation of this functionality based on Mustache/Pystache (see https://mustache.github.io/ and https://pypi.org/project/pystache/) exists in TemplatedScriptGenerator. This class can be instantiated by concrete implementations of a batch scheduler executor and the submit script generation can be delegated to that instance, which has a method whose signature matches that of generate_submit_script(). Besides an opened file which points to where the contents of the submit script are to be written, the parameters to generate_submit_script() are the Job that is being submitted and a context, which is a dictionary with the following structure:

{
    'job': <the job being submitted>
    'psij': {
        'lib': <dict; function library>,
        'launch_command': <str; launch command>,
        'script_dir': <str; directory where the submit script is generated>
    }
}

The script directory is a directory (typically ~/.psij/work) where submit scripts are written; it is also used for auxiliary files, such as the exit code file (see below) or the script output file.

The launch command is a list of strings which the script generator should render as the command to execute. It wraps the job executable in the proper Launcher.

The function library is a dictionary mapping function names to functions for all public functions in the template_function_library module.

The submit script must perform two essential actions:

1. redirect the output of the executable part of the script to the script output file, which is a file in <script_dir> named <native_id>.out, where <native_id> is the id given to the job by the queuing system.

2. store the exit code of the launch command in the exit code file named <native_id>.ec, also inside <script_dir>.

Additionally, where appropriate, the submit script should set the environment variable named PSIJ_NODEFILE to point to a file containing a list of nodes that are allocated for the job, one per line, with a total number of lines matching the process count of the job.

Once the submit script is generated, the executor renders the submit command using get_submit_command() and executes it. Its output is then parsed using job_id_from_submit_output() to retrieve the native_id of the job. Subsequently, the job is registered with the queue polling thread.

The queue polling thread regularly polls the batch scheduler queue for updates to job states. It builds the command for polling the queue using get_status_command(), which takes a list of native_id strings corresponding to all registered jobs. Implementations are strongly encouraged to restrict the query of job states to the specified jobs in order to reduce the load on the queuing system. The output of the status command is then parsed using parse_status_output() and the status of each job is updated accordingly. If the status of a registered job is not found in the output of the queue status command, it is assumed completed (or failed, depending on its exit code), since most queuing systems automatically purge completed jobs from their databases after a short period of time. The exit code is read from the exit code file, as described above. If the exit code value is not zero, the job is assumed failed and an attempt is made to read an error message from the script output file.

Parameters
attach(job, native_id)[source]

Attaches a job to a native job.

Attempts to connect job to a native job with native_id such that the job correctly reflects updates to the status of the native job. If the native job was previously submitted using this executor (hence having an exit code file and a script output file), the executor will attempt to retrieve the exit code and errors from the job. Otherwise, it may be impossible for the executor to distinguish between a failed and successfully completed job.

Parameters
  • job (Job) – The PSI/J job to attach.

  • native_id (str) – The id of the batch scheduler job to attach to.

Return type

None

cancel(job)[source]

Cancels a job if it has not otherwise completed.

A command is constructed using get_cancel_command() and executed in order to cancel the job. Also see cancel().

Parameters

job (Job) –

Return type

None

abstract generate_submit_script(job, context, submit_file)[source]

Called to generate a submit script for a job.

Concrete implementations of batch scheduler executors must override this method in order to generate a submit script for a job.

Parameters
  • job (Job) – The job to be submitted.

  • context (Dict[str, object]) – A dictionary containing information about the context in which the job is being submitted. For details, see the description of this class.

  • submit_file (IO[str]) – An opened file-like object to which the contents of the submit script should be written.

Return type

None

abstract get_cancel_command(native_id)[source]

Constructs a command to cancel a batch scheduler job.

Concrete implementations of batch scheduler executors must override this method.

Parameters

native_id (str) – The native id of the job being cancelled.

Returns

A list of strings representing the command and arguments to execute in order to cancel the job, such as, e.g., [‘qdel’, native_id].

Return type

List[str]

abstract get_list_command()[source]

Constructs a command to retrieve the list of jobs known to the LRM for the current user.

Concrete implementations of batch scheduler executors must override this method. Upon running the command, the output can be parsed with parse_list_output().

Returns

A list of strings representing the executable and arguments to invoke in order to obtain the list of jobs the LRM knows for the current user.

Return type

List[str]

abstract get_status_command(native_ids)[source]

Constructs a command to retrieve the status of a list of jobs.

Concrete implementations of batch scheduler executors must override this method. In order to prevent overloading the queueing system, concrete implementations are strongly encouraged to return a command that only queries for the status of the indicated jobs. The command returned by this method should produce an output that is understood by parse_status_output().

Parameters
  • jobs – A collection of native ids corresponding to the jobs whose status is sought.

  • native_ids (Collection[str]) –

Returns

A list of strings representing the command and arguments to execute in order to get the status of the jobs.

Return type

List[str]

abstract get_submit_command(job, submit_file_path)[source]

Constructs a command to submit a job to a batch scheduler.

Concrete implementations of batch scheduler executors must override this method.

Parameters
Returns

A list of strings representing the command and arguments to execute in order to submit the job, such as [‘qsub’, str(submit_file_path)].

Return type

List[str]

abstract job_id_from_submit_output(out)[source]

Extracts a native job id from the output of the submit command.

Concrete implementations of batch scheduler executors must override this method. This method is only invoked if the submit command completes with a zero exit code, so implementations of this method do not need to determine whether the output reflects an error from the submit command.

Parameters

out (str) – The output from the submit command.

Returns

A string representing the native id of the newly submitted job.

Return type

str

list()[source]

Returns a list of jobs known to the underlying implementation.

See list(). The returned list is a list of native_id strings representing jobs known to the underlying batch scheduler implementation, whether submitted through this executor or not. Implementations are encouraged to restrict the results to jobs accessible by the current user.

Return type

List[str]

parse_list_output(out)[source]

Parses the output of the command obtained from get_list_command().

The default implementation of this method assumes that the output has no header and consists of native IDs, one per line, possibly surrounded by whitespace. Concrete implementations should override this method if a different format is expected.

Parameters

out (str) – The output from the “list” command as returned by get_list_command().

Returns

A list of strings representing the native IDs of the jobs known to the LRM for the current user.

Return type

List[str]

abstract parse_status_output(exit_code, out)[source]

Parses the output of a job status command.

Concrete implementations of batch scheduler executors must override this method. The output is meant to have been produced by the command generated by get_status_command().

Parameters
Returns

A dictionary mapping native job ids to JobStatus objects. The implementation of this method need not process the exit code file or the script output file since it is done by the base BatchSchedulerExecutor implementation.

Return type

Dict[str, JobStatus]

abstract process_cancel_command_output(exit_code, out)[source]

Handle output from a failed cancel command.

The main purpose of this method is to help distinguish between the cancel command failing due to an invalid job state (such as the job having completed before the cancel command was invoked) and other types of errors. Since job state errors are ignored, there are two options:

1. Instruct the cancel command to not fail on invalid state errors and have this method always raise a SubmitException, since it is only invoked on “other” errors.

2. Have the cancel command fail on both invalid state errors and other errors and interpret the output from the cancel command to distinguish between the two and raise the appropriate exception.

Parameters
  • exit_code (int) – The exit code from the cancel command.

  • out (str) – The output from the cancel command.

Raises
  • InvalidJobStateError – Raised if the job cancellation has failed because the job was in a completed or failed state at the time when the cancellation command was invoked.

  • SubmitException – Raised for all other reasons.

Return type

None

submit(job)[source]

See submit().

Parameters

job (Job) –

Return type

None

BatchSchedulerExecutorConfig

class BatchSchedulerExecutorConfig(launcher_log_file=None, work_directory=None, queue_polling_interval=30, initial_queue_polling_delay=2, queue_polling_error_threshold=2, keep_files=False)[source]

Bases: JobExecutorConfig

A base configuration class for BatchSchedulerExecutor implementations.

When subclassing BatchSchedulerExecutor, specific configuration classes inheriting from this class should be defined, even if empty.

Parameters
  • launcher_log_file (Optional[Path]) – See JobExecutorConfig.

  • work_directory (Optional[Path]) – See JobExecutorConfig.

  • queue_polling_interval (int) – an interval, in seconds, at which the batch scheduler queue will be polled for updates to jobs.

  • initial_queue_polling_delay (int) – the time to wait before polling the queue for the first time; for quick tests that only submit a short job that completes nearly instantly or for jobs that fail very quickly, this can dramatically reduce the time taken to get the necessary job status update.

  • queue_polling_error_threshold (int) – The number of times consecutive queue polls have to fail in order for the executor to report them as job failures.

  • keep_files (bool) – Whether to keep submit files and auxiliary job files (exit code and output files) after a job has completed.

check_status_exit_code

check_status_exit_code(command, exit_code, out)[source]

Check if exit_code is nonzero and, if so, raise a RuntimeError.

This function produces a somewhat user-friendly exception message that combines the command that was run with its output.

Parameters
  • command (str) – The command that was run. This is only used to format the error message.

  • exit_code (int) – The exit code returned by running the command.

  • out (str) – The output produced by command.

Return type

None

SubmitScriptGenerator

class SubmitScriptGenerator(config)[source]

Bases: ABC

A base class representing a submit script generator.

A submit script generator is used to render a Job (together with all its properties, including JobSpec, ResourceSpec, etc.) into a submit script specific to a certain batch scheduler.

Parameters

config (JobExecutorConfig) – An executor configuration containing configuration properties for the executor that is attempting to use this generator. Submit script generators are meant to work in close cooperation with batch scheduler job executors, hence the sharing of a configuration mechanism.

Return type

None

generate_submit_script(job, context, out)[source]

Generates a job submit script.

Concerete implementations of submit script generators must implement this method. Its purpose is to generate the content of the submit script. For an extensive explanation of the mechanism behind this process, see BatchSchedulerExecutor.

Parameters
  • job (Job) – The job for which the submit script is to be generated.

  • context (Dict[str, object]) – A dictionary containing information about the context in which the job is being submitted. For details, see BatchSchedulerExecutor.

  • out (IO[str]) – An opened file-like object to which the contents of the submit script should be written.

Return type

None

TemplatedScriptGenerator

class TemplatedScriptGenerator(config, template_path, escape=<function bash_escape>)[source]

Bases: SubmitScriptGenerator

A Mustache templates submit script generator.

This script generator uses Pystache (https://pypi.org/project/pystache/), which is a Python implementation of the Mustache templating language (https://mustache.github.io/).

Parameters
  • config (JobExecutorConfig) – A configuration, which is passed to the base class.

  • template_path (Path) – The path to a Mustache template.

  • escape (Callable[[object], str]) – An escape function to use for escaping values. By default, a function that escapes strings for use in bash scripts is used.

Return type

None

generate_submit_script(job, context, out)[source]

See generate_submit_script().

Renders a submit script using the template specified when this generator was constructed.

Parameters
Return type

None

psij.executors.batch.template_function_library module

ALL: Dict[str, Callable[[...], Any]] = {'walltime_to_minutes': <function walltime_to_minutes>}

A dictionary of all template-accessible functions for the batch executor templating mechanism.

The dictionary which maps function names to their implementation. All public functions in this module are present in this dictionary and their corresponding keys are the same as their names.

walltime_to_minutes(walltime)[source]

Converts a walltime object to a number of minutes.

The walltime can either be a Python timedelta, an integer, in which case it is interpreted directly as a number of minutes, or a string with a format of either HH:MM:SS, HH:MM, or MM.

Parameters

walltime (Union[timedelta, int, str]) – the walltime to convert

Returns

The number of minutes represented by the walltime parameter.

Return type

int

Launchers

aprun

class AprunLauncher(config=None)[source]

Bases: MultipleLauncher

Launches a job using Cobalt’s aprun.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

jrun

class JsrunLauncher(config=None)[source]

Bases: MultipleLauncher

Launches a job using LSF’s jsrun.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

mpirun

class MPILauncher(config=None)[source]

Bases: MultipleLauncher

Launches jobs using mpirun.

mpirun is a tool provided by MPI implementations, such as Open MPI.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

multiple

class MultipleLauncher(script_path=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/psi-j-python/checkouts/latest/src/psij/launchers/scripts/multi_launch.sh'), config=None)[source]

Bases: ScriptBasedLauncher

A launcher that launches multiple identical copies of the executable.

The exit code of the job corresponds to the first non-zero exit code encountered in one of the executable copies or zero if all invocations of the executable succeed.

Parameters

single

class SingleLauncher(config=None)[source]

Bases: ScriptBasedLauncher

A launcher that launches a single copy of the executable. This is the default launcher.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

srun

class SrunLauncher(config=None)[source]

Bases: MultipleLauncher

Launches a job using Slurm’s srun.

See the Slurm Workload Manager.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration.

Launcher Infrastructure

Launcher

class Launcher(config=None)[source]

Bases: ABC

An abstract base class for all launchers.

Parameters

config (Optional[JobExecutorConfig]) – An optional configuration. If not specified, DEFAULT is used.

Return type

None

static get_instance(name, version_constraint=None, config=None)[source]

Returns an instance of a launcher optionally configured using a certain configuration.

The returned instance may or may not be a singleton object.

Parameters
Returns

A launcher instance.

Return type

Launcher

abstract get_launch_command(job)[source]

Constructs a command to launch a job given a job specification.

Parameters

job (Job) – The job to launch.

Returns

A list of strings representing the launch command and all of its arguments.

Return type

List[str]

abstract get_launcher_failure_message(output)[source]

Extracts the launcher error message from the output of this launcher’s invocation.

It is understood that the value of the output parameter is such that is_launcher_failure() returns True on it.

Parameters

output (str) – The output (combined stdout/stderr) from an invocation of the launcher command.

Returns

A string representing the part of the launcher output that describes the launcher error.

Return type

str

static get_launcher_names()[source]

Returns a set of registered launcher names.

Names returned by this method can be passed to get_instance() as the name parameter.

Returns

A set of launcher names corresponding to the known executors.

Return type

Set[str]

abstract is_launcher_failure(output)[source]

Determines whether the launcher invocation output contains a launcher failure or not.

Parameters

output (str) – The output (combined stdout/stderr) from an invocation of the launcher command

Returns

Returns True if the output parameter contains a string that represents a launncher failure.

Return type

bool

static register_launcher(desc, root)[source]

Registers a launcher class.

The registered class can then be instantiated using get_instance().

Parameters
  • desc (Descriptor) – A Descriptor with information about the launcher to register.

  • root (str) – A filesystem path under which the implementation of the launcher is to be loaded from. Launchers from other locations, even if under the correct package, will not be registered by this method. If a launcher implementation is only available under a different root path, this method will throw an exception.

Return type

None

ScriptBasedLauncher

class ScriptBasedLauncher(script_path, config=None)[source]

Bases: Launcher

A launcher that uses a script to start the job, possibly by wrapping it in other tools.

This launcher is an abstract base class for launchers that wrap the job in a script. The script must be a bash script and is invoked with the first four parameters as:

  • the job ID

  • a launcher log file, which is taken from the launcher_log_file configuration setting and defaults to /dev/null

  • the pre- and post- launcher scripts, or empty strings if they are not specified

Additional positional arguments to the script can be specified by subclasses by overriding the get_additional_args() method.

The remaining arguments to the script are the job executable and arguments.

A simple script library is provided in scripts/launcher_lib.sh. Its use is optional and it is intended to be included at the beginning of a main launcher script using source $(dirname “$0”)/launcher_lib.sh. It does the following:

  • sets ‘-e’ mode (exit on error)

  • sets the variables _PSI_J_JOB_ID, _PSI_J_LOG_FILE, _PSI_J_PRE_LAUNCH, and _PSI_J_POST_LAUNCH from the first arguments, as specified above.

  • saves the current stdout and stderr in descriptors 3 and 4, respectively

  • redirects stdout and stderr to the log file, while prepending a timestamp and the job ID to each line

  • defines the commands “pre_launch” and “post_launch”, which can be invoked by the main script.

When invoking the job executable (either directly or through a launch command), it is recommended that the stdout and stderr of the job process be redirected to descriptors 3 and 4, respectively, such that they can be captured by the entity invoking the launcher rather than ending up in a the launcher log file.

A successful completion of the launcher should be signalled by the launcher by printing the string “_PSI_J_LAUNCHER_DONE” to stdout. The launcher can then exit with the exit code returned by the launched command. This allows executor to distinguish between a non-zero exit code due to application failure or due to a premature launcher failure.

The actual launcher scripts, as well as the library, are deployed at run-time into the work directory, where submit scripts are also generated. This directory is meant to be accessible by both the node submitting the job as well as the node launching the job.

Parameters
  • script_path (Path) – A path to a script that is invoked as described above.

  • config (Optional[JobExecutorConfig]) – An optional configuration.

Return type

None

get_additional_args(job)[source]

Returns any additional arguments, after first mandatory four, to be passed to the script.

Parameters

job (Job) – The job that is being launched.

Return type

List[str]

get_launch_command(job, log_file=None)[source]

See get_launch_command().

Parameters
Return type

List[str]

get_launcher_failure_message(output)[source]

See get_launcher_failure_message().

Parameters

output (str) –

Return type

str

is_launcher_failure(output)[source]

See is_launcher_failure().

Parameters

output (str) –

Return type

bool