Skip to content

Concepts

molq is built around a small number of orthogonal abstractions. Knowing which one owns what saves you from guessing where to put configuration or where a bug lives.

The big picture

                         ┌───────────────────────────┐
                         │         Submitor          │   lifecycle engine
                         │  (store, monitor, events) │   "how jobs are tracked"
                         └────────────┬──────────────┘
                                      │ target=
                         ┌───────────────────────────┐
                         │          Cluster          │   destination
                         │  scheduler kind + options │   "where jobs run"
                         └─────┬───────────────┬─────┘
                               │               │
               ┌───────────────▼─┐       ┌─────▼─────────────┐
               │    Scheduler    │       │     Transport     │
               │   (protocol)    │       │    (protocol)     │
               │ Shell / Slurm / │       │  Local / SSH      │
               │   PBS  / LSF    │       │                   │
               └─────────────────┘       └───────────────────┘
                  HOW to talk                WHERE commands and
                  to the scheduler           file ops execute
                  (sbatch / qsub / bjobs)    (subprocess vs ssh+rsync)

The point of the split: Scheduler × Transport are independent axes. You can drive a remote SLURM cluster via SSH, or a local "no batch system" cluster via subprocess, or run jobs on a remote workstation by pairing scheduler="local" with host="..." — without touching Submitor or Cluster code.

Cluster — where jobs run

A Cluster is a destination spec. It owns:

  • a name (used to scope persisted records)
  • a scheduler kind ("local", "slurm", "pbs", "lsf")
  • a Transport (defaults to LocalTransport; pass host="user@host" to use SSH)
  • optional scheduler options (SlurmSchedulerOptions, etc.)

scheduler="local" is the no-batch-system backend; the transport decides where it runs. With LocalTransport jobs run on this host; with SshTransport (or the host= shortcut) they run on a remote workstation that has no queue manager.

A Cluster has no lifecycle state — no store, no monitor, no event bus. It is cheap to construct. Multiple Submitors can share a Cluster, or a Cluster can outlive a Submitor.

import molq as mq

local = mq.Cluster("dev", "local")
hpc   = mq.Cluster("hpc",  "slurm", host="user@hpc.example.com")

Cluster exposes only destination-side reads:

  • cluster.get_queue() — snapshot of squeue --me / qstat -u $USER / bjobs (empty for local)
  • cluster.get_workspace(name, path=...) — handle to a remote directory
  • cluster.get_project(name, workspace=...) — sub-namespace under a workspace

See Cluster in the API reference.

Submitor — how jobs are tracked

A Submitor is the lifecycle engine. It owns:

  • the JobStore (SQLite at ~/.molq/jobs.db by default)
  • the JobReconciler (syncs persisted state with the scheduler)
  • the JobMonitor (blocking waits, polling strategies)
  • the EventBus (lifecycle event pub/sub)
  • per-job defaults, retry policy, retention policy

Each Submitor is bound to one Cluster as its target at construction. All lifecycle ops are implicitly scoped to that target's name, so two Submitors targeting different Clusters can share a JobStore without seeing each other's records.

submitor = mq.Submitor(target=hpc)

handle = submitor.submit_job(argv=["python", "train.py"])
records = submitor.list_jobs()
submitor.cancel_job(handle.job_id)

The Submitor surface is verb_noun: submit_job, list_jobs, get_job, cancel_job, watch_jobs, refresh_jobs, cleanup_jobs, run_daemon, on_event, off_event. See API reference.

Multi-cluster

Multi-cluster on one process is just multiple Submitors:

sub_local = mq.Submitor(target=mq.Cluster("dev", "local"))
sub_hpc   = mq.Submitor(target=mq.Cluster("hpc", "slurm", host="..."))

# Each Submitor's list_jobs() only sees its own target's records,
# even though they share the same JobStore file.

Scheduler — the protocol behind the kind string

The Scheduler protocol is internal: users don't construct a Scheduler directly. It is the abstract interface that ShellScheduler (the backend for scheduler="local"), SlurmScheduler, PBSScheduler, and LSFScheduler implement, with methods:

  • submit(spec, job_dir) — translate a JobSpec into a scheduler submission (writes run_slurm.sh, calls sbatch, etc.)
  • poll_many(ids) — batch query for current state
  • cancel(id) — cancel a job
  • resolve_terminal(id) — determine how a vanished job ended
  • list_queue(user=None) — snapshot the scheduler's current queue

You configure schedulers indirectly by passing scheduler_options=... to Cluster. See Schedulers for option types.

Transport — physical where commands run

The Transport protocol is also internal but worth understanding because it is what makes "remote SLURM" work without any new dependencies:

  • LocalTransport — runs commands via subprocess, file ops via pathlib
  • SshTransport — shells out to OpenSSH / rsync; inherits your ~/.ssh/config, agents, ProxyJump, ControlMaster, Kerberos

Schedulers use self._transport.run(...) for every shell call, so a SLURM scheduler with an SshTransport runs sbatch, squeue, scancel, and sacct over SSH — automatically.

You normally pick a Transport implicitly:

mq.Cluster("hpc", "slurm")                            # LocalTransport
mq.Cluster("hpc", "slurm", host="user@hpc.example")   # SshTransport (shortcut)

# Or explicitly, when you need custom SSH options:
from molq.options import SshTransportOptions
from molq.transport import SshTransport

ssh = SshTransport(SshTransportOptions(
    host="user@bastion",
    identity_file="~/.ssh/hpc_key",
    ssh_opts=("-o", "ProxyJump=jump.example.com"),
))
mq.Cluster("hpc", "slurm", transport=ssh)

host= and transport= are mutually exclusive.

Workspace and Project — remote directories

A Workspace is a base directory on the cluster's filesystem. A Project is a sub-namespace under a workspace. Both share a tiny file-ops surface that goes through the cluster's Transport, so the same code works against a local filesystem or a remote cluster over SSH.

ws   = cluster.get_workspace("scratch", path="/scratch/$USER")
proj = ws.get_project("alphafold")        # /scratch/$USER/alphafold

proj.ensure()                             # mkdir -p
proj.upload("./inputs", recursive=True)   # rsync local → cluster
handle = proj.submit_job(submitor, argv=["python", "run.py"])
proj.download("results.csv", "./out.csv")

Project.submit_job is sugar that overrides JobExecution.cwd to the project path before forwarding to submitor.submit_job(...).

Workspace and Project are deliberately thin. They do not auto-stage local files referenced in argv — call proj.upload(...) explicitly.

Job objects

These flow back from a submission:

  • JobHandle — lightweight handle returned by submitor.submit_job(...). Methods: status(), wait(timeout), cancel(), refresh(). Fields: job_id, cluster_name, scheduler, scheduler_job_id.
  • JobRecord — immutable snapshot of a job's full lifecycle state. Returned by submitor.get_job(...) and handle.wait().
  • QueueEntry — one row from cluster.get_queue(). Fields include scheduler_job_id, name, user, state, partition, submit_time, start_time. Distinct from JobRecord: JobRecord is molq's view of a job; QueueEntry is what the scheduler client sees, including jobs submitted outside molq.

Cheat sheet

Question Answer
Where do I configure SSH? Cluster(host="user@host") or Cluster(transport=SshTransport(...))
Where do I configure the SLURM partition for a single job? submit_job(scheduling=JobScheduling(partition="gpu"))
Where do I set per-cluster defaults? Submitor(target=cluster, defaults=SubmitorDefaults(...))
Where do I customize SQLite path? Submitor(target=cluster, store=JobStore(path))
Where do retries live? Submitor(target=cluster, default_retry_policy=...), or per-call submit_job(retry=...)
Where do I see jobs other people submitted? cluster.get_queue() (live scheduler snapshot)
Where do I see jobs I submitted via molq? submitor.list_jobs() (persisted records)
Why is my SLURM --partition flag missing? You passed JobScheduling(queue=...) — the field is partition now (legacy queue still loads from disk for one release)