Monitoring¶

molq separates submission from observation. Jobs are persisted in SQLite, refreshed by reconciliation, and surfaced through handles, record lists, CLI commands, and a Rich dashboard. Reconciliation, polling, and events all live on Submitor; live scheduler introspection lives on Cluster.

Two views of "what is running"¶

View	Source	Includes other users?
`submitor.list_jobs()`	molq's persisted SQLite records	No — only your jobs
`cluster.get_queue()`	live `squeue --me` / `qstat` / `bjobs`	Scheduler-dependent

Use list_jobs when you want molq's own state machine view (with retries, transitions, dependencies). Use get_queue when you want what the scheduler client itself shows.

Job Lifecycle¶

Jobs move through a terminal-aware state enum:

from molq import JobState

JobState.CREATED
JobState.SUBMITTED
JobState.QUEUED
JobState.RUNNING
JobState.SUCCEEDED
JobState.FAILED
JobState.CANCELLED
JobState.TIMED_OUT
JobState.LOST

All terminal states report True from is_terminal.

Persistence¶

JobStore persists all job records in SQLite with WAL mode enabled. By default the database is stored at ~/.molq/jobs.db. Multiple Submitors (across multiple Clusters) share this store and filter by their target's name.

Each record contains:

stable job_id
cluster_name (the Submitor's target name)
scheduler (the scheduler kind)
scheduler identity (scheduler_job_id)
cached state
timestamps such as submitted_at, started_at, and finished_at
exit code and failure reason
execution metadata including default stdout and stderr paths

Reconciliation¶

JobReconciler is responsible for syncing persisted state with the scheduler. Its main loop is:

Load active jobs for a cluster.
Batch-query the scheduler (via the Cluster's Transport).
Compare cached state with reported state.
Persist transitions and updated timestamps.

submitor.refresh_jobs() runs a reconciliation pass for the current target Cluster.

Blocking waits¶

JobMonitor provides higher-level waiting primitives on top of reconciliation.

handle = submitor.submit_job(argv=["python", "train.py"])
record = handle.wait(timeout=3600)

For multiple jobs:

records = submitor.watch_jobs([h.job_id for h in handles], timeout=7200)

Polling strategies¶

molq ships three public polling strategies:

Strategy	Use case
`FixedStrategy`	Simple fixed-interval polling
`ExponentialBackoffStrategy`	Lower scheduler pressure for long jobs
`AdaptiveStrategy`	Polling cadence based on expected runtime

These strategies live in molq.strategies and back the monitor layer.

Logs¶

Each submitted job gets default stdout and stderr paths unless you override them in JobExecution. By default these files live under the submission working directory at .molq/jobs/<job-id>/.

At runtime, those resolved paths are stored in JobRecord.metadata under:

molq.stdout_path
molq.stderr_path
molq.job_dir

The CLI exposes them through:

molq logs <job-id> local --stream stdout
molq logs <job-id> local --stream stderr --tail 50

Lifecycle events¶

Submitor emits typed events through an EventBus. Subscribe with on_event / unsubscribe with off_event:

from molq import EventType

def on_terminated(payload):
    print(f"{payload.job_id}: {payload.transition.new_state.value}")

submitor.on_event(EventType.JOB_COMPLETED, on_terminated)
submitor.on_event(EventType.JOB_FAILED, on_terminated)

# Later:
submitor.off_event(EventType.JOB_COMPLETED, on_terminated)

Background reconciliation¶

submitor.run_daemon(once=False, interval=5.0, run_cleanup=True) runs a lightweight loop that calls refresh_jobs (and optionally cleanup_jobs) on a fixed interval. Useful in long-running services that want fresh state without explicit polling at every read.

Full-screen dashboard¶

The Rich dashboard is available in two forms:

RunDashboard for a focused dashboard experience
MolqMonitor for a full-screen view across all persisted jobs

From the CLI:

molq monitor
molq monitor --all --limit 500 --refresh 1.5

Closing the dashboard does not cancel running jobs.