Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.tandemn.com/llms.txt

Use this file to discover all available pages before exploring further.

The Tandemn System control plane exposes a REST API at http://localhost:26336 by default. Most users should use the CLI; use the REST API for integrations, dashboards, and operational tooling.

Endpoint reference

EndpointMethodDescription
/submit/batchPOSTSubmit a batch inference job.
/test/placementPOSTRun solver only without launching.
/jobsGETList all jobs.
/job/{id}GETJob status and progress.
/job/{id}/phasePOSTUpdate job lifecycle phase.
/job/{id}/metricsGETLatest aggregated metrics snapshot.
/job/{id}/metrics/streamGETSSE metrics stream.
/job/{id}/metrics/ingestPOSTSidecar metrics ingest from replicas.
/job/{id}/metrics/summaryPOSTPer-replica build metrics summary.
/job/{id}/throughputGETSustained throughput over the rolling window.
/job/{id}/replicasGETPer-replica state, phase, region, and metrics availability.
/job/{id}/replicas/{rid}/metricsGETMetrics for a specific replica.
/job/{id}/replicas/summariesGETPer-replica completion summaries.
/job/{id}/scalePOSTAdd replicas to a running job.
/job/{id}/killPOSTKill specific replicas.
/job/{id}/swapPOSTHot-swap replicas to a new GPU configuration.
/job/{id}/chunks/progressGETChunk-level progress.
/job/{id}/chunks/pullPOSTPull the next chunk. Replica-facing.
/job/{id}/chunks/completePOSTMark a chunk complete.
/job/{id}/chunks/renewPOSTRenew a chunk lease.
/dashboardGETWeb dashboard HTML.
/dashboard/pollGETDashboard JSON payload for polling fallback.
/dashboard/streamGETReal-time dashboard SSE stream.
/analytics/runsGETList completed runs.
/analytics/runs/{id}GETFull completed run report.
/analytics/runs/{id}/timeseriesGETScheduler timeseries for a run.
/quota/statusGETQuota usage across AWS regions.
/resourcesGETInstance catalog and quota pools.

Example responses

GET /job/{id}/metrics

{
  "job_id": "mo-qwen7b-a1b2",
  "timestamp": 1711612800.0,
  "avg_generation_throughput_toks_per_s": 1450.5,
  "avg_prompt_throughput_toks_per_s": 320.0,
  "gpu_cache_usage_perc": 0.42,
  "num_requests_running": 64,
  "num_requests_waiting": 0,
  "generation_tokens_total": 2800000,
  "prompt_tokens_total": 350000,
  "gpu_sm_util_pct": 95.2,
  "gpu_mem_bw_util_pct": 61.0,
  "ttft_ms_p50": 45.0,
  "ttft_ms_p95": 120.0,
  "tpot_ms_p50": 8.5,
  "tpot_ms_p95": 15.0
}

GET /job/{id}/chunks/progress

{
  "total": 10,
  "pending": 3,
  "inflight": 2,
  "completed": 5,
  "failed": 0,
  "all_done": false
}

POST /job/{id}/scale

{
  "count": 2,
  "gpu_type": "L40S",
  "tp_size": 4,
  "pp_size": 1,
  "on_demand": false,
  "force": false
}
gpu_type, tp_size, and pp_size are optional. If omitted, Tandemn inherits them from the existing job.

POST /job/{id}/swap

{
  "gpu_type": "H100",
  "tp_size": 4,
  "num_replicas": 2,
  "ready_threshold": 1,
  "force": false
}
New replicas launch first. Old replicas are killed after ready_threshold new replicas begin processing.