Skip to main content
Tandemn System uses a client-server architecture. Administrators operate the self-hosted control plane, and users submit inference jobs through the CLI.

Components

Control plane

The server that receives job requests, runs placement planning, tracks quota, launches replicas, and coordinates orchestration.

Tandemn CLI

The command line interface users install locally to check connectivity and submit workloads.

GPU resources

AWS instances launched by SkyPilot to execute inference workloads.

Batch inputs

OpenAI-style JSONL files that describe the work users want Tandemn to run.

Request flow

1

A user submits a job

The user runs tandemn deploy with a model, JSONL input file, and SLO deadline.
2

The server receives the job

The control plane validates the request, uploads local input to S3, and evaluates available resources.
3

Tandemn builds an execution plan

The placement solver chooses GPU type, region, tensor parallelism, pipeline parallelism, replica count, and launch mode.
4

The workload runs

SkyPilot launches replicas, Redis coordinates chunks, vLLM runs inference, and outputs are written to S3.

Execution model

Input is split into chunks and queued in Redis. Replica clusters pull chunks, run inference through vLLM, upload per-chunk outputs, and mark chunks complete. If a replica dies, its in-flight chunks can be reclaimed and returned to the queue. tandemn swap keeps the same queue while launching replacement replicas with a different GPU configuration. Old replicas are removed after the new replicas begin processing.

Optional Koi

Orca standalone is the default path. Leave KOI_SERVICE_URL unset and the control plane handles placement, launch, chunked execution, monitoring, recovery, and output assembly itself. If KOI_SERVICE_URL is set, Tandemn can call Koi for an additional recommendation and send lifecycle callbacks. If Koi is unset, unavailable, times out, or returns bad data, Tandemn continues through the standalone path.

Why this model matters

This separation lets infrastructure teams manage hardware and cluster policy centrally while application and ML teams get a simpler job submission interface.
Treat the Tandemn server as shared infrastructure. Keep ownership, deployment, and monitoring responsibilities clear before inviting users onto the cluster.