Components
Control plane
The server that receives job requests, runs placement planning, tracks quota, launches replicas, and coordinates orchestration.
Tandemn CLI
The command line interface users install locally to check connectivity and submit workloads.
GPU resources
AWS instances launched by SkyPilot to execute inference workloads.
Batch inputs
OpenAI-style JSONL files that describe the work users want Tandemn to run.
Request flow
The server receives the job
The control plane validates the request, uploads local input to S3, and evaluates available resources.
Tandemn builds an execution plan
The placement solver chooses GPU type, region, tensor parallelism, pipeline parallelism, replica count, and launch mode.
Execution model
Input is split into chunks and queued in Redis. Replica clusters pull chunks, run inference through vLLM, upload per-chunk outputs, and mark chunks complete. If a replica dies, its in-flight chunks can be reclaimed and returned to the queue.tandemn swap keeps the same queue while launching replacement replicas with a different GPU configuration. Old replicas are removed after the new replicas begin processing.
Optional Koi
Orca standalone is the default path. LeaveKOI_SERVICE_URL unset and the control plane handles placement, launch, chunked execution, monitoring, recovery, and output assembly itself.
If KOI_SERVICE_URL is set, Tandemn can call Koi for an additional recommendation and send lifecycle callbacks. If Koi is unset, unavailable, times out, or returns bad data, Tandemn continues through the standalone path.
Why this model matters
This separation lets infrastructure teams manage hardware and cluster policy centrally while application and ML teams get a simpler job submission interface.Treat the Tandemn server as shared infrastructure. Keep ownership, deployment, and monitoring responsibilities clear before inviting users onto the cluster.

