Skip to main content
Tandemn System helps teams run inference workloads across accelerated infrastructure without hand-tuning every model, GPU, and batch job. It provides a self-hosted control plane that manages orchestration and a CLI that users can use to submit inference jobs. Instead of sending every workload to the newest and most expensive accelerator, Tandemn can route jobs across heterogeneous GPU pools and use the hardware that best matches the workload’s latency and throughput requirements.

Run your first job

Install the server, connect the CLI, and submit a model inference job.

Understand the architecture

Learn how the server, users, GPU workers, and job scheduler fit together.

Set up a cluster

Install the control plane, configure AWS, and start the server.

Use the CLI

Check connectivity and submit jobs from a local Python environment.

What Tandemn is for

Tandemn System is designed for teams that already have, or plan to operate, accelerated compute and want a simpler way to run inference workloads across that capacity.
  • Infrastructure teams can expose a single service for users instead of asking each team to manage hardware placement.
  • ML and application teams can submit jobs through the CLI without deciding which machine or GPU should run them.
  • Organizations with mixed GPU supply can use available capacity more efficiently across different machines and accelerator types.

How the workflow fits together

1

An administrator starts the Tandemn server

The control plane is deployed on a machine that users and EC2 replicas can reach over the network. It manages cluster state and receives inference job requests.
2

Users install the Tandemn CLI

Users install the tandemn Python package, set TD_SERVER_URL, and check connectivity.
3

Users submit inference jobs

A user provides a model, a JSONL prompt file, and a service-level objective. Tandemn schedules the job across available accelerated resources.
4

Tandemn chooses the execution plan

The orchestration layer selects an efficient hardware mix for the workload so users can focus on the job rather than the cluster.

Next step

Start with the Quickstart if you want the shortest path from a new environment to a submitted inference job.