Skip to main content
Tandemn users submit jobs with a model identifier, a JSONL workload, and a deadline. The control plane uses the deployment’s available resources and configuration to decide how that workload should run.

Model identifiers

The quickstart uses a Hugging Face style model identifier:
tandemn deploy Qwen/Qwen2.5-7B-Instruct prompts.jsonl --slo 4
Use model identifiers that your Tandemn deployment supports. Tandemn System can run HuggingFace models compatible with vLLM, and you can override placement manually for models outside the performance database.
tandemn deploy <any-hf-model> input.jsonl --gpu A10G --tp 1

Placement solvers

tandemn plan and tandemn deploy can show recommendations from two built-in solvers:
SolverDescription
LLM AdvisorUses the performance database and an LLM reasoning layer to rank placements by cost, throughput, and SLO feasibility.
Roofline solverUses GPU bandwidth, TFLOPS, memory, and model constraints. No API key is required.
If the advisor is unavailable, Tandemn falls back to the roofline solver. If KOI_SERVICE_URL is set, Tandemn can also show an optional Koi recommendation.

Routing goals

Tandemn’s routing layer is designed to reduce the amount of manual placement work users need to do. Instead of asking each user to pick a specific machine, Tandemn can evaluate the job and choose an appropriate hardware mix.

What affects placement

The exact placement decision depends on deployment-specific configuration, but these are the common inputs to reason about:
  • Model size and runtime requirements
  • Prompt file size
  • Requested SLO
  • Available GPUs
  • Current cluster load
  • AWS quota and capacity
  • Spot or on-demand launch mode
  • Tensor and pipeline parallelism settings

Supported hardware

GPUAWS instanceVRAM
A100 80GBp4d.24xlarge, p4de.24xlarge8 x 80GB
H100 80GBp5.48xlarge8 x 80GB
L40S 48GBg6e.12xlarge, g6e.24xlarge, g6e.48xlarge4 x / 4 x / 8 x 48GB
A10G 24GBg5.12xlarge, g5.48xlarge4 x / 8 x 24GB
The solver searches across GPU types and parallelism configurations to find a placement that fits the model in memory and meets the requested deadline.
If a model cannot be scheduled, the first thing to check is whether the model is enabled in the deployment and whether compatible resources are available.