Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.tandemn.com/llms.txt

Use this file to discover all available pages before exploring further.

Tandemn users submit jobs with a model identifier, a JSONL workload, and a deadline. The control plane uses the deployment’s available resources and configuration to decide how that workload should run.

Model identifiers

The quickstart uses a Hugging Face style model identifier:
tandemn deploy Qwen/Qwen2.5-7B-Instruct prompts.jsonl --slo 4
Use model identifiers that your Tandemn deployment supports. Tandemn System can run HuggingFace models compatible with vLLM, and you can override placement manually for models outside the performance database.
tandemn deploy <any-hf-model> input.jsonl --gpu A10G --tp 1

Placement solvers

tandemn plan and tandemn deploy can show recommendations from two built-in solvers:
SolverDescription
LLM AdvisorUses the performance database and an LLM reasoning layer to rank placements by cost, throughput, and SLO feasibility.
Roofline solverUses GPU bandwidth, TFLOPS, memory, and model constraints. No API key is required.
If the advisor is unavailable, Tandemn falls back to the roofline solver. If KOI_SERVICE_URL is set, Tandemn can also show an optional Koi recommendation. The LLM Advisor requires ANTHROPIC_API_KEY and the performance database. Without those, tandemn plan and tandemn deploy can still use the roofline solver.

Performance database

The LLM placement advisor uses a performance database of profiled vLLM runs. Download it into the server repository:
curl -L https://github.com/Tandemn-Labs/LLM_placement_solver/releases/download/aiconfigurator-v1/data.csv \
  -o LLM_placement_solver/llm_advisor/data/aiconfigurator/data.csv
The dataset contains 103K profiled vLLM runs across A100, H100, H200, B200, GB200, and L40S GPUs, sourced from NVIDIA Dynamo AIConfigurator.

Profiled models

ModelParamsTypeProfiled GPUs
Meta-Llama-3.1-8B8BDenseA100, H100, H200, B200, GB200, L40S
Meta-Llama-3.1-70B70BDenseA100, H100, H200, B200, GB200, L40S
Llama-3.1-70B-Instruct-FP870BDense FP8A100, H100, H200, B200, GB200, L40S
Meta-Llama-3.1-405B405BDenseA100, H100, H200, B200, GB200
Nemotron-Super-49B49BDenseA100, H100, H200, B200, GB200, L40S
Nemotron-H-56B56BDenseA100, H100, H200, B200, GB200, L40S
Qwen3-8B8BDenseA100, H100, H200, B200, GB200, L40S
Qwen3-32B32BDenseA100, H100, H200, B200, GB200, L40S
Qwen3-32B-FP832BDense FP8A100, H100, H200, B200, GB200, L40S
Qwen3-30B-A3B30BMoE, 3B activeA100, H100, H200, B200, GB200, L40S
Qwen3-235B-A22B235BMoE, 22B activeA100, H100, H200, B200, GB200
Qwen3-235B-A22B-FP8235BMoE, 22B active, FP8A100, H100, H200, B200, GB200
Qwen3-235B-A22B-NVFP4235BMoE, 22B active, FP4A100, H100, H200, B200, GB200
Qwen3-Coder-480B-A35B480BMoE, 35B activeH100, H200, B200, GB200
Nemotron-3-Nano-30B-A3B30BMoE, 3B activeA100, H100, H200, B200, GB200, L40S
For models not in the database, the advisor estimates throughput by matching model family, size, and I/O profile. Use --gpu, --tp, and --pp when you want to override placement manually.

Routing goals

Tandemn’s routing layer is designed to reduce the amount of manual placement work users need to do. Instead of asking each user to pick a specific machine, Tandemn can evaluate the job and choose an appropriate hardware mix.

What affects placement

The exact placement decision depends on deployment-specific configuration, but these are the common inputs to reason about:
  • Model size and runtime requirements
  • Prompt file size
  • Requested SLO
  • Available GPUs
  • Current cluster load
  • AWS quota and capacity
  • Spot or on-demand launch mode
  • Tensor and pipeline parallelism settings

Supported hardware

GPUAWS instanceVRAM
A100 80GBp4d.24xlarge, p4de.24xlarge8 x 80GB
H100 80GBp5.48xlarge8 x 80GB
L40S 48GBg6e.12xlarge, g6e.24xlarge, g6e.48xlarge4 x / 4 x / 8 x 48GB
A10G 24GBg5.12xlarge, g5.48xlarge4 x / 8 x 24GB
The solver searches across GPU types and parallelism configurations to find a placement that fits the model in memory and meets the requested deadline.
If a model cannot be scheduled, the first thing to check is whether the model is enabled in the deployment and whether compatible resources are available.