Documentation Index
Fetch the complete documentation index at: https://docs.tandemn.com/llms.txt
Use this file to discover all available pages before exploring further.
Tandemn users submit jobs with a model identifier, a JSONL workload, and a deadline. The control plane uses the deployment’s available resources and configuration to decide how that workload should run.
Model identifiers
The quickstart uses a Hugging Face style model identifier:
tandemn deploy Qwen/Qwen2.5-7B-Instruct prompts.jsonl --slo 4
Use model identifiers that your Tandemn deployment supports. Tandemn System can run HuggingFace models compatible with vLLM, and you can override placement manually for models outside the performance database.
tandemn deploy <any-hf-model> input.jsonl --gpu A10G --tp 1
Placement solvers
tandemn plan and tandemn deploy can show recommendations from two built-in solvers:
| Solver | Description |
|---|
| LLM Advisor | Uses the performance database and an LLM reasoning layer to rank placements by cost, throughput, and SLO feasibility. |
| Roofline solver | Uses GPU bandwidth, TFLOPS, memory, and model constraints. No API key is required. |
If the advisor is unavailable, Tandemn falls back to the roofline solver. If KOI_SERVICE_URL is set, Tandemn can also show an optional Koi recommendation.
The LLM Advisor requires ANTHROPIC_API_KEY and the performance database. Without those, tandemn plan and tandemn deploy can still use the roofline solver.
The LLM placement advisor uses a performance database of profiled vLLM runs. Download it into the server repository:
curl -L https://github.com/Tandemn-Labs/LLM_placement_solver/releases/download/aiconfigurator-v1/data.csv \
-o LLM_placement_solver/llm_advisor/data/aiconfigurator/data.csv
The dataset contains 103K profiled vLLM runs across A100, H100, H200, B200, GB200, and L40S GPUs, sourced from NVIDIA Dynamo AIConfigurator.
Profiled models
| Model | Params | Type | Profiled GPUs |
|---|
| Meta-Llama-3.1-8B | 8B | Dense | A100, H100, H200, B200, GB200, L40S |
| Meta-Llama-3.1-70B | 70B | Dense | A100, H100, H200, B200, GB200, L40S |
| Llama-3.1-70B-Instruct-FP8 | 70B | Dense FP8 | A100, H100, H200, B200, GB200, L40S |
| Meta-Llama-3.1-405B | 405B | Dense | A100, H100, H200, B200, GB200 |
| Nemotron-Super-49B | 49B | Dense | A100, H100, H200, B200, GB200, L40S |
| Nemotron-H-56B | 56B | Dense | A100, H100, H200, B200, GB200, L40S |
| Qwen3-8B | 8B | Dense | A100, H100, H200, B200, GB200, L40S |
| Qwen3-32B | 32B | Dense | A100, H100, H200, B200, GB200, L40S |
| Qwen3-32B-FP8 | 32B | Dense FP8 | A100, H100, H200, B200, GB200, L40S |
| Qwen3-30B-A3B | 30B | MoE, 3B active | A100, H100, H200, B200, GB200, L40S |
| Qwen3-235B-A22B | 235B | MoE, 22B active | A100, H100, H200, B200, GB200 |
| Qwen3-235B-A22B-FP8 | 235B | MoE, 22B active, FP8 | A100, H100, H200, B200, GB200 |
| Qwen3-235B-A22B-NVFP4 | 235B | MoE, 22B active, FP4 | A100, H100, H200, B200, GB200 |
| Qwen3-Coder-480B-A35B | 480B | MoE, 35B active | H100, H200, B200, GB200 |
| Nemotron-3-Nano-30B-A3B | 30B | MoE, 3B active | A100, H100, H200, B200, GB200, L40S |
For models not in the database, the advisor estimates throughput by matching model family, size, and I/O profile. Use --gpu, --tp, and --pp when you want to override placement manually.
Routing goals
Tandemn’s routing layer is designed to reduce the amount of manual placement work users need to do. Instead of asking each user to pick a specific machine, Tandemn can evaluate the job and choose an appropriate hardware mix.
What affects placement
The exact placement decision depends on deployment-specific configuration, but these are the common inputs to reason about:
- Model size and runtime requirements
- Prompt file size
- Requested SLO
- Available GPUs
- Current cluster load
- AWS quota and capacity
- Spot or on-demand launch mode
- Tensor and pipeline parallelism settings
Supported hardware
| GPU | AWS instance | VRAM |
|---|
| A100 80GB | p4d.24xlarge, p4de.24xlarge | 8 x 80GB |
| H100 80GB | p5.48xlarge | 8 x 80GB |
| L40S 48GB | g6e.12xlarge, g6e.24xlarge, g6e.48xlarge | 4 x / 4 x / 8 x 48GB |
| A10G 24GB | g5.12xlarge, g5.48xlarge | 4 x / 8 x 24GB |
The solver searches across GPU types and parallelism configurations to find a placement that fits the model in memory and meets the requested deadline.
If a model cannot be scheduled, the first thing to check is whether the model is enabled in the deployment and whether compatible resources are available.