Skip to main content
Use tandemn plan to preview placement without launching, and tandemn deploy to submit a batch inference job. Both commands take a model name, an input JSONL file or S3 URI, and an SLO deadline.

Usage

tandemn plan <model> <input> [options]
tandemn deploy <model> <input> [options]

Examples

tandemn plan Qwen/Qwen2.5-7B-Instruct examples/workloads/demo_batch.jsonl --slo 4
tandemn deploy Qwen/Qwen2.5-7B-Instruct examples/workloads/demo_batch.jsonl --slo 4
The solver automatically chooses a GPU type, tensor parallelism, pipeline parallelism, region, and launch mode unless you override those values. Manual placement example:
tandemn deploy Qwen/Qwen2.5-7B-Instruct input.jsonl --gpu A10G --tp 1
Multi-replica chunked execution example:
tandemn deploy Qwen/Qwen2.5-7B-Instruct input.jsonl --gpu A10G --tp 1 --replicas 2 --chunk-size 100

Arguments

ArgumentRequiredDescription
modelYesModel identifier to run. Use a model supported by your Tandemn deployment.
inputYesLocal JSONL file or s3://... URI containing the batch workload. Local files are uploaded to S3 automatically.

Options

FlagDescriptionDefault
--slo <hours>Deadline. Accepts plain hours (4), fractional hours (0.5h), or minutes (30m).4
--max-output-tokens NMaximum tokens per response.1024
--gpu <type>Override GPU type, such as A100, H100, L40S, or A10G.Solver-selected
--tp NOverride tensor parallelism.Solver-selected
--pp NOverride pipeline parallelism.Solver-selected
--replicas NNumber of replica clusters.1
--chunk-size NLines per chunk for multi-replica jobs.1000
--no-advisorSkip the LLM advisor and use the roofline solver only.Disabled
--skip-dangerouslySkip the interactive solver choice and auto-pick the advisor recommendation.Disabled
--forceSkip feasibility checks and launch anyway.Disabled
--persistKeep clusters alive after the job completes.Disabled
--on-demandUse on-demand instances instead of spot instances.Disabled

Placement behavior

Two placement solvers can run when you call tandemn plan or tandemn deploy:
  • LLM Advisor: architecture-aware recommendation over the performance database, with an LLM reasoning layer that ranks candidates by cost, throughput, and SLO feasibility.
  • Roofline solver: deterministic analytical placement based on GPU bandwidth, TFLOPS, memory, and model constraints.
Both recommendations can be shown side by side. If the advisor is unavailable, Tandemn falls back to the roofline solver. If KOI_SERVICE_URL is set, Tandemn can also show an optional Koi recommendation alongside the built-in solvers.

Supported overrides

Use --gpu, --tp, and --pp when you want to force a specific hardware plan. This is useful for models outside the performance database or when you already know the target fleet.
tandemn deploy <any-hf-model> input.jsonl --gpu A10G --tp 1

Prompt file format

Use OpenAI-style batch JSONL. See Input format for the full schema.
prompts.jsonl
{"custom_id":"req-1","method":"POST","url":"/v1/chat/completions","body":{"model":"placeholder","messages":[{"role":"user","content":"What is Tandemn?"}],"max_tokens":256}}

Before submitting

  • Run tandemn check.
  • Confirm the prompt file exists.
  • Confirm the file is valid JSONL.
  • Confirm the model is supported by your deployment.
  • Confirm the S3 upload bucket is configured on the server.
  • Start with a small file before scaling up.