Batch inference is the process of running many inference requests as a workload instead of serving one interactive request at a time. In Tandemn, a batch job usually starts as a prompt file and a model selection. The CLI sends that job to the server, and the server decides how to run it across the available accelerated infrastructure.Documentation Index
Fetch the complete documentation index at: https://docs.tandemn.com/llms.txt
Use this file to discover all available pages before exploring further.
When batch inference is a good fit
- Offline evaluation jobs
- Dataset labeling or enrichment
- Scheduled summarization, extraction, or classification tasks
- Experiments that can tolerate queueing in exchange for lower cost or better hardware utilization
Why heterogeneous GPUs help
Not every workload needs the newest, largest GPU. Some jobs can run efficiently on smaller or less utilized accelerators. Tandemn is designed to make that resource selection part of the orchestration layer instead of a manual decision for every user.What users provide
Users typically provide:- A model identifier
- A JSONL input file
- A service-level objective

