POST
/
v1
/
chat
/
completions
Create Chat Completion
curl --request POST \
  --url http://api.tandemn.com/v1/chat/completions \
  --header 'Content-Type: application/json' \
  --data '{
  "model": "casperhansen/deepseek-r1-distill-llama-70b-awq",
  "stream": true,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ],
  "max_completion_tokens": 100,
  "temperature": 0.7
}'
"data: {\"id\":\"uuid-here\",\"object\":\"chat.completion.chunk\",\"created\":1234567890,\"model\":\"llama-3.2-1b\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\"The\"}}]}\n\ndata: {\"id\":\"uuid-here\",\"object\":\"chat.completion.chunk\",\"created\":1234567890,\"model\":\"llama-3.2-1b\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\" capital\"}}]}\n\ndata: {\"id\":\"uuid-here\",\"object\":\"chat.completion.chunk\",\"created\":1234567890,\"model\":\"llama-3.2-1b\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\" of\"}}]}\n\ndata: {\"id\":\"uuid-here\",\"object\":\"chat.completion.chunk\",\"created\":1234567890,\"model\":\"llama-3.2-1b\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\" France\"}}]}\n\ndata: {\"id\":\"uuid-here\",\"object\":\"chat.completion.chunk\",\"created\":1234567890,\"model\":\"llama-3.2-1b\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\" is\"}}]}\n\ndata: {\"id\":\"uuid-here\",\"object\":\"chat.completion.chunk\",\"created\":1234567890,\"model\":\"llama-3.2-1b\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\" Paris.\"}}]}\n\ndata: [DONE]\n"

Body

application/json
model
string
required

The name of the deployed model to use for the chat completion. Must match a model that has been deployed via the /deploy endpoint.

Example:

"llama-3.2-1b"

stream
boolean
required

Whether to stream the response using Server-Sent Events (SSE). When true, returns partial message deltas as they are generated.

Example:

true

messages
object[]
required

The messages to generate a chat completion for. Supports roles: "system", "user", "assistant", "tool" (developer role is converted to system).

Minimum length: 1
Example:
[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello, how are you?"
}
]
max_completion_tokens
integer | null

The maximum number of tokens to generate in the completion. Maps to vLLM's max_tokens parameter.

Required range: x >= 1
Example:

100

temperature
number | null

Sampling temperature to use, between 0 and 2. Higher values make the output more random, lower values more deterministic.

Required range: 0 <= x <= 2
Example:

0.7

top_p
number | null

Nucleus sampling parameter. The model considers tokens with top_p probability mass. E.g., 0.1 means only tokens comprising the top 10% probability mass are considered.

Required range: 0 <= x <= 1
Example:

0.95

top_k
integer | null

The number of highest probability vocabulary tokens to keep for top-k filtering.

Required range: x >= 1
Example:

50

min_p
number | null

Minimum probability for a token to be considered, relative to the most likely token.

Required range: 0 <= x <= 1
Example:

0.05

min_tokens
integer | null

The minimum number of tokens to generate before stopping.

Required range: x >= 0
Example:

10

seed
integer | null

Random seed for reproducible sampling.

Example:

42

frequency_penalty
number | null

Penalizes tokens based on their frequency in the generated text so far. Positive values decrease likelihood of repetition.

Required range: -2 <= x <= 2
Example:

0.5

repetition_penalty
number | null

Penalizes tokens that have already appeared in the generated text. Values > 1.0 discourage repetition, < 1.0 encourage it.

Required range: x >= 0
Example:

1.1

presence_penalty
number | null

Penalizes tokens based on whether they appear in the text so far. Positive values encourage the model to talk about new topics.

Required range: -2 <= x <= 2
Example:

0.3

n
enum<integer> | null
default:1

Number of chat completion choices to generate. Currently hardcoded to 1.

Available options:
1
Example:

1

eos_token_id
integer[] | null

List of token IDs that should trigger the end of generation. Maps to vLLM's stop_token_ids parameter.

Example:
[2, 50256]
stop
string[] | null

List of strings that should trigger the end of generation when encountered.

Example:
["\n", "###"]

Response

Successful response - returns a stream of chat completion chunks

Server-Sent Event stream format for chat completion chunks. Each chunk is prefixed with "data: " and followed by two newlines. The stream ends with "data: [DONE]".

id
string<uuid>

Unique identifier for this chat completion

Example:

"550e8400-e29b-41d4-a716-446655440000"

object
enum<string>

Object type, always "chat.completion.chunk" for streaming

Available options:
chat.completion.chunk
Example:

"chat.completion.chunk"

created
number

Unix timestamp when this chunk was created

Example:

1234567890.123

model
string

The model used for this completion

Example:

"llama-3.2-1b"

choices
object[]

Array of completion choices (currently always 1 choice)