Body
The name of the deployed model to use for the chat completion. Must match a model that has been deployed via the /deploy endpoint.
"llama-3.2-1b"
Whether to stream the response using Server-Sent Events (SSE). When true, returns partial message deltas as they are generated.
true
The messages to generate a chat completion for. Supports roles: "system", "user", "assistant", "tool" (developer role is converted to system).
1
[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello, how are you?"
}
]
The maximum number of tokens to generate in the completion. Maps to vLLM's max_tokens parameter.
x >= 1
100
Sampling temperature to use, between 0 and 2. Higher values make the output more random, lower values more deterministic.
0 <= x <= 2
0.7
Nucleus sampling parameter. The model considers tokens with top_p probability mass. E.g., 0.1 means only tokens comprising the top 10% probability mass are considered.
0 <= x <= 1
0.95
The number of highest probability vocabulary tokens to keep for top-k filtering.
x >= 1
50
Minimum probability for a token to be considered, relative to the most likely token.
0 <= x <= 1
0.05
The minimum number of tokens to generate before stopping.
x >= 0
10
Random seed for reproducible sampling.
42
Penalizes tokens based on their frequency in the generated text so far. Positive values decrease likelihood of repetition.
-2 <= x <= 2
0.5
Penalizes tokens that have already appeared in the generated text. Values > 1.0 discourage repetition, < 1.0 encourage it.
x >= 0
1.1
Penalizes tokens based on whether they appear in the text so far. Positive values encourage the model to talk about new topics.
-2 <= x <= 2
0.3
Number of chat completion choices to generate. Currently hardcoded to 1.
1
1
List of token IDs that should trigger the end of generation. Maps to vLLM's stop_token_ids parameter.
[2, 50256]
List of strings that should trigger the end of generation when encountered.
["\n", "###"]
Response
Successful response - returns a stream of chat completion chunks
Server-Sent Event stream format for chat completion chunks. Each chunk is prefixed with "data: " and followed by two newlines. The stream ends with "data: [DONE]".
Unique identifier for this chat completion
"550e8400-e29b-41d4-a716-446655440000"
Object type, always "chat.completion.chunk" for streaming
chat.completion.chunk
"chat.completion.chunk"
Unix timestamp when this chunk was created
1234567890.123
The model used for this completion
"llama-3.2-1b"
Array of completion choices (currently always 1 choice)