Creates a chat completion for the provided messages using the specified model. This endpoint is compatible with OpenAI’s chat completion API format and supports streaming responses via Server-Sent Events (SSE).
The name of the deployed model to use for the chat completion. Must match a model that has been deployed via the /deploy endpoint.
"llama-3.2-1b"
Whether to stream the response using Server-Sent Events (SSE). When true, returns partial message deltas as they are generated.
true
The messages to generate a chat completion for. Supports roles: "system", "user", "assistant", "tool" (developer role is converted to system).
1[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello, how are you?"
}
]The maximum number of tokens to generate in the completion. Maps to vLLM's max_tokens parameter.
x >= 1100
Sampling temperature to use, between 0 and 2. Higher values make the output more random, lower values more deterministic.
0 <= x <= 20.7
Nucleus sampling parameter. The model considers tokens with top_p probability mass. E.g., 0.1 means only tokens comprising the top 10% probability mass are considered.
0 <= x <= 10.95
The number of highest probability vocabulary tokens to keep for top-k filtering.
x >= 150
Minimum probability for a token to be considered, relative to the most likely token.
0 <= x <= 10.05
The minimum number of tokens to generate before stopping.
x >= 010
Random seed for reproducible sampling.
42
Penalizes tokens based on their frequency in the generated text so far. Positive values decrease likelihood of repetition.
-2 <= x <= 20.5
Penalizes tokens that have already appeared in the generated text. Values > 1.0 discourage repetition, < 1.0 encourage it.
x >= 01.1
Penalizes tokens based on whether they appear in the text so far. Positive values encourage the model to talk about new topics.
-2 <= x <= 20.3
Number of chat completion choices to generate. Currently hardcoded to 1.
1 1
List of token IDs that should trigger the end of generation. Maps to vLLM's stop_token_ids parameter.
[2, 50256]List of strings that should trigger the end of generation when encountered.
["\n", "###"]Successful response - returns a stream of chat completion chunks
Server-Sent Event stream format for chat completion chunks. Each chunk is prefixed with "data: " and followed by two newlines. The stream ends with "data: [DONE]".
Unique identifier for this chat completion
"550e8400-e29b-41d4-a716-446655440000"
Object type, always "chat.completion.chunk" for streaming
chat.completion.chunk "chat.completion.chunk"
Unix timestamp when this chunk was created
1234567890.123
The model used for this completion
"llama-3.2-1b"
Array of completion choices (currently always 1 choice)