Models on kluster.ai＃

kluster.ai offers a wide variety of open-source models for both real-time and batch inferences, with more being constantly added.

This page covers all the models the API supports, with the API request limits for each.

Model names＃

Each model supported by kluster.ai has a unique name that must be used when defining the model in the request.

Model	Model API name
DeepSeek-R1	`deepseek-ai/DeepSeek-R1`
DeepSeek-R1-0528	`deepseek-ai/DeepSeek-R1-0528`
DeepSeek-V3-0324	`deepseek-ai/DeepSeek-V3-0324`
Gemma 3 27B	`google/gemma-3-27b-it`
Magistral Small	`mistralai/Magistral-Small-2506`
Meta Llama 3.1 8B	`klusterai/Meta-Llama-3.1-8B-Instruct-Turbo`
Meta Llama 3.3 70B	`klusterai/Meta-Llama-3.3-70B-Instruct-Turbo`
Meta Llama 4 Maverick	`meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`
Meta Llama 4 Scout	`meta-llama/Llama-4-Scout-17B-16E-Instruct`
Mistral NeMo	`mistralai/Mistral-Nemo-Instruct-2407`
Mistral Small	`mistralai/Mistral-Small-24B-Instruct-2501`
Qwen2.5-VL 7B	`Qwen/Qwen2.5-VL-7B-Instruct`
Qwen3-235B-A22B	`Qwen/Qwen3-235B-A22B-FP8`

Model	Description	Real-time inference support	Batch inference support	Fine-tuning support	Image analysis
DeepSeek-R1	Mathematical problem-solving code generation complex data analysis.
DeepSeek-R1-0528	Mathematical problem-solving code generation complex data analysis.
DeepSeek-V3-0324	Natural language generation open-ended text creation contextually rich writing.
Gemma 3 27B	Multilingual applications extended-context tasks image analysis and complex reasoning.
Magistral Small	Reasoning Natural language generation open-ended text creation contextually rich writing.
Llama 3.1 8B	Low-latency or simple tasks cost-efficient inference.
Llama 3.3 70B	General-purpose AI balanced cost-performance.
Llama 4 Maverick	A state-of-the-art multimodal model with integrated vision and language understanding, optimized for complex reasoning, coding, and perception tasks
Llama 4 Scout	General-purpose multimodal AI extended context tasks and balanced cost-performance across text and vision.
Mistral NeMo	Natural language generation open-ended text creation contextually rich writing.
Mistral Small	Fast conversational agents local inference on consumer hardware domain-specific applications and low-latency function calling.
Qwen2.5-VL 7B	Visual question answering document analysis image-based reasoning multimodal chat.
Qwen3-235B-A22B	Qwen3's flagship 235 billion parameter model optimized with 8-bit quantization

The following limits apply to API requests based on your plan:

TrialCoreScaleEnterprise

Model	Context size [tokens]	Max output [tokens]	Max batch requests	Concurrent requests	Requests per minute	Hosted fine-tuned models
DeepSeek-R1	32k	4k	1000	20	30	1
DeepSeek-R1-0528	32k	4k	1000	20	30	1
DeepSeek-V3-0324	32k	4k	1000	20	30	1
Gemma 3 27B	32k	4k	1000	20	30	1
Magistral Small	32k	4k	1000	20	30	1
Meta Llama 3.1 8B	32k	4k	1000	20	30	1
Meta Llama 3.3 70B	32k	4k	1000	20	30	1
Meta Llama 4 Maverick	32k	4k	1000	20	30	1
Meta Llama 4 Scout	32k	4k	1000	20	30	1
Mistral NeMo	32k	4k	1000	20	30	1
Mistral Small	32k	4k	1000	20	30	1
Qwen2.5-VL 7B	32k	4k	1000	20	30	1
Qwen3-235B-A22B	32k	4k	1000	20	30	1

Model	Context size [tokens]	Max output [tokens]	Max batch requests	Concurrent requests	Requests per minute	Hosted fine-tuned models
DeepSeek-R1	163k	163k	100k	100	600	10
DeepSeek-R1-0528	163k	163k	100k	100	600	10
DeepSeek-V3-0324	163k	163k	100k	100	600	10
Gemma 3 27B	64k	8k	100k	100	600	10
Magistral Small	40k	40k	100k	100	600	10
Meta Llama 3.1 8B	131k	131k	100k	100	600	10
Meta Llama 3.3 70B	131k	131k	100k	100	600	10
Meta Llama 4 Maverick	1M	1M	100k	100	600	10
Meta Llama 4 Scout	131k	131k	100k	100	600	10
Mistral NeMo	131k	131k	100k	100	600	10
Mistral Small	32k	32k	100k	100	600	10
Qwen2.5-VL 7B	32k	32k	100k	100	600	10
Qwen3-235B-A22B	40k	40k	100k	100	600	10

Model	Context size [tokens]	Max output [tokens]	Max batch requests	Concurrent requests	Requests per minute	Hosted fine-tuned models
DeepSeek-R1	163k	163k	500k	100	1200	25
DeepSeek-R1-0528	163k	163k	500k	100	1200	25
DeepSeek-V3-0324	163k	163k	500k	100	1200	25
Gemma 3 27B	64k	8k	500k	100	1200	25
Magistral Small	40k	40k	500k	100	1200	25
Meta Llama 3.1 8B	131k	131k	500k	100	1200	25
Meta Llama 3.3 70B	131k	131k	500k	100	1200	25
Meta Llama 4 Maverick	1M	1M	500k	100	1200	25
Meta Llama 4 Scout	131k	131k	500k	100	1200	25
Mistral NeMo	131k	131k	500k	100	1200	25
Mistral Small	32k	32k	500k	100	1200	25
Qwen2.5-VL 7B	32k	32k	500k	100	1200	25
Qwen3-235B-A22B	40k	40k	500k	100	1200	25

Model	Context size [tokens]	Max output [tokens]	Max batch requests	Concurrent requests	Requests per minute	Hosted fine-tuned models
DeepSeek-R1	163k	163k	Unlimited	100	Unlimited	Unlimited
DeepSeek-R1-0528	163k	163k	Unlimited	100	Unlimited	Unlimited
DeepSeek-V3-0324	163k	163k	Unlimited	100	Unlimited	Unlimited
Gemma 3 27B	64k	8k	Unlimited	100	Unlimited	Unlimited
Magistral Small	40k	40k	Unlimited	100	Unlimited	Unlimited
Meta Llama 3.1 8B	131k	131k	Unlimited	100	Unlimited	Unlimited
Meta Llama 3.3 70B	131k	131k	Unlimited	100	Unlimited	Unlimited
Meta Llama 4 Maverick	1M	1M	Unlimited	100	Unlimited	Unlimited
Meta Llama 4 Scout	131k	131k	Unlimited	100	Unlimited	Unlimited
Mistral NeMo	131k	131k	Unlimited	100	Unlimited	Unlimited
Mistral Small	32k	32k	Unlimited	100	Unlimited	Unlimited
Qwen2.5-VL 7B	32k	32k	Unlimited	100	Unlimited	Unlimited
Qwen3-235B-A22B	40k	40k	Unlimited	100	Unlimited	Unlimited