Integrating LiteLLM with the kluster.ai API#
This guide shows you how to integrate LiteLLM—an open-source library providing unified access to 100+ large language models—with the kluster.ai API. You can seamlessly develop and deploy robust, AI-driven applications by combining LiteLLM's load balancing, fallback logic, and spend tracking with kluster.ai's powerful models.
Prerequisites#
Before starting, ensure you have the following:
- A python virtual environment - This is optional but recommended. Ensure that you enter the Python virtual environment before following along with this tutorial
-
LiteLLM installed - to install the library, use the following command:
pip install litellm
-
A kluster.ai account - sign up on the kluster.ai platform if you don't have one
- A kluster.ai API key - after signing in, go to the API Keys section and create a new key. For detailed instructions, check out the Get an API key guide
Integrate with LiteLLM#
In this section, you'll learn how to integrate kluster.ai with LiteLLM. You’ll configure your environment variables, specify a kluster.ai model, and make a simple request using LiteLLM’s OpenAI-like interface.
- Import LiteLLM and its dependencies - Create a new file (e.g.,
hello-litellm.py
) and start by importing the necessary Python modules:import os from litellm import completion
- Set your kluster.ai API key and Base URL - Replace INSERT_API_KEY with your actual API key. If you don't have one yet, refer to the Get an API key
# Set environment vars, shown in script for readability os.environ["OPENAI_API_KEY"] = "INSERT_KLUSTER_API_KEY" os.environ["OPENAI_API_BASE"] = "https://api.kluster.ai/v1"
- Define your conversation (system + user messages) - Set up your initial system prompt and user message. The system message defines your AI assistant’s role, while the user message is the actual question or prompt
# Basic Chat messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of California?"} ]
- Select your kluster.ai model - Choose one of the kluster.ai models that best fits your use case. Prepend the model name with
openai/
so LiteLLM recognizes it as an OpenAI-like model request.# Use an "openai/..." model prefix so LiteLLM treats this as an OpenAI-like call model = "openai/klusterai/Meta-Llama-3.3-70B-Instruct-Turbo"
- Call the LiteLLM completion function - Finally, invoke the completion function to send your request:
response = completion( model=model, messages=messages, max_tokens=1000, ) print(response)
View full code file
import os
from litellm import completion
# Set environment vars, shown in script for readability
os.environ["OPENAI_API_KEY"] = "INSERT_KLUSTER_API_KEY"
os.environ["OPENAI_API_BASE"] = "https://api.kluster.ai/v1"
# Basic Chat
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of California?"}
]
# Use an "openai/..." model prefix so LiteLLM treats this as an OpenAI-like call
model = "openai/klusterai/Meta-Llama-3.3-70B-Instruct-Turbo"
response = completion(
model=model,
messages=messages,
max_tokens=1000,
)
print(response)
Use the following command to run your script:
python hello-litellm.py
That's it! You've successfully integrated LiteLLM with the kluster.ai API. Continue on to learn how to experiment with more advanced features of LiteLLM.
Exploring LiteLLM Features#
In the previous section, you learned how to use LiteLLM with the kluster.ai API by properly configuring the model via an OpenAI-like call and configuring the API key and API base URL. This section will dive deeper into some of the interesting features offered by LiteLLM and how you can use them in conjunction with the kluster.ai API.
To set up the demo file, go ahead and create a new python file, then take the following steps:
- Import LiteLLM and its dependencies:
import os import litellm.exceptions from litellm import completion
- Set your kluster API key and base URL:
# Set environment variables for kluster.ai os.environ["OPENAI_API_KEY"] = "INSERT_API_KEY" # Replace with your key os.environ["OPENAI_API_BASE"] = "https://api.kluster.ai/v1"
- Set your desired kluster model:
def main(): model = "openai/klusterai/Meta-Llama-3.3-70B-Instruct-Turbo"
- Define the system prompt and your first user message:
messages = [ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "Explain the significance of the California Gold Rush."}, ]
Streaming Responses#
You can enable streaming by simply passing stream=True
to the completion()
function. This returns a generator instead of a static response, letting you iterate over partial output chunks as they arrive. In the code sample below, each chunk is accessed in a for chunk in response: loop, and you can extract just the textual content (e.g., chunk.choices[0].delta.content)
rather than printing all metadata.
To configure a streaming response, take the following steps:
- Initiate a streaming request to the model by setting
stream=True
in thecompletion()
function. This tells LiteLLM to return partial pieces (chunks) of the response as they become available, rather than waiting for the entire response to be ready.# --- 1) STREAMING CALL: Only print chunk text -------------------------------- try: response_stream = completion( model=model, messages=messages, max_tokens=300, temperature=0.3, stream=True, # streaming enabled ) except Exception as err: print(f"Error calling model: {err}") return print("\n--------- STREAMING RESPONSE (text only) ---------") streamed_text = []
- However, if we just return all of the streamed data, it's going to include a lot of excessive noise like token counts, etc. For readability, you probably prefer just the text content of the response. Isolate that from the rest of the streamed response with the following code:
# Iterate over each chunk from the streaming generator for chunk in response_stream: if hasattr(chunk, "choices") and chunk.choices: # If the content is None, we replace it with "" (empty string) partial_text = getattr(chunk.choices[0].delta, "content", "") or "" streamed_text.append(partial_text) print(partial_text, end="", flush=True) print("\n") # new line after streaming ends
Multi-Turn Conversation Handling#
LiteLLM can facilitate multi-turn conversations by maintaining message history in a sequential chain, enabling the model to consider the context of previous messages. This section demonstrates multi-turn conversation handling by updating the messages list each time we receive a new response from the assistant. This pattern can be repeated for as many turns as you need, continuously appending messages to maintain the conversational flow.
Let's take a closer look at each step:
- First, we need to combine the streamed chunks of the first message. Since they were streamed, they need to be re-assembled into a single message. After collecting partial responses in
streamed_text
, join them into a single string calledcomplete_first_answer
.# Combine the partial chunks into one string complete_first_answer = "".join(streamed_text)
- Next, append the assistant’s reply to enhance the context of the conversation. Add this
complete_first_answer
back into messages under the "assistant" role as follows:# Append the entire first answer to the conversation for multi-turn context messages.append({"role": "assistant", "content": complete_first_answer})
- Then, craft the 2nd message to the assistant. Append a new message object to messages with the user’s next question as follows:
# --- 2) SECOND CALL (non-streamed): Print just the text --------------------- messages.append({ "role": "user", "content": ( "Thanks for that. Can you propose a short, 3-minute presentation outline " "about the Gold Rush, focusing on its broader implications?" ), })
- Now, ask the model for the response to the 2nd question, this time without the streaming feature enabled. Pass the updated messages to completion() with
stream=False
, prompting LiteLLM to generate a standard (single-shot) response as follows:try: response_2 = completion( model=model, messages=messages, max_tokens=300, temperature=0.6, stream=False # non-streamed ) except Exception as err: print(f"Error calling model: {err}") return
- Finally, parse and print the second answer. Extract
response_2.choices[0].message["content"]
, store it insecond_answer_text
, and print to the console for your final output:print("--------- RESPONSE 2 (non-streamed, text only) ---------") second_answer_text = "" if response_2.choices and hasattr(response_2.choices[0], "message"): second_answer_text = response_2.choices[0].message.get("content", "") or "" print(second_answer_text)
Putting it All Together#
You can find the full code file below, demonstrating a comparison of a streamed response vs. a regular response alongside handling a multi-turn conversation.
litellm-features.py
import os
import litellm.exceptions
from litellm import completion
# Set environment variables for kluster.ai
os.environ["OPENAI_API_KEY"] = "INSERT_API_KEY" # Replace with your key
os.environ["OPENAI_API_BASE"] = "https://api.kluster.ai/v1"
def main():
model = "openai/klusterai/Meta-Llama-3.3-70B-Instruct-Turbo"
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain the significance of the California Gold Rush."},
]
# --- 1) STREAMING CALL: Only print chunk text --------------------------------
try:
response_stream = completion(
model=model,
messages=messages,
max_tokens=300,
temperature=0.3,
stream=True, # streaming enabled
)
except Exception as err:
print(f"Error calling model: {err}")
return
print("\n--------- STREAMING RESPONSE (text only) ---------")
streamed_text = []
# Iterate over each chunk from the streaming generator
for chunk in response_stream:
if hasattr(chunk, "choices") and chunk.choices:
# If the content is None, we replace it with "" (empty string)
partial_text = getattr(chunk.choices[0].delta, "content", "") or ""
streamed_text.append(partial_text)
print(partial_text, end="", flush=True)
print("\n") # new line after streaming ends
# Combine the partial chunks into one string
complete_first_answer = "".join(streamed_text)
# Append the entire first answer to the conversation for multi-turn context
messages.append({"role": "assistant", "content": complete_first_answer})
# --- 2) SECOND CALL (non-streamed): Print just the text ---------------------
messages.append({
"role": "user",
"content": (
"Thanks for that. Can you propose a short, 3-minute presentation outline "
"about the Gold Rush, focusing on its broader implications?"
),
})
try:
response_2 = completion(
model=model,
messages=messages,
max_tokens=300,
temperature=0.6,
stream=False # non-streamed
)
except Exception as err:
print(f"Error calling model: {err}")
return
print("--------- RESPONSE 2 (non-streamed, text only) ---------")
second_answer_text = ""
if response_2.choices and hasattr(response_2.choices[0], "message"):
second_answer_text = response_2.choices[0].message.get("content", "") or ""
print(second_answer_text)
if __name__ == "__main__":
main()
Upon running it you'll see output like the following:
Both responses appear to trail off abruptly, but that's because we limited the output to 300
tokens each. Feel free to tweak the parameters and rerun the script at your leisure!