Integrate LiteLLM with kluster.ai#
LiteLLM is an open-source Python library that streamlines access to a broad range of Large Language Model (LLM) providers through a standardized interface inspired by the OpenAI format. By providing features like fallback mechanisms, cost tracking, and streaming support, LiteLLM reduces the complexity of working with different models, ensuring a more reliable and cost-effective approach to AI-driven applications.
Integrating LiteLLM with the kluster.ai API enables the use of kluster.ai's powerful models alongside LiteLLM's flexible orchestration. This combination makes it simple to switch between models on the fly, handle token usage limits with context window fallback, and monitor usage costs in real-time—leading to robust, scalable, and adaptable AI workflows.
Prerequisites#
Before starting, ensure you have the following:
- A kluster.ai account - sign up on the kluster.ai platform if you don't have one
- A kluster.ai API key - after signing in, go to the API Keys section and create a new key. For detailed instructions, check out the Get an API key guide
- A python virtual environment - this is optional but recommended. Ensure that you enter the Python virtual environment before following along with this tutorial
-
LiteLLM installed - to install the library, use the following command:
pip install litellm
Configure LiteLLM#
In this section, you'll learn how to integrate kluster.ai with LiteLLM. You'll configure your environment variables, specify a kluster.ai model, and make a simple request using LiteLLM's OpenAI-like interface.
- Import LiteLLM and its dependencies - create a new file (e.g.,
hello-litellm.py
) and start by importing the necessary Python modules:import os from litellm import completion
- Set your kluster.ai API key and Base URL - replace INSERT_API_KEY with your actual API key. If you don't have one yet, refer to the Get an API key guide
# Set environment vars, shown in script for readability os.environ["OPENAI_API_KEY"] = "INSERT_KLUSTER_API_KEY" os.environ["OPENAI_API_BASE"] = "https://api.kluster.ai/v1"
- Define your conversation (system + user messages) - set up your initial system prompt and user message. The system message defines your AI assistant's role, while the user message is the actual question or prompt
# Basic Chat messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of California?"} ]
- Select your kluster.ai model - choose one of kluster.ai's available models that best fits your use case. Prepend the model name with
openai/
so LiteLLM recognizes it as an OpenAI-like model request# Use an "openai/..." model prefix so LiteLLM treats this as an OpenAI-like call model = "openai/klusterai/Meta-Llama-3.3-70B-Instruct-Turbo"
- Call the LiteLLM completion function - finally, invoke the completion function to send your request:
response = completion( model=model, messages=messages, max_tokens=1000, ) print(response)
View complete script
import os
from litellm import completion
# Set environment vars, shown in script for readability
os.environ["OPENAI_API_KEY"] = "INSERT_KLUSTER_API_KEY"
os.environ["OPENAI_API_BASE"] = "https://api.kluster.ai/v1"
# Basic Chat
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of California?"}
]
# Use an "openai/..." model prefix so LiteLLM treats this as an OpenAI-like call
model = "openai/klusterai/Meta-Llama-3.3-70B-Instruct-Turbo"
response = completion(
model=model,
messages=messages,
max_tokens=1000,
)
print(response)
Use the following command to run your script:
python hello-litellm.py
That's it! You've successfully integrated LiteLLM with the kluster.ai API. Continue to learn how to experiment with more advanced features of LiteLLM.
Explore LiteLLM features#
In the previous section, you learned how to use LiteLLM with the kluster.ai API by properly configuring the model via an OpenAI-like call and configuring the API key and API base URL. The following sections demonstrate using LiteLLM's streaming response and multi-turn conversation features with the kluster.ai API.
The following guide assumes you just finished the configuration exercise in the preceding section. If you haven't already done so, please complete the configuration steps in the Configure LiteLLM section before you continue.
Use streaming responses#
You can enable streaming by simply passing stream=True
to the completion()
function. Streaming returns a generator instead of a static response, letting you iterate over partial output chunks as they arrive. In the code sample below, each chunk is accessed in a for-in loop, allowing you to extract the textual content (e.g., chunk.choices[0].delta.content)
rather than printing all metadata.
To configure a streaming response, take the following steps:
-
Update the
messages
system prompt and first user message - you can supply a user message or use the sample provided:messages = [ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "Explain the significance of the California Gold Rush."}, ]
-
Initiate a streaming request to the model - set
stream=True
in thecompletion()
function to tell LiteLLM to return partial pieces (chunks) of the response as they become available rather than waiting for the entire response to be ready# --- 1) STREAMING CALL: Only print chunk text -------------------------------- try: response_stream = completion( model=model, messages=messages, max_tokens=300, temperature=0.3, stream=True, # streaming enabled ) except Exception as err: print(f"Error calling model: {err}") return print("\n--------- STREAMING RESPONSE (text only) ---------") streamed_text = []
- Isolate the returned text content - returning all of the streamed data will include a lot of excessive noise like token counts, etc. You can isolate the text content from the rest of the streamed response with the following code:
# Iterate over each chunk from the streaming generator for chunk in response_stream: if hasattr(chunk, "choices") and chunk.choices: # If the content is None, we replace it with "" (empty string) partial_text = getattr(chunk.choices[0].delta, "content", "") or "" streamed_text.append(partial_text) print(partial_text, end="", flush=True) print("\n") # new line after streaming ends
Handle multi-turn conversation#
LiteLLM can facilitate multi-turn conversations by maintaining message history in a sequential chain, enabling the model to consider the context of previous messages. This section demonstrates multi-turn conversation handling by updating the messages list each time we receive a new response from the assistant. This pattern can be repeated for as many turns as you need, continuously appending messages to maintain the conversational flow.
Let's take a closer look at each step:
- Combine the streamed chunks of the first message - since the message is streamed in chunks, you must re-assemble them into a single message. After collecting partial responses in
streamed_text
, join them into a single string calledcomplete_first_answer
:# Combine the partial chunks into one string complete_first_answer = "".join(streamed_text)
- Append the assistant's reply - to enhance the context of the conversation. Add
complete_first_answer
back into messages under the "assistant" role as follows:# Append the entire first answer to the conversation for multi-turn context messages.append({"role": "assistant", "content": complete_first_answer})
- Craft the second message to the assistant - append a new message object to messages with the user's next question as follows:
# --- 2) SECOND CALL (non-streamed): Print just the text --------------------- messages.append({ "role": "user", "content": ( "Thanks for that. Can you propose a short, 3-minute presentation outline " "about the Gold Rush, focusing on its broader implications?" ), })
- Ask the model to respond to the second question - this time, don't enable the streaming feature. Pass the updated messages to
completion()
withstream=False
, prompting LiteLLM to generate a standard (single-shot) response as follows:try: response_2 = completion( model=model, messages=messages, max_tokens=300, temperature=0.6, stream=False # non-streamed ) except Exception as err: print(f"Error calling model: {err}") return
- Parse and print the second answer - extract
response_2.choices[0].message["content"]
, store it insecond_answer_text
, and print to the console for your final output:print("--------- RESPONSE 2 (non-streamed, text only) ---------") second_answer_text = "" if response_2.choices and hasattr(response_2.choices[0], "message"): second_answer_text = response_2.choices[0].message.get("content", "") or "" print(second_answer_text)
You can view the full script below. It demonstrates a streamed response versus a regular response and how to handle a multi-turn conversation.
View complete script
import os
import litellm.exceptions
from litellm import completion
# Set environment variables for kluster.ai
os.environ["OPENAI_API_KEY"] = "INSERT_API_KEY" # Replace with your key
os.environ["OPENAI_API_BASE"] = "https://api.kluster.ai/v1"
def main():
model = "openai/klusterai/Meta-Llama-3.3-70B-Instruct-Turbo"
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain the significance of the California Gold Rush."},
]
# --- 1) STREAMING CALL: Only print chunk text --------------------------------
try:
response_stream = completion(
model=model,
messages=messages,
max_tokens=300,
temperature=0.3,
stream=True, # streaming enabled
)
except Exception as err:
print(f"Error calling model: {err}")
return
print("\n--------- STREAMING RESPONSE (text only) ---------")
streamed_text = []
# Iterate over each chunk from the streaming generator
for chunk in response_stream:
if hasattr(chunk, "choices") and chunk.choices:
# If the content is None, we replace it with "" (empty string)
partial_text = getattr(chunk.choices[0].delta, "content", "") or ""
streamed_text.append(partial_text)
print(partial_text, end="", flush=True)
print("\n") # new line after streaming ends
# Combine the partial chunks into one string
complete_first_answer = "".join(streamed_text)
# Append the entire first answer to the conversation for multi-turn context
messages.append({"role": "assistant", "content": complete_first_answer})
# --- 2) SECOND CALL (non-streamed): Print just the text ---------------------
messages.append({
"role": "user",
"content": (
"Thanks for that. Can you propose a short, 3-minute presentation outline "
"about the Gold Rush, focusing on its broader implications?"
),
})
try:
response_2 = completion(
model=model,
messages=messages,
max_tokens=300,
temperature=0.6,
stream=False # non-streamed
)
except Exception as err:
print(f"Error calling model: {err}")
return
print("--------- RESPONSE 2 (non-streamed, text only) ---------")
second_answer_text = ""
if response_2.choices and hasattr(response_2.choices[0], "message"):
second_answer_text = response_2.choices[0].message.get("content", "") or ""
print(second_answer_text)
if __name__ == "__main__":
main()
Put it all together#
Use the following command to run your script:
python hello-litellm.py
You should see output that resembles the following:
Both responses appear to trail off abruptly, but that's because we limited the output to 300
tokens each. Feel free to tweak the parameters and rerun the script at your leisure!