Skip to content

Integrating LiteLLM with the kluster.ai API

This guide shows you how to integrate LiteLLM—an open-source library providing unified access to 100+ large language models—with the kluster.ai API. You can seamlessly develop and deploy robust, AI-driven applications by combining LiteLLM's load balancing, fallback logic, and spend tracking with kluster.ai's powerful models.

Prerequisites

Before starting, ensure you have the following:

  • A python virtual environment - This is optional but recommended. Ensure that you enter the Python virtual environment before following along with this tutorial
  • LiteLLM installed - to install the library, use the following command:

    pip install litellm
    
  • A kluster.ai account - sign up on the kluster.ai platform if you don't have one

  • A kluster.ai API key - after signing in, go to the API Keys section and create a new key. For detailed instructions, check out the Get an API key guide

Integrate with LiteLLM

In this section, you'll learn how to integrate kluster.ai with LiteLLM. You’ll configure your environment variables, specify a kluster.ai model, and make a simple request using LiteLLM’s OpenAI-like interface.

  1. Import LiteLLM and its dependencies - Create a new file (e.g., hello-litellm.py) and start by importing the necessary Python modules:
    import os
    
    from litellm import completion
    
  2. Set your kluster.ai API key and Base URL - Replace INSERT_API_KEY with your actual API key. If you don't have one yet, refer to the Get an API key
    # Set environment vars, shown in script for readability
    os.environ["OPENAI_API_KEY"] = "INSERT_KLUSTER_API_KEY"
    os.environ["OPENAI_API_BASE"] = "https://api.kluster.ai/v1"
    
  3. Define your conversation (system + user messages) - Set up your initial system prompt and user message. The system message defines your AI assistant’s role, while the user message is the actual question or prompt
    # Basic Chat
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": "What is the capital of California?"}
    ]
    
  4. Select your kluster.ai model - Choose one of the kluster.ai models that best fits your use case. Prepend the model name with openai/ so LiteLLM recognizes it as an OpenAI-like model request.
    # Use an "openai/..." model prefix so LiteLLM treats this as an OpenAI-like call
    model = "openai/klusterai/Meta-Llama-3.3-70B-Instruct-Turbo"
    
  5. Call the LiteLLM completion function - Finally, invoke the completion function to send your request:
    response = completion(
        model=model,
        messages=messages,
        max_tokens=1000, 
    )
    
    print(response)
    
View full code file
import os

from litellm import completion

# Set environment vars, shown in script for readability
os.environ["OPENAI_API_KEY"] = "INSERT_KLUSTER_API_KEY"
os.environ["OPENAI_API_BASE"] = "https://api.kluster.ai/v1"

# Basic Chat
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "What is the capital of California?"}
]

# Use an "openai/..." model prefix so LiteLLM treats this as an OpenAI-like call
model = "openai/klusterai/Meta-Llama-3.3-70B-Instruct-Turbo"

response = completion(
    model=model,
    messages=messages,
    max_tokens=1000, 
)

print(response)

Use the following command to run your script:

python hello-litellm.py
python hello-litellm.py ModelResponse(id='chatcmpl-9877dfe6-6f1d-483f-a392-d791b89c75d6', created=1739495162, model='klusterai/Meta-Llama-3.3-70B-Instruct-Turbo', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='The capital of California is Sacramento.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, refusal=None))], usage=Usage(completion_tokens=8, prompt_tokens=48, total_tokens=56, completion_tokens_details=None, prompt_tokens_details=None), service_tier=None, prompt_logprobs=None)

That's it! You've successfully integrated LiteLLM with the kluster.ai API. Continue on to learn how to experiment with more advanced features of LiteLLM.

Exploring LiteLLM Features

In the previous section, you learned how to use LiteLLM with the kluster.ai API by properly configuring the model via an OpenAI-like call and configuring the API key and API base URL. This section will dive deeper into some of the interesting features offered by LiteLLM and how you can use them in conjunction with the kluster.ai API.

To set up the demo file, go ahead and create a new python file, then take the following steps:

  1. Import LiteLLM and its dependencies:
    import os
    
    import litellm.exceptions
    from litellm import completion
    
  2. Set your kluster API key and base URL:
    # Set environment variables for kluster.ai
    os.environ["OPENAI_API_KEY"] = "INSERT_API_KEY"  # Replace with your key
    os.environ["OPENAI_API_BASE"] = "https://api.kluster.ai/v1"
    
  3. Set your desired kluster model:
    def main():
        model = "openai/klusterai/Meta-Llama-3.3-70B-Instruct-Turbo"
    
  4. Define the system prompt and your first user message:
        messages = [
            {"role": "system", "content": "You are a helpful AI assistant."},
            {"role": "user",   "content": "Explain the significance of the California Gold Rush."},
        ]
    

Streaming Responses

You can enable streaming by simply passing stream=True to the completion() function. This returns a generator instead of a static response, letting you iterate over partial output chunks as they arrive. In the code sample below, each chunk is accessed in a for chunk in response: loop, and you can extract just the textual content (e.g., chunk.choices[0].delta.content) rather than printing all metadata.

To configure a streaming response, take the following steps:

  1. Initiate a streaming request to the model by setting stream=True in the completion() function. This tells LiteLLM to return partial pieces (chunks) of the response as they become available, rather than waiting for the entire response to be ready.
        # --- 1) STREAMING CALL: Only print chunk text --------------------------------
        try:
            response_stream = completion(
                model=model,
                messages=messages,
                max_tokens=300,
                temperature=0.3,
                stream=True,  # streaming enabled
            )
        except Exception as err:
            print(f"Error calling model: {err}")
            return
    
        print("\n--------- STREAMING RESPONSE (text only) ---------")
        streamed_text = []
    
  2. However, if we just return all of the streamed data, it's going to include a lot of excessive noise like token counts, etc. For readability, you probably prefer just the text content of the response. Isolate that from the rest of the streamed response with the following code:
        # Iterate over each chunk from the streaming generator
        for chunk in response_stream:
            if hasattr(chunk, "choices") and chunk.choices:
                # If the content is None, we replace it with "" (empty string)
                partial_text = getattr(chunk.choices[0].delta, "content", "") or ""
                streamed_text.append(partial_text)
                print(partial_text, end="", flush=True)
    
        print("\n")  # new line after streaming ends
    

Multi-Turn Conversation Handling

LiteLLM can facilitate multi-turn conversations by maintaining message history in a sequential chain, enabling the model to consider the context of previous messages. This section demonstrates multi-turn conversation handling by updating the messages list each time we receive a new response from the assistant. This pattern can be repeated for as many turns as you need, continuously appending messages to maintain the conversational flow.

Let's take a closer look at each step:

  1. First, we need to combine the streamed chunks of the first message. Since they were streamed, they need to be re-assembled into a single message. After collecting partial responses in streamed_text, join them into a single string called complete_first_answer.
        # Combine the partial chunks into one string
        complete_first_answer = "".join(streamed_text)
    
  2. Next, append the assistant’s reply to enhance the context of the conversation. Add this complete_first_answer back into messages under the "assistant" role as follows:
        # Append the entire first answer to the conversation for multi-turn context
        messages.append({"role": "assistant", "content": complete_first_answer})
    
  3. Then, craft the 2nd message to the assistant. Append a new message object to messages with the user’s next question as follows:
        # --- 2) SECOND CALL (non-streamed): Print just the text ---------------------
        messages.append({
            "role": "user",
            "content": (
                "Thanks for that. Can you propose a short, 3-minute presentation outline "
                "about the Gold Rush, focusing on its broader implications?"
            ),
        })
    
  4. Now, ask the model for the response to the 2nd question, this time without the streaming feature enabled. Pass the updated messages to completion() with stream=False, prompting LiteLLM to generate a standard (single-shot) response as follows:
        try:
            response_2 = completion(
                model=model,
                messages=messages,
                max_tokens=300,
                temperature=0.6,
                stream=False  # non-streamed
            )
        except Exception as err:
            print(f"Error calling model: {err}")
            return
    
  5. Finally, parse and print the second answer. Extract response_2.choices[0].message["content"], store it in second_answer_text, and print to the console for your final output:
        print("--------- RESPONSE 2 (non-streamed, text only) ---------")
        second_answer_text = ""
        if response_2.choices and hasattr(response_2.choices[0], "message"):
            second_answer_text = response_2.choices[0].message.get("content", "") or ""
    
        print(second_answer_text)
    

Putting it All Together

You can find the full code file below, demonstrating a comparison of a streamed response vs. a regular response alongside handling a multi-turn conversation.

litellm-features.py
import os

import litellm.exceptions
from litellm import completion

# Set environment variables for kluster.ai
os.environ["OPENAI_API_KEY"] = "INSERT_API_KEY"  # Replace with your key
os.environ["OPENAI_API_BASE"] = "https://api.kluster.ai/v1"

def main():
    model = "openai/klusterai/Meta-Llama-3.3-70B-Instruct-Turbo"

    messages = [
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user",   "content": "Explain the significance of the California Gold Rush."},
    ]

    # --- 1) STREAMING CALL: Only print chunk text --------------------------------
    try:
        response_stream = completion(
            model=model,
            messages=messages,
            max_tokens=300,
            temperature=0.3,
            stream=True,  # streaming enabled
        )
    except Exception as err:
        print(f"Error calling model: {err}")
        return

    print("\n--------- STREAMING RESPONSE (text only) ---------")
    streamed_text = []

    # Iterate over each chunk from the streaming generator
    for chunk in response_stream:
        if hasattr(chunk, "choices") and chunk.choices:
            # If the content is None, we replace it with "" (empty string)
            partial_text = getattr(chunk.choices[0].delta, "content", "") or ""
            streamed_text.append(partial_text)
            print(partial_text, end="", flush=True)

    print("\n")  # new line after streaming ends

    # Combine the partial chunks into one string
    complete_first_answer = "".join(streamed_text)

    # Append the entire first answer to the conversation for multi-turn context
    messages.append({"role": "assistant", "content": complete_first_answer})

    # --- 2) SECOND CALL (non-streamed): Print just the text ---------------------
    messages.append({
        "role": "user",
        "content": (
            "Thanks for that. Can you propose a short, 3-minute presentation outline "
            "about the Gold Rush, focusing on its broader implications?"
        ),
    })

    try:
        response_2 = completion(
            model=model,
            messages=messages,
            max_tokens=300,
            temperature=0.6,
            stream=False  # non-streamed
        )
    except Exception as err:
        print(f"Error calling model: {err}")
        return

    print("--------- RESPONSE 2 (non-streamed, text only) ---------")
    second_answer_text = ""
    if response_2.choices and hasattr(response_2.choices[0], "message"):
        second_answer_text = response_2.choices[0].message.get("content", "") or ""

    print(second_answer_text)

if __name__ == "__main__":
    main()

Upon running it you'll see output like the following:

python streaming-litellm.py --------- STREAMING RESPONSE (text only) --------- The California Gold Rush, which occurred from 1848 to 1855, was a pivotal event in American history that had significant economic, social, and cultural impacts on the United States and the world. Here are some of the key reasons why the California Gold Rush was important: 1. **Mass Migration and Population Growth**: The Gold Rush triggered a massive influx of people to California, with estimates suggesting that over 300,000 people arrived in the state between 1848 and 1852. This migration helped to populate the western United States and contributed to the country's westward expansion. 2. **Economic Boom**: The Gold Rush created a huge economic boom, with thousands of people striking it rich and investing their newfound wealth in businesses, infrastructure, and other ventures. The gold rush helped to stimulate economic growth, create new industries, and establish California as a major economic hub. 3. **Technological Innovations**: The Gold Rush drove technological innovations, particularly in the areas of mining and transportation. The development of new mining techniques, such as hydraulic mining, and the construction of roads, bridges, and canals, helped to facilitate the extraction and transportation of gold. 4. **Impact on Native American Populations**: The Gold Rush had a devastating impact on Native American populations in California, who were forcibly removed from their lands, killed, or displaced by the influx of miners. The Gold Rush marked the beginning of a long and tragic period of colonization and marginalization for Native American communities in --------- RESPONSE 2 (non-streamed, text only) --------- Here's a suggested 3-minute presentation outline on the California Gold Rush, focusing on its broader implications: **Title:** The California Gold Rush: A Catalyst for Change **Introduction (30 seconds)** * Briefly introduce the California Gold Rush and its significance * Thesis statement: The California Gold Rush was a pivotal event in American history that had far-reaching implications for the country's economy, society, and politics. **Section 1: Economic Implications (45 seconds)** * Discuss how the Gold Rush stimulated economic growth and helped establish California as a major economic hub * Mention the impact on trade, commerce, and industry, including the growth of San Francisco and other cities * Highlight the role of the Gold Rush in shaping the US economy and contributing to the country's westward expansion **Section 2: Social and Cultural Implications (45 seconds)** * Discuss the impact of the Gold Rush on Native American populations, including forced removals, violence, and displacement * Mention the diversity of people who came to California during the Gold Rush, including immigrants from China, Latin America, and Europe * Highlight the social and cultural changes that resulted from this diversity, including the growth of cities and the development of new communities **Section 3: Lasting Legacy (45 seconds)** * Discuss the lasting legacy of the Gold Rush, including its contribution to the development of the US West Coast and the growth of the US economy * Mention the ongoing impact of the Gold

Both responses appear to trail off abruptly, but that's because we limited the output to 300 tokens each. Feel free to tweak the parameters and rerun the script at your leisure!