Skip to content

Integrate LiteLLM with kluster.ai

LiteLLM is an open-source Python library that streamlines access to a broad range of Large Language Model (LLM) providers through a standardized interface inspired by the OpenAI format. By providing features like fallback mechanisms, cost tracking, and streaming support, LiteLLM reduces the complexity of working with different models, ensuring a more reliable and cost-effective approach to AI-driven applications.

Integrating LiteLLM with the kluster.ai API enables the use of kluster.ai's powerful models alongside LiteLLM's flexible orchestration. This combination makes it simple to switch between models on the fly, handle token usage limits with context window fallback, and monitor usage costs in real-time—leading to robust, scalable, and adaptable AI workflows.

Prerequisites

Before starting, ensure you have the following:

  • A kluster.ai account - sign up on the kluster.ai platform if you don't have one
  • A kluster.ai API key - after signing in, go to the API Keys section and create a new key. For detailed instructions, check out the Get an API key guide
  • A python virtual environment - this is optional but recommended. Ensure that you enter the Python virtual environment before following along with this tutorial
  • LiteLLM installed - to install the library, use the following command:

    pip install litellm
    

Configure LiteLLM

In this section, you'll learn how to integrate kluster.ai with LiteLLM. You'll configure your environment variables, specify a kluster.ai model, and make a simple request using LiteLLM's OpenAI-like interface.

  1. Import LiteLLM and its dependencies - create a new file (e.g., hello-litellm.py) and start by importing the necessary Python modules:
    import os
    
    from litellm import completion
    
  2. Set your kluster.ai API key and Base URL - replace INSERT_API_KEY with your actual API key. If you don't have one yet, refer to the Get an API key guide
    # Set environment vars, shown in script for readability
    os.environ["OPENAI_API_KEY"] = "INSERT_KLUSTER_API_KEY"
    os.environ["OPENAI_API_BASE"] = "https://api.kluster.ai/v1"
    
  3. Define your conversation (system + user messages) - set up your initial system prompt and user message. The system message defines your AI assistant's role, while the user message is the actual question or prompt
    # Basic Chat
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": "What is the capital of California?"}
    ]
    
  4. Select your kluster.ai model - choose one of kluster.ai's available models that best fits your use case. Prepend the model name with openai/ so LiteLLM recognizes it as an OpenAI-like model request
    # Use an "openai/..." model prefix so LiteLLM treats this as an OpenAI-like call
    model = "openai/klusterai/Meta-Llama-3.3-70B-Instruct-Turbo"
    
  5. Call the LiteLLM completion function - finally, invoke the completion function to send your request:
    response = completion(
        model=model,
        messages=messages,
        max_tokens=1000, 
    )
    
    print(response)
    
View complete script
hello-litellm.py
import os

from litellm import completion

# Set environment vars, shown in script for readability
os.environ["OPENAI_API_KEY"] = "INSERT_KLUSTER_API_KEY"
os.environ["OPENAI_API_BASE"] = "https://api.kluster.ai/v1"

# Basic Chat
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "What is the capital of California?"}
]

# Use an "openai/..." model prefix so LiteLLM treats this as an OpenAI-like call
model = "openai/klusterai/Meta-Llama-3.3-70B-Instruct-Turbo"

response = completion(
    model=model,
    messages=messages,
    max_tokens=1000, 
)

print(response)

Use the following command to run your script:

python hello-litellm.py
python hello-litellm.py ModelResponse(id='chatcmpl-9877dfe6-6f1d-483f-a392-d791b89c75d6', created=1739495162, model='klusterai/Meta-Llama-3.3-70B-Instruct-Turbo', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='The capital of California is Sacramento.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, refusal=None))], usage=Usage(completion_tokens=8, prompt_tokens=48, total_tokens=56, completion_tokens_details=None, prompt_tokens_details=None), service_tier=None, prompt_logprobs=None)

That's it! You've successfully integrated LiteLLM with the kluster.ai API. Continue to learn how to experiment with more advanced features of LiteLLM.

Explore LiteLLM features

In the previous section, you learned how to use LiteLLM with the kluster.ai API by properly configuring the model via an OpenAI-like call and configuring the API key and API base URL. The following sections demonstrate using LiteLLM's streaming response and multi-turn conversation features with the kluster.ai API.

The following guide assumes you just finished the configuration exercise in the preceding section. If you haven't already done so, please complete the configuration steps in the Configure LiteLLM section before you continue.

Use streaming responses

You can enable streaming by simply passing stream=True to the completion() function. Streaming returns a generator instead of a static response, letting you iterate over partial output chunks as they arrive. In the code sample below, each chunk is accessed in a for-in loop, allowing you to extract the textual content (e.g., chunk.choices[0].delta.content) rather than printing all metadata.

To configure a streaming response, take the following steps:

  1. Update the messages system prompt and first user message - you can supply a user message or use the sample provided:

        messages = [
            {"role": "system", "content": "You are a helpful AI assistant."},
            {"role": "user",   "content": "Explain the significance of the California Gold Rush."},
        ]
    

  2. Initiate a streaming request to the model - set stream=True in the completion() function to tell LiteLLM to return partial pieces (chunks) of the response as they become available rather than waiting for the entire response to be ready

        # --- 1) STREAMING CALL: Only print chunk text --------------------------------
        try:
            response_stream = completion(
                model=model,
                messages=messages,
                max_tokens=300,
                temperature=0.3,
                stream=True,  # streaming enabled
            )
        except Exception as err:
            print(f"Error calling model: {err}")
            return
    
        print("\n--------- STREAMING RESPONSE (text only) ---------")
        streamed_text = []
    

  3. Isolate the returned text content - returning all of the streamed data will include a lot of excessive noise like token counts, etc. You can isolate the text content from the rest of the streamed response with the following code:
        # Iterate over each chunk from the streaming generator
        for chunk in response_stream:
            if hasattr(chunk, "choices") and chunk.choices:
                # If the content is None, we replace it with "" (empty string)
                partial_text = getattr(chunk.choices[0].delta, "content", "") or ""
                streamed_text.append(partial_text)
                print(partial_text, end="", flush=True)
    
        print("\n")  # new line after streaming ends
    

Handle multi-turn conversation

LiteLLM can facilitate multi-turn conversations by maintaining message history in a sequential chain, enabling the model to consider the context of previous messages. This section demonstrates multi-turn conversation handling by updating the messages list each time we receive a new response from the assistant. This pattern can be repeated for as many turns as you need, continuously appending messages to maintain the conversational flow.

Let's take a closer look at each step:

  1. Combine the streamed chunks of the first message - since the message is streamed in chunks, you must re-assemble them into a single message. After collecting partial responses in streamed_text, join them into a single string called complete_first_answer:
        # Combine the partial chunks into one string
        complete_first_answer = "".join(streamed_text)
    
  2. Append the assistant's reply - to enhance the context of the conversation. Add complete_first_answer back into messages under the "assistant" role as follows:
        # Append the entire first answer to the conversation for multi-turn context
        messages.append({"role": "assistant", "content": complete_first_answer})
    
  3. Craft the second message to the assistant - append a new message object to messages with the user's next question as follows:
        # --- 2) SECOND CALL (non-streamed): Print just the text ---------------------
        messages.append({
            "role": "user",
            "content": (
                "Thanks for that. Can you propose a short, 3-minute presentation outline "
                "about the Gold Rush, focusing on its broader implications?"
            ),
        })
    
  4. Ask the model to respond to the second question - this time, don't enable the streaming feature. Pass the updated messages to completion() with stream=False, prompting LiteLLM to generate a standard (single-shot) response as follows:
        try:
            response_2 = completion(
                model=model,
                messages=messages,
                max_tokens=300,
                temperature=0.6,
                stream=False  # non-streamed
            )
        except Exception as err:
            print(f"Error calling model: {err}")
            return
    
  5. Parse and print the second answer - extract response_2.choices[0].message["content"], store it in second_answer_text, and print to the console for your final output:
        print("--------- RESPONSE 2 (non-streamed, text only) ---------")
        second_answer_text = ""
        if response_2.choices and hasattr(response_2.choices[0], "message"):
            second_answer_text = response_2.choices[0].message.get("content", "") or ""
    
        print(second_answer_text)
    

You can view the full script below. It demonstrates a streamed response versus a regular response and how to handle a multi-turn conversation.

View complete script
hello-litellm.py
import os

import litellm.exceptions
from litellm import completion

# Set environment variables for kluster.ai
os.environ["OPENAI_API_KEY"] = "INSERT_API_KEY"  # Replace with your key
os.environ["OPENAI_API_BASE"] = "https://api.kluster.ai/v1"

def main():
    model = "openai/klusterai/Meta-Llama-3.3-70B-Instruct-Turbo"

    messages = [
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user",   "content": "Explain the significance of the California Gold Rush."},
    ]

    # --- 1) STREAMING CALL: Only print chunk text --------------------------------
    try:
        response_stream = completion(
            model=model,
            messages=messages,
            max_tokens=300,
            temperature=0.3,
            stream=True,  # streaming enabled
        )
    except Exception as err:
        print(f"Error calling model: {err}")
        return

    print("\n--------- STREAMING RESPONSE (text only) ---------")
    streamed_text = []

    # Iterate over each chunk from the streaming generator
    for chunk in response_stream:
        if hasattr(chunk, "choices") and chunk.choices:
            # If the content is None, we replace it with "" (empty string)
            partial_text = getattr(chunk.choices[0].delta, "content", "") or ""
            streamed_text.append(partial_text)
            print(partial_text, end="", flush=True)

    print("\n")  # new line after streaming ends

    # Combine the partial chunks into one string
    complete_first_answer = "".join(streamed_text)

    # Append the entire first answer to the conversation for multi-turn context
    messages.append({"role": "assistant", "content": complete_first_answer})

    # --- 2) SECOND CALL (non-streamed): Print just the text ---------------------
    messages.append({
        "role": "user",
        "content": (
            "Thanks for that. Can you propose a short, 3-minute presentation outline "
            "about the Gold Rush, focusing on its broader implications?"
        ),
    })

    try:
        response_2 = completion(
            model=model,
            messages=messages,
            max_tokens=300,
            temperature=0.6,
            stream=False  # non-streamed
        )
    except Exception as err:
        print(f"Error calling model: {err}")
        return

    print("--------- RESPONSE 2 (non-streamed, text only) ---------")
    second_answer_text = ""
    if response_2.choices and hasattr(response_2.choices[0], "message"):
        second_answer_text = response_2.choices[0].message.get("content", "") or ""

    print(second_answer_text)

if __name__ == "__main__":
    main()

Put it all together

Use the following command to run your script:

python hello-litellm.py

You should see output that resembles the following:

python streaming-litellm.py --------- STREAMING RESPONSE (text only) --------- The California Gold Rush, which occurred from 1848 to 1855, was a pivotal event in American history that had significant economic, social, and cultural impacts on the United States and the world. Here are some of the key reasons why the California Gold Rush was important: 1. **Mass Migration and Population Growth**: The Gold Rush triggered a massive influx of people to California, with estimates suggesting that over 300,000 people arrived in the state between 1848 and 1852. This migration helped to populate the western United States and contributed to the country's westward expansion. 2. **Economic Boom**: The Gold Rush created a huge economic boom, with thousands of people striking it rich and investing their newfound wealth in businesses, infrastructure, and other ventures. The gold rush helped to stimulate economic growth, create new industries, and establish California as a major economic hub. 3. **Technological Innovations**: The Gold Rush drove technological innovations, particularly in the areas of mining and transportation. The development of new mining techniques, such as hydraulic mining, and the construction of roads, bridges, and canals, helped to facilitate the extraction and transportation of gold. 4. **Impact on Native American Populations**: The Gold Rush had a devastating impact on Native American populations in California, who were forcibly removed from their lands, killed, or displaced by the influx of miners. The Gold Rush marked the beginning of a long and tragic period of colonization and marginalization for Native American communities in --------- RESPONSE 2 (non-streamed, text only) --------- Here's a suggested 3-minute presentation outline on the California Gold Rush, focusing on its broader implications: **Title:** The California Gold Rush: A Catalyst for Change **Introduction (30 seconds)** * Briefly introduce the California Gold Rush and its significance * Thesis statement: The California Gold Rush was a pivotal event in American history that had far-reaching implications for the country's economy, society, and politics. **Section 1: Economic Implications (45 seconds)** * Discuss how the Gold Rush stimulated economic growth and helped establish California as a major economic hub * Mention the impact on trade, commerce, and industry, including the growth of San Francisco and other cities * Highlight the role of the Gold Rush in shaping the US economy and contributing to the country's westward expansion **Section 2: Social and Cultural Implications (45 seconds)** * Discuss the impact of the Gold Rush on Native American populations, including forced removals, violence, and displacement * Mention the diversity of people who came to California during the Gold Rush, including immigrants from China, Latin America, and Europe * Highlight the social and cultural changes that resulted from this diversity, including the growth of cities and the development of new communities **Section 3: Lasting Legacy (45 seconds)** * Discuss the lasting legacy of the Gold Rush, including its contribution to the development of the US West Coast and the growth of the US economy * Mention the ongoing impact of the Gold

Both responses appear to trail off abruptly, but that's because we limited the output to 300 tokens each. Feel free to tweak the parameters and rerun the script at your leisure!