Text classification with kluster.ai API¶
Text classification is assigning a class/label to a given text, and it is a common go-to example to demonstrate how helpful an AI model can be.
This tutorial runs through a notebook where you'll learn how to use the kluster.ai batch API to classify a dataset based on a predefined set of categories.
The example uses an extract from the IMDB top 1000 movies dataset and categorizes them into "Action," "Adventure," "Comedy," "Crime," "Documentary," "Drama," "Fantasy," "Horror," "Romance," or "Sci-Fi."
You can adapt this example by using your data and categories relevant to your use case. With this approach, you can effortlessly process datasets of any scale, big or small, and obtain categorized results powered by a state-of-the-art language model.
Prerequisites¶
Before getting started, ensure you have the following:
- A kluster.ai account - sign up on the kluster.ai platform if you don't have one
- A kluster.ai API key - after signing in, go to the API Keys section and create a new key. For detailed instructions, check out the Get an API key guide
Setup¶
In this notebook, we'll use Python's getpass
module to safely input the key. After execution, please provide your unique kluster.ai API key (ensure no spaces).
from getpass import getpass
api_key = getpass("Enter your kluster.ai API key: ")
Enter your kluster.ai API key: ··········
Next, ensure you've installed OpenAI Python library:
pip install -q openai
With the OpenAI Python library installed, we import the necessary dependencies for the tutorial:
from openai import OpenAI
import pandas as pd
import time
import json
import os
from IPython.display import clear_output, display
And then, initialize the client
by pointing it to the kluster.ai endpoint, and passing your API key.
# Set up the client
client = OpenAI(
base_url="https://api.kluster.ai/v1",
api_key=api_key,
)
Get the data¶
Now that you've initialized an OpenAI-compatible client pointing to kluster.ai, we can talk about the data.
This notebook includes a preloaded sample dataset derived from the Top 1000 IMDb Movies dataset. It contains movie descriptions ready for classification. No additional setup is needed. Simply proceed to the next steps to begin working with this data.
df = pd.DataFrame({
"text": [
"Breakfast at Tiffany's: A young New York socialite becomes interested in a young man who has moved into her apartment building, but her past threatens to get in the way.",
"Giant: Sprawling epic covering the life of a Texas cattle rancher and his family and associates.",
"From Here to Eternity: In Hawaii in 1941, a private is cruelly punished for not boxing on his unit's team, while his captain's wife and second-in-command are falling in love.",
"Lifeboat: Several survivors of a torpedoed merchant ship in World War II find themselves in the same lifeboat with one of the crew members of the U-boat that sank their ship.",
"The 39 Steps: A man in London tries to help a counter-espionage Agent. But when the Agent is killed, and the man stands accused, he must go on the run to save himself and stop a spy ring which is trying to steal top secret information."
]
})
Perform batch inference¶
To execute the batch inference job, we'll take the following steps:
- Create the batch job file - we'll generate a JSON lines file with the desired requests to be processed by the model
- Upload the batch job file - once it is ready, we'll upload it to the kluster.ai platform using the API, where it will be processed. We'll receive a unique ID associated with our file
- Start the batch job - after the file is uploaded, we'll initiate the job to process the uploaded data, using the file ID obtained before
- Monitor job progress - (optional) track the status of the batch job to ensure it has been successfully completed
- Retrieve results - once the job has completed execution, we can access and process the resultant data
This notebook is prepared for you to follow along. Run the cells below to watch it all come together.
Create the batch job file¶
This example selects the deepseek-ai/DeepSeek-V3
model. If you'd like to use a different model, feel free to change it by modifying the model
field. In this notebook, you can also comment DeepSeek V3, and uncomment whatever model you want to try out.
Please refer to the Supported models section for a list of the models we support.
The following snippets prepare the JSONL file, where each line represents a different request. Note that each separate batch request can have its own model. Also, we are using a temperature of 0.5
but feel free to change it and play around with the different outcomes (but we are only asking to respond with a single word, the genre).
# Prompt
SYSTEM_PROMPT = '''
Classify the main genre of the given movie description based on the following genres (Respond with only the genre):
“Action”, “Adventure”, “Comedy”, “Crime”, “Documentary”, “Drama”, “Fantasy”, “Horror”, “Romance”, “Sci-Fi”.
'''
# Models
#model="deepseek-ai/DeepSeek-R1"
model="deepseek-ai/DeepSeek-V3"
#model="klusterai/Meta-Llama-3.1-8B-Instruct-Turbo"
#model="klusterai/Meta-Llama-3.1-405B-Instruct-Turbo"
#model="klusterai/Meta-Llama-3.3-70B-Instruct-Turbo"
#model="Qwen/Qwen2.5-VL-7B-Instruct"
# Ensure the directory exists
os.makedirs("text_clasification/data", exist_ok=True)
# Create the batch job file with the prompt and content
def create_batch_file(df):
batch_list = []
for index, row in df.iterrows():
content = row['text']
request = {
"custom_id": f"movie_classification-{index}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": model,
"temperature": 0.5,
"messages": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": content}
],
}
}
batch_list.append(request)
return batch_list
# Save file
def save_batch_file(batch_list):
filename = f"text_clasification/batch_job_request.jsonl"
with open(filename, 'w') as file:
for request in batch_list:
file.write(json.dumps(request) + '\n')
return filename
Let's run the functions we've defined before:
batch_list = create_batch_file(df)
filename = save_batch_file(batch_list)
Next, we can preview what that batch job file looks like:
!head -n 1 text_clasification/batch_job_request.jsonl
{"custom_id": "movie_classification-0", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "deepseek-ai/DeepSeek-V3", "temperature": 0.5, "messages": [{"role": "system", "content": "\n Classify the main genre of the given movie description based on the following genres (Respond with only the genre):\n \u201cAction\u201d, \u201cAdventure\u201d, \u201cComedy\u201d, \u201cCrime\u201d, \u201cDocumentary\u201d, \u201cDrama\u201d, \u201cFantasy\u201d, \u201cHorror\u201d, \u201cRomance\u201d, \u201cSci-Fi\u201d.\n "}, {"role": "user", "content": "Breakfast at Tiffany's: A young New York socialite becomes interested in a young man who has moved into her apartment building, but her past threatens to get in the way."}]}}
Upload batch job file to kluster.ai¶
Now that we’ve prepared our input file, it’s time to upload it to the kluster.ai platform. To do so, you can use the files.create
endpoint of the client, where the purpose is set to batch
. This will return the file ID, which we need to log for the next steps.
data_dir = 'text_clasification/batch_job_request.jsonl'
# Upload batch job request file
with open(data_dir, 'rb') as file:
upload_response = client.files.create(
file=file,
purpose="batch"
)
# Print job ID
file_id = upload_response.id
print(f"File uploaded successfully. File ID: {file_id}")
File uploaded successfully. File ID: 67ddb3ef6afe1d706e51b7a6
Start the batch job¶
Once the file has been successfully uploaded, we're ready to start (create) the batch job by providing the file ID we got in the previous step. To do so, we use the batches.create
method, for which we need to set the endpoint to /v1/chat/completions
. This will return the batch job details, with the ID.
# Create batch job with completions endpoint
batch_job = client.batches.create(
input_file_id=file_id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
print("\nBatch job created:")
batch_dict = batch_job.model_dump()
print(json.dumps(batch_dict, indent=2))
Batch job created: { "id": "67ddb3f151ba53f099cdb3f9", "completion_window": "24h", "created_at": 1742582769, "endpoint": "/v1/chat/completions", "input_file_id": "67ddb3ef6afe1d706e51b7a6", "object": "batch", "status": "pre_schedule", "cancelled_at": null, "cancelling_at": null, "completed_at": null, "error_file_id": null, "errors": [], "expired_at": null, "expires_at": 1742669169, "failed_at": null, "finalizing_at": null, "in_progress_at": null, "metadata": {}, "output_file_id": null, "request_counts": { "completed": 0, "failed": 0, "total": 0 } }
Check job progress¶
Now that your batch job has been created, you can track its progress.
To monitor the job's progress, we can use the batches.retrieve
method and pass the batch job ID. The response contains an status
field that tells us if it is completed or not, and the subsequent status of each job separately.
The following snippet checks the status every 10 seconds until the entire batch is completed:
all_completed = False
# Loop to check status every 10 seconds
while not all_completed:
all_completed = True
output_lines = []
updated_job = client.batches.retrieve(batch_job.id)
if updated_job.status != "completed":
all_completed = False
completed = updated_job.request_counts.completed
total = updated_job.request_counts.total
output_lines.append(f"Job status: {updated_job.status} - Progress: {completed}/{total}")
else:
output_lines.append(f"Job completed!")
# Clear the output and display updated status
clear_output(wait=True)
for line in output_lines:
display(line)
if not all_completed:
time.sleep(10)
'Job completed!'
Get the results¶
With the job completed, we'll retrieve the results and review the responses generated for each request. The results are parsed. To fetch the results from the platform, you need to retrieve the output_file_id
from the batch job, and then use the files.content
endpoint, providing that specific file ID. Note that the job status must be completed
for you to retrieve the results!
#Parse results as a JSON object
def parse_json_objects(data_string):
if isinstance(data_string, bytes):
data_string = data_string.decode('utf-8')
json_strings = data_string.strip().split('\n')
json_objects = []
for json_str in json_strings:
try:
json_obj = json.loads(json_str)
json_objects.append(json_obj)
except json.JSONDecodeError as e:
print(f"Error parsing JSON: {e}")
return json_objects
# Retrieve results with job ID
job = client.batches.retrieve(batch_job.id)
result_file_id = job.output_file_id
result = client.files.content(result_file_id).content
# Parse JSON results
parsed_result = parse_json_objects(result)
# Extract and print only the content of each response
print("\nExtracted Responses:")
for item in parsed_result:
try:
content = item["response"]["body"]["choices"][0]["message"]["content"]
print(content)
except KeyError as e:
print(f"Missing key in response: {e}")
Extracted Responses: Romance Drama Drama Drama Action
Summary¶
This tutorial used the chat completion endpoint to perform a simple text classification task with batch inference. This particular example clasified a series of movies based on their description.
To submit a batch job we've:
- Created the JSONL file, where each line of the file represented a separate request
- Submitted the file to the platform
- Started the batch job, and monitored its progress
- Once completed, we fetched the results
All of this using the OpenAI Python library and API, no changes needed!
Kluster.ai's batch API empowers you to scale your workflows seamlessly, making it an invaluable tool for processing extensive datasets. As next steps, feel free to create your own dataset, or expand on top of this existing example. Good luck!