Keyword extraction with kluster.ai API¶
Welcome to the keyword extraction notebook powered by the kluster.ai Batch API!
In this notebook, we’ll demonstrate how to leverage the kluster.ai Batch API and the Llama 70B Large Language Model (LLM) to identify keywords in a given dataset. By using an extract from the AG News dataset as an example, we’ll show you how to extract keywords from the dataset. You can easily modify this example for your own use case and data format. Our solution efficiently processes text of any size, small text samples to enterprise-scale datasets.
To get started, simply input your API key and execute the preloaded cells to perform the keyword extraction. If you don’t have an API key, you can register for free on our platform.
Let’s dive in!
Setup¶
Input your unique kluster.ai API key. If you haven’t obtained one yet, don’t forget to sign up.
from getpass import getpass
api_key = getpass("Enter your kluster.ai API key: ")
Enter your kluster.ai API key: ········
%pip install -q openai
Note: you may need to restart the kernel to use updated packages.
from openai import OpenAI
import pandas as pd
import time
import json
from IPython.display import clear_output, display
# Set up the client
client = OpenAI(
base_url="https://api.kluster.ai/v1",
api_key=api_key,
)
Get the data¶
This notebook comes with a preloaded sample dataset based on the AG News dataset. It includes excerpts of news headlines and their leads, all set for processing. There’s no extra setup required—just move on to the next steps to start working with the data.
df = pd.DataFrame({
"text": [
"Chorus Frog Found Croaking in Virginia - The Southern chorus frog has been found in southeastern Virginia, far outside its previously known range. The animal had never before been reported north of Beaufort County, N.C., about 125 miles to the south.",
"Expedition to Probe Gulf of Mexico - Scientists will use advanced technology never before deployed beneath the sea as they try to discover new creatures, behaviors and phenomena in a 10-day expedition to the Gulf of Mexico's deepest reaches.",
"Feds Accused of Exaggerating Fire ImpactP - The Forest Service exaggerated the effect of wildfires on California spotted owls in justifying a planned increase in logging in the Sierra Nevada, according to a longtime agency expert who worked on the plan.",
"New Method May Predict Quakes Weeks Ahead - Swedish geologists may have found a way to predict earthquakes weeks before they happen by monitoring the amount of metals like zinc and copper in subsoil water near earthquake sites, scientists said Wednesday.",
"Marine Expedition Finds New Species - Norwegian scientists who explored the deep waters of the Atlantic Ocean said Thursday their findings #151; including what appear to be new species of fish and squid #151; could be used to protect marine ecosystems worldwide."
]
})
Batch inference¶
To run the inference job, we’ll follow three simple steps:
- Create the batch input file - we’ll create a file containing the requests to be processed by the model.
- Upload the batch input file to kluster.ai - once the file is ready, we’ll upload it to the kluster.ai platform using the API, where it will be queued for processing.
- Start the job - after the upload, we’ll trigger the job to process the data.
Everything is preconfigured for you—just execute the cells below to see it all in action!
Create the Batch file¶
This example uses the klusterai/Meta-Llama-3.3-70B-Instruct-Turbo
model. If you’d prefer to use a different model, you can easily modify the model name in the next cell. For a full list of supported models, please check our documentation.
def create_inference_file(df):
inference_list = []
for index, row in df.iterrows():
content = row['text']
request = {
"custom_id": f"keyword_extraction-{index}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "klusterai/Meta-Llama-3.3-70B-Instruct-Turbo",
"temperature": 0.5,
"response_format": {"type": "json_object"},
"messages": [
{"role": "system", "content": 'Extract up to 5 relevant keywords from the given text. Provide only the keywords between double quotes and separated by commas.'},
{"role": "user", "content": content}
],
}
}
inference_list.append(request)
return inference_list
def save_inference_file(inference_list):
filename = f"keyword_extraction_inference_request.jsonl"
with open(filename, 'w') as file:
for request in inference_list:
file.write(json.dumps(request) + '\n')
return filename
inference_list = create_inference_file(df)
filename = save_inference_file(inference_list)
Let’s preview what that request file looks like:
!head -n 1 keyword_extraction_inference_request.jsonl
{"custom_id": "keyword_extraction-0", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "klusterai/Meta-Llama-3.3-70B-Instruct-Turbo", "temperature": 0.5, "response_format": {"type": "json_object"}, "messages": [{"role": "system", "content": "Extract up to 5 relevant keywords from the given text. Provide only the keywords between double quotes and separated by commas."}, {"role": "user", "content": "Chorus Frog Found Croaking in Virginia - The Southern chorus frog has been found in southeastern Virginia, far outside its previously known range. The animal had never before been reported north of Beaufort County, N.C., about 125 miles to the south."}]}}
Upload inference file to kluster.ai¶
Now that we’ve prepared our input file, it’s time to upload it to the kluster.ai platform.
inference_input_file = client.files.create(
file=open(filename, "rb"),
purpose="batch"
)
Start the job¶
Once the file has been successfully uploaded, we’re ready to start the inference job.
inference_job = client.batches.create(
input_file_id=inference_input_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
All requests are now being processed!
Check job progress¶
In the following section, we’ll monitor the status of each job to see how they’re progressing. Let’s take a look and keep track of their completion.
def parse_json_objects(data_string):
if isinstance(data_string, bytes):
data_string = data_string.decode('utf-8')
json_strings = data_string.strip().split('\n')
json_objects = []
for json_str in json_strings:
try:
json_obj = json.loads(json_str)
json_objects.append(json_obj)
except json.JSONDecodeError as e:
print(f"Error parsing JSON: {e}")
return json_objects
all_completed = False
while not all_completed:
all_completed = True
output_lines = []
updated_job = client.batches.retrieve(inference_job.id)
if updated_job.status != "completed":
all_completed = False
completed = updated_job.request_counts.completed
total = updated_job.request_counts.total
output_lines.append(f"Job status: {updated_job.status} - Progress: {completed}/{total}")
else:
output_lines.append(f"Job completed!")
# Clear the output and display updated status
clear_output(wait=True)
for line in output_lines:
display(line)
if not all_completed:
time.sleep(10)
'Job completed!'
Get the results¶
Now that the job is complete, we’ll fetch the results and examine the responses generated for each request.
job = client.batches.retrieve(inference_job.id)
result_file_id = job.output_file_id
result = client.files.content(result_file_id).content
results = parse_json_objects(result)
for res in results:
task_id = res['custom_id']
index = task_id.split('-')[-1]
result = res['response']['body']['choices'][0]['message']['content']
text = df.iloc[int(index)]['text']
print(f'\n -------------------------- \n')
print(f"Task ID: {task_id}. \n\nINPUT TEXT: {text}\n\nLLM ANSWER: {result}")
-------------------------- Task ID: keyword_extraction-0. INPUT TEXT: Chorus Frog Found Croaking in Virginia - The Southern chorus frog has been found in southeastern Virginia, far outside its previously known range. The animal had never before been reported north of Beaufort County, N.C., about 125 miles to the south. LLM ANSWER: "Chorus Frog", "Virginia", "Southeastern", "Beaufort County", "North Carolina" -------------------------- Task ID: keyword_extraction-1. INPUT TEXT: Expedition to Probe Gulf of Mexico - Scientists will use advanced technology never before deployed beneath the sea as they try to discover new creatures, behaviors and phenomena in a 10-day expedition to the Gulf of Mexico's deepest reaches. LLM ANSWER: "Gulf of Mexico", "expedition", "scientists", "technology", "ocean" -------------------------- Task ID: keyword_extraction-2. INPUT TEXT: Feds Accused of Exaggerating Fire ImpactP - The Forest Service exaggerated the effect of wildfires on California spotted owls in justifying a planned increase in logging in the Sierra Nevada, according to a longtime agency expert who worked on the plan. LLM ANSWER: "Wildfires", "California", "Spotted Owls", "Logging", "Sierra Nevada" -------------------------- Task ID: keyword_extraction-3. INPUT TEXT: New Method May Predict Quakes Weeks Ahead - Swedish geologists may have found a way to predict earthquakes weeks before they happen by monitoring the amount of metals like zinc and copper in subsoil water near earthquake sites, scientists said Wednesday. LLM ANSWER: "earthquakes", "prediction", "geologists", "metals", "seismology" -------------------------- Task ID: keyword_extraction-4. INPUT TEXT: Marine Expedition Finds New Species - Norwegian scientists who explored the deep waters of the Atlantic Ocean said Thursday their findings #151; including what appear to be new species of fish and squid #151; could be used to protect marine ecosystems worldwide. LLM ANSWER: "Marine", "Expedition", "Species", "Atlantic Ocean", "Ecosystems"
Conclusion¶
Congratulations! You’ve successfully completed the keyword extraction task using the kluster.ai Batch API. This demonstration highlights how you can effortlessly manage large datasets and extract meaningful insights from them. With the Batch API, you can scale your workflows smoothly, making it an essential tool for processing large-scale datasets.