Evaluating LLM performance without ground truth using an LLM judge¶
In our previous notebook we explored the idea of selecting the best model to perform a classification task. We did that by calculating the accuracy of each model based on a ground truth label. In real life applications, though, the ground truth is not always available and to create one we might depend on human anotation which is time-consuming and costly.
In this notebook, we will use the Llama-3.1-8B-Instruct-Turbo
model to classify the genre of movies from the IMDb Top 1000 dataset based on their descriptions. To evaluate the accuracy of these predictions, we will use the Llama-3.1-405B-Instruct-Turbo
model as a judge, tasked with determining whether the base model’s answers are correct. Since the dataset includes the true genres as ground truth, we can also assess how well the judge model aligns with the actual answers provided in the dataset.
Tutorial structure¶
- Setup and configuration
- Data acquisition
- Performing batch inference
- LLM as a judge
- Conclusion
1. Setting up your environment¶
API key configuration¶
To get started with this tutorial, you'll need a kluster.ai API key. If you don't have one yet, follow these steps:
- Visit the kluster.ai to create an account.
- Generate your API key
Once you have your API key, we'll use it to authenticate our requests to the kluster.ai API.
Important note: keep your API key secure and never share it publicly. In this notebook, we'll use Python's getpass module to safely input the key.¶
from getpass import getpass
api_key = getpass("Enter your kluster.ai API key: ")
%pip install -q OpenAI
Note: you may need to restart the kernel to use updated packages.
import os
import urllib.request
import pandas as pd
import numpy as np
import random
import requests
from openai import OpenAI
import time
import json
from IPython.display import clear_output, display
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, accuracy_score
pd.set_option('display.max_columns', 1000, 'display.width', 1000, 'display.max_rows',1000, 'display.max_colwidth', 500)
# Set up the client
client = OpenAI(
base_url="https://api.kluster.ai/v1",
api_key=api_key,
)
Building our evaluation pipeline¶
Understanding the helper functions¶
In this section, we'll create several utility functions that will help us:
- Prepare our data for batch processing
- Send requests to the kluster.ai API
- Monitor the progress of our evaluation
- Collect and analyze results
These functions will make our evaluation process more efficient and organized. Let's go through each one and understand its purpose.
create_tasks()
- formats our data for the APIsave_tasks()
- prepares batch files for processingmonitor_job_status()
- tracks evaluation progressget_results()
- collects and processes model outputs
Creating and managing batch files¶
What is a batch file?¶
A batch file in our context is a collection of requests that we'll send to our models for evaluation. Think of it as a organized list of tasks we want our models to complete.
Step-by-Step process¶
- Creating tasks - we'll convert each movie description into a format LLMs can process
- Organizing data -we'll add necessary metadata and instructions for each task
- Saving files - we'll store these tasks in a structured format (JSONL) for processing
Understanding the code¶
Let's break down the key components of our batch file creation:
custom_id
- helps us track individual requestssystem_prompt
- provides instructions to the modelcontent
- the actual text we want to classify
This structured approach allows us to efficiently process multiple requests in parallel.
def create_tasks(user_contents, system_prompt, task_type, model):
tasks = []
for index, user_content in enumerate(user_contents):
task = {
"custom_id": f"{task_type}-{index}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": model,
"temperature": 0,
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_content},
],
}
}
tasks.append(task)
return tasks
def save_tasks(tasks, task_type):
filename = f"batch_tasks_{task_type}.jsonl"
with open(filename, 'w') as file:
for task in tasks:
file.write(json.dumps(task) + '\n')
return filename
Uploading files to kluster.ai¶
The upload process¶
Now that we've prepared our batch files, we'll upload them to the kluster.ai platform for batch inference. This step is crucial for:
- Getting our data to the models
- Setting up the processing queue
- Preparing for inference
What happens next?¶
After upload:
- The platform queues our requests
- Models process them efficiently
- Results are made available for collection
def create_batch_job(file_name):
print(f"Creating batch job for {file_name}")
batch_file = client.files.create(
file=open(file_name, "rb"),
purpose="batch"
)
batch_job = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
return batch_job
This function provides real-time monitoring of batch job progress:
- Continuously checks job status via the kluster.ai API
- Displays current completion count (completed/total requests)
- Updates status every 10 seconds until job is finished
- Automatically clears previous output for clean progress tracking
def parse_json_objects(data_string):
if isinstance(data_string, bytes):
data_string = data_string.decode('utf-8')
json_strings = data_string.strip().split('\n')
json_objects = []
for json_str in json_strings:
try:
json_obj = json.loads(json_str)
json_objects.append(json_obj)
except json.JSONDecodeError as e:
print(f"Error parsing JSON: {e}")
return json_objects
def monitor_job_status(client, job_id, task_type):
all_completed = False
while not all_completed:
all_completed = True
output_lines = []
updated_job = client.batches.retrieve(job_id)
if updated_job.status.lower() != "completed":
all_completed = False
completed = updated_job.request_counts.completed
total = updated_job.request_counts.total
output_lines.append(f"{task_type.capitalize()} job status: {updated_job.status} - Progress: {completed}/{total}")
else:
output_lines.append(f"{task_type.capitalize()} job completed!")
# Clear the output and display updated status
clear_output(wait=True)
for line in output_lines:
display(line)
if not all_completed:
time.sleep(10)
def get_results(client, job_id):
batch_job = client.batches.retrieve(job_id)
result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content
results = parse_json_objects(result)
answers = []
for res in results:
result = res['response']['body']['choices'][0]['message']['content']
answers.append(result)
return answers
2. Data acquisition¶
Now that we have covered the core general functions and workflow used for batch inference, in this guide, we’ll be using the IMDb Top 1000 dataset, which contains information about top-rated movies, including their descriptions and genres. Let's download it and see what it looks like.
# IMDB Top 1000 dataset:
url = "https://raw.githubusercontent.com/kluster-ai/klusterai-cookbook/refs/heads/main/data/imdb_top_1000.csv"
urllib.request.urlretrieve(url,filename='imdb_top_1000.csv')
# Load and process the dataset based on URL content
df = pd.read_csv('imdb_top_1000.csv', usecols=['Series_Title', 'Overview', 'Genre']).tail(300)
df[['Series_Title','Overview']].head(3)
Series_Title | Overview | |
---|---|---|
700 | Wait Until Dark | A recently blinded woman is terrorized by a trio of thugs while they search for a heroin-stuffed doll they believe is in her apartment. |
701 | Guess Who's Coming to Dinner | A couple's attitudes are challenged when their daughter introduces them to her African-American fianc. |
702 | Bonnie and Clyde | Bored waitress Bonnie Parker falls in love with an ex-con named Clyde Barrow and together they start a violent crime spree through the country, stealing cars and robbing banks. |
3. Performing batch inference¶
In this section, we will perform batch inference using the previously defined helper functions and the IMDb dataset. The goal is to classify movie genres based on their descriptions using a Large Language Model (LLM).
Here, we define the input prompts for the LLM. This includes a system prompt that explains the task to the LLM and the user content, which is a list of movie descriptions from our dataset.
prompt_dict = {
"ASSISTANT_PROMPT" : '''
You are a helpful assitant that classifies movie genres based on the movie description. Choose one of the following options:
Action, Adventure, Animation, Biography, Comedy, Crime, Drama, Family, Fantasy, Film-Noir, History, Horror, Music, Musical, Mystery, Romance, Sci-Fi, Sport, Thriller, War, Western.
Provide your response as a single word with the matching genre. Don't include punctuation.
''',
"USER_CONTENTS" : df['Overview'].tolist()
}
The tasks will be created and saved. The batch inference job will then be submitted, and its progress will be monitored. Once the process is complete, the predictions will be integrated into the dataset.
task_list = create_tasks(user_contents=prompt_dict["USER_CONTENTS"],
system_prompt=prompt_dict["ASSISTANT_PROMPT"],
model="klusterai/Meta-Llama-3.1-8B-Instruct-Turbo",
task_type='assistant')
filename = save_tasks(task_list, task_type='assistant')
job = create_batch_job(filename)
monitor_job_status(client=client, job_id=job.id, task_type='assistant')
df['predicted_genre'] = get_results(client=client, job_id=job.id)
'Assistant job completed!'
4. LLM as a judge¶
This section evaluates the performance of the initial LLM predictions. We use another LLM as a judge to assess whether the predicted genres align with the movie descriptions.
First we define the input prompts for the LLM judge. These prompts include the movie description, a list of possible genres, and the genre predicted by the first LLM. The judge LLM evaluates the correctness of the predictions based on specific criteria.
prompt_dict = {
"JUDGE_PROMPT" : '''
You will be provided with a movie description, a list of possible genres, and a predicted movie genre made by another LLM. Your task is to evaluate whether the predicted genre is ‘correct’ or ‘incorrect’ based on the following steps and requirements.
Steps to Follow:
1. Carefully read the movie description.
2. Determine your own classification of the genre for the movie. Do not rely on the LLM's answer since it may be incorrect. Do not rely on individual words to identify the genre; read the whole description to identify the genre.
3. Read the LLM answer (enclosed in double quotes) and evaluate if it is the correct answer by following the Evaluation Criteria mentioned below.
4. Provide your evaluation as 'correct' or 'incorrect'.
Evaluation Criteria:
- Ensure the LLM answer (enclosed in double quotes) is one of the provided genres. If it is not listed, the evaluation should be ‘incorrect’.
- If the LLM answer (enclosed in double quotes) does not align with the movie description, the evaluation should be ‘incorrect’.
- The first letter of the LLM answer (enclosed in double quotes) must be capitalized (e.g., Drama). If it has any other capitalization, the evaluation should be ‘incorrect’.
- All other letters in the LLM answer (enclosed in double quotes) must be lowercase. Otherwise, the evaluation should be ‘incorrect’.
- If the LLM answer consists of multiple words, the evaluation should be ‘incorrect’.
- If the LLM answer includes punctuation, spaces, or additional characters, the evaluation should be ‘incorrect’.
Output Rules:
- Provide the evaluation with no additional text, punctuation, or explanation.
- The output should be in lowercase.
Final Answer Format:
evaluation
Example:
correct
''',
"USER_CONTENTS" : [f'''Movie Description: {row['Overview']}.
Available Genres: Action, Adventure, Animation, Biography, Comedy, Crime, Drama, Family, Fantasy, Film-Noir, History, Horror, Music, Musical, Mystery, Romance, Sci-Fi, Sport, Thriller, War, Western
LLM answer: "{row['predicted_genre']}"
''' for _, row in df.iterrows()
]
}
Following the same set of steps as the previous inference, we will create and save the tasks, submit the batch inference job, and monitor its progress. Once the process is complete, the predictions will also be integrated into the dataset.
task_list = create_tasks(user_contents=prompt_dict["USER_CONTENTS"],
system_prompt=prompt_dict["JUDGE_PROMPT"],
task_type='judge',
model="klusterai/Meta-Llama-3.1-405B-Instruct-Turbo")
filename = save_tasks(task_list, task_type='judge')
job = create_batch_job(filename)
monitor_job_status(client=client, job_id=job.id, task_type='judge')
df['judge_evaluation'] = get_results(client=client, job_id=job.id)
'Judge job completed!'
Now, we will calculate the LLM classification accuracy based on what the LLM judge considers correct or incorrect. For this purpose, we will compute the accuracy. If you are unfamiliar with accuracy metrics, please refer to our previous notebook.
print('LLM Judge-determined accuracy: ',df['judge_evaluation'].value_counts(normalize=True)['correct'])
LLM Judge-determined accuracy: 0.86
Conclusion¶
According to the LLM Judge, the accuracy of the baseline model was 82%. This demonstrates how, in situations where we lack a ground truth, we can leverage a large-language model to evaluate the responses of another model. By doing so, we can establish a form of ground truth or an evaluation metric that allows us to assess model performance, refine prompts, or understand how well the model is performing overall.
This approach is particularly valuable when dealing with large datasets containing thousands of entries, where manual evaluation would be impractical. Automating this process not only saves significant time but also reduces costs by eliminating the need for extensive human annotations. Ultimately, it provides a scalable and efficient way to gain meaningful insights into model performance.
(Optional) Validation against ground truth¶
According to the LLM Judge, the accuracy of the baseline model is 82%. But how accurate is this evaluation? In this particular case, the IMDb Top 1000 dataset provides ground truth labels, allowing us to directly calculate the accuracy of the predicted genres. Let’s compare and see how close the results are.
print('LLM ground truth accuracy: ',df.apply(lambda row: row['predicted_genre'] in row['Genre'].split(', '), axis=1).mean())
LLM ground truth accuracy: 0.7833333333333333
Although the ground truth accuracy is not exactly identical to the evaluation provided by the LLM Judge, in situations where we lack ground truth, using an LLM as an evaluator offers a valuable way to assess how well our baseline model is performing.