Evaluating LLM performance without ground truth using an LLM judge¶
How can we test a model's accuracy when the ground truth is unavailable? One approach could be to test the predictions made by the base model against a larger model, which, comparatively, should do better.
This tutorial uses a base model (Llama-3.1-8B-Instruct-Turbo
) to classify a dataset based on a description. Next, we will use a larger model (klusterai/Meta-Llama-3.3-70B-Instruct-Turbo
) as a judge, tasked to determine whether the base model's predictions are correct. Since the dataset also contains the ground truth, the notebook also assesses how well the judge model performed.
A great breakdown on calculating a model's accuracy can be found in our model comparison notebook.
You'll be using the same dataset as in our text classification notebook, which is an extract from the IMDB top 1000 movies dataset categorized into 21 different genres.
Prerequisites¶
Before getting started, ensure you have the following:
- A kluster.ai account - sign up on the kluster.ai platform if you don't have one
- A kluster.ai API key - after signing in, go to the API Keys section and create a new key. For detailed instructions, check out the Get an API key guide
Setup¶
In this notebook, we'll use Python's getpass
module to input the key safely. After execution, please provide your unique kluster.ai API key (ensure no spaces).
from getpass import getpass
api_key = getpass("Enter your kluster.ai API key: ")
Enter your kluster.ai API key: ········
Next, ensure you've installed the OpenAI Python library:
%pip install -q OpenAI
Note: you may need to restart the kernel to use updated packages.
With the OpenAI Python library installed, we import the necessary dependencies for the tutorial:
from openai import OpenAI
import pandas as pd
import time
import json
import os
import urllib.request
Then, initialize the client
by pointing it to the kluster.ai endpoint and passing your API key.
# Set up the client
client = OpenAI(
base_url="https://api.kluster.ai/v1",
api_key=api_key,
)
Get the data¶
Now that you've initialized an OpenAI-compatible client pointing to kluster.ai, we can talk about the data.
This notebook uses a dataset from the Top 1000 IMDb Movies dataset, which contains descriptions and genres for each movie. In some cases, a movie can have more than one label. When calculating the accuracy, we'll consider the prediction correct if the predicted genre matches at least one of the genres listed in the dataset, the ground truth. This ground truth allows the notebook to calculate the accuracy and measure how well a given LLM has performed.
# IMDB Top 1000 dataset:
url = "https://raw.githubusercontent.com/kluster-ai/klusterai-cookbook/refs/heads/main/data/imdb_top_1000.csv"
urllib.request.urlretrieve(url,filename='imdb_top_1000.csv')
# Load and process the dataset based on URL content
df = pd.read_csv('imdb_top_1000.csv', usecols=['Series_Title', 'Overview', 'Genre'])
df.head(3)
Series_Title | Genre | Overview | |
---|---|---|---|
0 | The Shawshank Redemption | Drama | Two imprisoned men bond over a number of years... |
1 | The Godfather | Crime, Drama | An organized crime dynasty's aging patriarch t... |
2 | The Dark Knight | Action, Crime, Drama | When the menace known as the Joker wreaks havo... |
Perform batch inference¶
To execute the batch inference job, we'll create the following functions:
- Create the batch job file - we'll generate a JSON lines file with the desired requests to be processed by the model. Consequently, we'll create a file for the assistant model and one for the judge model. You can also work with a single file by providing the different models for each request
- Upload the batch job file - once it is ready, we'll upload it to the kluster.ai platform using the API, where it will be processed. We'll receive a unique ID associated with our file
- Start the batch job - after the file is uploaded, we'll initiate the job to process the uploaded data, using the file ID obtained before
- Monitor job progress - (optional) track the status of the batch job to ensure it has been successfully completed
- Retrieve results - once the job has completed execution, we can access and process the resultant data
Next, we will run the functions for the base model and feed the results to the pipeline using the judge model.
This notebook is prepared for you to follow along. Run the cells below to watch it all come together.
Create the batch job file¶
The following snippets prepare the JSONL file, where each line represents a different request. The function is set to reuse between the base and judge models.
Note that each separate batch request can have its own model. Also, we are using a temperature of 0.5
, but feel free to change it and play around with the different outcomes (but we are only asking to respond with a single word, the genre).
# Ensure the directory exists
os.makedirs("llm_as_judge", exist_ok=True)
# Create the batch job file with the prompt and content for the model
def create_batch_file(index, model, system_prompt, content):
request = {
"custom_id": f"{model}-{index}-analysis",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": model,
"temperature": 0.5,
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": content},
],
},
}
return request
# Save file
def save_batch_file(batch_list, model):
filename = f"llm_as_judge/batch_job_{model}_request.jsonl"
with open(filename, "w") as file:
for request in batch_list:
file.write(json.dumps(request) + "\n")
return filename
Upload batch job file to kluster.ai¶
Once we've prepared our input file, it's time to upload them to the kluster.ai platform. To do so, you can use the files.create
endpoint of the client, where the purpose is set to batch
. This will return the file ID, which we need to log for the next steps.
def upload_batch_file(data_dir):
print(f"Creating request for {data_dir}")
with open(data_dir, 'rb') as file:
upload_response = client.files.create(
file=file,
purpose="batch"
)
# Print job ID
file_id = upload_response.id
print(f"File uploaded successfully. File ID: {file_id}")
return upload_response
Start the job¶
Once all the files have been successfully uploaded, we're ready to start (create) the batch jobs by providing the file ID. To start each job, we use the batches.create
method, for which we need to set the endpoint to /v1/chat/completions
. This will return each batch job's details, with each ID.
# Create batch job with completions endpoint
def create_batch_job(file_id):
batch_job = client.batches.create(
input_file_id=file_id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
print(f"Batch job created with ID {batch_job.id}")
return batch_job
Check job progress¶
Once your batch jobs have been created, you can track their progress.
To monitor the job's progress, we can use the batches.retrieve
method and pass the batch job ID. The response contains a status
field that tells whether it is completed and the subsequent status of each job separately. We can repeat this process for every batch job ID we get in the previous step.
The following snippet checks the status of all batch jobs every 10 seconds until the entire batch is completed.
def monitor_batch_job(job):
completed = False
# Loop until all jobs are completed
while not completed:
completed = True
updated_job = client.batches.retrieve(job.id)
status = updated_job.status
# If job is completed
if status == "completed":
msg = f"Job ended with status: {status}"
print(f"\r{msg}{' ' * (80 - len(msg))}", end="", flush=True)
break
# If job failed, cancelled or expired
elif status in ["failed", "cancelled", "expired"]:
print(f"\rJob ended with status: {status}")
break
# If job is ongoing
else:
completed = False
current_completed = updated_job.request_counts.completed
total = updated_job.request_counts.total
msg = f"Job status: {status} - Progress: {current_completed}/{total}"
print(f"\r{msg}{' ' * (80 - len(msg))}", end="", flush=True)
# Check every 5 seconds
if not completed:
time.sleep(5)
Get the results¶
When the batch job is completed, we'll retrieve the results and review the responses generated for each request. The results are parsed. To fetch the results from the platform, you must retrieve the output_file_id
from the batch job and then use the files.content
endpoint, providing that specific file ID. We will repeat this for every single batch job id. Note that the job status must be completed
to retrieve the results!
#Parse results as a JSON object
def parse_json_objects(data_string):
if isinstance(data_string, bytes):
data_string = data_string.decode('utf-8')
json_strings = data_string.strip().split('\n')
json_objects = []
for json_str in json_strings:
try:
json_obj = json.loads(json_str)
json_objects.append(json_obj)
except json.JSONDecodeError as e:
print(f"Error parsing JSON: {e}")
return json_objects
# Retrieve results with job ID
def retrieve_results(batch_job):
job = client.batches.retrieve(batch_job.id)
result_file_id = job.output_file_id
result = client.files.content(result_file_id).content
# Parse JSON results
parsed_result = parse_json_objects(result)
answers = []
# Extract the content of each response
for item in parsed_result:
try:
content = item["response"]["body"]["choices"][0]["message"]["content"]
answers.append(content)
except KeyError as e:
print(f"Missing key in response: {e}")
return answers
Now that the basic inference pipeline has been established, let's run it first for the base model.
Batch inference for the base model¶
This example uses Llama 3.1 8B as the base model. If you'd like to test different models, feel free to modify the scripts accordingly.
Please refer to the Supported models section for a list of the models we support.
For the base model, the prompt is pretty similar to that of the text classification notebook, where we ask to classify each movie genre based on a description, and providing a specific set of options as possible genres.
# System prompt
SYSTEM_PROMPT_BASE = """
You are a helpful assistant who classifies movie genres based on the provided description. Choose one of the following options:
Action, Adventure, Animation, Biography, Comedy, Crime, Drama, Family, Fantasy, Film-Noir, History, Horror, Music, Musical, Mystery, Romance, Sci-Fi, Sport, Thriller, War, Western.
Provide your response as a single word with the matching genre. Don't include punctuation.
"""
# Model
model_name = "Llama3.1-8B-Base"
model = "klusterai/Meta-Llama-3.1-8B-Instruct-Turbo"
# Create batch file
batch_list = []
for index, row in df.iterrows():
content = row["Overview"]
batch_list.append(create_batch_file(index, model, SYSTEM_PROMPT_BASE, content))
filename = save_batch_file(batch_list, model_name)
print(f"Batch file created {filename}")
# Upload batch file
batch_file = upload_batch_file(filename)
# Create batch job
batch_job = create_batch_job(batch_file.id)
# Monitor batch job
monitor_batch_job(batch_job)
# Save results
df['Predicted_Genre_Base_Model'] = retrieve_results(batch_job)
Batch file created llm_as_judge/batch_job_Llama3.1-8B-Base_request.jsonl Creating request for llm_as_judge/batch_job_Llama3.1-8B-Base_request.jsonl File uploaded successfully. File ID: 681df4e8c92de30bfdc7d1aa Batch job created with ID 681df4e8c39f51afee4d7adf Job ended with status: completed
Next, let's print the first three predictions made by the base model.
# Print the first 3 genre predictions
df.head(3)
Series_Title | Genre | Overview | Predicted_Genre_Base_Model | |
---|---|---|---|---|
0 | The Shawshank Redemption | Drama | Two imprisoned men bond over a number of years... | Drama |
1 | The Godfather | Crime, Drama | An organized crime dynasty's aging patriarch t... | Crime |
2 | The Dark Knight | Action, Crime, Drama | When the menace known as the Joker wreaks havo... | Superhero |
With the base model inference performed, let's move to the judge model inference.
Batch inference for the judge model¶
This example uses the larger Llama 3.3 70B as the judge model (the artificial ground truth). If you'd like to test different models, feel free to modify the scripts accordingly.
Please refer to the Supported models section for a list of the models we support.
For the judge model, we must be very specific about the task to be executed, providing unambiguous guidelines on what constitutes a correct and incorrect prediction by the base model. For example, you must also consider cases in which the base model offers a response that is not formatted correctly. You might need to tune each prompt to ensure the judge model accurately measures the base model response.
# System prompt
SYSTEM_PROMPT_JUDGE = """
You will receive a movie description, a list of possible genres, and a predicted movie genre made by another LLM (base model).
Your task is to evaluate whether the predicted genre is ‘correct’ or ‘incorrect’ based on the following steps and requirements.
Steps to Follow:
1. Carefully read the movie description.
2. Determine your own classification of the genre for the movie. Do not rely on the base model answer since it may be incorrect.
3. Do not rely on individual words to identify the genre; read the whole description to identify the genre.
4. Read the base model answer (enclosed in double quotes) and evaluate if it is correct by following the Evaluation Criteria below.
5. Provide your evaluation as 'correct' or 'incorrect'.
Evaluation Criteria:
- If the base model answer (enclosed in double quotes) does not align with the movie description, the evaluation should be ‘incorrect’.
- The first letter of the base model answer (enclosed in double quotes) must be capitalized (e.g., Drama). If it has any other capitalization, the evaluation should be ‘incorrect’.
- All other letters in the base model answer (enclosed in double quotes) must be lowercase. Otherwise, the evaluation should be ‘incorrect’.
- If the base model answer consists of multiple words, the evaluation should be ‘incorrect’.
- If the base model answer includes punctuation, spaces, or additional characters, the evaluation should be ‘incorrect’.
- If the base model answer (enclosed in double quotes) is not one of the provided genres, the evaluation should be ‘incorrect’.
- If it is not listed, the evaluation should be ‘incorrect’.
Output Rules:
- Provide your genre prediction and evaluation without additional text, punctuation, or explanation.
- The output must be: the genre prediction and the evaluation. The first letter uppercase and all other letters lowercase.
Final Answer Format:
Prediction,Evaluation
Example:
Drama,Correct
"""
# Model
model_name = "Llama3.3-70B-Judge"
model = "klusterai/Meta-Llama-3.3-70B-Instruct-Turbo"
# Create batch file
batch_list = []
for index, row in df.iterrows():
# Message content for judging
content = f"""
Movie Description: {row['Overview']}.
Available Genres: Action, Adventure, Animation, Biography, Comedy, Crime, Drama, Family, Fantasy, Film-Noir, History, Horror, Music, Musical, Mystery, Romance, Sci-Fi, Sport, Thriller, War, Western
Base model answer: "{row['Predicted_Genre_Base_Model']}"
"""
batch_list.append(create_batch_file(index, model, SYSTEM_PROMPT_JUDGE, content))
filename = save_batch_file(batch_list, model_name)
print(f"Batch file created {filename}")
# Upload batch file
batch_file = upload_batch_file(filename)
# Create batch job
batch_job = create_batch_job(batch_file.id)
# Monitor batch job
monitor_batch_job(batch_job)
# Save results
df['Judge_Prediction_Evaluation'] = retrieve_results(batch_job)
Batch file created llm_as_judge/batch_job_Llama3.3-70B-Judge_request.jsonl Creating request for llm_as_judge/batch_job_Llama3.3-70B-Judge_request.jsonl File uploaded successfully. File ID: 681df517030fcc793229acbd Batch job created with ID 681df51861f50fed2032f44d Job ended with status: completed
Next, let's print the first 10 predictions with the evaluation from the judge model.
# Print the first 10 judge evaluations
df.head(10)
Series_Title | Genre | Overview | Predicted_Genre_Base_Model | Judge_Prediction_Evaluation | |
---|---|---|---|---|---|
0 | The Shawshank Redemption | Drama | Two imprisoned men bond over a number of years... | Drama | Drama,Correct |
1 | The Godfather | Crime, Drama | An organized crime dynasty's aging patriarch t... | Crime | Drama,Incorrect |
2 | The Dark Knight | Action, Crime, Drama | When the menace known as the Joker wreaks havo... | Superhero | Action,Incorrect |
3 | The Godfather: Part II | Crime, Drama | The early life and career of Vito Corleone in ... | Crime | Crime,Correct |
4 | 12 Angry Men | Crime, Drama | A jury holdout attempts to prevent a miscarria... | Mystery | Drama,Incorrect |
5 | The Lord of the Rings: The Return of the King | Action, Adventure, Drama | Gandalf and Aragorn lead the World of Men agai... | Fantasy | Fantasy,Correct |
6 | Pulp Fiction | Crime, Drama | The lives of two mob hitmen, a boxer, a gangst... | Crime | Crime,Correct |
7 | Schindler's List | Biography, Drama, History | In German-occupied Poland during World War II,... | Drama | History,Incorrect |
8 | Inception | Action, Adventure, Sci-Fi | A thief who steals corporate secrets through t... | Thriller | Sci-Fi,Incorrect |
9 | Fight Club | Drama | An insomniac office worker and a devil-may-car... | Drama | Thriller,Incorrect |
Analysis of the LLM as a judge¶
In the previous sections we first defined a batch inference pipeline. Next, we ran that inference pipeline using a base model, requesting it to predict the genre of a movie based of a brief overview/description. Lastly, we ran another batch inference using a judge model, asking the model to evaluate the results of the base model in accordance to its own prediction.
To analyze the accuracy of the base model using the judge model as ground truth, we can count the number of correct
evaluations.
# Extract the evaluation from judge model
df["Evaluation"] = df["Judge_Prediction_Evaluation"].str.split(",").str[1].str.strip()
print('LLM Judge-determined accuracy: ', (df["Evaluation"].str.lower() == "correct").mean())
LLM Judge-determined accuracy: 0.592
# Clean Genre data
df["Actual_Genres_List"] = df["Genre"].str.split(",").apply(lambda genres: [g.strip() for g in genres])
# Base model compared to ground truth
df["Base_Eval"] = df.apply(lambda row: row["Predicted_Genre_Base_Model"] in row["Actual_Genres_List"], axis=1)
# Extract the predicted genre from judge model
df["Judge_Predicted_Genre"] = df["Judge_Prediction_Evaluation"].str.split(",").str[0].str.strip()
# Judge model compared to ground truth
df["Judge_Eval"] = df.apply(lambda row: row["Judge_Predicted_Genre"] in row["Actual_Genres_List"], axis=1)
print(f"Base model ground truth determined accuracy: {df["Base_Eval"].mean()}")
print(f"Judge model ground truth determined accuracy: {df["Judge_Eval"].mean()}")
Base model ground truth determined accuracy: 0.725 Judge model ground truth determined accuracy: 0.767
Summary¶
This tutorial used the chat completion endpoint to perform genre classification on movie descriptions from the IMDb Top 1000 dataset using the kluster.ai batch API. Additionally, we used a larger language model to generate a synthetic ground truth, enabling us to evaluate the base model's performance better.
First, we built a pipeline to submit batch jobs and retrieve results using the base model. We then applied the same pipeline with a larger "judge" model, which evaluated whether the base model's genre predictions were correct, based on strict formatting and semantic criteria.
Finally, we compared three types of accuracy (results may vary depending on each notebook execution):
- Base model accuracy against LLM ground truth: 59.2%
- Base model accuracy against real ground truth: 72.5%
- Judge model accuracy against real ground truth: 76.7%
These results show that while the base model achieved moderate accuracy when compared to human-annotated ground truth, it performed significantly worse when judged by the larger model. This suggests that the judge model applies stricter or more nuanced evaluation criteria. Notably, the judge model achieved the highest accuracy compared to the real ground truth.
As next steps, feel free to create your own dataset, or expand on top of this existing example. Good luck!