Retrieval-augmented generation¶
There are use cases when models struggle to complete tasks due to a lack of specialized knowledge. This could lead to issues like hallucinations, where models generate incorrect or fabricated information. To help mitigate this, Retrieval-Augmented Generation (RAG) allows the model to pull real-time data from specified sources.
This tutorial demonstrates how to use the embeddings endpoint in a RAG pipeline with PDF document support using LlamaIndex to handle the data and retrieval.
Models:
- Embeddings: Leveraging the BAAI/bge-m3 model.
- Language Model (LLM) for querying: For this example we use Llama-4-Maverick-17B. You can also choose any other model from the available models list.
Prerequisites¶
Before getting started, ensure you have the following:
- A kluster.ai account: Sign up on the kluster.ai platform if you don't have one.
- A kluster.ai API key: After signing in, go to the API Keys section and create a new key. For detailed instructions, check out the Get an API key guide.
Setup¶
In this notebook, we'll use Python's getpass
module to safely input the key. Provide your unique kluster.ai API key (ensure there are no spaces).
from getpass import getpass
api_key = getpass("Enter your kluster.ai API key: ")
# Install the necessary packages, including the PDF reader for LlamaIndex
%pip install llama-index llama-index-llms-openai-like llama-index-embeddings-openai-like llama-index-readers-file requests
import os
import logging
import sys
import requests
import json
from pprint import pprint
import warnings
warnings.filterwarnings("ignore")
from openai import OpenAI
base_url="https://api.kluster.ai/v1"
# Configure kluster.ai client
client = OpenAI(
base_url=base_url,
api_key=api_key
)
Embeddings¶
Embeddings are numerical representations of text that capture their meaning in a format computers can understand.
Think of embeddings as coordinates in a high-dimensional space, where similar meanings are placed closer together, which allows RAG systems to measure how related different pieces of text are, making it easier to find relevant information quickly.
This section demonstrates how to generate embeddings directly using the BAAI/bge-m3
model and the end result.
Generate embeddings¶
Generating embeddings for RAG systems is crucial because embeddings capture the semantic meaning of text, enabling the efficient retrieval of relevant information from a knowledge base.
Let's convert a sample text to embeddings and print the output.
# Generate embedding for the example text about Buenos Aires
sample_text = "The capital of Argentina is Buenos Aires. It is known for its tango music and dance, as well as its vibrant nightlife."
response = client.embeddings.create(
model="BAAI/bge-m3",
input=sample_text,
encoding_format="float"
)
# Print the first ten dimensions of the embedding vector
print(f"Sample text: '{sample_text}'")
print(f"Model used: {response.model}")
print(f"Embedding dimensions: {len(response.data[0].embedding)}")
print("\nFirst ten dimensions of the embedding vector:")
print(response.data[0].embedding[:10])
# Show token usage information
print(f"\nToken usage: {response.usage.prompt_tokens} tokens")
Sample text: 'The capital of Argentina is Buenos Aires. It is known for its tango music and dance, as well as its vibrant nightlife.' Model used: BAAI/bge-m3 Embedding dimensions: 1024 First ten dimensions of the embedding vector: [0.054107666015625, 0.01416778564453125, -0.0236663818359375, 0.039306640625, -0.039337158203125, 0.0098724365234375, -0.013946533203125, 0.037872314453125, -0.057281494140625, 0.01058197021484375] Token usage: 29 tokens
Batch embeddings¶
When building RAG systems, you often need to process multiple documents or text chunks. Instead of making individual API calls for each piece of text, batch embeddings allow you to process multiple texts in a single request, significantly improving efficiency and reducing latency.
The embeddings endpoint accepts an array of strings, processing up to 2048 individual text inputs in one call.
# Example: Processing multiple texts in a single batch
batch_texts = [
"The capital of Argentina is Buenos Aires.",
"Paris is known for the Eiffel Tower and its romantic atmosphere.",
"Tokyo is the most populous metropolitan area in the world.",
"London has a rich history dating back to Roman times.",
"New York City is often called the Big Apple."
]
# Generate embeddings for all texts in one API call
batch_response = client.embeddings.create(
model="BAAI/bge-m3",
input=batch_texts,
encoding_format="float"
)
print(f"Number of texts processed: {len(batch_texts)}")
print(f"Number of embeddings returned: {len(batch_response.data)}")
print(f"Total tokens used: {batch_response.usage.prompt_tokens}")
# Verify the embeddings are returned in the same order
for i, text in enumerate(batch_texts):
print(f"\nText {i+1}: '{text[:50]}...'")
print(f"Embedding dimensions: {len(batch_response.data[i].embedding)}")
print(f"First five values: {batch_response.data[i].embedding[:5]}")
Number of texts processed: 5 Number of embeddings returned: 5 Total tokens used: 64 Text 1: 'The capital of Argentina is Buenos Aires....' Embedding dimensions: 1024 First five values: [0.037200927734375, 0.022430419921875, -0.032928466796875, 0.0226593017578125, -0.046539306640625] Text 2: 'Paris is known for the Eiffel Tower and its romant...' Embedding dimensions: 1024 First five values: [-0.0014429092407226562, 0.0240478515625, -0.0137481689453125, 0.04278564453125, -0.0023708343505859375] Text 3: 'Tokyo is the most populous metropolitan area in th...' Embedding dimensions: 1024 First five values: [-0.0001596212387084961, 0.0217742919921875, -0.0133056640625, 0.0304412841796875, 0.00852203369140625] Text 4: 'London has a rich history dating back to Roman tim...' Embedding dimensions: 1024 First five values: [-0.0438232421875, 0.03497314453125, -0.021148681640625, 0.0021762847900390625, -0.01172637939453125] Text 5: 'New York City is often called the Big Apple....' Embedding dimensions: 1024 First five values: [0.0038509368896484375, 0.03302001953125, -0.06243896484375, 0.0157318115234375, -0.0172882080078125]
Performance comparison: single vs batch¶
To demonstrate the practical benefits of batch processing, let's compare the performance of individual API calls versus kluster.ai's batch processing using the same set of texts. This comparison will show the time savings and efficiency gains you can expect when implementing batch embeddings in production systems.
import time
# Test texts for performance comparison
test_texts = [
"Machine learning is transforming industries.",
"Natural language processing enables computers to understand human language.",
"Deep learning models require significant computational resources.",
"Transfer learning allows models to apply knowledge from one domain to another.",
"Embeddings capture semantic meaning in numerical form."
]
# Method 1: Individual API calls (not recommended for production)
print("Method 1: Individual API calls")
start_time = time.time()
individual_embeddings = []
for text in test_texts:
response = client.embeddings.create(
model="BAAI/bge-m3",
input=text,
encoding_format="float"
)
individual_embeddings.append(response.data[0].embedding)
individual_time = time.time() - start_time
print(f"Time taken: {individual_time:.2f} seconds")
print(f"Number of API calls: {len(test_texts)}")
# Method 2: Batch API call (recommended)
print("\nMethod 2: Batch API call")
start_time = time.time()
batch_response = client.embeddings.create(
model="BAAI/bge-m3",
input=test_texts,
encoding_format="float"
)
batch_time = time.time() - start_time
print(f"Time taken: {batch_time:.2f} seconds")
print(f"Number of API calls: 1")
# Performance improvement
improvement = (individual_time / batch_time)
print(f"\nBatch processing is approximately {improvement:.1f}x faster!")
print(f"Time saved: {individual_time - batch_time:.2f} seconds")
Method 1: Individual API calls Time taken: 2.81 seconds Number of API calls: 5 Method 2: Batch API call Time taken: 0.76 seconds Number of API calls: 1 Batch processing is approximately 3.7x faster! Time saved: 2.05 seconds
In practice, when building RAG systems with frameworks like LlamaIndex, the embedding batching is often handled automatically. Let's see how this works when we process PDF documents.
Build a RAG system with PDF documents¶
Now that we understand how embeddings work, let's build a complete RAG system using a real PDF document. This section demonstrates how to:
- Download and load a PDF document.
- Split the document into manageable chunks.
- Convert those chunks into embeddings using kluster.ai.
- Create a searchable knowledge base.
- Query the system to retrieve relevant information.
We'll use a research paper about polar bears as our knowledge source, showing how RAG can help answer specific questions about document content that wouldn't be in the LLM's training data.
Download the document¶
For this exercise, we use a large PDF file, but you can adapt this to your needs.
The files used will serve as the LLM's knowledge base.
Download the PDF and store it in the sample_pdfs
directory.
import urllib.request
import os
# Create a directory for PDFs if it doesn't exist
pdf_dir = "sample_pdfs"
os.makedirs(pdf_dir, exist_ok=True)
# Download a sample PDF about Polar Bears (you can replace with your own PDFs)
sample_pdf_url = "https://portals.iucn.org/library/sites/library/files/documents/SSC-OP-007.pdf"
pdf_path = os.path.join(pdf_dir, "polar_bears.pdf")
if not os.path.exists(pdf_path):
print(f"Downloading sample PDF to {pdf_path}...")
urllib.request.urlretrieve(sample_pdf_url, pdf_path)
print("Download complete!")
else:
print(f"Sample PDF already exists at {pdf_path}")
Downloading sample PDF to sample_pdfs/polar_bears.pdf... Download complete!
Load the document¶
Now it's time to load the 115-page PDF document into memory.
This example leverages LlamaIndex SimpleDirectoryReader
as a data connector. Pass in an input directory or a list of files, and it selects the best file reader based on the file extensions.
# Import the necessary document loader from llama_index
from llama_index.core import Document
from llama_index.core import SimpleDirectoryReader
# Load documents from the PDF file
print(f"Loading PDF from {pdf_dir}...")
pdf_reader = SimpleDirectoryReader(input_dir=pdf_dir)
documents = pdf_reader.load_data()
print(f"Loaded {len(documents)} document(s) from PDF file")
Loading PDF from sample_pdfs... Loaded 115 document(s) from PDF file
Set up RAG¶
If the entire PDF document is sent to the LLM, it would consume the entire context window with no success.
To address this, the document needs to be divided into smaller pieces called chunks. LlamaIndex provides an efficient way to accomplish this.
- LLM: The model responsible for generating the responses from the knowledge base.
- Embedding model: The model used to convert text into vectors (embeddings), enabling semantic similarity searches.
- Chunking parameters: Defines how documents are split into smaller chunks (nodes) for indexing:
chunk_size
: The size of each chunk (in tokens).chunk_overlap
: The number of tokens overlapping between consecutive chunks.
LlamaIndex provides OpenAI-compatible interfaces that allow you to connect to any API that follows the OpenAI format. Since kluster.ai uses OpenAI-compatible endpoints, we use:
OpenAILike
: A wrapper that adapts kluster.ai's chat completion API to work with LlamaIndex's LLM interfaceOpenAILikeEmbedding
: A wrapper that adapts kluster.ai's embeddings API to work with LlamaIndex's embedding interface
This approach allows you to use kluster.ai models seamlessly within LlamaIndex without requiring custom integration code.
To set up with kluster.ai, configure OpenAILike
for the LLM and OpenAILikeEmbedding
for the embedding model.
from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.openai_like import OpenAILikeEmbedding
from llama_index.core import Settings
# Set up the llamaIndex client with kluster.ai
llm = OpenAILike(
model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
api_base=base_url,
api_key=api_key,
is_chat_model=True
)
# Set up the embedding model
embed_model = OpenAILikeEmbedding(
model_name="BAAI/bge-m3",
api_base=base_url,
api_key=api_key
)
# Set the global settings for LlamaIndex
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512 # Set chunk size for document splitting
Settings.chunk_overlap = 20 # Set chunk overlap for document splitting
print("kluster.ai embedding model configured.")
kluster.ai embedding model configured.
Create the vector store¶
Before building the index, let's quickly clarify a couple of terms:
- Index: A searchable structure built from your documents that enables fast similarity search
- Vector store: Stores the embeddings (vector representations) of each document chunk, enabling rapid retrieval
Creating a VectorStoreIndex combines these concepts — it enables your RAG pipeline to efficiently search your PDF and retrieve the most relevant chunks, grounding the LLM's responses in actual document content.
VectorStoreIndex.from_documents(documents)
takes the PDF and internally breaks these documents into smaller text chunks (Nodes) using settings like chunk_size
defined earlier.
Next, it converts each chunk into embeddings using the pre-configured embedding model. LlamaIndex automatically batches these embedding requests for efficiency, similar to what we demonstrated above.
Finally, it stores these chunks and their embeddings in an in-memory vector store, making the index object ready for efficient similarity searches.
Create a VectorStoreIndex
from the PDF document.
from llama_index.core import VectorStoreIndex
# Create the index from the PDF document
print("Creating index from PDF document...")
index = VectorStoreIndex.from_documents(
documents
)
# Get the document store from the index
docstore = index.docstore
# Get the number of nodes (chunks) in the document store
num_nodes = len(docstore.docs)
print("Index created successfully!")
print(f"Number of text chunks (nodes) indexed: {num_nodes}")
# You can also print the index ID if you're curious
print(f"Index ID: {index.index_id}")
# And the type of vector store being used (by default it's an in-memory SimpleVectorStore)
print(f"Vector store type: {type(index.vector_store)}")
Creating index from PDF document... Index created successfully! Index created successfully! Number of text chunks (nodes) indexed: 320 Index ID: 93e6a54a-ca85-44c9-8bac-953c15483f06 Vector store type: <class 'llama_index.core.vector_stores.simple.SimpleVectorStore'>
Create the query engine¶
When a question is asked, it's first converted into a numerical embedding. The engine then searches the VectorStoreIndex to retrieve the most semantically similar text chunks from the document. Finally, these retrieved chunks (as context) and the original question are given to the LLM to generate a grounded answer.
The following steps create a query engine and a helper function for direct LLM responses (which we'll use later for comparison).
# Create a query engine for RAG
query_engine = index.as_query_engine()
# Function to get a direct response from the LLM without using RAG
def get_direct_llm_response(query):
"""Get a response directly from the LLM without using RAG"""
return llm.complete(query).text
print("Query engine created successfully!")
Test the RAG¶
Now that the RAG system is configured, test it with a query about the PDF document. This demonstration shows how a single RAG query processes:
- Query processing: The question is converted into an embedding vector.
- Chunk retrieval: The system finds the most relevant document chunks using similarity search.
- Response generation: The LLM uses the retrieved chunks as context to answer the question.
The following example includes detailed output showing the retrieved chunks and their similarity scores, helping understand how the RAG system selects relevant information from the knowledge base.
# Query about content from the PDF
pdf_query = "Fact check this: <quote> The NWT suggested caution regarding a proposal that polar bear hides be transportable to the U.S. on CITES permits. It was suggested that whalebone carvings and seal-skin products be considered first and then if there are no political problems, possibly consider polar bears.</quote> If you don't know, say 'I don't know'."
print(f"Querying the RAG system with:\n'{pdf_query}'\n")
# --- Step 1: Query the RAG engine ---
# This step internally performs:
# 1. Query Embedding: Your 'pdf_query' is converted to a vector.
# 2. Retrieval: The vector is used to find the most similar document chunks (Nodes) from your VectorStoreIndex.
print("--- Processing RAG query... ---")
rag_response_object = query_engine.query(pdf_query)
# The 'rag_response_object' now contains both the retrieved nodes and the final synthesized answer.
# --- Step 2: Inspect the Retrieved Chunks (Source Nodes) ---
print("\n--- Retrieved Context (Source Nodes used by RAG) ---")
if rag_response_object.source_nodes:
for i, source_node in enumerate(rag_response_object.source_nodes):
print(f"Source Node {i+1} (Similarity Score: {source_node.score:.4f}):")
# .get_content() is a robust way to get the text from the node.
# .strip() removes leading/trailing whitespace for cleaner printing.
print(f"Retrieved Chunk: \"{source_node.node.get_content().strip()[:20]}...\"")
# You can also print other metadata if available, e.g., source_node.node.metadata
# print(f"Metadata: {source_node.node.metadata}")
print("-" * 30)
else:
print("No source nodes were retrieved for this query.")
# --- Step 3: See the Final LLM Response (Synthesized with RAG) ---
# This is the answer generated by the LLM based on your query AND the retrieved chunks.
print("\n--- Final RAG Response (using knowledge base) ---")
print(rag_response_object.response) # .response attribute holds the textual answer
The NWT suggested caution regarding a proposal that polar bear hides be transportable to the U.S. on CITES permits. It was suggested that whalebone carvings and seal-skin products be considered first and then if there are no political problems, possibly consider polar bears.If you don't know, say 'I don't know'." print(f"Querying the RAG system with:\n'{pdf_query}'\n") # --- Step 1: Query the RAG engine --- # This step internally performs: # 1. Query Embedding: Your 'pdf_query' is converted to a vector. # 2. Retrieval: The vector is used to find the most similar document chunks (Nodes) from your VectorStoreIndex. print("--- Processing RAG query... ---") rag_response_object = query_engine.query(pdf_query) # The 'rag_response_object' now contains both the retrieved nodes and the final synthesized answer. # --- Step 2: Inspect the Retrieved Chunks (Source Nodes) --- print("\n--- Retrieved Context (Source Nodes used by RAG) ---") if rag_response_object.source_nodes: for i, source_node in enumerate(rag_response_object.source_nodes): print(f"Source Node {i+1} (Similarity Score: {source_node.score:.4f}):") # .get_content() is a robust way to get the text from the node. # .strip() removes leading/trailing whitespace for cleaner printing. print(f"Retrieved Chunk: \"{source_node.node.get_content().strip()[:20]}...\"") # You can also print other metadata if available, e.g., source_node.node.metadata # print(f"Metadata: {source_node.node.metadata}") print("-" * 30) else: print("No source nodes were retrieved for this query.") # --- Step 3: See the Final LLM Response (Synthesized with RAG) --- # This is the answer generated by the LLM based on your query AND the retrieved chunks. print("\n--- Final RAG Response (using knowledge base) ---") print(rag_response_object.response) # .response attribute holds the textual answer
Querying the RAG system with: 'Fact check this: <quote> The NWT suggested caution regarding a proposal that polar bear hides be transportable to the U.S. on CITES permits. It was suggested that whalebone carvings and seal-skin products be considered first and then if there are no political problems, possibly consider polar bears.</quote> If you don't know, say 'I don't know'.' --- Processing RAG query... --- --- Retrieved Context (Source Nodes used by RAG) --- Source Node 1 (Similarity Score: 0.6246): Retrieved Chunk: "Cape ChurchillWildli..." ------------------------------ Source Node 2 (Similarity Score: 0.6136): Retrieved Chunk: "Table1. continued Ca..." ------------------------------ --- Final RAG Response (using knowledge base) --- The statement is true. The given context information contains the exact quote on page_label: 10, confirming that the NWT indeed suggested caution regarding the proposal to transport polar bear hides to the U.S. on CITES permits and recommended considering whalebone carvings and seal-skin products first.
Compare results¶
To highlight the effectiveness of RAG, the following code compares responses from the RAG system against direct LLM responses without any document context. This comparison demonstrates how RAG provides more accurate, grounded answers for domain-specific questions by leveraging the knowledge base.
# Let's fact check the same query using the direct LLM response without RAG
pdf_query = "Fact check this: <quote> The NWT suggested caution regarding a proposal that polar bear hides be transportable to the U.S. on CITES permits. It was suggested that whalebone carvings and seal-skin products be considered first and then if there are no political problems, possibly consider polar bears.</quote> **Important: if you don't know the answer just reply 'SORRY, I DON'T KNOW' without any other text.** If you do have data to answer, provide full answer quoting the sources"
print(f"Query: {pdf_query}\n")
print("--- Direct LLM Response (without RAG) ---")
direct_response = get_direct_llm_response(pdf_query)
print(direct_response)
print("--- RAG Response (using knowledge base) ---")
rag_response = query_engine.query(pdf_query)
print(f"{rag_response}")
The NWT suggested caution regarding a proposal that polar bear hides be transportable to the U.S. on CITES permits. It was suggested that whalebone carvings and seal-skin products be considered first and then if there are no political problems, possibly consider polar bears.**Important: if you don't know the answer just reply 'SORRY, I DON'T KNOW' without any other text.** If you do have data to answer, provide full answer quoting the sources" print(f"Query: {pdf_query}\n") print("--- Direct LLM Response (without RAG) ---") direct_response = get_direct_llm_response(pdf_query) print(direct_response) print("--- RAG Response (using knowledge base) ---") rag_response = query_engine.query(pdf_query) print(f"{rag_response}")
Query: Fact check this: <quote> The NWT suggested caution regarding a proposal that polar bear hides be transportable to the U.S. on CITES permits. It was suggested that whalebone carvings and seal-skin products be considered first and then if there are no political problems, possibly consider polar bears.</quote> **Important: if you dont the answer just reply 'SORRY, I DONT KNOW' without any other text.** If you do have data to answer, provide full answer quoting the sources --- Direct LLM Response (without RAG) --- SORRY, I DONT KNOW --- RAG Response (using knowledge base) --- "The NWT suggested caution regarding a proposal that polar bear hides be transportable to the U.S. on CITES permits. It was suggested that whalebone carvings and seal-skin products be considered first and then if there are no political problems, possibly consider polar bears." is TRUE according to page_label: 10.
Continue testing queries against the knowledge base to evaluate how well the RAG system retrieves and grounds answers using the PDF document.
This demonstrates the effectiveness of retrieval-augmented generation compared to direct LLM responses.
# Query about a specific technical detail in the paper
technical_query = "What does the Toxicology and Monitoring of Pollutant Levels in Polar Bear Tissue say about the CHC levels? IMPORTANT: If you don't know, say 'I don't know'."
print(f"Query: {technical_query}\n")
print("--- Direct LLM Response (without RAG) ---")
direct_response = get_direct_llm_response(technical_query)
print(direct_response)
print("--- RAG Response (using knowledge base) ---")
rag_response = query_engine.query(technical_query)
print(f"{rag_response}\n")
Query: What does the Toxicology and Monitoring of Pollutant Levels in Polar Bear Tissue says about the CHC levels? IMPORTANT: If you don't know, say 'I don't know'. --- Direct LLM Response (without RAG) --- I don't know the specific details about what the Toxicology and Monitoring of Pollutant Levels in Polar Bear Tissue says about the CHC levels. If you're looking for accurate information on this topic, I recommend consulting the original research or a reliable scientific summary. --- RAG Response (using knowledge base) --- The levels of CHCs were generally inversely correlated to latitude, and reanalysis of polar bear fat samples showed that the level of most CHCs, especially chlordane compounds, had increased from 1969 to 1984 in Hudson Bay and Baffin Bay bears.
# Query about authors and publication details
authors_query = "Who are the authors of the Polar Bear Paper? IMPORTANT: If you don't know, say 'I don't know'."
print(f"Query: {authors_query}\n")
print("--- Direct LLM Response (without RAG) ---")
direct_response = get_direct_llm_response(authors_query)
print(direct_response)
print("--- RAG Response (using knowledge base) ---")
rag_response = query_engine.query(authors_query)
print(f"{rag_response}\n")
Query: Who are the authors of the Polar Bear Paper? IMPORTANT: If you don't know, say 'I don't know'. --- Direct LLM Response (without RAG) --- I don't know. --- RAG Response (using knowledge base) --- Steven C. Amstrup and Oystein Wiig are the compilers and editors of the Polar Bear publication, as indicated on page 3. However, the authors listed in the references on page 29 include Stirling, Schweinsburg, Kolenosky, Juniper, Robertson, Luttich, Calvelt, Sjare, Taylor, Bunnell, DeMaster, and Smith. Without more information, it's unclear if they are authors of the Polar Bear Paper or just cited references. Therefore, a more accurate answer would be that Steven C. Amstrup and Oystein Wiig are the compilers and editors, while the other names appear as authors of cited references.
Conclusion¶
This notebook demonstrated a RAG system using LlamaIndex and kluster.ai, incorporating a PDF document as a knowledge source. Key takeaways include:
- Embeddings functionality: Generated and visualized embeddings using the BAAI/bge-m3 model.
- Batch processing benefits: Demonstrated how batch embeddings provide significant performance improvements.
- PDF integration: Loaded and processed a research paper for the knowledge base, with LlamaIndex handling embedding batching automatically.
- RAG vs. direct LLM comparison: Compared responses from the RAG system to direct LLM outputs.
Next Steps:
- Try with different PDF documents or document types.
- Experiment with different chunking strategies to optimize retrieval.
- Explore other embedding models available on kluster.ai.