Skip to content

Verify

LLMs can generate non-factual or irrelevant information (hallucinations). For developers, this presents significant challenges:

  • Difficulty in programmatically trusting LLM outputs.
  • Increased complexity in error handling and quality assurance.
  • Potential for cascading failures in chained AI operations.
  • Requirement for manual review cycles, slowing down development and deployment.

Traditional validation methods may involve complex rule sets, fine-tuning, or exhibit high false-positive rates, adding to the development burden.

Verify is an intelligent verification service that validates LLM outputs in real-time. It's designed to give you the trust needed to deploy AI at scale in production environments where accuracy matters most.

This page provides an overview of the Verify service.

How Verify works

The Verify service functions as an intelligent agent. It assesses LLM output reliability based on three key inputs provided in the API call:

  1. prompt: The original input or question provided to the LLM. This gives context to the user's intent.
  2. output: The response generated by the LLM that requires validation.
  3. context (Optional): Any source material or documents provided to the LLM (e.g., in RAG scenarios) against which the output's claims should be verified.

Verify analyzes these inputs and can leverage real time internet access to validating claims against up-to-date public information, extending its capabilities beyond static knowledge bases.

Performance benchmarks

Verify has been benchmarked against other solutions on HaluEval and HaluBench datasets (over 25,000 samples).

  • Non-RAG Scenarios (Context-Free):
    • Compared against CleanLab TLM (GPT 4o-mini, medium quality, optimized threshold).
    • Results: Verify showed 11% higher overall accuracy, a 2.8% higher median F1 score (72.3% vs. 69.5%), and higher precision (fewer false positives). Response times are comparable (sub-10 seconds).
  • RAG Validation (Context-Provided):
    • Compared against Patronus AI's Lynx (70B) and CleanLab TLM.
    • Results: On RAGTruth (factual consistency), Verify significantly outperformed Lynx 70B and CleanLab TLM. On DROP (numerical/logical reasoning), Verify showed competitive performance against Lynx and outperformed CleanLab TLM.
    • Note: Lynx was trained on the training sets of DROP and RAGTruth, highlighting Verify's generalization capabilities to unseen data configurations.

These results indicate Verify's effectiveness in diverse scenarios relevant to production AI systems.

Target applications & use cases

Developers can integrate Verify into applications where LLM output accuracy is paramount:

  • Automated content generation pipelines.
  • Customer-facing chatbots and virtual assistants.
  • Question-answering systems over private or public data (RAG).
  • AI-driven data extraction and summarization tools.
  • Internal workflow automation involving LLM-generated text.