Home / Blog / Evals: OpenAI’s Framework for Evaluating LLM’s

Evals: OpenAI’s Framework for Evaluating LLM’s

Author: Jorick van Weelie | Date: 20/05/2025 | Updated: 21/05/2025

What are Evals?

Evals is an open-source framework from OpenAI designed to systematically evaluate large language models (LLMs) and LLM-powered systems. An eval is a structured test or benchmark that measures a model’s output quality on specific tasks by comparing responses to expected answers or criteria.

OpenAI maintains a registry of ready-made evals across various domains, and developers can also create custom evals using proprietary data to match their application’s needs.

Why Evals matter

Building with LLMs requires ongoing experimentation. Evals provide objective, reproducible metrics-like accuracy and consistency. This is especially important because even small modifications often require re-testing the entire system to ensure stability and avoid introducing regressions. Before any change can go into production, the whole LLM application typically needs to be re-evaluated end-to-end. Evals make that process scalable and reliable. They are essential for:

Ensuring application stability as models evolve
Catching regressions before deployment (often via CI/CD integration)
Reducing risk and increasing trust in LLM deployments

“Evals are surprisingly often all you need.”
Greg Brockman, OpenAI President

Understanding model behavior

Evals enable systematic comparison of different models or versions by running standardized test cases and producing quantitative results. They can measure:

Factual accuracy
Reasoning and chain-of-thought quality
Instruction following (e.g., valid JSON output)

The OpenAI Evals registry includes pre-built tests for question answering, logic puzzles, code generation, and content compliance.

Types of Evals

Basic (ground-truth) Evals:
Compare model outputs to known correct answers using deterministic checks. Ideal for tasks with clear, verifiable answers (e.g., math, multiple-choice).

Model-graded Evals:
Using another AI model to evaluate whether the output meets the desired goal. Typically, a stronger model than the one being tested is used to judge subjective qualities, such as humor or the quality of a summary. However, it’s recommended that humans also review the results to ensure the grading is accurate. The model-graded eval approach is especially useful for open-ended or qualitative tasks.

OpenAI provides eval templates for both approaches, making it easy to get started without coding.

Using the Evals registry

The registry offers datasets and evaluation logic for:

Question answering (e.g., CoQA)
Logic and math puzzles
Code generation and understanding (e.g., HumanEval)
Content compliance and safety

Each eval is defined by a YAML config and (optionally) reference data files. Running an eval is as simple as installing the openai-evals package and launching a command or using the API.

Creating custom Evals

Custom evals let you test your own data and tasks:

Prepare a dataset:
Collect sample prompts and expected answers from your application, formatted in JSONL.
Configure the Eval:
Write a YAML file specifying the eval template, dataset path, model(s), and parameters.

No coding is required for most cases, and OpenAI provides guides and examples.

Custom evals can remain private, allowing businesses to test sensitive or domain-specific data securely.

Evals in the LLM product lifecycle

Evals support the full LLM development cycle:

Model Selection: Objectively compare models before deployment.
Continuous Quality Assurance: Monitor performance with each update, catching regressions early.
Model Upgrades: Quantify improvements or detect degradations when models change.
Fine-tuning Validation: Ensure fine-tuned models outperform base models on relevant tasks.
Stakeholder Assurance: Provide transparent metrics for compliance and reporting.

Introducing HealthBench: Evaluating AI systems in Healthcare

On May 12, 2025, OpenAI introduced HealthBench, a new benchmark designed to evaluate AI systems in realistic healthcare scenarios. Developed in collaboration with 262 physicians from 60 countries, HealthBench aims to ensure that AI models are both useful and safe in health settings .

Key Features of HealthBench:

Realistic Scenarios:
HealthBench includes 5,000 multi-turn, multilingual conversations simulating interactions between AI models and users or clinicians. These conversations cover various medical specialties and contexts, reflecting real-world use cases.
Physician-Created Rubrics:
Each conversation is accompanied by a custom rubric created by physicians, containing specific criteria that a model’s response should meet. In total, HealthBench encompasses 48,562 unique rubric criteria.
Model-Based Grading:
Responses are evaluated using a model-based grader (GPT-4.1), which assesses whether each rubric criterion is met. This approach ensures consistent and scalable evaluation across the dataset.
Performance Benchmarks:
OpenAI has shared performance metrics for several of its models on HealthBench, setting new baselines for future improvements. For instance, the o3 model achieved a 60% score, indicating significant progress compared to earlier models .

Importance of HealthBench:

HealthBench addresses critical gaps in existing healthcare AI evaluations by focusing on:

Meaningfulness:
Scores reflect real-world impact, going beyond exam questions to capture complex, real-life scenarios and workflows.
Trustworthiness:
Evaluations are grounded in physician judgment, providing a rigorous foundation for improving AI systems.
Progressiveness:
Benchmarks are designed to support ongoing progress, ensuring that current models have substantial room for improvement.

By providing a comprehensive and realistic evaluation framework, HealthBench serves as a valuable tool for developers and researchers aiming to enhance the safety and effectiveness of AI systems in healthcare.

Conclusion

Evals are the backbone of robust LLM application development, offering standardized, customizable, and transparent evaluation. By integrating Evals, teams can iterate faster, reduce risk, and deliver more reliable AI-powered products.