What are Evals?
OpenAI Evals is an open-source framework designed to systematically evaluate large language models (LLMs) and LLM-powered systems. An “eval” serves as a structured test or benchmark that measures a model’s output quality on specific tasks by comparing responses to expected answers or expert-defined criteria.
OpenAI maintains a registry of ready-made benchmarks across various domains, and developers can also create custom tests using proprietary data to match their specific application’s needs. As of January 2026, the framework has become a cornerstone for the AI community, boasting 17,600 stars and 2,900 forks on GitHub.
Why Evals matter
Building with LLMs requires ongoing experimentation rather than “vibe-based” testing. Evals provide the objective, reproducible metrics, such as factual accuracy and reasoning quality, needed to build reliable systems.
This is especially critical because even small modifications to a prompt or model version often require re-testing the entire system to ensure stability and avoid regressions. Before any change enters production, the entire LLM application must be re-evaluated end-to-end to catch potential failures. Evals facilitate this by making the quality assurance process automated and scalable. They are essential for:
- Ensuring application stability as underlying models evolve
- Catching regressions before deployment through CI/CD integration
- Reducing the risk of hallucinations in user-facing products
“Evals are surprisingly often all you need.”
Greg Brockman, OpenAI President
Understanding model behavior
Evals enable systematic comparison of different models by running standardized test cases and producing quantitative results. They can measure:
- Factual accuracy: Verifying if the AI provides correct information.
- Reasoning quality: Assessing the “chain-of-thought” logic.
- Instruction following: Ensuring the model adheres to specific formats like valid JSON output.
The OpenAI Evals registry includes pre-built tests for complex tasks including logic puzzles, code generation, and content safety compliance.
Types of Evals
- Basic (Ground-Truth) Evals: These compare model outputs to known correct answers using deterministic checks. They are ideal for tasks with clear, verifiable answers such as mathematical problems or multiple-choice exams.
- Model-graded Evals: These leverage another AI model to judge the quality of a response. Typically, a stronger model is used to evaluate subjective qualities like humor, tone, or the quality of a summary. While scalable, it is recommended that human experts periodically audit these graders to ensure accuracy.
The model-graded approach is especially vital for open-ended or qualitative tasks. OpenAI provides eval templates for both approaches, allowing developers to start measuring performance without writing extensive code.
Using the Evals registry
The registry offers standardized datasets and evaluation logic for a wide variety of industries:

Each eval is defined by a YAML config and (optionally) reference data files. Running an eval is as simple as installing the openai-evals package and launching a command or using the API.
Creating custom Evals
Custom evals allow organizations to test their own proprietary data:
- Prepare a dataset: Collect sample prompts and expected answers in a JSONL format.
- Configure the Eval: Define a YAML file specifying the dataset path and desired model parameters.
No coding is required for most standard use cases, and OpenAI provides detailed guides for implementation. These custom tests can remain private, allowing businesses to validate sensitive or domain-specific workflows securely.
Evals in the LLM product lifecycle
Evals support the full development cycle of modern AI products:
- Model Selection: Objectively compare performance across different model families.
- Continuous QA: Monitor performance with each prompt update.
- Model Upgrades: Detect any degradations when migrating to newer versions.
- Fine-tuning Validation: Ensure fine-tuned models outperform base architectures.
- Stakeholder Assurance: Provide transparent metrics for compliance and reporting.
Introducing HealthBench: Evaluating AI systems in Healthcare
On May 12, 2025, OpenAI introduced HealthBench, a specialized benchmark designed to evaluate AI performance in realistic healthcare scenarios. Developed in collaboration with 262 physicians from 60 countries, HealthBench ensures that models are useful and safe in clinical settings.
Key Features of HealthBench:
- Realistic Scenarios: It includes 5,000 multi-turn, multilingual conversations covering 26 medical specialties.
- Physician-Created Rubrics: Every interaction is graded against 48,562 unique rubric criteria to ensure clinical depth.
- Model-Based Grading: Responses are evaluated by the model-based grader (GPT5.2), which assesses whether each rubric criterion is met. This approach ensures consistent and scalable evaluation across the dataset.
Performance Milestones: Recent results show rapid reasoning gains; the o3 model reached a 60% score, while the flagship GPT-5.2 has demonstrated near-perfect 98.7% accuracy in standardized clinical tests.
Importance of HealthBench:
HealthBench addresses critical gaps in existing healthcare AI evaluations by focusing on:
- Meaningfulness:
Scores reflect real-world impact, going beyond exam questions to capture complex, real-life scenarios and workflows. - Trustworthiness:
Evaluations are grounded in physician judgment, providing a rigorous foundation for improving AI systems. - Progressiveness:
Benchmarks are designed to support ongoing progress, ensuring that current models have substantial room for improvement.
By providing a comprehensive and realistic evaluation framework, HealthBench serves as a valuable tool for developers and researchers aiming to enhance the safety and effectiveness of AI systems in healthcare.
Conclusion
Evals are the backbone of robust LLM application development, offering standardized, customizable, and transparent evaluation. By integrating Evals, teams can iterate faster, reduce risk, and deliver more reliable AI-powered products.
Frequently Asked Questions (FAQ)
Can I use OpenAI Evals to test non-OpenAI models like Claude or Gemini?
Yes. While the framework is built by OpenAI, it is designed for evaluating external models as well. You can configure custom model endpoints that are compatible with the Chat Completions API to run benchmarks against a wide variety of third-party or locally hosted models.
Is my proprietary data safe when creating custom evaluations?
Security is a top priority for enterprise users. Custom evals can remain private, allowing businesses to test sensitive or domain-specific data within their own secure environment. However, if you choose to contribute to the open-source registry on GitHub, that data becomes subject to public licensing.
How much does it cost to run the OpenAI Evals framework?
The framework itself is open-source and free to install. However, running evaluations involves making calls to LLMs, which consumes standard API credits. Costs vary based on the model being used (e.g., GPT-4o vs. GPT-5.2) and the number of test cases in your dataset.
What is the difference between a benchmark and an eval?
In the context of the framework, a benchmark (like MMLU or HumanEval) is a standardized test used to compare models in isolation. An “eval” is a more specific term for the structured tests you implement
to measure how well an LLM application performs a specific job for your unique use case.
Does HealthBench use real patient records for testing?
No. To maintain strict privacy, HealthBench uses 5,000 synthesized conversations designed by medical experts to mirror real-world interactions. This ensures the benchmark is clinically grounded without exposing personally identifiable information (PII).