What are Evals?
Evals is an open-source framework from OpenAI designed to systematically evaluate large language models (LLMs) and LLM-powered systems. An eval is a structured test or benchmark that measures a model’s output quality on specific tasks by comparing responses to expected answers or criteria.
OpenAI maintains a registry of ready-made evals across various domains, and developers can also create custom evals using proprietary data to match their application’s needs.
Why Evals matter
Building with LLMs requires ongoing experimentation. Evals provide objective, reproducible metrics-like accuracy and consistency. This is especially important because even small modifications often require re-testing the entire system to ensure stability and avoid introducing regressions. Before any change can go into production, the whole LLM application typically needs to be re-evaluated end-to-end. Evals make that process scalable and reliable. They are essential for:
- Ensuring application stability as models evolve
- Catching regressions before deployment (often via CI/CD integration)
- Reducing risk and increasing trust in LLM deployments
“Evals are surprisingly often all you need.”
Greg Brockman, OpenAI President
Understanding model behavior
Evals enable systematic comparison of different models or versions by running standardized test cases and producing quantitative results. They can measure:
- Factual accuracy
- Reasoning and chain-of-thought quality
- Instruction following (e.g., valid JSON output)
The OpenAI Evals registry includes pre-built tests for question answering, logic puzzles, code generation, and content compliance.
Types of Evals
Basic (ground-truth) Evals:
Compare model outputs to known correct answers using deterministic checks. Ideal for tasks with clear, verifiable answers (e.g., math, multiple-choice).
Model-graded Evals:
Using another AI model to evaluate whether the output meets the desired goal. Typically, a stronger model than the one being tested is used to judge subjective qualities, such as humor or the quality of a summary. However, it’s recommended that humans also review the results to ensure the grading is accurate. The model-graded eval approach is especially useful for open-ended or qualitative tasks.
OpenAI provides eval templates for both approaches, making it easy to get started without coding.
Using the Evals registry
The registry offers datasets and evaluation logic for:

Each eval is defined by a YAML config and (optionally) reference data files. Running an eval is as simple as installing the openai-evals package and launching a command or using the API.
Creating custom Evals
Custom evals let you test your own data and tasks:
- Prepare a dataset:
Collect sample prompts and expected answers from your application, formatted in JSONL. - Configure the Eval:
Write a YAML file specifying the eval template, dataset path, model(s), and parameters.
No coding is required for most cases, and OpenAI provides guides and examples.
Custom evals can remain private, allowing businesses to test sensitive or domain-specific data securely.
Evals in the LLM product lifecycle
Evals support the full LLM development cycle:
- Model Selection: Objectively compare models before deployment.
- Continuous Quality Assurance: Monitor performance with each update, catching regressions early.
- Model Upgrades: Quantify improvements or detect degradations when models change.
- Fine-tuning Validation: Ensure fine-tuned models outperform base models on relevant tasks.
- Stakeholder Assurance: Provide transparent metrics for compliance and reporting.
Introducing HealthBench: Evaluating AI Systems in Healthcare
On May 12, 2025, OpenAI introduced HealthBench, a new benchmark designed to evaluate AI systems in realistic healthcare scenarios. Developed in collaboration with 262 physicians from 60 countries, HealthBench aims to ensure that AI models are both useful and safe in health settings .
Key Features of HealthBench:
- Realistic Scenarios:
HealthBench includes 5,000 multi-turn, multilingual conversations simulating interactions between AI models and users or clinicians. These conversations cover various medical specialties and contexts, reflecting real-world use cases. - Physician-Created Rubrics:
Each conversation is accompanied by a custom rubric created by physicians, containing specific criteria that a model’s response should meet. In total, HealthBench encompasses 48,562 unique rubric criteria. - Model-Based Grading:
Responses are evaluated using a model-based grader (GPT-4.1), which assesses whether each rubric criterion is met. This approach ensures consistent and scalable evaluation across the dataset. - Performance Benchmarks:
OpenAI has shared performance metrics for several of its models on HealthBench, setting new baselines for future improvements. For instance, the o3 model achieved a 60% score, indicating significant progress compared to earlier models .
Importance of HealthBench:
HealthBench addresses critical gaps in existing healthcare AI evaluations by focusing on:
- Meaningfulness:
Scores reflect real-world impact, going beyond exam questions to capture complex, real-life scenarios and workflows. - Trustworthiness:
Evaluations are grounded in physician judgment, providing a rigorous foundation for improving AI systems. - Progressiveness:
Benchmarks are designed to support ongoing progress, ensuring that current models have substantial room for improvement.
By providing a comprehensive and realistic evaluation framework, HealthBench serves as a valuable tool for developers and researchers aiming to enhance the safety and effectiveness of AI systems in healthcare.
Conclusion
Evals are the backbone of robust LLM application development, offering standardized, customizable, and transparent evaluation. By integrating Evals, teams can iterate faster, reduce risk, and deliver more reliable AI-powered products.