How to Use OpenAI's Evals API: A Comprehensive Tutorial

Introduction to the OpenAI Evals API Framework

The OpenAI Evals API, introduced on April 9, 2025, represents a significant advancement in the systematic evaluation of Large Language Models (LLMs). While evaluation capabilities have been available through the OpenAI dashboard for some time, the Evals API now enables developers to programmatically define tests, automate evaluation runs, and iterate rapidly on prompts and model implementations within their own workflows. This powerful interface supports methodical assessment of model outputs, facilitating evidence-based decision making when selecting models or refining prompt engineering strategies.

This tutorial provides a comprehensive technical guide to implementing and leveraging the OpenAI Evals API. We'll explore the underlying architecture, implementation patterns, and advanced techniques for creating robust evaluation pipelines that can objectively measure the performance of your LLM applications.

OpenAI Evals API: How Does It Work?

The OpenAI Evals API follows a hierarchical structure built around two primary abstractions:

Eval Configuration - The container for evaluation specifications that includes:

Data source schema definition
Testing criteria configuration
Metadata for organization and retrieval

2. Eval Runs - Individual evaluation executions that include:

Reference to a parent eval configuration
Specific data samples for evaluation
Model responses and evaluation results

This separation of concerns enables reusability across multiple testing scenarios while maintaining consistency in evaluation standards.

The Evals API Object Model

The core objects within the Evals API follow this relationship:

data_source_config (schema definition)
testing_criteria (evaluation methods)
metadata (description, tags, etc.)
Run 1 (against specific data)
Run 2 (alternative implementation)
...
Run N (version comparison)

Setting Up Your Environment for OpenAI Evals API

When implementing the OpenAI Evals API, your choice of testing and development tools can significantly impact your productivity and results quality.

Apidog stands out as the premier API platform that outperforms traditional solutions like Postman in several key aspects, making it the ideal companion for working with the technically complex Evals API.

button

Before implementing evaluations, you'll need to properly configure your development environment:

import openai
import os
import pydantic
import json
from typing import Dict, List, Any, Optional

# Configure API access with appropriate permissions
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY", "your-api-key")

# For production environments, consider using a more secure method
# such as environment variables loaded from a .env file

The OpenAI Python client library provides the interface for interacting with the Evals API. Ensure you're using the latest version that includes Evals API support:

pip install --upgrade openai>=1.20.0  # Version that includes Evals API support

Creating Your First Evaluation with the OpenAI Evals API

Let's implement a complete evaluation workflow using the OpenAI Evals API. We'll create an evaluation system for a text summarization task, demonstrating the full process from evaluation design to results analysis.

Defining Data Models for OpenAI Evals API

First, we need to define the structure of our test data using Pydantic models:

class ArticleSummaryData(pydantic.BaseModel):
    """Data structure for article summarization evaluation."""
    article: str
    reference_summary: Optional[str] = None  # Optional reference for comparison

    class Config:
        frozen = True  # Ensures immutability for consistent evaluation

This model defines the schema for our evaluation data, which will be used by the Evals API to validate inputs and provide template variables for our testing criteria.

Implementing the Target Function for Evals API Testing

Next, we'll implement the function that generates the outputs we want to evaluate:

def generate_article_summary(article_text: str) -> Dict[str, Any]:
    """
    Generate a concise summary of an article using OpenAI's models.

    Args:
        article_text: The article content to summarize

    Returns:
        Completion response object with summary
    """
    summarization_prompt = """
    Summarize the following article in a concise, informative manner.
    Capture the key points while maintaining accuracy and context.
    Keep the summary to 1-2 paragraphs.

    Article:
    {{article}}
    """

    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": summarization_prompt.replace("{{article}}", article_text)},
        ],
        temperature=0.3,  # Lower temperature for more consistent summaries
        max_tokens=300
    )

    return response.model_dump()  # Convert to serializable dictionary

Configuring Data Source for the OpenAI Evals API

The Evals API requires a defined data source configuration that specifies the schema of your evaluation data:

data_source_config = {
    "type": "custom",
    "item_schema": ArticleSummaryData.model_json_schema(),
    "include_sample_schema": True,  # Includes model output schema automatically
}

print("Data Source Schema:")
print(json.dumps(data_source_config, indent=2))

This configuration tells the Evals API what fields to expect in your evaluation data and how to process them.

Implementing Testing Criteria with the OpenAI Evals API

Now we'll define how the Evals API should evaluate the model outputs. We'll create a comprehensive evaluation with multiple criteria:

# 1. Accuracy evaluation using model-based judgment
accuracy_grader = {
    "name": "Summary Accuracy Evaluation",
    "type": "label_model",
    "model": "gpt-4o",
    "input": [
        {
            "role": "system",
            "content": """
            You are an expert evaluator assessing the accuracy of article summaries.
            Evaluate if the summary accurately represents the main points of the original article.
            Label the summary as one of:
            - "accurate": Contains all key information, no factual errors
            - "partially_accurate": Contains most key information, minor errors or omissions
            - "inaccurate": Significant errors, missing critical information, or misrepresentation

            Provide a detailed explanation for your assessment.
            """
        },
        {
            "role": "user",
            "content": """
            Original Article:
            {{item.article}}

            Summary to Evaluate:
            {{sample.choices[0].message.content}}

            Assessment:
            """
        }
    ],
    "passing_labels": ["accurate", "partially_accurate"],
    "labels": ["accurate", "partially_accurate", "inaccurate"],
}

# 2. Conciseness evaluation
conciseness_grader = {
    "name": "Summary Conciseness Evaluation",
    "type": "label_model",
    "model": "gpt-4o",
    "input": [
        {
            "role": "system",
            "content": """
            You are an expert evaluator assessing the conciseness of article summaries.
            Evaluate if the summary expresses information efficiently without unnecessary details.
            Label the summary as one of:
            - "concise": Perfect length, no unnecessary information
            - "acceptable": Slightly verbose but generally appropriate
            - "verbose": Excessively long or containing unnecessary details

            Provide a detailed explanation for your assessment.
            """
        },
        {
            "role": "user",
            "content": """
            Summary to Evaluate:
            {{sample.choices[0].message.content}}

            Assessment:
            """
        }
    ],
    "passing_labels": ["concise", "acceptable"],
    "labels": ["concise", "acceptable", "verbose"],
}

# 3. If reference summaries are available, add a reference comparison
reference_comparison_grader = {
    "name": "Reference Comparison Evaluation",
    "type": "label_model",
    "model": "gpt-4o",
    "input": [
        {
            "role": "system",
            "content": """
            Compare the generated summary with the reference summary.
            Evaluate how well the generated summary captures the same key information as the reference.
            Label the comparison as one of:
            - "excellent": Equivalent or better than reference
            - "good": Captures most important information from reference
            - "inadequate": Missing significant information present in reference

            Provide a detailed explanation for your assessment.
            """
        },
        {
            "role": "user",
            "content": """
            Reference Summary:
            {{item.reference_summary}}

            Generated Summary:
            {{sample.choices[0].message.content}}

            Assessment:
            """
        }
    ],
    "passing_labels": ["excellent", "good"],
    "labels": ["excellent", "good", "inadequate"],
    "condition": "item.reference_summary != null"  # Only apply when reference exists
}

Creating the Evaluation Configuration with OpenAI Evals API

With our data schema and testing criteria defined, we can now create the evaluation configuration:

eval_create_result = openai.evals.create(
    name="Article Summarization Quality Evaluation",
    metadata={
        "description": "Comprehensive evaluation of article summarization quality across multiple dimensions",
        "version": "1.0",
        "created_by": "Your Organization",
        "tags": ["summarization", "content-quality", "accuracy"]
    },
    data_source_config=data_source_config,
    testing_criteria=[
        accuracy_grader,
        conciseness_grader,
        reference_comparison_grader
    ],
)

eval_id = eval_create_result.id
print(f"Created evaluation with ID: {eval_id}")
print(f"View in dashboard: {eval_create_result.dashboard_url}")

Executing Evaluation Runs with the OpenAI Evals API

Preparing Evaluation Data

Now we'll prepare test data for our evaluation:

test_articles = [
    {
        "article": """
        The European Space Agency (ESA) announced today the successful deployment of its new Earth observation satellite, Sentinel-6.
        This satellite will monitor sea levels with unprecedented accuracy, providing crucial data on climate change impacts.
        The Sentinel-6 features advanced radar altimetry technology capable of measuring sea-level changes down to millimeter precision.
        Scientists expect this data to significantly improve climate models and coastal planning strategies.
        The satellite, launched from Vandenberg Air Force Base in California, is part of the Copernicus program, a collaboration
        between ESA, NASA, NOAA, and other international partners.
        """,
        "reference_summary": """
        The ESA has successfully deployed the Sentinel-6 Earth observation satellite, designed to monitor sea levels
        with millimeter precision using advanced radar altimetry. This mission, part of the international Copernicus program,
        will provide crucial data for climate change research and coastal planning.
        """
    },
    # Additional test articles would be added here
]

# Process our test data for evaluation
run_data = []
for item in test_articles:
    # Generate summary using our function
    article_data = ArticleSummaryData(**item)
    result = generate_article_summary(article_data.article)

    # Prepare the run data entry
    run_data.append({
        "item": article_data.model_dump(),
        "sample": result
    })

Creating and Executing an Evaluation Run

With our data prepared, we can create an evaluation run:

eval_run_result = openai.evals.runs.create(
    eval_id=eval_id,
    name="baseline-summarization-run",
    metadata={
        "model": "gpt-4o",
        "temperature": 0.3,
        "max_tokens": 300
    },
    data_source={
        "type": "jsonl",
        "source": {
            "type": "file_content",
            "content": run_data,
        }
    },
)

print(f"Evaluation run created: {eval_run_result.id}")
print(f"View detailed results: {eval_run_result.report_url}")

Retrieving and Analyzing Evaluation Results from the Evals API

Once an evaluation run completes, you can retrieve detailed results:

def analyze_run_results(run_id: str) -> Dict[str, Any]:
    """
    Retrieve and analyze results from an evaluation run.

    Args:
        run_id: The ID of the evaluation run

    Returns:
        Dictionary containing analyzed results
    """
    # Retrieve the run details
    run_details = openai.evals.runs.retrieve(run_id)

    # Extract the results
    results = {}

    # Calculate overall pass rate
    if run_details.results and "pass_rate" in run_details.results:
        results["overall_pass_rate"] = run_details.results["pass_rate"]

    # Extract criterion-specific metrics
    if run_details.criteria_results:
        results["criteria_performance"] = {}
        for criterion, data in run_details.criteria_results.items():
            results["criteria_performance"][criterion] = {
                "pass_rate": data.get("pass_rate", 0),
                "sample_count": data.get("total_count", 0)
            }

    # Extract failures for further analysis
    if run_details.raw_results:
        results["failure_analysis"] = [
            {
                "item": item.get("item", {}),
                "result": item.get("result", {}),
                "criteria_results": item.get("criteria_results", {})
            }
            for item in run_details.raw_results
            if not item.get("passed", True)
        ]

    return results

# Analyze our run
results_analysis = analyze_run_results(eval_run_result.id)
print(json.dumps(results_analysis, indent=2))

Advanced OpenAI Evals API Techniques

Implementing A/B Testing with the Evals API

The Evals API excels at comparing different implementations. Here's how to set up an A/B test between two model configurations:

def generate_summary_alternative_model(article_text: str) -> Dict[str, Any]:
    """Alternative implementation using a different model configuration."""
    response = openai.chat.completions.create(
        model="gpt-4o-mini",  # Using a different model
        messages=[
            {"role": "system", "content": "Summarize this article concisely."},
            {"role": "user", "content": article_text},
        ],
        temperature=0.7,  # Higher temperature for comparison
        max_tokens=250
    )

    return response.model_dump()

# Process our test data with the alternative model
alternative_run_data = []
for item in test_articles:
    article_data = ArticleSummaryData(**item)
    result = generate_summary_alternative_model(article_data.article)

    alternative_run_data.append({
        "item": article_data.model_dump(),
        "sample": result
    })

# Create the alternative evaluation run
alternative_eval_run = openai.evals.runs.create(
    eval_id=eval_id,
    name="alternative-model-run",
    metadata={
        "model": "gpt-4o-mini",
        "temperature": 0.7,
        "max_tokens": 250
    },
    data_source={
        "type": "jsonl",
        "source": {
            "type": "file_content",
            "content": alternative_run_data,
        }
    },
)

# Compare the results programmatically
def compare_evaluation_runs(run_id_1: str, run_id_2: str) -> Dict[str, Any]:
    """
    Compare results from two evaluation runs.

    Args:
        run_id_1: ID of first evaluation run
        run_id_2: ID of second evaluation run

    Returns:
        Dictionary containing comparative analysis
    """
    run_1_results = analyze_run_results(run_id_1)
    run_2_results = analyze_run_results(run_id_2)

    comparison = {
        "overall_comparison": {
            "run_1_pass_rate": run_1_results.get("overall_pass_rate", 0),
            "run_2_pass_rate": run_2_results.get("overall_pass_rate", 0),
            "difference": run_1_results.get("overall_pass_rate", 0) - run_2_results.get("overall_pass_rate", 0)
        },
        "criteria_comparison": {}
    }

    # Compare each criterion
    all_criteria = set(run_1_results.get("criteria_performance", {}).keys()) | set(run_2_results.get("criteria_performance", {}).keys())

    for criterion in all_criteria:
        run_1_criterion = run_1_results.get("criteria_performance", {}).get(criterion, {})
        run_2_criterion = run_2_results.get("criteria_performance", {}).get(criterion, {})

        comparison["criteria_comparison"][criterion] = {
            "run_1_pass_rate": run_1_criterion.get("pass_rate", 0),
            "run_2_pass_rate": run_2_criterion.get("pass_rate", 0),
            "difference": run_1_criterion.get("pass_rate", 0) - run_2_criterion.get("pass_rate", 0)
        }

    return comparison

# Compare our two runs
comparison_results = compare_evaluation_runs(eval_run_result.id, alternative_eval_run.id)
print(json.dumps(comparison_results, indent=2))

Detecting Regressions with the OpenAI Evals API

One of the most valuable applications of the Evals API is regression detection when updating prompts:

def create_regression_detection_pipeline(eval_id: str, baseline_run_id: str) -> None:
    """
    Create a regression detection pipeline that compares a new prompt
    against a baseline run.

    Args:
        eval_id: The ID of the evaluation configuration
        baseline_run_id: The ID of the baseline run to compare against
    """

    def test_prompt_for_regression(new_prompt: str, threshold: float = 0.95) -> Dict[str, Any]:
        """
        Test if a new prompt causes regression compared to baseline.

        Args:
            new_prompt: The new prompt to test
            threshold: Minimum acceptable performance ratio (new/baseline)

        Returns:
            Dictionary containing regression analysis
        """
        # Define function using new prompt
        def generate_summary_new_prompt(article_text: str) -> Dict[str, Any]:
            response = openai.chat.completions.create(
                model="gpt-4o",  # Same model as baseline
                messages=[
                    {"role": "system", "content": new_prompt},
                    {"role": "user", "content": article_text},
                ],
                temperature=0.3,
                max_tokens=300
            )
            return response.model_dump()

        # Process test data with new prompt
        new_prompt_run_data = []
        for item in test_articles:
            article_data = ArticleSummaryData(**item)
            result = generate_summary_new_prompt(article_data.article)

            new_prompt_run_data.append({
                "item": article_data.model_dump(),
                "sample": result
            })

        # Create evaluation run for new prompt
        new_prompt_run = openai.evals.runs.create(
            eval_id=eval_id,
            name=f"regression-test-{int(time.time())}",
            metadata={
                "prompt": new_prompt,
                "test_type": "regression"
            },
            data_source={
                "type": "jsonl",
                "source": {
                    "type": "file_content",
                    "content": new_prompt_run_data,
                }
            },
        )

        # Wait for completion (in production, you might want to implement async handling)
        # This is a simplified implementation
        time.sleep(10)  # Wait for evaluation to complete

        # Compare against baseline
        comparison = compare_evaluation_runs(baseline_run_id, new_prompt_run.id)

        # Determine if there's a regression
        baseline_pass_rate = comparison["overall_comparison"]["run_1_pass_rate"]
        new_pass_rate = comparison["overall_comparison"]["run_2_pass_rate"]

        regression_detected = (new_pass_rate / baseline_pass_rate if baseline_pass_rate > 0 else 0) < threshold

        return {
            "regression_detected": regression_detected,
            "baseline_pass_rate": baseline_pass_rate,
            "new_pass_rate": new_pass_rate,
            "performance_ratio": new_pass_rate / baseline_pass_rate if baseline_pass_rate > 0 else 0,
            "threshold": threshold,
            "detailed_comparison": comparison,
            "report_url": new_prompt_run.report_url
        }

    return test_prompt_for_regression

# Create a regression detection pipeline
regression_detector = create_regression_detection_pipeline(eval_id, eval_run_result.id)

# Test a potentially problematic prompt
problematic_prompt = """
Summarize this article in excessive detail, making sure to include every minor point.
The summary should be comprehensive and leave nothing out.
"""

regression_analysis = regression_detector(problematic_prompt)
print(json.dumps(regression_analysis, indent=2))

Working with Custom Metrics in the OpenAI Evals API

For specialized evaluation needs, you can implement custom metrics:

# Example of a custom numeric score evaluation
numeric_score_grader = {
    "name": "Summary Quality Score",
    "type": "score_model",
    "model": "gpt-4o",
    "input": [
        {
            "role": "system",
            "content": """
            You are an expert evaluator assessing the quality of article summaries.
            Rate the overall quality of the summary on a scale from 1.0 to 10.0, where:
            - 1.0-3.9: Poor quality, significant issues
            - 4.0-6.9: Acceptable quality with room for improvement
            - 7.0-8.9: Good quality, meets expectations
            - 9.0-10.0: Excellent quality, exceeds expectations

            Provide a specific numeric score and detailed justification.
            """
        },
        {
            "role": "user",
            "content": """
            Original Article:
            {{item.article}}

            Summary to Evaluate:
            {{sample.choices[0].message.content}}

            Score (1.0-10.0):
            """
        }
    ],
    "passing_threshold": 7.0,  # Minimum score to pass
    "min_score": 1.0,
    "max_score": 10.0
}

# Add this to your testing criteria when creating an eval

Integrating the OpenAI Evals API into Development Workflows

CI/CD Integration with the Evals API

Integrating the Evals API into your CI/CD pipeline ensures consistent quality:

def ci_cd_evaluation_workflow(
    prompt_file_path: str,
    baseline_eval_id: str,
    baseline_run_id: str,
    threshold: float = 0.95
) -> bool:
    """
    CI/CD integration for evaluating model prompts before deployment.

    Args:
        prompt_file_path: Path to the prompt file being updated
        baseline_eval_id: ID of the baseline evaluation configuration
        baseline_run_id: ID of the baseline run to compare against
        threshold: Minimum acceptable performance ratio

    Returns:
        Boolean indicating whether the new prompt passed evaluation
    """
    # Load the new prompt from version control
    with open(prompt_file_path, 'r') as f:
        new_prompt = f.read()

    # Create regression detector using the baseline
    regression_detector = create_regression_detection_pipeline(baseline_eval_id, baseline_run_id)

    # Test the new prompt
    regression_analysis = regression_detector(new_prompt)

    # Determine if the prompt is safe to deploy
    is_approved = not regression_analysis["regression_detected"]

    # Log the evaluation results
    print(f"Evaluation Results for {prompt_file_path}")
    print(f"Baseline Pass Rate: {regression_analysis['baseline_pass_rate']:.2f}")
    print(f"New Prompt Pass Rate: {regression_analysis['new_pass_rate']:.2f}")
    print(f"Performance Ratio: {regression_analysis['performance_ratio']:.2f}")
    print(f"Deployment Decision: {'APPROVED' if is_approved else 'REJECTED'}")
    print(f"Detailed Report: {regression_analysis['report_url']}")

    return is_approved

Scheduled Monitoring with the OpenAI Evals API

Regular evaluation helps detect model drift or degradation:

def schedule_periodic_evaluation(
    eval_id: str,
    baseline_run_id: str,
    interval_hours: int = 24
) -> None:
    """
    Schedule periodic evaluations to monitor for performance changes.

    Args:
        eval_id: ID of the evaluation configuration
        baseline_run_id: ID of the baseline run to compare against
        interval_hours: Frequency of evaluations in hours
    """
    # In a production system, you would use a task scheduler like Airflow,
    # Celery, or cloud-native solutions. This is a simplified example.

    def perform_periodic_evaluation():
        while True:
            try:
                # Run the current production configuration against the eval
                print(f"Running scheduled evaluation at {datetime.now()}")

                # Implement your evaluation logic here, similar to regression testing

                # Sleep until next scheduled run
                time.sleep(interval_hours * 60 * 60)
            except Exception as e:
                print(f"Error in scheduled evaluation: {e}")
                # Implement error handling and alerting

    # In a real implementation, you would manage this thread properly
    # or use a dedicated scheduling system
    import threading
    evaluation_thread = threading.Thread(target=perform_periodic_evaluation)
    evaluation_thread.daemon = True
    evaluation_thread.start()

Advanced OpenAI Evals API Usage Patterns

Multi-stage Evaluation Pipelines

For complex applications, implement multi-stage evaluation pipelines:

def create_multi_stage_evaluation_pipeline(
    article_data: List[Dict[str, str]]
) -> Dict[str, Any]:
    """
    Create a multi-stage evaluation pipeline for content generation.

    Args:
        article_data: List of articles for evaluation

    Returns:
        Dictionary containing evaluation results from each stage
    """
    # Stage 1: Content generation evaluation
    generation_eval_id = create_content_generation_eval()
    generation_run_id = run_content_generation_eval(generation_eval_id, article_data)

    # Stage 2: Factual accuracy evaluation
    accuracy_eval_id = create_factual_accuracy_eval()
    accuracy_run_id = run_factual_accuracy_eval(accuracy_eval_id, article_data)

    # Stage 3: Tone and style evaluation
    tone_eval_id = create_tone_style_eval()
    tone_run_id = run_tone_style_eval(tone_eval_id, article_data)

    # Aggregate results from all stages
    results = {
        "generation": analyze_run_results(generation_run_id),
        "accuracy": analyze_run_results(accuracy_run_id),
        "tone": analyze_run_results(tone_run_id)
    }

    # Calculate composite score
    composite_score = (
        results["generation"].get("overall_pass_rate", 0) * 0.4 +
        results["accuracy"].get("overall_pass_rate", 0) * 0.4 +
        results["tone"].get("overall_pass_rate", 0) * 0.2
    )

    results["composite_score"] = composite_score

    return results

Conclusion: Mastering the OpenAI Evals API

The OpenAI Evals API represents a significant advancement in systematic LLM evaluation, providing developers with powerful tools to objectively assess model performance and make data-driven decisions.

As LLMs become increasingly integrated into critical applications, the importance of systematic evaluation grows correspondingly. The OpenAI Evals API provides the infrastructure needed to implement these evaluation practices at scale, ensuring that your AI systems remain robust, reliable, and aligned with your expectations over time.

But why stops here? By integrating Apidog into your OpenAI Evals API workflow provides significant advantages:

Streamlined Testing: Apidog's request templates and automated testing capabilities reduce the development time for implementing evaluation pipelines
Enhanced Documentation: Automatic API documentation generation ensures your evaluation criteria and implementations are well-documented
Team Collaboration: Shared workspaces facilitate consistent evaluation standards across development teams
CI/CD Integration: Command-line capabilities enable integration with existing CI/CD pipelines for automated testing
Visual Analysis: Built-in visualization tools help interpret complex evaluation results quickly

button

button