How to Test LLM Applications: The Complete Guide to Promptfoo (2026)

TL;DR

Promptfoo is an open-source LLM evaluation and red-teaming framework that helps developers test AI applications systematically. It supports 90+ model providers, offers 67+ security attack plugins, and runs 100% locally for privacy. With 1.6 million npm downloads and production use at companies serving 10M+ users, it has become the standard for LLM testing. Get started with npm install -g promptfoo and promptfoo init --example getting-started.

Introduction

You spent weeks building your AI-powered customer support chatbot. It answered questions perfectly during development. Then users started finding ways to make it leak sensitive data, bypass safety guardrails, and give inconsistent responses.

This scenario plays out every day. Teams ship LLM applications based on gut feel and manual testing, only to discover vulnerabilities and quality issues in production. The cost of fixing these problems after launch is 100x higher than catching them during development.

Promptfoo solves this by bringing systematic, automated testing to LLM applications. It lets you evaluate prompts across multiple models, run security red-team assessments, and catch regressions before they reach users.

I have analyzed the promptfoo codebase (version 0.121.2) and tested its core features to bring you this comprehensive guide. You will learn how to set up evaluations, run security scans, integrate with CI/CD, and avoid common pitfalls.

By the end, you will have a working test suite for your LLM application and know how to ship with confidence.

💡

If you work with API testing or need to validate API behavior alongside your LLM tests, Apidog provides a unified platform for API design, testing, and documentation. You can use both tools together: promptfoo for LLM evaluation and Apidog for API layer validation.

button

What Is Promptfoo and Why You Need It

Promptfoo is a command-line tool and Node.js library for evaluating and red-teaming LLM applications. Think of it as a testing framework built specifically for the quirks of AI development.

Traditional testing tools fail with LLMs because outputs are non-deterministic. You cannot assert exact string matches when the same prompt produces different responses each time. Promptfoo solves this with:

Semantic assertions that check meaning instead of exact text
LLM-graded evals where one model evaluates another’s output
Multi-model comparison to test the same prompt across GPT-4, Claude, and others
Security plugins that automatically probe for vulnerabilities

The tool runs locally on your machine. Your prompts and test data never leave your environment unless you opt into cloud features. This privacy-first design makes it suitable for testing with sensitive data.

The Problem Promptfoo Solves

Most teams test LLM applications manually. They send a few prompts, read the outputs, and decide if things look good. This approach has three fatal flaws:

No regression detection - You cannot tell if a model update broke existing functionality
Coverage gaps - Manual testing misses edge cases and adversarial inputs
No metrics - You cannot track improvement or compare models objectively

Promptfoo replaces this with automated evals that run on every change. You define test cases once and execute them against any model. Results include pass/fail rates, cost comparisons, and latency metrics.

Who Uses Promptfoo

The project has 1.6 million npm downloads and powers LLM applications serving over 10 million end users. Companies use it for:

Customer support chatbots that need consistent, accurate responses
Content generation pipelines that must maintain brand voice
Healthcare and fintech applications with strict compliance requirements
Security-sensitive systems that cannot leak data or accept harmful inputs

In March 2026, Promptfoo joined OpenAI. The project remains open source and MIT licensed, with continued development under the new ownership.

Getting Started: Install and Run Your First Eval

You can install promptfoo globally or run it without installation using npx.

Installation

# Global install (recommended)
npm install -g promptfoo

# Or run without installing
npx promptfoo@latest

# macOS users can also use Homebrew
brew install promptfoo

# Python users can also use pip
pip install promptfoo

Set your API keys as environment variables:

export OPENAI_API_KEY=sk-abc123
export ANTHROPIC_API_KEY=sk-ant-xxx

Create Your First Eval

Initialize an example project:

promptfoo init --example getting-started
cd getting-started

This creates a promptfooconfig.yaml file with sample prompts, providers, and test cases.

Run the evaluation:

promptfoo eval

View results in the web UI:

promptfoo view

The UI opens at localhost:3000 and shows a side-by-side comparison of outputs from each model, with pass/fail status for each assertion.

Understanding the Config File

The promptfooconfig.yaml file defines your eval suite:

description: "My First Eval Suite"

prompts:
  - prompts/greeting.txt
  - prompts/farewell.txt

providers:
  - openai:gpt-4o
  - anthropic:claude-sonnet-4-5

tests:
  - vars:
      input: "Hello"
    assert:
      - type: contains
        value: "Hi"
      - type: latency
        threshold: 3000

prompts: Files or inline text to test
providers: Models to evaluate (supports 90+ providers)
tests: Test cases with variables and assertions

You can scale this to hundreds of test cases. Many teams keep eval configs in version control and run them in CI on every pull request.

Core Features: What Promptfoo Can Do

1. Automated Evaluations

Automated evals are the foundation of promptfoo. You define test cases with expected outcomes, and the tool runs them against your chosen models.

Assertion Types

Promptfoo includes 30+ built-in assertion types:

Assertion	Purpose
`contains`	Output includes a substring
`equals`	Exact string match
`regex`	Match against a regex pattern
`json-schema`	Validate JSON structure
`javascript`	Custom JS function returns pass/fail
`python`	Custom Python function
`llm-rubric`	Use an LLM to grade output
`similar`	Semantic similarity score
`latency`	Response time under threshold
`cost`	Cost per request under threshold

Example with multiple assertions:

tests:
  - vars:
      question: "What is the capital of France?"
    assert:
      - type: contains
        value: "Paris"
      - type: javascript
        value: output.length < 100
      - type: latency
        threshold: 2000
      - type: cost
        threshold: 0.001

This test checks that the answer mentions Paris, stays under 100 characters, responds in under 2 seconds, and costs less than $0.001.

LLM-Graded Evals

The llm-rubric assertion uses one LLM to grade another’s output. This is powerful for subjective criteria like tone or helpfulness:

assert:
  - type: llm-rubric
    value: "Response should be helpful, harmless, and honest"

The grader LLM reads the output and scores it against your rubric. You can use a cheaper model for grading to reduce costs.

2. Red Teaming and Security Testing

Promptfoo includes comprehensive security testing through its red team module. It automatically generates adversarial inputs to probe for vulnerabilities.

Supported Attack Vectors

The red team system includes 67+ plugins organized by category:

Category	What It Tests
Prompt Injection	Direct, indirect, and context injection attacks
Jailbreaks	DAN, persona switching, role-play bypasses
Data Exfiltration	SSRF, system prompt extraction, prompt leakage
Harmful Content	Hate speech, dangerous activities, self-harm requests
Compliance	PII leakage, HIPAA violations, financial data exposure
Audio/Visual	Audio injection and image-based attacks

Running a Red Team Scan

Initialize a red team config:

promptfoo redteam init

Run the security scan:

promptfoo redteam run

View the report:

promptfoo redteam report [directory]

The redteam run command performs two steps:

Generates dynamic attack probes tailored to your application
Evaluates probes against your target and scores vulnerabilities

Results include severity ratings (Critical, High, Medium, Low), exploitable test cases, and remediation recommendations.

Example Red Team Output

Vulnerability Summary:
- Critical: 2 (PII leakage, prompt extraction)
- High: 5 (jailbreaks, injection attacks)
- Medium: 12 (bias, inconsistent responses)
- Low: 23 (minor policy violations)

Fix critical issues before deployment. Re-run scans after changes to verify fixes.

3. Code Scanning for Pull Requests

Promptfoo integrates with GitHub Actions to scan pull requests for LLM-related security issues.

# .github/workflows/promptfoo-scan.yml
name: Promptfoo Code Scan
on: [pull_request]
jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: promptfoo/promptfoo/code-scan-action@main
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}

This catches:

Hardcoded API keys in config files
Insecure prompt patterns
Missing input validation
Potential prompt injection vectors

4. Model Comparison

Compare outputs from multiple models side by side to choose the best one for your use case.

# Run eval with multiple providers
promptfoo eval

# View comparison in web UI
promptfoo view

The web UI displays:

Pass/fail rates per model
Cost per 1000 requests
Average latency
Qualitative output differences

This data-driven approach prevents bias toward familiar models. You might find that a cheaper model outperforms GPT-4 on your specific evals.

Supported Providers: 90+ LLM Integrations

Promptfoo supports over 90 LLM providers out of the box. You can test the same prompt across OpenAI, Anthropic, Google, Amazon, and local models without changing your code.

Major Providers

Provider	Models
OpenAI	GPT-4, GPT-4o, GPT-4o-mini, o1, o3
Anthropic	Claude 3.5/3.7/4.5/4.6, Thinking models
Google	Gemini 1.5/2.0, Vertex AI
Microsoft	Azure OpenAI, Phi
Amazon	Bedrock (Claude, Llama, Titan)
Meta	Llama 3, 3.1, 3.2 (via multiple providers)
Ollama	Local models (Llama, Mistral, Phi)

Custom Providers

You can write custom providers in Python or JavaScript if your model is not supported.

Python example:

# custom_provider.py
from typing import Any

class CustomProvider:
    async def call_api(self, prompt: str, options: dict, context: dict) -> dict:
        response = await my_async_api.generate(prompt)
        return {
            "output": response.text,
            "tokenUsage": {
                "total": response.usage.total_tokens,
                "prompt": response.usage.prompt_tokens,
                "completion": response.usage.completion_tokens
            }
        }

JavaScript example:

// customProvider.js
export default class CustomProvider {
  async callApi(prompt) {
    return {
      output: await myApi.generate(prompt),
      tokenUsage: { total: 50, prompt: 20, completion: 30 }
    };
  }
}

providers:
  - id: file://custom_provider.py
    config:
      api_key: ${MY_API_KEY}

Command-Line Interface: Essential Commands

Promptfoo’s CLI provides all the functionality you need for daily workflows.

Core Commands

# Run evaluations
promptfoo eval -c promptfooconfig.yaml

# Open web UI
promptfoo view

# Share results online
promptfoo share

# Red team testing
promptfoo redteam init
promptfoo redteam run

# Configuration
promptfoo init
promptfoo validate [config]

# Results management
promptfoo list
promptfoo show <id>
promptfoo delete <id>
promptfoo export <id>

# Utilities
promptfoo cache clear
promptfoo retry <id>

Useful Flags

--no-cache              # Disable caching for fresh results
--max-concurrency <n>   # Limit parallel API calls
--output <file>         # Write results to JSON file
--verbose               # Enable debug logging
--env-file <path>       # Load environment variables from file
--filter <pattern>      # Run specific test cases

Example: Run Eval with Custom Settings

promptfoo eval \
  -c promptfooconfig.yaml \
  --no-cache \
  --max-concurrency 3 \
  --output results.json \
  --env-file .env

This runs evals fresh (no cache), limits concurrency to 3 parallel calls, saves results to JSON, and loads API keys from .env.

CI/CD Integration: Automate LLM Testing

Integrate promptfoo into your CI/CD pipeline to catch regressions before deployment.

GitHub Actions Example

name: LLM Tests
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '22'
      - run: npm install -g promptfoo
      - run: promptfoo eval -c promptfooconfig.yaml
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Quality Gates

Set pass/fail thresholds in your config:

commandLineOptions:
  threshold: 0.8  # Require 80% pass rate

This fails CI if evals do not meet the threshold, preventing regressions from merging.

Caching in CI

Enable caching to speed up CI runs:

- uses: actions/cache@v4
  with:
    path: ~/.cache/promptfoo
    key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}

Cached results skip API calls for unchanged tests, reducing CI time and costs.

The built-in web UI (promptfoo view) provides an interactive interface for reviewing evals.

Features

Eval matrix - Compare outputs side by side
Filtering - Find specific test cases by status or provider
Diff view - See exactly what changed between runs
Sharing - Generate shareable links for team review
Real-time updates - Watch evals run live

Access and Security

The UI runs on localhost:3000 by default. It includes CSRF protection using Sec-Fetch-Site and Origin headers to block cross-site requests from untrusted origins.

Do not expose the local web server to untrusted networks. For team access, use the promptfoo share command to upload results to the cloud, or self-host with authentication.

Database and Caching

Cache Location

macOS/Linux: ~/.cache/promptfoo
Windows: %LOCALAPPDATA%\promptfoo

The cache stores evaluation results to speed up repeated runs. Use --no-cache during development to ensure fresh results.

Database Location

All platforms: ~/.promptfoo/promptfoo.db (SQLite)

The database stores historical eval runs for comparison and trend analysis. Do not delete this file unless you want to lose historical data.

Security Model: What You Can Trust

Promptfoo operates on a trust-by-configuration model. Understanding this prevents security surprises.

Trusted Inputs (Treated as Code)

These inputs execute as code and should only come from trusted sources:

Config files (promptfooconfig.yaml)
Custom JavaScript/Python/Ruby assertions
Provider configurations
Transform functions

Untrusted Inputs (Data-Only)

These inputs are treated as data and should not trigger code execution:

Prompt text
Test case variables
Model outputs
Remote content fetched during evals

Hardening Recommendations

For high-security environments:

Run inside a container or VM with minimal privileges
Use dedicated, least-privileged API keys
Avoid placing secrets in prompts or config files
Restrict network egress for third-party code
Do not expose the local web server to untrusted networks

Performance: Optimize Your Evals

Optimization Tips

Use caching - Default behavior speeds up repeated runs
Tune concurrency - --max-concurrency balances speed vs. rate limits
Filter tests - Use --filter to run specific test cases during development
Sample datasets - Use --repeat with subsets for iteration before full runs

Scaling for Large Evals

For large-scale evaluations with thousands of test cases:

Use the scheduler (src/scheduler/) for distributed runs
Leverage remote generation for offloading compute
Export results to Google Sheets for team visibility

Extensibility: Build Custom Features

Custom Assertions

Write custom assertions for domain-specific checks:

// assertions/customCheck.js
export default function customCheck(output, context) {
  const pass = output.includes('expected');
  return {
    pass,
    score: pass ? 1 : 0,
    reason: pass ? 'Output matched' : 'Missing expected content'
  };
}

Use in your config:

assert:
  - type: file://assertions/customCheck.js

MCP Server

Promptfoo includes a Model Context Protocol (MCP) server for integration with AI assistants like Claude Code:

promptfoo mcp

This enables AI agents to:

Run evaluations directly from chat
Access red team capabilities
Query stored results
Generate new test cases

Real-World Use Cases

Customer Support Chatbot

A SaaS company uses promptfoo to test their support chatbot before each deployment:

500 test cases covering common questions
Eval across GPT-4 and Claude to compare quality
Red team scans for PII leakage and jailbreaks
CI integration blocks deploys with failing evals

Result: 90% reduction in customer-reported issues after implementing automated evals.

Content Generation Pipeline

A marketing team validates AI-generated content for brand voice:

LLM-graded evals check tone and style
Latency thresholds ensure fast generation
Cost monitoring keeps expenses under control
Model comparison finds the best value provider

Result: Consistent brand voice across all content with 40% lower API costs.

Healthcare Application

A healthtech startup ensures compliance with strict testing:

Red team scans for HIPAA violations
Custom assertions validate medical accuracy
All evals run locally for data privacy
Audit trails for regulatory requirements

Result: Passed SOC 2 audit with promptfoo evals as evidence.

Conclusion

Promptfoo brings systematic testing to LLM applications. It replaces manual, error-prone processes with automated evals that catch regressions, security issues, and quality problems before deployment.

Key takeaways:

Install with npm install -g promptfoo and start with promptfoo init
Use assertions to validate outputs beyond exact string matching
Run red team scans to find security vulnerabilities
Integrate with CI/CD to block regressions
Compare models objectively with side-by-side evals
Custom providers and assertions extend functionality

The future of AI development is data-driven. With promptfoo, you have the tools to build, test, and secure LLM applications at scale.

button

If you also work with APIs, consider using Apidog alongside promptfoo. Apidog handles API design, testing, and documentation, while promptfoo focuses on LLM evaluation. Together they cover the full stack of modern application testing.

FAQ

What is promptfoo used for?

Promptfoo is used for testing and evaluating LLM applications. It runs automated tests against prompts, compares outputs across models, and performs security red-team assessments to find vulnerabilities.

Is promptfoo free?

Yes, promptfoo is open source and MIT licensed. You can use it for free for personal and commercial projects. Cloud features and enterprise support may require paid plans.

How do I install promptfoo?

Run npm install -g promptfoo for global installation. You can also use npx promptfoo@latest without installing, or install via brew install promptfoo on macOS or pip install promptfoo for Python.

What models does promptfoo support?

Promptfoo supports 90+ LLM providers including OpenAI (GPT-4, GPT-4o, o1), Anthropic (Claude 3.5/4/4.5), Google (Gemini), Microsoft (Azure OpenAI), Amazon Bedrock, and local models via Ollama.

How do I run a red team scan?

Run promptfoo redteam init to create a config, then promptfoo redteam run to execute the security scan. View results with promptfoo redteam report.

Can I use promptfoo in CI/CD?

Yes. Install promptfoo in your CI pipeline and run promptfoo eval with your config file. Set quality gates with the threshold option to fail CI if evals do not meet pass rates.

Does promptfoo send my data to external servers?

No. Promptfoo runs 100% locally by default. Your prompts and test data never leave your machine unless you explicitly opt into cloud features. Cache and database files are stored locally.

How do I compare models with promptfoo?

List multiple providers in your config file, then run promptfoo eval. View the comparison in the web UI with promptfoo view, which shows pass/fail rates, costs, and latency for each model.

TL;DR

Introduction

What Is Promptfoo and Why You Need It

The Problem Promptfoo Solves

Who Uses Promptfoo

Getting Started: Install and Run Your First Eval

Installation

Create Your First Eval

Understanding the Config File

Core Features: What Promptfoo Can Do

1. Automated Evaluations

Assertion Types

LLM-Graded Evals

2. Red Teaming and Security Testing

Supported Attack Vectors

Running a Red Team Scan

Example Red Team Output

3. Code Scanning for Pull Requests

4. Model Comparison

Supported Providers: 90+ LLM Integrations

Major Providers

Custom Providers

Command-Line Interface: Essential Commands

Core Commands

Useful Flags

Example: Run Eval with Custom Settings

CI/CD Integration: Automate LLM Testing

GitHub Actions Example

Quality Gates

Caching in CI

Web UI: Visualize and Share Results

Features

Access and Security

Database and Caching

Cache Location

Database Location

Security Model: What You Can Trust

Trusted Inputs (Treated as Code)

Untrusted Inputs (Data-Only)

Hardening Recommendations

Performance: Optimize Your Evals

Optimization Tips

Scaling for Large Evals

Extensibility: Build Custom Features

Custom Assertions

MCP Server

Real-World Use Cases

Customer Support Chatbot

Content Generation Pipeline

Healthcare Application

Conclusion

FAQ

What is promptfoo used for?

Is promptfoo free?

How do I install promptfoo?

What models does promptfoo support?

How do I run a red team scan?

Can I use promptfoo in CI/CD?

Does promptfoo send my data to external servers?

How do I compare models with promptfoo?