How to Test LLM Applications: The Complete Guide to Promptfoo (2026)

Learn how to test LLM applications with Promptfoo. Complete guide covering automated evals, red team security scanning, and CI/CD integration.

Ashley Innocent

Ashley Innocent

19 March 2026

How to Test LLM Applications: The Complete Guide to Promptfoo (2026)

TL;DR

Promptfoo is an open-source LLM evaluation and red-teaming framework that helps developers test AI applications systematically. It supports 90+ model providers, offers 67+ security attack plugins, and runs 100% locally for privacy. With 1.6 million npm downloads and production use at companies serving 10M+ users, it has become the standard for LLM testing. Get started with npm install -g promptfoo and promptfoo init --example getting-started.

Introduction

You spent weeks building your AI-powered customer support chatbot. It answered questions perfectly during development. Then users started finding ways to make it leak sensitive data, bypass safety guardrails, and give inconsistent responses.

This scenario plays out every day. Teams ship LLM applications based on gut feel and manual testing, only to discover vulnerabilities and quality issues in production. The cost of fixing these problems after launch is 100x higher than catching them during development.

Promptfoo solves this by bringing systematic, automated testing to LLM applications. It lets you evaluate prompts across multiple models, run security red-team assessments, and catch regressions before they reach users.

I have analyzed the promptfoo codebase (version 0.121.2) and tested its core features to bring you this comprehensive guide. You will learn how to set up evaluations, run security scans, integrate with CI/CD, and avoid common pitfalls.

By the end, you will have a working test suite for your LLM application and know how to ship with confidence.

💡
If you work with API testing or need to validate API behavior alongside your LLM tests, Apidog provides a unified platform for API design, testing, and documentation. You can use both tools together: promptfoo for LLM evaluation and Apidog for API layer validation.
button

What Is Promptfoo and Why You Need It

Promptfoo is a command-line tool and Node.js library for evaluating and red-teaming LLM applications. Think of it as a testing framework built specifically for the quirks of AI development.

Traditional testing tools fail with LLMs because outputs are non-deterministic. You cannot assert exact string matches when the same prompt produces different responses each time. Promptfoo solves this with:

The tool runs locally on your machine. Your prompts and test data never leave your environment unless you opt into cloud features. This privacy-first design makes it suitable for testing with sensitive data.

The Problem Promptfoo Solves

Most teams test LLM applications manually. They send a few prompts, read the outputs, and decide if things look good. This approach has three fatal flaws:

  1. No regression detection - You cannot tell if a model update broke existing functionality
  2. Coverage gaps - Manual testing misses edge cases and adversarial inputs
  3. No metrics - You cannot track improvement or compare models objectively

Promptfoo replaces this with automated evals that run on every change. You define test cases once and execute them against any model. Results include pass/fail rates, cost comparisons, and latency metrics.

Who Uses Promptfoo

The project has 1.6 million npm downloads and powers LLM applications serving over 10 million end users. Companies use it for:

In March 2026, Promptfoo joined OpenAI. The project remains open source and MIT licensed, with continued development under the new ownership.

Getting Started: Install and Run Your First Eval

You can install promptfoo globally or run it without installation using npx.

Installation

# Global install (recommended)
npm install -g promptfoo

# Or run without installing
npx promptfoo@latest

# macOS users can also use Homebrew
brew install promptfoo

# Python users can also use pip
pip install promptfoo

Set your API keys as environment variables:

export OPENAI_API_KEY=sk-abc123
export ANTHROPIC_API_KEY=sk-ant-xxx

Create Your First Eval

Initialize an example project:

promptfoo init --example getting-started
cd getting-started

This creates a promptfooconfig.yaml file with sample prompts, providers, and test cases.

Run the evaluation:

promptfoo eval

View results in the web UI:

promptfoo view

The UI opens at localhost:3000 and shows a side-by-side comparison of outputs from each model, with pass/fail status for each assertion.

Understanding the Config File

The promptfooconfig.yaml file defines your eval suite:

description: "My First Eval Suite"

prompts:
  - prompts/greeting.txt
  - prompts/farewell.txt

providers:
  - openai:gpt-4o
  - anthropic:claude-sonnet-4-5

tests:
  - vars:
      input: "Hello"
    assert:
      - type: contains
        value: "Hi"
      - type: latency
        threshold: 3000

You can scale this to hundreds of test cases. Many teams keep eval configs in version control and run them in CI on every pull request.

Core Features: What Promptfoo Can Do

1. Automated Evaluations

Automated evals are the foundation of promptfoo. You define test cases with expected outcomes, and the tool runs them against your chosen models.

Assertion Types

Promptfoo includes 30+ built-in assertion types:

Assertion Purpose
contains Output includes a substring
equals Exact string match
regex Match against a regex pattern
json-schema Validate JSON structure
javascript Custom JS function returns pass/fail
python Custom Python function
llm-rubric Use an LLM to grade output
similar Semantic similarity score
latency Response time under threshold
cost Cost per request under threshold

Example with multiple assertions:

tests:
  - vars:
      question: "What is the capital of France?"
    assert:
      - type: contains
        value: "Paris"
      - type: javascript
        value: output.length < 100
      - type: latency
        threshold: 2000
      - type: cost
        threshold: 0.001

This test checks that the answer mentions Paris, stays under 100 characters, responds in under 2 seconds, and costs less than $0.001.

LLM-Graded Evals

The llm-rubric assertion uses one LLM to grade another’s output. This is powerful for subjective criteria like tone or helpfulness:

assert:
  - type: llm-rubric
    value: "Response should be helpful, harmless, and honest"

The grader LLM reads the output and scores it against your rubric. You can use a cheaper model for grading to reduce costs.

2. Red Teaming and Security Testing

Promptfoo includes comprehensive security testing through its red team module. It automatically generates adversarial inputs to probe for vulnerabilities.

Supported Attack Vectors

The red team system includes 67+ plugins organized by category:

Category What It Tests
Prompt Injection Direct, indirect, and context injection attacks
Jailbreaks DAN, persona switching, role-play bypasses
Data Exfiltration SSRF, system prompt extraction, prompt leakage
Harmful Content Hate speech, dangerous activities, self-harm requests
Compliance PII leakage, HIPAA violations, financial data exposure
Audio/Visual Audio injection and image-based attacks

Running a Red Team Scan

Initialize a red team config:

promptfoo redteam init

Run the security scan:

promptfoo redteam run

View the report:

promptfoo redteam report [directory]

The redteam run command performs two steps:

  1. Generates dynamic attack probes tailored to your application
  2. Evaluates probes against your target and scores vulnerabilities

Results include severity ratings (Critical, High, Medium, Low), exploitable test cases, and remediation recommendations.

Example Red Team Output

Vulnerability Summary:
- Critical: 2 (PII leakage, prompt extraction)
- High: 5 (jailbreaks, injection attacks)
- Medium: 12 (bias, inconsistent responses)
- Low: 23 (minor policy violations)

Fix critical issues before deployment. Re-run scans after changes to verify fixes.

3. Code Scanning for Pull Requests

Promptfoo integrates with GitHub Actions to scan pull requests for LLM-related security issues.

# .github/workflows/promptfoo-scan.yml
name: Promptfoo Code Scan
on: [pull_request]
jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: promptfoo/promptfoo/code-scan-action@main
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}

This catches:

4. Model Comparison

Compare outputs from multiple models side by side to choose the best one for your use case.

# Run eval with multiple providers
promptfoo eval

# View comparison in web UI
promptfoo view

The web UI displays:

This data-driven approach prevents bias toward familiar models. You might find that a cheaper model outperforms GPT-4 on your specific evals.

Supported Providers: 90+ LLM Integrations

Promptfoo supports over 90 LLM providers out of the box. You can test the same prompt across OpenAI, Anthropic, Google, Amazon, and local models without changing your code.

Major Providers

Provider Models
OpenAI GPT-4, GPT-4o, GPT-4o-mini, o1, o3
Anthropic Claude 3.5/3.7/4.5/4.6, Thinking models
Google Gemini 1.5/2.0, Vertex AI
Microsoft Azure OpenAI, Phi
Amazon Bedrock (Claude, Llama, Titan)
Meta Llama 3, 3.1, 3.2 (via multiple providers)
Ollama Local models (Llama, Mistral, Phi)

Custom Providers

You can write custom providers in Python or JavaScript if your model is not supported.

Python example:

# custom_provider.py
from typing import Any

class CustomProvider:
    async def call_api(self, prompt: str, options: dict, context: dict) -> dict:
        response = await my_async_api.generate(prompt)
        return {
            "output": response.text,
            "tokenUsage": {
                "total": response.usage.total_tokens,
                "prompt": response.usage.prompt_tokens,
                "completion": response.usage.completion_tokens
            }
        }

JavaScript example:

// customProvider.js
export default class CustomProvider {
  async callApi(prompt) {
    return {
      output: await myApi.generate(prompt),
      tokenUsage: { total: 50, prompt: 20, completion: 30 }
    };
  }
}

Register custom providers in your config:

providers:
  - id: file://custom_provider.py
    config:
      api_key: ${MY_API_KEY}

Command-Line Interface: Essential Commands

Promptfoo’s CLI provides all the functionality you need for daily workflows.

Core Commands

# Run evaluations
promptfoo eval -c promptfooconfig.yaml

# Open web UI
promptfoo view

# Share results online
promptfoo share

# Red team testing
promptfoo redteam init
promptfoo redteam run

# Configuration
promptfoo init
promptfoo validate [config]

# Results management
promptfoo list
promptfoo show <id>
promptfoo delete <id>
promptfoo export <id>

# Utilities
promptfoo cache clear
promptfoo retry <id>

Useful Flags

--no-cache              # Disable caching for fresh results
--max-concurrency <n>   # Limit parallel API calls
--output <file>         # Write results to JSON file
--verbose               # Enable debug logging
--env-file <path>       # Load environment variables from file
--filter <pattern>      # Run specific test cases

Example: Run Eval with Custom Settings

promptfoo eval \
  -c promptfooconfig.yaml \
  --no-cache \
  --max-concurrency 3 \
  --output results.json \
  --env-file .env

This runs evals fresh (no cache), limits concurrency to 3 parallel calls, saves results to JSON, and loads API keys from .env.

CI/CD Integration: Automate LLM Testing

Integrate promptfoo into your CI/CD pipeline to catch regressions before deployment.

GitHub Actions Example

name: LLM Tests
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '22'
      - run: npm install -g promptfoo
      - run: promptfoo eval -c promptfooconfig.yaml
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Quality Gates

Set pass/fail thresholds in your config:

commandLineOptions:
  threshold: 0.8  # Require 80% pass rate

This fails CI if evals do not meet the threshold, preventing regressions from merging.

Caching in CI

Enable caching to speed up CI runs:

- uses: actions/cache@v4
  with:
    path: ~/.cache/promptfoo
    key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}

Cached results skip API calls for unchanged tests, reducing CI time and costs.

Web UI: Visualize and Share Results

The built-in web UI (promptfoo view) provides an interactive interface for reviewing evals.

Features

Access and Security

The UI runs on localhost:3000 by default. It includes CSRF protection using Sec-Fetch-Site and Origin headers to block cross-site requests from untrusted origins.

Do not expose the local web server to untrusted networks. For team access, use the promptfoo share command to upload results to the cloud, or self-host with authentication.

Database and Caching

Cache Location

The cache stores evaluation results to speed up repeated runs. Use --no-cache during development to ensure fresh results.

Database Location

The database stores historical eval runs for comparison and trend analysis. Do not delete this file unless you want to lose historical data.

Security Model: What You Can Trust

Promptfoo operates on a trust-by-configuration model. Understanding this prevents security surprises.

Trusted Inputs (Treated as Code)

These inputs execute as code and should only come from trusted sources:

Untrusted Inputs (Data-Only)

These inputs are treated as data and should not trigger code execution:

Hardening Recommendations

For high-security environments:

  1. Run inside a container or VM with minimal privileges
  2. Use dedicated, least-privileged API keys
  3. Avoid placing secrets in prompts or config files
  4. Restrict network egress for third-party code
  5. Do not expose the local web server to untrusted networks

Performance: Optimize Your Evals

Optimization Tips

  1. Use caching - Default behavior speeds up repeated runs
  2. Tune concurrency - --max-concurrency balances speed vs. rate limits
  3. Filter tests - Use --filter to run specific test cases during development
  4. Sample datasets - Use --repeat with subsets for iteration before full runs

Scaling for Large Evals

For large-scale evaluations with thousands of test cases:

Extensibility: Build Custom Features

Custom Assertions

Write custom assertions for domain-specific checks:

// assertions/customCheck.js
export default function customCheck(output, context) {
  const pass = output.includes('expected');
  return {
    pass,
    score: pass ? 1 : 0,
    reason: pass ? 'Output matched' : 'Missing expected content'
  };
}

Use in your config:

assert:
  - type: file://assertions/customCheck.js

MCP Server

Promptfoo includes a Model Context Protocol (MCP) server for integration with AI assistants like Claude Code:

promptfoo mcp

This enables AI agents to:

Real-World Use Cases

Customer Support Chatbot

A SaaS company uses promptfoo to test their support chatbot before each deployment:

Result: 90% reduction in customer-reported issues after implementing automated evals.

Content Generation Pipeline

A marketing team validates AI-generated content for brand voice:

Result: Consistent brand voice across all content with 40% lower API costs.

Healthcare Application

A healthtech startup ensures compliance with strict testing:

Result: Passed SOC 2 audit with promptfoo evals as evidence.

Conclusion

Promptfoo brings systematic testing to LLM applications. It replaces manual, error-prone processes with automated evals that catch regressions, security issues, and quality problems before deployment.

Key takeaways:

The future of AI development is data-driven. With promptfoo, you have the tools to build, test, and secure LLM applications at scale.

button

If you also work with APIs, consider using Apidog alongside promptfoo. Apidog handles API design, testing, and documentation, while promptfoo focuses on LLM evaluation. Together they cover the full stack of modern application testing.

FAQ

What is promptfoo used for?

Promptfoo is used for testing and evaluating LLM applications. It runs automated tests against prompts, compares outputs across models, and performs security red-team assessments to find vulnerabilities.

Is promptfoo free?

Yes, promptfoo is open source and MIT licensed. You can use it for free for personal and commercial projects. Cloud features and enterprise support may require paid plans.

How do I install promptfoo?

Run npm install -g promptfoo for global installation. You can also use npx promptfoo@latest without installing, or install via brew install promptfoo on macOS or pip install promptfoo for Python.

What models does promptfoo support?

Promptfoo supports 90+ LLM providers including OpenAI (GPT-4, GPT-4o, o1), Anthropic (Claude 3.5/4/4.5), Google (Gemini), Microsoft (Azure OpenAI), Amazon Bedrock, and local models via Ollama.

How do I run a red team scan?

Run promptfoo redteam init to create a config, then promptfoo redteam run to execute the security scan. View results with promptfoo redteam report.

Can I use promptfoo in CI/CD?

Yes. Install promptfoo in your CI pipeline and run promptfoo eval with your config file. Set quality gates with the threshold option to fail CI if evals do not meet pass rates.

Does promptfoo send my data to external servers?

No. Promptfoo runs 100% locally by default. Your prompts and test data never leave your machine unless you explicitly opt into cloud features. Cache and database files are stored locally.

How do I compare models with promptfoo?

List multiple providers in your config file, then run promptfoo eval. View the comparison in the web UI with promptfoo view, which shows pass/fail rates, costs, and latency for each model.

Explore more

How to Remove Censorship from LLM Models with Heretic

How to Remove Censorship from LLM Models with Heretic

Learn how Heretic automatically removes safety filters from language models using directional ablation. Complete guide with installation, usage, and ethical deployment practices.

19 March 2026

Free Codex for Open Source Developers: Here is How to Apply

Free Codex for Open Source Developers: Here is How to Apply

Discover how to obtain the Free Codex for Open Source, including eligibility requirements, the application process, and real-world usage tips for open source developers.

19 March 2026

How to Train Your Own ChatGPT for $50?

How to Train Your Own ChatGPT for $50?

Train your own GPT-2 level chatbot for $50 in 2 hours. Complete guide to nanochat with code examples, benchmarks, and step-by-step instructions.

19 March 2026

Practice API Design-first in Apidog

Discover an easier way to build and use APIs