How to Make AI Agents Prove Their Work with Screenshots (No More Hallucinations)

Build an evidence-based QA workflow that requires screenshot proof before AI agents can approve work. Uses Playwright for automated captures, cross-references claims with actual code, and provides PASS/FAIL certification.

Ashley Innocent

Ashley Innocent

19 March 2026

How to Make AI Agents Prove Their Work with Screenshots (No More Hallucinations)

Apidog for Enterprise

On-Premises Deploy

SSO & RBAC

SOC 2 Compliant

Explore Apidog Enterprise

TL;DR

Stop AI hallucinations with 4 steps: (1) Install Playwright and configure breakpoints (desktop, tablet, mobile), (2) Create screenshot test suite that captures full page, responsive layouts, and interactions, (3) Run ./qa-playwright-capture.sh to collect evidence, (4) Activate Reality Checker agent to cross-reference claims with grep results and screenshots. Agents output PASS or NEEDS WORK with specific blocking issues—no more fantasy approvals.

Introduction

Stop accepting “looks great” from AI agents. Build an evidence-based QA workflow with Playwright screenshots that requires visual proof before any approval.

You ask an AI agent to review your landing page. It responds:

The design looks premium and polished. The glassmorphism effects are well-implemented. The page is fully responsive. Ready for production!

You open the page. The “glassmorphism” is a solid gray background. The “fully responsive” layout breaks on mobile. Nothing is premium or polished.

AI agents hallucinate. They tell you what you want to hear. They avoid conflict. They approve everything.

The Reality Checker agent from The Agency collection takes a different approach:

Status: NEEDS WORK

Evidence:
- grep for "glassmorphism" returned NO PREMIUM FEATURES FOUND
- responsive-mobile.png shows broken layout at 375px width
- test-results.json shows 3 console errors, 2.1s load time

Blocking issues: 4

No feelings. No opinions. Just evidence.

In this tutorial, you’ll build an evidence-based QA workflow that complements your API testing pipeline. Whether you’re validating frontend layouts or verifying API responses in Apidog, the principle is the same: require proof before approval. You’ll set up Playwright for automated screenshot captures, create mandatory reality check commands, cross-reference agent claims with actual code, and require PASS/FAIL certification before shipping.

Why Evidence Matters

AI agents are people-pleasers. They want to help. They want you to like them. So they say what sounds good:

Evidence-based QA changes this. Instead of opinions, you get:

No more “trust me.” Just proof.

Step 1: Set Up Playwright

Install Playwright:

npm install -D @playwright/test
npx playwright install chromium

Create qa-playwright.config.ts:

import { defineConfig } from '@playwright/test';

export default defineConfig({
  testMatch: '**/qa-screenshots.spec.ts',
  timeout: 30000,
  use: {
    baseURL: process.env.BASE_URL || 'http://localhost:8000',
    screenshot: 'on',
    trace: 'on-first-retry',
    headless: true,
  },
  projects: [
    {
      name: 'desktop',
      use: { viewport: { width: 1920, height: 1080 } },
    },
    {
      name: 'tablet',
      use: { viewport: { width: 768, height: 1024 } },
    },
    {
      name: 'mobile',
      use: { viewport: { width: 375, height: 667 } },
    },
  ],
  reporter: [['json', { outputFile: 'public/qa-screenshots/test-results.json' }]],
  outputDir: 'public/qa-screenshots',
});

Step 2: Create Screenshot Test Suite

Create qa-screenshots.spec.ts:

import { test, expect } from '@playwright/test';
import * as fs from 'fs';
import * as path from 'path';

// Ensure output directory exists
const outputDir = 'public/qa-screenshots';
if (!fs.existsSync(outputDir)) {
  fs.mkdirSync(outputDir, { recursive: true });
}

test.describe('Reality Check Screenshots', () => {
  test('capture full page at all breakpoints', async ({ page, browserName }) => {
    const errors: string[] = [];
    const consoleLogs: string[] = [];

    // Capture console errors
    page.on('console', msg => {
      if (msg.type() === 'error') {
        consoleLogs.push(`[ERROR] ${msg.text()}`);
      }
    });

    // Capture network failures
    page.on('requestfailed', request => {
      errors.push(`[NETWORK] ${request.url()} failed`);
    });

    // Navigate to page
    await page.goto('/');
    await page.waitForLoadState('networkidle');

    // Capture performance metrics
    const metrics = await page.metrics();
    const performance = {
      jsHeapSize: metrics.JSHeapUsedSize,
      loadTime: await page.evaluate(() => performance.timing.loadEventEnd - performance.timing.navigationStart),
      domContentLoaded: await page.evaluate(() => performance.timing.domContentLoadedEventEnd - performance.timing.navigationStart),
    };

    // Save screenshot
    const projectName = browserName || 'chromium';
    await page.screenshot({
      path: path.join(outputDir, `full-page-${projectName}.png`),
      fullPage: true,
    });

    // Save metrics
    fs.writeFileSync(
      path.join(outputDir, 'performance-metrics.json'),
      JSON.stringify({ performance, consoleErrors: consoleLogs, networkErrors: errors }, null, 2)
    );
  });

  test('capture responsive layouts', async ({ page }) => {
    const breakpoints = [
      { name: 'desktop', width: 1920, height: 1080 },
      { name: 'tablet', width: 768, height: 1024 },
      { name: 'mobile', width: 375, height: 667 },
    ];

    for (const breakpoint of breakpoints) {
      await page.setViewportSize({ width: breakpoint.width, height: breakpoint.height });
      await page.goto('/');
      await page.waitForLoadState('networkidle');
      await page.screenshot({
        path: path.join(outputDir, `responsive-${breakpoint.name}.png`),
        fullPage: true,
      });
    }
  });

  test('capture navigation interactions', async ({ page }) => {
    await page.goto('/');

    // Find and click navigation items
    const navItems = await page.$$('nav a, header a, .nav a');
    for (let i = 0; i < Math.min(navItems.length, 5); i++) {
      await page.screenshot({ path: path.join(outputDir, `nav-${i}-before.png`) });
      await navItems[i].click();
      await page.waitForLoadState('networkidle');
      await page.screenshot({ path: path.join(outputDir, `nav-${i}-after.png`) });
      await page.goBack();
      await page.waitForLoadState('networkidle');
    }
  });

  test('capture form interactions', async ({ page }) => {
    await page.goto('/');

    // Find forms
    const forms = await page.$$('form');
    for (let i = 0; i < forms.length; i++) {
      const form = forms[i];
      await form.screenshot({ path: path.join(outputDir, `form-${i}-initial.png`) });

      // Fill inputs
      const inputs = await form.$$('input[type="text"], input[type="email"], input[type="password"]');
      for (const input of inputs) {
        await input.fill('test@example.com');
      }

      await form.screenshot({ path: path.join(outputDir, `form-${i}-filled.png`) });
    }
  });

  test('capture accordion/dropdown interactions', async ({ page }) => {
    await page.goto('/');

    // Find accordions
    const accordions = await page.$$('[data-accordion], details, .accordion');
    for (let i = 0; i < accordions.length; i++) {
      await accordions[i].screenshot({ path: path.join(outputDir, `accordion-${i}-closed.png`) });
      await accordions[i].click();
      await page.waitForTimeout(300);
      await accordions[i].screenshot({ path.join(outputDir, `accordion-${i}-open.png`) });
    }
  });
});

Step 3: Create Reality Check Script

Create qa-playwright-capture.sh:

#!/usr/bin/env bash
#
# qa-playwright-capture.sh — Run Playwright screenshot captures for reality checking
#
# Usage: ./qa-playwright-capture.sh [BASE_URL] [OUTPUT_DIR]
#

set -euo pipefail

BASE_URL="${1:-http://localhost:8000}"
OUTPUT_DIR="${2:-public/qa-screenshots}"

echo "Starting Reality Check screenshot capture..."
echo "  Base URL: $BASE_URL"
echo "  Output: $OUTPUT_DIR"

# Ensure output directory exists
mkdir -p "$OUTPUT_DIR"

# Run Playwright tests
export BASE_URL
npx playwright test --config=qa-playwright.config.ts --grep "@screenshot"

# Generate summary
echo ""
echo "Generating summary..."

# Count screenshots
SCREENSHOT_COUNT=$(find "$OUTPUT_DIR" -name "*.png" | wc -l)
echo "  Screenshots captured: $SCREENSHOT_COUNT"

# Check for console errors
if [ -f "$OUTPUT_DIR/performance-metrics.json" ]; then
  ERROR_COUNT=$(cat "$OUTPUT_DIR/performance-metrics.json" | grep -c '"\[ERROR\]"' || echo "0")
  echo "  Console errors: $ERROR_COUNT"
fi

# Check load time
if [ -f "$OUTPUT_DIR/performance-metrics.json" ]; then
  LOAD_TIME=$(cat "$OUTPUT_DIR/performance-metrics.json" | grep -o '"loadTime": [0-9.]*' | head -1 | awk '{print $2}')
  echo "  Load time: ${LOAD_TIME:-N/A}ms"
fi

echo ""
echo "Reality Check complete. Review screenshots in: $OUTPUT_DIR"
echo ""
echo "Next step: Run Reality Checker agent to validate evidence"

Make it executable:

chmod +x qa-playwright-capture.sh

Step 4: Run Reality Check Commands

Before any AI agent can approve work, run these commands:

# 1. Verify what was actually built
ls -la resources/views/ || ls -la *.html
ls -la src/components/ || ls -la components/

# 2. Cross-check claimed features
grep -r "glassmorphism\|backdrop-filter\|blur" . --include="*.css" --include="*.html" || echo "NO GLASSMORPHISM FOUND"
grep -r "responsive\|media-query\|@media" . --include="*.css" || echo "NO RESPONSIVE CSS FOUND"
grep -r "jwt\|authentication\|auth" . --include="*.ts" --include="*.js" || echo "NO AUTH FOUND"

# 3. Run screenshot capture
./qa-playwright-capture.sh http://localhost:8000 public/qa-screenshots

# 4. Review evidence
ls -la public/qa-screenshots/
# Expected files:
# - responsive-desktop.png
# - responsive-tablet.png
# - responsive-mobile.png
# - nav-*-before.png, nav-*-after.png
# - form-*-initial.png, form-*-filled.png

# 5. Check metrics
cat public/qa-screenshots/test-results.json
cat public/qa-screenshots/performance-metrics.json

Step 5: Activate Reality Checker Agent

Open a Claude Code session:

Activate Reality Checker mode.

Run your mandatory reality check process:

1. Verify files exist: ls -la src/components/
2. Cross-reference claimed features: grep for "premium", "glassmorphism"
3. Review screenshot evidence: public/qa-screenshots/
4. Check test-results.json for metrics

Project URL: http://localhost:8000

Output: PASS or NEEDS WORK with specific blocking issues.

Expected output:

## Reality Check Results

### File Verification: PASS
- Component files present: 12 files found
- Expected structure matches

### Feature Verification: NEEDS WORK
- Claimed: "Premium glassmorphism design"
- grep result: NO GLASSMORPHISM FOUND
- Status: CLAIM NOT SUPPORTED

### Screenshot Evidence: NEEDS WORK
- Desktop (1920x1080): Layout correct
- Tablet (768x1024): Navigation overlap detected
- Mobile (375x667): Product grid broken (2 columns instead of 1)

### Performance Metrics: NEEDS WORK
- Load time: 2.3s (target: <1s)
- Console errors: 3 (target: 0)
- Network failures: 1 (target: 0)

## Final Status: NEEDS WORK

### Blocking Issues:
1. Glassmorphism claimed but not implemented
2. Mobile layout broken at 375px
3. Load time exceeds 1s target
4. 3 console errors to fix

### Non-Blocking:
- Tablet navigation overlap
- Add loading states

Do not approve until blocking issues are resolved.

Step 6: Cross-Reference Claims with Evidence

Create a claims checklist:

## Claims vs. Evidence Checklist

| Claim | Evidence Command | Result |
|-------|------------------|--------|
| "Premium glassmorphism" | grep "backdrop-filter" | NOT FOUND |
| "Fully responsive" | responsive-mobile.png | FAIL (broken grid) |
| "No console errors" | test-results.json | 3 errors found |
| "Fast load time" | performance-metrics.json | 2.3s (target: <1s) |
| "JWT authentication" | grep "jsonwebtoken" | FOUND |
| "Rate limiting" | grep "rateLimit" | NOT FOUND |

Update this checklist for every project. Require evidence for every claim.


Complete Reality Check Workflow

┌─────────────────────────────────────────────────────────────────┐
│  1. Developer/AI completes work                                 │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  2. Run reality check commands                                  │
│     - ls to verify files                                        │
│     - grep to verify features                                   │
│     - Playwright for screenshots                                │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  3. Activate Reality Checker agent                              │
│     - Review file verification                                  │
│     - Cross-reference claims                                    │
│     - Analyze screenshots                                       │
│     - Check metrics                                             │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  4. Output: PASS or NEEDS WORK                                  │
│     - PASS: Ship with confidence                                │
│     - NEEDS WORK: Fix blocking issues, re-run                   │
└─────────────────────────────────────────────────────────────────┘

Integration with CI/CD

Add reality checks to your CI pipeline:

# .github/workflows/qa-reality-check.yml
name: Reality Check

on: [pull_request]

jobs:
  reality-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install dependencies
        run: npm ci

      - name: Install Playwright
        run: npx playwright install chromium

      - name: Start server
        run: npm start &
        env:
          PORT: 8000

      - name: Wait for server
        run: sleep 5

      - name: Run reality check screenshots
        run: ./qa-playwright-capture.sh http://localhost:8000 public/qa-screenshots

      - name: Upload screenshots
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: reality-check-screenshots
          path: public/qa-screenshots/

      - name: Check for console errors
        run: |
          ERRORS=$(cat public/qa-screenshots/performance-metrics.json | grep -c '"\[ERROR\]"' || echo "0")
          if [ "$ERRORS" -gt "0" ]; then
            echo "Console errors found: $ERRORS"
            exit 1
          fi

      - name: Check load time
        run: |
          LOAD_TIME=$(cat public/qa-screenshots/performance-metrics.json | grep -o '"loadTime": [0-9.]*' | head -1 | awk '{print $2}')
          if (( $(echo "$LOAD_TIME > 1000" | bc -l) )); then
            echo "Load time too slow: ${LOAD_TIME}ms (target: <1000ms)"
            exit 1
          fi

What You Built

Component Purpose
Playwright config Automated screenshot captures at 3 breakpoints
Test suite Full page, responsive, interactions
Reality check script One-command evidence collection
Claims checklist Cross-reference AI claims with grep results
CI/CD integration Automated reality checks on PR

Next Steps

Extend the workflow:

Build a claims database:

Share with your team:

Troubleshooting Common Issues

Playwright tests timeout:

Screenshots not capturing:

Console errors not being captured:

Mobile screenshots show desktop layout:

CI/CD pipeline fails on Ubuntu:

Advanced Reality Check Patterns

Pattern 1: Visual Regression Testing

Compare screenshots against baselines to catch unintended changes:

import { expect } from '@playwright/test';

test('visual regression check', async ({ page }) => {
  await page.goto('/');
  await expect(page).toHaveScreenshot('homepage-base.png', {
    maxDiffPixels: 100, // Allow minor differences
    fullPage: true,
  });
});

Pattern 2: Accessibility Audit

Integrate axe-core for accessibility evidence:

import AxeBuilder from '@axe-core/playwright';

test('accessibility audit', async ({ page }) => {
  await page.goto('/');
  const accessibilityScanResults = await new AxeBuilder({ page }).analyze();

  // Save results
  const fs = require('fs');
  fs.writeFileSync(
    'public/qa-screenshots/accessibility-results.json',
    JSON.stringify(accessibilityScanResults, null, 2)
  );

  // Fail if critical violations
  const criticalViolations = accessibilityScanResults.violations.filter(
    v => v.impact === 'critical' || v.impact === 'serious'
  );
  expect(criticalViolations).toHaveLength(0);
});

Pattern 3: Performance Budget Enforcement

Fail builds that exceed performance thresholds:

test('performance budget', async ({ page }) => {
  await page.goto('/');

  const metrics = await page.metrics();
  const loadTime = await page.evaluate(() =>
    performance.timing.loadEventEnd - performance.timing.navigationStart
  );

  // Budget thresholds
  expect(loadTime).toBeLessThan(2000); // 2s max
  expect(metrics.JSHeapUsedSize).toBeLessThan(5 * 1024 * 1024); // 5MB max
});

Your AI agents can no longer get away with “looks great.” They must prove their work with screenshots, metrics, and grep results.

No more hallucinations. No more fantasy approvals. Just evidence.

That’s what evidence-based QA looks like: run the commands, check the screenshots, require proof.

Your turn: add reality checks to your workflow. Ship with confidence.

button

FAQ

Why do AI agents hallucinate when reviewing code?AI agents are trained to be helpful and agreeable. They respond with what sounds good rather than what’s verified. Without evidence requirements, they say “looks great” to avoid conflict.

How do I set up Playwright for screenshot testing?Install with npm install -D @playwright/test, run npx playwright install chromium, create a config file with viewport breakpoints, and write test suites that capture screenshots at each breakpoint.

What reality check commands should I run before approval?Run ls to verify files exist, grep to verify claimed features exist in code, Playwright tests for screenshots, and check test-results.json for console errors and performance metrics.

What is the Reality Checker agent?Reality Checker is a specialized AI agent from The Agency that validates work using evidence. It runs verification commands, reviews screenshots, cross-references claims, and outputs PASS or NEEDS WORK with specific blocking issues.

How do I integrate reality checks into CI/CD?Add a GitHub Actions workflow that installs Playwright, starts your server, runs screenshot captures, uploads artifacts, and fails the build if console errors exceed 0 or load time exceeds your threshold.

What if screenshots show issues but the agent says PASS?The agent is misconfigured. Reality Checker must review evidence before outputting status. Retrain it to require: (1) grep results proving features, (2) screenshot review, (3) metrics within thresholds.

How do I get my team to adopt evidence-based QA?Document the reality check process, add CI/CD gates that require passing tests, make screenshot review mandatory for PR approval, and track which agents produce the most accurate assessments.

Explore more

How to Extend Your Claude Fable 5 Usage With the Perfect Prompt

How to Extend Your Claude Fable 5 Usage With the Perfect Prompt

Get more from every Claude Fable 5 call. Turn Anthropic's official prompting guide into a measurable playbook, then test effort and token use in Apidog.

12 June 2026

How to Test an AI Agent's Tool Calls with Apidog (Before They Break in Production)

How to Test an AI Agent's Tool Calls with Apidog (Before They Break in Production)

A reliable AI agent is a tested tool layer, not a smarter prompt. Build an agent and use Apidog to mock, assert, and test every tool call, including the failure paths.

12 June 2026

Claude Fable 5 & Mythos API Changes: What Still Works (and How to Test It)

Claude Fable 5 & Mythos API Changes: What Still Works (and How to Test It)

Claude Fable 5 and Mythos changed data retention and guardrails, not the API contract. See what still works for programmatic access and how to test it in Apidog.

12 June 2026

Practice API Design-first in Apidog

Discover an easier way to build and use APIs