TL;DR
Stop AI hallucinations with 4 steps: (1) Install Playwright and configure breakpoints (desktop, tablet, mobile), (2) Create screenshot test suite that captures full page, responsive layouts, and interactions, (3) Run ./qa-playwright-capture.sh to collect evidence, (4) Activate Reality Checker agent to cross-reference claims with grep results and screenshots. Agents output PASS or NEEDS WORK with specific blocking issues—no more fantasy approvals.
Introduction
Stop accepting “looks great” from AI agents. Build an evidence-based QA workflow with Playwright screenshots that requires visual proof before any approval.
You ask an AI agent to review your landing page. It responds:
The design looks premium and polished. The glassmorphism effects are well-implemented. The page is fully responsive. Ready for production!
You open the page. The “glassmorphism” is a solid gray background. The “fully responsive” layout breaks on mobile. Nothing is premium or polished.
AI agents hallucinate. They tell you what you want to hear. They avoid conflict. They approve everything.
The Reality Checker agent from The Agency collection takes a different approach:
Status: NEEDS WORK
Evidence:
- grep for "glassmorphism" returned NO PREMIUM FEATURES FOUND
- responsive-mobile.png shows broken layout at 375px width
- test-results.json shows 3 console errors, 2.1s load time
Blocking issues: 4
No feelings. No opinions. Just evidence.
In this tutorial, you’ll build an evidence-based QA workflow that complements your API testing pipeline. Whether you’re validating frontend layouts or verifying API responses in Apidog, the principle is the same: require proof before approval. You’ll set up Playwright for automated screenshot captures, create mandatory reality check commands, cross-reference agent claims with actual code, and require PASS/FAIL certification before shipping.
Why Evidence Matters
AI agents are people-pleasers. They want to help. They want you to like them. So they say what sounds good:
- “The code looks solid!” (never tested)
- “Performance should be great!” (never measured)
- “Fully responsive!” (never checked on mobile)
Evidence-based QA changes this. Instead of opinions, you get:
- Screenshots at desktop, tablet, mobile breakpoints
- Performance metrics from actual page loads
- Grep results proving features exist (or don’t)
- Console error counts from headless browser tests
No more “trust me.” Just proof.
Step 1: Set Up Playwright
Install Playwright:
npm install -D @playwright/test
npx playwright install chromium
Create qa-playwright.config.ts:
import { defineConfig } from '@playwright/test';
export default defineConfig({
testMatch: '**/qa-screenshots.spec.ts',
timeout: 30000,
use: {
baseURL: process.env.BASE_URL || 'http://localhost:8000',
screenshot: 'on',
trace: 'on-first-retry',
headless: true,
},
projects: [
{
name: 'desktop',
use: { viewport: { width: 1920, height: 1080 } },
},
{
name: 'tablet',
use: { viewport: { width: 768, height: 1024 } },
},
{
name: 'mobile',
use: { viewport: { width: 375, height: 667 } },
},
],
reporter: [['json', { outputFile: 'public/qa-screenshots/test-results.json' }]],
outputDir: 'public/qa-screenshots',
});
Step 2: Create Screenshot Test Suite
Create qa-screenshots.spec.ts:
import { test, expect } from '@playwright/test';
import * as fs from 'fs';
import * as path from 'path';
// Ensure output directory exists
const outputDir = 'public/qa-screenshots';
if (!fs.existsSync(outputDir)) {
fs.mkdirSync(outputDir, { recursive: true });
}
test.describe('Reality Check Screenshots', () => {
test('capture full page at all breakpoints', async ({ page, browserName }) => {
const errors: string[] = [];
const consoleLogs: string[] = [];
// Capture console errors
page.on('console', msg => {
if (msg.type() === 'error') {
consoleLogs.push(`[ERROR] ${msg.text()}`);
}
});
// Capture network failures
page.on('requestfailed', request => {
errors.push(`[NETWORK] ${request.url()} failed`);
});
// Navigate to page
await page.goto('/');
await page.waitForLoadState('networkidle');
// Capture performance metrics
const metrics = await page.metrics();
const performance = {
jsHeapSize: metrics.JSHeapUsedSize,
loadTime: await page.evaluate(() => performance.timing.loadEventEnd - performance.timing.navigationStart),
domContentLoaded: await page.evaluate(() => performance.timing.domContentLoadedEventEnd - performance.timing.navigationStart),
};
// Save screenshot
const projectName = browserName || 'chromium';
await page.screenshot({
path: path.join(outputDir, `full-page-${projectName}.png`),
fullPage: true,
});
// Save metrics
fs.writeFileSync(
path.join(outputDir, 'performance-metrics.json'),
JSON.stringify({ performance, consoleErrors: consoleLogs, networkErrors: errors }, null, 2)
);
});
test('capture responsive layouts', async ({ page }) => {
const breakpoints = [
{ name: 'desktop', width: 1920, height: 1080 },
{ name: 'tablet', width: 768, height: 1024 },
{ name: 'mobile', width: 375, height: 667 },
];
for (const breakpoint of breakpoints) {
await page.setViewportSize({ width: breakpoint.width, height: breakpoint.height });
await page.goto('/');
await page.waitForLoadState('networkidle');
await page.screenshot({
path: path.join(outputDir, `responsive-${breakpoint.name}.png`),
fullPage: true,
});
}
});
test('capture navigation interactions', async ({ page }) => {
await page.goto('/');
// Find and click navigation items
const navItems = await page.$$('nav a, header a, .nav a');
for (let i = 0; i < Math.min(navItems.length, 5); i++) {
await page.screenshot({ path: path.join(outputDir, `nav-${i}-before.png`) });
await navItems[i].click();
await page.waitForLoadState('networkidle');
await page.screenshot({ path: path.join(outputDir, `nav-${i}-after.png`) });
await page.goBack();
await page.waitForLoadState('networkidle');
}
});
test('capture form interactions', async ({ page }) => {
await page.goto('/');
// Find forms
const forms = await page.$$('form');
for (let i = 0; i < forms.length; i++) {
const form = forms[i];
await form.screenshot({ path: path.join(outputDir, `form-${i}-initial.png`) });
// Fill inputs
const inputs = await form.$$('input[type="text"], input[type="email"], input[type="password"]');
for (const input of inputs) {
await input.fill('test@example.com');
}
await form.screenshot({ path: path.join(outputDir, `form-${i}-filled.png`) });
}
});
test('capture accordion/dropdown interactions', async ({ page }) => {
await page.goto('/');
// Find accordions
const accordions = await page.$$('[data-accordion], details, .accordion');
for (let i = 0; i < accordions.length; i++) {
await accordions[i].screenshot({ path: path.join(outputDir, `accordion-${i}-closed.png`) });
await accordions[i].click();
await page.waitForTimeout(300);
await accordions[i].screenshot({ path.join(outputDir, `accordion-${i}-open.png`) });
}
});
});
Step 3: Create Reality Check Script
Create qa-playwright-capture.sh:
#!/usr/bin/env bash
#
# qa-playwright-capture.sh — Run Playwright screenshot captures for reality checking
#
# Usage: ./qa-playwright-capture.sh [BASE_URL] [OUTPUT_DIR]
#
set -euo pipefail
BASE_URL="${1:-http://localhost:8000}"
OUTPUT_DIR="${2:-public/qa-screenshots}"
echo "Starting Reality Check screenshot capture..."
echo " Base URL: $BASE_URL"
echo " Output: $OUTPUT_DIR"
# Ensure output directory exists
mkdir -p "$OUTPUT_DIR"
# Run Playwright tests
export BASE_URL
npx playwright test --config=qa-playwright.config.ts --grep "@screenshot"
# Generate summary
echo ""
echo "Generating summary..."
# Count screenshots
SCREENSHOT_COUNT=$(find "$OUTPUT_DIR" -name "*.png" | wc -l)
echo " Screenshots captured: $SCREENSHOT_COUNT"
# Check for console errors
if [ -f "$OUTPUT_DIR/performance-metrics.json" ]; then
ERROR_COUNT=$(cat "$OUTPUT_DIR/performance-metrics.json" | grep -c '"\[ERROR\]"' || echo "0")
echo " Console errors: $ERROR_COUNT"
fi
# Check load time
if [ -f "$OUTPUT_DIR/performance-metrics.json" ]; then
LOAD_TIME=$(cat "$OUTPUT_DIR/performance-metrics.json" | grep -o '"loadTime": [0-9.]*' | head -1 | awk '{print $2}')
echo " Load time: ${LOAD_TIME:-N/A}ms"
fi
echo ""
echo "Reality Check complete. Review screenshots in: $OUTPUT_DIR"
echo ""
echo "Next step: Run Reality Checker agent to validate evidence"
Make it executable:
chmod +x qa-playwright-capture.sh
Step 4: Run Reality Check Commands
Before any AI agent can approve work, run these commands:
# 1. Verify what was actually built
ls -la resources/views/ || ls -la *.html
ls -la src/components/ || ls -la components/
# 2. Cross-check claimed features
grep -r "glassmorphism\|backdrop-filter\|blur" . --include="*.css" --include="*.html" || echo "NO GLASSMORPHISM FOUND"
grep -r "responsive\|media-query\|@media" . --include="*.css" || echo "NO RESPONSIVE CSS FOUND"
grep -r "jwt\|authentication\|auth" . --include="*.ts" --include="*.js" || echo "NO AUTH FOUND"
# 3. Run screenshot capture
./qa-playwright-capture.sh http://localhost:8000 public/qa-screenshots
# 4. Review evidence
ls -la public/qa-screenshots/
# Expected files:
# - responsive-desktop.png
# - responsive-tablet.png
# - responsive-mobile.png
# - nav-*-before.png, nav-*-after.png
# - form-*-initial.png, form-*-filled.png
# 5. Check metrics
cat public/qa-screenshots/test-results.json
cat public/qa-screenshots/performance-metrics.json
Step 5: Activate Reality Checker Agent
Open a Claude Code session:
Activate Reality Checker mode.
Run your mandatory reality check process:
1. Verify files exist: ls -la src/components/
2. Cross-reference claimed features: grep for "premium", "glassmorphism"
3. Review screenshot evidence: public/qa-screenshots/
4. Check test-results.json for metrics
Project URL: http://localhost:8000
Output: PASS or NEEDS WORK with specific blocking issues.
Expected output:
## Reality Check Results
### File Verification: PASS
- Component files present: 12 files found
- Expected structure matches
### Feature Verification: NEEDS WORK
- Claimed: "Premium glassmorphism design"
- grep result: NO GLASSMORPHISM FOUND
- Status: CLAIM NOT SUPPORTED
### Screenshot Evidence: NEEDS WORK
- Desktop (1920x1080): Layout correct
- Tablet (768x1024): Navigation overlap detected
- Mobile (375x667): Product grid broken (2 columns instead of 1)
### Performance Metrics: NEEDS WORK
- Load time: 2.3s (target: <1s)
- Console errors: 3 (target: 0)
- Network failures: 1 (target: 0)
## Final Status: NEEDS WORK
### Blocking Issues:
1. Glassmorphism claimed but not implemented
2. Mobile layout broken at 375px
3. Load time exceeds 1s target
4. 3 console errors to fix
### Non-Blocking:
- Tablet navigation overlap
- Add loading states
Do not approve until blocking issues are resolved.
Step 6: Cross-Reference Claims with Evidence
Create a claims checklist:
## Claims vs. Evidence Checklist
| Claim | Evidence Command | Result |
|-------|------------------|--------|
| "Premium glassmorphism" | grep "backdrop-filter" | NOT FOUND |
| "Fully responsive" | responsive-mobile.png | FAIL (broken grid) |
| "No console errors" | test-results.json | 3 errors found |
| "Fast load time" | performance-metrics.json | 2.3s (target: <1s) |
| "JWT authentication" | grep "jsonwebtoken" | FOUND |
| "Rate limiting" | grep "rateLimit" | NOT FOUND |
Update this checklist for every project. Require evidence for every claim.
Complete Reality Check Workflow
┌─────────────────────────────────────────────────────────────────┐
│ 1. Developer/AI completes work │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 2. Run reality check commands │
│ - ls to verify files │
│ - grep to verify features │
│ - Playwright for screenshots │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 3. Activate Reality Checker agent │
│ - Review file verification │
│ - Cross-reference claims │
│ - Analyze screenshots │
│ - Check metrics │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ 4. Output: PASS or NEEDS WORK │
│ - PASS: Ship with confidence │
│ - NEEDS WORK: Fix blocking issues, re-run │
└─────────────────────────────────────────────────────────────────┘
Integration with CI/CD
Add reality checks to your CI pipeline:
# .github/workflows/qa-reality-check.yml
name: Reality Check
on: [pull_request]
jobs:
reality-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: '20'
- name: Install dependencies
run: npm ci
- name: Install Playwright
run: npx playwright install chromium
- name: Start server
run: npm start &
env:
PORT: 8000
- name: Wait for server
run: sleep 5
- name: Run reality check screenshots
run: ./qa-playwright-capture.sh http://localhost:8000 public/qa-screenshots
- name: Upload screenshots
uses: actions/upload-artifact@v4
if: always()
with:
name: reality-check-screenshots
path: public/qa-screenshots/
- name: Check for console errors
run: |
ERRORS=$(cat public/qa-screenshots/performance-metrics.json | grep -c '"\[ERROR\]"' || echo "0")
if [ "$ERRORS" -gt "0" ]; then
echo "Console errors found: $ERRORS"
exit 1
fi
- name: Check load time
run: |
LOAD_TIME=$(cat public/qa-screenshots/performance-metrics.json | grep -o '"loadTime": [0-9.]*' | head -1 | awk '{print $2}')
if (( $(echo "$LOAD_TIME > 1000" | bc -l) )); then
echo "Load time too slow: ${LOAD_TIME}ms (target: <1000ms)"
exit 1
fi
What You Built
| Component | Purpose |
|---|---|
| Playwright config | Automated screenshot captures at 3 breakpoints |
| Test suite | Full page, responsive, interactions |
| Reality check script | One-command evidence collection |
| Claims checklist | Cross-reference AI claims with grep results |
| CI/CD integration | Automated reality checks on PR |
Next Steps
Extend the workflow:
- Add Lighthouse integration for performance scores
- Add accessibility audits (axe-core)
- Add visual regression testing (pixel comparison)
Build a claims database:
- Log every AI claim with evidence status
- Track which agents hallucinate most
- Create accuracy scores over time
Share with your team:
- Document the reality check process
- Require evidence before any PR approval
- Make “show me the screenshots” a team habit
Troubleshooting Common Issues
Playwright tests timeout:
- Increase timeout in config:
timeout: 60000 - Check if the server is running:
curl http://localhost:8000 - Add longer wait for network idle:
await page.waitForLoadState('networkidle', { timeout: 30000 }) - Run in headed mode to debug:
npx playwright test --headed
Screenshots not capturing:
- Verify output directory exists and is writable:
mkdir -p public/qa-screenshots - Check file permissions:
chmod 755 public/qa-screenshots - Ensure Chromium is installed:
npx playwright install chromium - Run with debug output:
DEBUG=pw:api npx playwright test
Console errors not being captured:
- Add the listener before navigation:
page.on('console', ...)beforepage.goto() - Check error type filtering:
msg.type() === 'error' - Log all console messages to debug:
page.on('console', msg => console.log(msg.text())) - Verify the page actually loads content (no blank screenshots)
Mobile screenshots show desktop layout:
- Ensure viewport is set before navigation
- Add mobile user agent:
await page.setUserAgent('Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X)') - Use device emulation:
use: { ...devices['iPhone 12'] } - Check for responsive meta tag in HTML:
<meta name="viewport" content="width=device-width">
CI/CD pipeline fails on Ubuntu:
- Install system dependencies:
sudo apt-get install -y libnss3 libnspr4 libatk1.0-0 - Use Playwright’s official image:
mcr.microsoft.com/playwright:v1.40.0-jammy - Run
npx playwright install-depsbefore installing browsers - Add
--no-sandboxflag for containerized environments
Advanced Reality Check Patterns
Pattern 1: Visual Regression Testing
Compare screenshots against baselines to catch unintended changes:
import { expect } from '@playwright/test';
test('visual regression check', async ({ page }) => {
await page.goto('/');
await expect(page).toHaveScreenshot('homepage-base.png', {
maxDiffPixels: 100, // Allow minor differences
fullPage: true,
});
});
Pattern 2: Accessibility Audit
Integrate axe-core for accessibility evidence:
import AxeBuilder from '@axe-core/playwright';
test('accessibility audit', async ({ page }) => {
await page.goto('/');
const accessibilityScanResults = await new AxeBuilder({ page }).analyze();
// Save results
const fs = require('fs');
fs.writeFileSync(
'public/qa-screenshots/accessibility-results.json',
JSON.stringify(accessibilityScanResults, null, 2)
);
// Fail if critical violations
const criticalViolations = accessibilityScanResults.violations.filter(
v => v.impact === 'critical' || v.impact === 'serious'
);
expect(criticalViolations).toHaveLength(0);
});
Pattern 3: Performance Budget Enforcement
Fail builds that exceed performance thresholds:
test('performance budget', async ({ page }) => {
await page.goto('/');
const metrics = await page.metrics();
const loadTime = await page.evaluate(() =>
performance.timing.loadEventEnd - performance.timing.navigationStart
);
// Budget thresholds
expect(loadTime).toBeLessThan(2000); // 2s max
expect(metrics.JSHeapUsedSize).toBeLessThan(5 * 1024 * 1024); // 5MB max
});
Your AI agents can no longer get away with “looks great.” They must prove their work with screenshots, metrics, and grep results.
No more hallucinations. No more fantasy approvals. Just evidence.
That’s what evidence-based QA looks like: run the commands, check the screenshots, require proof.
Your turn: add reality checks to your workflow. Ship with confidence.
FAQ
Why do AI agents hallucinate when reviewing code?AI agents are trained to be helpful and agreeable. They respond with what sounds good rather than what’s verified. Without evidence requirements, they say “looks great” to avoid conflict.
How do I set up Playwright for screenshot testing?Install with npm install -D @playwright/test, run npx playwright install chromium, create a config file with viewport breakpoints, and write test suites that capture screenshots at each breakpoint.
What reality check commands should I run before approval?Run ls to verify files exist, grep to verify claimed features exist in code, Playwright tests for screenshots, and check test-results.json for console errors and performance metrics.
What is the Reality Checker agent?Reality Checker is a specialized AI agent from The Agency that validates work using evidence. It runs verification commands, reviews screenshots, cross-references claims, and outputs PASS or NEEDS WORK with specific blocking issues.
How do I integrate reality checks into CI/CD?Add a GitHub Actions workflow that installs Playwright, starts your server, runs screenshot captures, uploads artifacts, and fails the build if console errors exceed 0 or load time exceeds your threshold.
What if screenshots show issues but the agent says PASS?The agent is misconfigured. Reality Checker must review evidence before outputting status. Retrain it to require: (1) grep results proving features, (2) screenshot review, (3) metrics within thresholds.
How do I get my team to adopt evidence-based QA?Document the reality check process, add CI/CD gates that require passing tests, make screenshot review mandatory for PR approval, and track which agents produce the most accurate assessments.



