Skip to main content

UX Visual Development & Testing Quality Plan

Problem Statement

The April 7 V2 design system implementation session revealed systematic quality gaps in our AI-assisted UI development workflow. Over ~15 iteration rounds across a single session:

  1. Agent self-approved broken output. Claude Code took screenshots, analyzed them with vision, and declared "icons are crisp, properly centered" when Material Symbols font wasn't loading (icons showed as text) and when a fingerprint icon was actually a person silhouette. LLM vision is a semantic tool, not a pixel-accurate one.

  2. External dependency failure. Material Symbols font loaded from Google CDN broke silently in Docker — no error, just text fallback. The agent didn't catch it because the CI/headless environment also silently degrades.

  3. Excessive back-and-forth. The session required ~15 rounds of "change → deploy → screenshot → CEO feedback → fix → repeat" to converge on the correct design. Each round took 5-15 minutes. The overview page alone needed 8 iterations.

  4. Design-to-code gap. Despite having Stitch mockups and a DESIGN.md, the first implementation diverged significantly from the design intent — wrong page architecture, wrong content model, missing visual elements.

  5. No automated guard rails. The visual QA script (visual-qa.ts) checks for render errors and blank pages, but doesn't validate visual correctness against a reference. The reg-cli pixel diff runs in CI but only compares before/after — not against the design mockup.

Current Tooling

ToolPurposeGap
Google StitchAI design mockup generationGood for ideation, but mockup-to-code fidelity is low. Stitch HTML is not production code.
Stitch MCPPull designs into Claude Code contextWorks for design data, but cannot validate implementation matches design.
DESIGN.mdDesign system spec for agentsAgents read it but don't mechanically validate against it.
visual-qa.tsRender error detectionCatches blank pages, JS errors, broken nav. Does NOT catch wrong icons, wrong colors, wrong layout.
visual-screenshot.tsFull-page screenshot captureCaptures pages but no comparison against reference design.
visual-diff-report.tsBefore/after pixel diff (reg-cli)Catches regressions but not design compliance. Deployed to Cloudflare Pages for human review.
agent-browserExploratory testingGood for ad-hoc bug hunts. Not reliable for visual validation (headless font issues).
PlaywrightBrowser automationAlready installed. Visual regression (toHaveScreenshot()) NOT yet configured.

Session Analysis: What Went Wrong

Round-by-Round Breakdown

RoundWhat happenedRoot cause
1-3Design tokens + layout shellCorrect — foundational work
4-5Overview redesignUsed generic stats instead of cluster data. Agent didn't understand the Stitch mockup's information architecture — only copied surface styling.
6-7Remediation Brief pageClose but wrong terminology, wrong button state, missing fields. Agent read the CEO spec but implemented imprecisely.
8Adversarial review catches 3 blocking issuesGood — external review found what self-review missed.
9-10Cross-review finds nav labels, branding, terminology gapsPattern: surface-level changes were applied but deeper meaning was missed.
11CEO round 2 feedback: "page architecture is still the old model"Agent applied new visual shell to old content model. Fundamental misunderstanding of intent.
12-13Cluster-driven overview + remediation hierarchy fixesCorrect direction but icon implementation broke.
14Material Symbols font fails in DockerAgent chose Google CDN without considering deployment environment.
15Agent says "icons are crisp" on broken screenshotLLM vision cannot reliably validate pixel-level rendering.
16Fingerprint icon is actually a person silhouetteAgent hand-drew SVG path instead of copying from a verified source.
17Agent approves wrong icon AGAIN in screenshot reviewSame fundamental problem — agent cannot self-approve visual output.

Key Failure Patterns

  1. Surface-level interpretation. Agent copies visual styling but misses information architecture intent. The Stitch mockup showed a cluster-driven hero; the agent built a generic-stats hero with the same colors.

  2. Self-approval bias. Agent consistently approved its own visual output. When shown a screenshot, it described what it expected to see rather than what was actually rendered.

  3. External dependency blindness. Agent added Google CDN fonts without considering that Docker containers may not have internet access. No automated check caught this.

  4. SVG path fabrication. Instead of copying a verified icon from lucide.dev or another source, the agent drew SVG paths freehand — producing wrong shapes.

  5. Iteration cost accumulation. Each feedback round required: read feedback → understand intent → modify code → typecheck → Docker rebuild (~30s) → seed data → set tenant → screenshot → verify. At 5-15 min per round × 17 rounds = 1.5-4 hours of iteration.

Proposed Solution: 5 Prevention Layers

Layer 1: Design Intent Capture (Before Code)

Problem: Agent misinterprets design intent. Solution: Structured design brief that forces understanding before implementation.

Before any UI implementation, create a DESIGN-INTENT.md per feature that answers:

  • What is the information architecture? (not visual style — the content model)
  • What data drives each section? (API hook → UI section mapping)
  • What is the reading order? (what does the user see first, second, third)
  • What are the interaction targets? (what is clickable, where does it go)
  • What is the acceptance test? (specific assertions, not "looks like the mockup")

This document should be reviewed by the product owner BEFORE implementation starts — not after.

Layer 2: Design Token + Constraint Linting (Build Time)

Problem: Design system rules violated, external deps introduced. Solution: Automated lint checks.

  • Grep-based pre-commit check blocking external CDN font/icon URLs
  • ESLint rule or custom lint for hardcoded Tailwind colors (should use tokens)
  • Design constraint validation: check for border- classes that violate No-Line rule
  • Font dependency audit: flag any @import url() that isn't bundled

Layer 3: Component-Level Visual Regression (Test Time)

Problem: Wrong icon shape, broken rendering, layout drift. Solution: Playwright toHaveScreenshot() with golden files.

  • Per-icon golden file tests (1% threshold)
  • Per-section golden files for key page areas (hero, visualization, cards)
  • Full-page golden files for all routes
  • Baselines generated in CI (Linux), never locally (macOS font differences)
  • Update baselines via dedicated PR (--update-snapshots)

Layer 4: Agent Behavior Rules (Development Time)

Problem: Agent self-approves visual output. Solution: Explicit rules in .claude/rules/.

  • NEVER self-approve visual changes. Run automated checks, attach screenshots for human review.
  • NEVER hand-draw SVG paths. Copy from lucide.dev, tabler-icons.io, or heroicons.
  • NEVER use external CDN fonts. Use lucide-react or inline SVGs.
  • ALWAYS run visual-qa.ts before committing UI changes.
  • ALWAYS take a screenshot and describe what you ACTUALLY see (not what you expect).
  • ALWAYS compare against the Stitch mockup when one exists.

Layer 5: Human-in-the-Loop Review (Pre-Merge)

Problem: Automated checks have limits — some visual quality requires human eyes. Solution: Structured visual review in PR workflow.

  • CI generates visual diff report → Cloudflare Pages
  • PR template includes "Visual Review" checklist
  • Stitch mockup screenshot attached to PR for side-by-side comparison
  • CEO/designer approval required for pages that changed visually

Design Tool Strategy

Current: Google Stitch

Strengths: AI-generated mockups, fast ideation, DESIGN.md export, MCP integration. Weaknesses: Mockup-to-code fidelity is low. Stitch HTML is not production code. The agent interprets mockups loosely, leading to divergence.

Potential Addition: Figma

When to add Figma:

  • When we have a dedicated designer (not AI-generated mockups)
  • When we need pixel-precise handoff (Figma Dev Mode)
  • When we need component-level design specs (not just page screenshots)

Figma + Stitch workflow:

  • Stitch for rapid ideation and first-pass mockups
  • Figma for refinement, precise specs, and developer handoff
  • Figma golden files as Playwright baselines

Not needed yet if we implement Layers 1-5 properly. The bigger problem is process, not tools.

Design-to-Code Validation: Reference Image Pipeline

Regardless of design tool, the key missing piece is:

Design mockup (Stitch/Figma) → Reference screenshot
↕ comparison
Implementation screenshot → Pixel diff report

This pipeline should run automatically in CI. The reference screenshots are the "golden files" that Playwright compares against.

Implementation Roadmap

Phase 1: Immediate (This Week)

  • Add agent behavior rules to .claude/rules/visual-review-tooling.md
  • Add lint-icon-deps.sh pre-commit check
  • Add checkFontLoading() and checkIconIntegrity() to visual-qa.ts
  • Create GitHub issue for regression test suite

Phase 2: Short-Term (Next Sprint)

  • Set up Playwright Test runner with playwright.config.ts
  • Create icons.spec.ts with golden-file icon tests
  • Create pages.spec.ts with per-route full-page screenshots
  • Generate initial baselines from known-good build
  • Add visual regression job to CI

Phase 3: Medium-Term (Next Month)

  • Add component-level golden files for key UI sections
  • Evaluate Figma for design refinement (if designer joins)
  • Build Stitch-to-Playwright golden file pipeline
  • Add design token validation (automated color/spacing checks)
  • Implement structured design intent documents per feature

Phase 4: Long-Term (Quarterly)

  • Cross-browser visual regression (Firefox, Safari)
  • Accessibility regression testing
  • Performance budget monitoring (CSS bundle size, load time)
  • Design system component library with Storybook (if warranted)

Success Metrics

  1. Zero "agent said correct but wasn't" incidents — enforced by automated checks + human review.
  2. First-implementation accuracy > 80% — measured by rounds of feedback needed (target: ≤ 3 rounds, down from 17).
  3. Visual regression catch rate > 95% — measured by bugs found by CI vs bugs found by humans.
  4. Design-to-code cycle time < 2 hours per page — from mockup approval to merged PR.

Industry Research: AI-Assisted UI Development Landscape (2026)

Design Tool Chain: Stitch → Figma → Claude Code

The recommended 2026 workflow chains tools by strength:

ToolRoleWhen
Google StitchExplore — generate mockups from text/voiceStarting from zero, rapid ideation
Figma + Dev ModeRefine — pixel-precise specs, design systemDesigner refinement, developer handoff
v0 (Vercel)Component generation — React + shadcn/uiIndividual component implementation
Claude Code + MCPFull implementation — reads design contextEntire page/feature builds

Key insight: Stitch and Figma are not competitors — they chain. Stitch generates first-pass designs in minutes (free), Figma allows designer refinement ($15/editor/mo Dev Mode), then Figma MCP feeds precise specs to Claude Code.

Figma MCP (official): claude plugin install figma@claude-plugins-official — extracts exact font sizes, spacing, colors, component structure from Figma files. Simpler setup than Stitch MCP. Supports bidirectional flow (Claude Code can push UIs back to Figma as editable layers).

Emerging Tools for Agent Visual Validation

ToolWhatMaturity
ProofShotRecords browser sessions as proof artifacts for AI agents. Video + screenshots + errors bundled for human review.New (March 2026), open source
claude-code-frontend-dev8-agent visual testing plugin. Auto-launches dev server → captures screenshot → Claude vision analyzes → iterates up to 5x.Open source, experimental
Playwright MCP (@playwright/mcp)Microsoft's MCP server. AI agent controls browser via accessibility tree. Most mature MCP for testing.Widely adopted since mid-2025
MomenticAI-native E2E testing with self-healing locators. 99% false positive reduction. YC-backed.SaaS, team pricing

The Measurement-Based Approach (Most Promising)

The pattern that eliminates subjective visual judgment entirely:

  1. Extract spec from Figma via MCP (exact pixel values)
  2. Render implementation in headless Playwright
  3. Measure with page.evaluate() + getComputedStyle() (actual rendered values)
  4. Feed numerical deltas to Claude: "heading font-size is 28px, spec says 24px"
  5. Apply deterministic corrections (not "make it look better")

This is documented at vadim.blog/pixel-perfect-playwright-figma-mcp and is the most reliable way to get first-attempt accuracy.

The DESIGN.md Standard

DESIGN.md has become a cross-tool standard for AI agent design context. awesome-design-md (4,385+ stars) provides 58 ready-made examples from real products. Our DESIGN.md follows this pattern already but should be enriched with the 9-section standard format (Visual Theme, Color Palette, Typography, Component Stylings, Layout, Depth/Elevation, Do's/Don'ts, Responsive, Agent Prompt Guide).

Design Token Validation

No mature generic tool exists for automated design token compliance. The best options are org-specific:

  • IBM Carbon: @carbon-design-system/stylelint-plugin-carbon-tokens
  • Atlassian: @atlaskit/eslint-plugin-design-system (has ensure-design-token-usage rule)
  • Salesforce: @salesforce-ux/eslint-plugin-slds

For SecurityV0, a custom ESLint rule would be needed — flagging hardcoded Tailwind colors that should use our token system.

Skills That Improve AI UI Output Quality

SkillStars/InstallsKey Value
frontend-design (Anthropic official)277K+ installsAnti-"AI slop" guidelines, distinctive aesthetics
Impeccable (Paul Bakaus, ex-Google)10,198 stars20 design commands (/polish, /audit, /typeset), curated anti-patterns
web-interface (Vercel)100+ rules: a11y, performance, UX

How Teams Get "Right the First Time"

Teams achieving near-first-attempt accuracy do ALL of:

  1. Provide complete DESIGN.md with constraints
  2. Use Figma MCP for exact specifications (not vague descriptions)
  3. Install design-quality skills (Impeccable or frontend-design)
  4. Run measurement-based validation before human review
  5. Get stakeholder approval at design stage, not implementation stage — this is the single highest-impact change

Design Tool Strategy: Recommendation

Now: Stitch + DESIGN.md + Agent Rules

Continue using Stitch for ideation. Strengthen the DESIGN.md. Add agent behavior rules and automated checks. This addresses the immediate quality gap without new tool investment.

Next Sprint: Add Figma MCP

Install Figma MCP alongside Stitch MCP. Use Stitch for first-pass mockups, export to Figma for CEO/designer review, then use Figma MCP to feed precise specs to Claude Code. This closes the "agent misinterprets design intent" gap by giving it exact pixel values instead of screenshots.

Next Month: Measurement-Based Validation Loop

Build the Playwright + getComputedStyle() measurement pipeline. Extract specs from Figma MCP, compare numerically against rendered output, apply deterministic corrections. This is the only reliable way to achieve first-attempt accuracy with AI agents.

Quarterly: Evaluate Enterprise Tools

  • Applitools Eyes Figma plugin for design-to-code baselines (enterprise pricing)
  • Chromatic if we add Storybook (component-level visual testing)
  • ProofShot for proof bundling and PR integration

Open Questions

  • Should we invest in Figma now ($15/editor/mo) or wait until we have a human designer?
  • Is the measurement-based approach (Figma MCP → Playwright → numerical deltas) worth the setup effort?
  • Should visual regression be a merge gate (blocking) or advisory (non-blocking)?
  • Should we adopt Impeccable skill alongside frontend-design?
  • How do we handle design changes mid-implementation (Stitch mockup updated while coding)?

Sources