UX Visual Development & Testing Quality Plan

Problem Statement

The April 7 V2 design system implementation session revealed systematic quality gaps in our AI-assisted UI development workflow. Over ~15 iteration rounds across a single session:

Agent self-approved broken output. Claude Code took screenshots, analyzed them with vision, and declared "icons are crisp, properly centered" when Material Symbols font wasn't loading (icons showed as text) and when a fingerprint icon was actually a person silhouette. LLM vision is a semantic tool, not a pixel-accurate one.
External dependency failure. Material Symbols font loaded from Google CDN broke silently in Docker — no error, just text fallback. The agent didn't catch it because the CI/headless environment also silently degrades.
Excessive back-and-forth. The session required ~15 rounds of "change → deploy → screenshot → CEO feedback → fix → repeat" to converge on the correct design. Each round took 5-15 minutes. The overview page alone needed 8 iterations.
Design-to-code gap. Despite having Stitch mockups and a DESIGN.md, the first implementation diverged significantly from the design intent — wrong page architecture, wrong content model, missing visual elements.
No automated guard rails. The visual QA script (visual-qa.ts) checks for render errors and blank pages, but doesn't validate visual correctness against a reference. The reg-cli pixel diff runs in CI but only compares before/after — not against the design mockup.

Current Tooling

Tool	Purpose	Gap
Google Stitch	AI design mockup generation	Good for ideation, but mockup-to-code fidelity is low. Stitch HTML is not production code.
Stitch MCP	Pull designs into Claude Code context	Works for design data, but cannot validate implementation matches design.
DESIGN.md	Design system spec for agents	Agents read it but don't mechanically validate against it.
visual-qa.ts	Render error detection	Catches blank pages, JS errors, broken nav. Does NOT catch wrong icons, wrong colors, wrong layout.
visual-screenshot.ts	Full-page screenshot capture	Captures pages but no comparison against reference design.
visual-diff-report.ts	Before/after pixel diff (reg-cli)	Catches regressions but not design compliance. Deployed to Cloudflare Pages for human review.
agent-browser	Exploratory testing	Good for ad-hoc bug hunts. Not reliable for visual validation (headless font issues).
Playwright	Browser automation	Already installed. Visual regression (`toHaveScreenshot()`) NOT yet configured.

Session Analysis: What Went Wrong

Round-by-Round Breakdown

Round	What happened	Root cause
1-3	Design tokens + layout shell	Correct — foundational work
4-5	Overview redesign	Used generic stats instead of cluster data. Agent didn't understand the Stitch mockup's information architecture — only copied surface styling.
6-7	Remediation Brief page	Close but wrong terminology, wrong button state, missing fields. Agent read the CEO spec but implemented imprecisely.
8	Adversarial review catches 3 blocking issues	Good — external review found what self-review missed.
9-10	Cross-review finds nav labels, branding, terminology gaps	Pattern: surface-level changes were applied but deeper meaning was missed.
11	CEO round 2 feedback: "page architecture is still the old model"	Agent applied new visual shell to old content model. Fundamental misunderstanding of intent.
12-13	Cluster-driven overview + remediation hierarchy fixes	Correct direction but icon implementation broke.
14	Material Symbols font fails in Docker	Agent chose Google CDN without considering deployment environment.
15	Agent says "icons are crisp" on broken screenshot	LLM vision cannot reliably validate pixel-level rendering.
16	Fingerprint icon is actually a person silhouette	Agent hand-drew SVG path instead of copying from a verified source.
17	Agent approves wrong icon AGAIN in screenshot review	Same fundamental problem — agent cannot self-approve visual output.

Key Failure Patterns

Surface-level interpretation. Agent copies visual styling but misses information architecture intent. The Stitch mockup showed a cluster-driven hero; the agent built a generic-stats hero with the same colors.
Self-approval bias. Agent consistently approved its own visual output. When shown a screenshot, it described what it expected to see rather than what was actually rendered.
External dependency blindness. Agent added Google CDN fonts without considering that Docker containers may not have internet access. No automated check caught this.
SVG path fabrication. Instead of copying a verified icon from lucide.dev or another source, the agent drew SVG paths freehand — producing wrong shapes.
Iteration cost accumulation. Each feedback round required: read feedback → understand intent → modify code → typecheck → Docker rebuild (~30s) → seed data → set tenant → screenshot → verify. At 5-15 min per round × 17 rounds = 1.5-4 hours of iteration.

Proposed Solution: 5 Prevention Layers

Layer 1: Design Intent Capture (Before Code)

Problem: Agent misinterprets design intent. Solution: Structured design brief that forces understanding before implementation.

Before any UI implementation, create a DESIGN-INTENT.md per feature that answers:

What is the information architecture? (not visual style — the content model)
What data drives each section? (API hook → UI section mapping)
What is the reading order? (what does the user see first, second, third)
What are the interaction targets? (what is clickable, where does it go)
What is the acceptance test? (specific assertions, not "looks like the mockup")

This document should be reviewed by the product owner BEFORE implementation starts — not after.

Layer 2: Design Token + Constraint Linting (Build Time)

Problem: Design system rules violated, external deps introduced. Solution: Automated lint checks.

Grep-based pre-commit check blocking external CDN font/icon URLs
ESLint rule or custom lint for hardcoded Tailwind colors (should use tokens)
Design constraint validation: check for border- classes that violate No-Line rule
Font dependency audit: flag any @import url() that isn't bundled

Layer 3: Component-Level Visual Regression (Test Time)

Problem: Wrong icon shape, broken rendering, layout drift. Solution: Playwright toHaveScreenshot() with golden files.

Per-icon golden file tests (1% threshold)
Per-section golden files for key page areas (hero, visualization, cards)
Full-page golden files for all routes
Baselines generated in CI (Linux), never locally (macOS font differences)
Update baselines via dedicated PR (--update-snapshots)

Layer 4: Agent Behavior Rules (Development Time)

Problem: Agent self-approves visual output. Solution: Explicit rules in .claude/rules/.

NEVER self-approve visual changes. Run automated checks, attach screenshots for human review.
NEVER hand-draw SVG paths. Copy from lucide.dev, tabler-icons.io, or heroicons.
NEVER use external CDN fonts. Use lucide-react or inline SVGs.
ALWAYS run visual-qa.ts before committing UI changes.
ALWAYS take a screenshot and describe what you ACTUALLY see (not what you expect).
ALWAYS compare against the Stitch mockup when one exists.

Layer 5: Human-in-the-Loop Review (Pre-Merge)

Problem: Automated checks have limits — some visual quality requires human eyes. Solution: Structured visual review in PR workflow.

CI generates visual diff report → Cloudflare Pages
PR template includes "Visual Review" checklist
Stitch mockup screenshot attached to PR for side-by-side comparison
CEO/designer approval required for pages that changed visually

Design Tool Strategy

Current: Google Stitch

Strengths: AI-generated mockups, fast ideation, DESIGN.md export, MCP integration. Weaknesses: Mockup-to-code fidelity is low. Stitch HTML is not production code. The agent interprets mockups loosely, leading to divergence.

Potential Addition: Figma

When to add Figma:

When we have a dedicated designer (not AI-generated mockups)
When we need pixel-precise handoff (Figma Dev Mode)
When we need component-level design specs (not just page screenshots)

Figma + Stitch workflow:

Stitch for rapid ideation and first-pass mockups
Figma for refinement, precise specs, and developer handoff
Figma golden files as Playwright baselines

Not needed yet if we implement Layers 1-5 properly. The bigger problem is process, not tools.

Design-to-Code Validation: Reference Image Pipeline

Regardless of design tool, the key missing piece is:

Design mockup (Stitch/Figma) → Reference screenshot
                                      ↕ comparison
Implementation screenshot     → Pixel diff report

This pipeline should run automatically in CI. The reference screenshots are the "golden files" that Playwright compares against.

Implementation Roadmap

Phase 1: Immediate (This Week)

Add agent behavior rules to .claude/rules/visual-review-tooling.md
Add lint-icon-deps.sh pre-commit check
Add checkFontLoading() and checkIconIntegrity() to visual-qa.ts
Create GitHub issue for regression test suite

Phase 2: Short-Term (Next Sprint)

Set up Playwright Test runner with playwright.config.ts
Create icons.spec.ts with golden-file icon tests
Create pages.spec.ts with per-route full-page screenshots
Generate initial baselines from known-good build
Add visual regression job to CI

Phase 3: Medium-Term (Next Month)

Add component-level golden files for key UI sections
Evaluate Figma for design refinement (if designer joins)
Build Stitch-to-Playwright golden file pipeline
Add design token validation (automated color/spacing checks)
Implement structured design intent documents per feature

Phase 4: Long-Term (Quarterly)

Cross-browser visual regression (Firefox, Safari)
Accessibility regression testing
Performance budget monitoring (CSS bundle size, load time)
Design system component library with Storybook (if warranted)

Success Metrics

Zero "agent said correct but wasn't" incidents — enforced by automated checks + human review.
First-implementation accuracy > 80% — measured by rounds of feedback needed (target: ≤ 3 rounds, down from 17).
Visual regression catch rate > 95% — measured by bugs found by CI vs bugs found by humans.
Design-to-code cycle time < 2 hours per page — from mockup approval to merged PR.

Industry Research: AI-Assisted UI Development Landscape (2026)

Design Tool Chain: Stitch → Figma → Claude Code

The recommended 2026 workflow chains tools by strength:

Tool	Role	When
Google Stitch	Explore — generate mockups from text/voice	Starting from zero, rapid ideation
Figma + Dev Mode	Refine — pixel-precise specs, design system	Designer refinement, developer handoff
v0 (Vercel)	Component generation — React + shadcn/ui	Individual component implementation
Claude Code + MCP	Full implementation — reads design context	Entire page/feature builds

Key insight: Stitch and Figma are not competitors — they chain. Stitch generates first-pass designs in minutes (free), Figma allows designer refinement ($15/editor/mo Dev Mode), then Figma MCP feeds precise specs to Claude Code.

Figma MCP (official): claude plugin install figma@claude-plugins-official — extracts exact font sizes, spacing, colors, component structure from Figma files. Simpler setup than Stitch MCP. Supports bidirectional flow (Claude Code can push UIs back to Figma as editable layers).

Emerging Tools for Agent Visual Validation

Tool	What	Maturity
ProofShot	Records browser sessions as proof artifacts for AI agents. Video + screenshots + errors bundled for human review.	New (March 2026), open source
claude-code-frontend-dev	8-agent visual testing plugin. Auto-launches dev server → captures screenshot → Claude vision analyzes → iterates up to 5x.	Open source, experimental
Playwright MCP (`@playwright/mcp`)	Microsoft's MCP server. AI agent controls browser via accessibility tree. Most mature MCP for testing.	Widely adopted since mid-2025
Momentic	AI-native E2E testing with self-healing locators. 99% false positive reduction. YC-backed.	SaaS, team pricing

The Measurement-Based Approach (Most Promising)

The pattern that eliminates subjective visual judgment entirely:

Extract spec from Figma via MCP (exact pixel values)
Render implementation in headless Playwright
Measure with page.evaluate() + getComputedStyle() (actual rendered values)
Feed numerical deltas to Claude: "heading font-size is 28px, spec says 24px"
Apply deterministic corrections (not "make it look better")

This is documented at vadim.blog/pixel-perfect-playwright-figma-mcp and is the most reliable way to get first-attempt accuracy.

The DESIGN.md Standard

DESIGN.md has become a cross-tool standard for AI agent design context. awesome-design-md (4,385+ stars) provides 58 ready-made examples from real products. Our DESIGN.md follows this pattern already but should be enriched with the 9-section standard format (Visual Theme, Color Palette, Typography, Component Stylings, Layout, Depth/Elevation, Do's/Don'ts, Responsive, Agent Prompt Guide).

Design Token Validation

No mature generic tool exists for automated design token compliance. The best options are org-specific:

IBM Carbon: @carbon-design-system/stylelint-plugin-carbon-tokens
Atlassian: @atlaskit/eslint-plugin-design-system (has ensure-design-token-usage rule)
Salesforce: @salesforce-ux/eslint-plugin-slds

For SecurityV0, a custom ESLint rule would be needed — flagging hardcoded Tailwind colors that should use our token system.

Skills That Improve AI UI Output Quality

Skill	Stars/Installs	Key Value
frontend-design (Anthropic official)	277K+ installs	Anti-"AI slop" guidelines, distinctive aesthetics
Impeccable (Paul Bakaus, ex-Google)	10,198 stars	20 design commands (`/polish`, `/audit`, `/typeset`), curated anti-patterns
web-interface (Vercel)	—	100+ rules: a11y, performance, UX

How Teams Get "Right the First Time"

Teams achieving near-first-attempt accuracy do ALL of:

Provide complete DESIGN.md with constraints
Use Figma MCP for exact specifications (not vague descriptions)
Install design-quality skills (Impeccable or frontend-design)
Run measurement-based validation before human review
Get stakeholder approval at design stage, not implementation stage — this is the single highest-impact change

Design Tool Strategy: Recommendation

Now: Stitch + DESIGN.md + Agent Rules

Continue using Stitch for ideation. Strengthen the DESIGN.md. Add agent behavior rules and automated checks. This addresses the immediate quality gap without new tool investment.

Next Sprint: Add Figma MCP

Install Figma MCP alongside Stitch MCP. Use Stitch for first-pass mockups, export to Figma for CEO/designer review, then use Figma MCP to feed precise specs to Claude Code. This closes the "agent misinterprets design intent" gap by giving it exact pixel values instead of screenshots.

Next Month: Measurement-Based Validation Loop

Build the Playwright + getComputedStyle() measurement pipeline. Extract specs from Figma MCP, compare numerically against rendered output, apply deterministic corrections. This is the only reliable way to achieve first-attempt accuracy with AI agents.

Quarterly: Evaluate Enterprise Tools

Applitools Eyes Figma plugin for design-to-code baselines (enterprise pricing)
Chromatic if we add Storybook (component-level visual testing)
ProofShot for proof bundling and PR integration

Open Questions

Should we invest in Figma now ($15/editor/mo) or wait until we have a human designer?
Is the measurement-based approach (Figma MCP → Playwright → numerical deltas) worth the setup effort?
Should visual regression be a merge gate (blocking) or advisory (non-blocking)?
Should we adopt Impeccable skill alongside frontend-design?
How do we handle design changes mid-implementation (Stitch mockup updated while coding)?

Problem Statement​

Current Tooling​

Session Analysis: What Went Wrong​

Round-by-Round Breakdown​

Key Failure Patterns​

Proposed Solution: 5 Prevention Layers​

Layer 1: Design Intent Capture (Before Code)​

Layer 2: Design Token + Constraint Linting (Build Time)​

Layer 3: Component-Level Visual Regression (Test Time)​

Layer 4: Agent Behavior Rules (Development Time)​

Layer 5: Human-in-the-Loop Review (Pre-Merge)​

Design Tool Strategy​

Current: Google Stitch​

Potential Addition: Figma​

Design-to-Code Validation: Reference Image Pipeline​

Implementation Roadmap​

Phase 1: Immediate (This Week)​

Phase 2: Short-Term (Next Sprint)​

Phase 3: Medium-Term (Next Month)​

Phase 4: Long-Term (Quarterly)​

Success Metrics​

Industry Research: AI-Assisted UI Development Landscape (2026)​

Design Tool Chain: Stitch → Figma → Claude Code​

Emerging Tools for Agent Visual Validation​

The Measurement-Based Approach (Most Promising)​

The DESIGN.md Standard​

Design Token Validation​

Skills That Improve AI UI Output Quality​

How Teams Get "Right the First Time"​

Design Tool Strategy: Recommendation​

Now: Stitch + DESIGN.md + Agent Rules​

Next Sprint: Add Figma MCP​

Next Month: Measurement-Based Validation Loop​

Quarterly: Evaluate Enterprise Tools​

Open Questions​

Sources​