UX Visual Development & Testing Quality Plan
Problem Statement
The April 7 V2 design system implementation session revealed systematic quality gaps in our AI-assisted UI development workflow. Over ~15 iteration rounds across a single session:
-
Agent self-approved broken output. Claude Code took screenshots, analyzed them with vision, and declared "icons are crisp, properly centered" when Material Symbols font wasn't loading (icons showed as text) and when a fingerprint icon was actually a person silhouette. LLM vision is a semantic tool, not a pixel-accurate one.
-
External dependency failure. Material Symbols font loaded from Google CDN broke silently in Docker — no error, just text fallback. The agent didn't catch it because the CI/headless environment also silently degrades.
-
Excessive back-and-forth. The session required ~15 rounds of "change → deploy → screenshot → CEO feedback → fix → repeat" to converge on the correct design. Each round took 5-15 minutes. The overview page alone needed 8 iterations.
-
Design-to-code gap. Despite having Stitch mockups and a DESIGN.md, the first implementation diverged significantly from the design intent — wrong page architecture, wrong content model, missing visual elements.
-
No automated guard rails. The visual QA script (
visual-qa.ts) checks for render errors and blank pages, but doesn't validate visual correctness against a reference. The reg-cli pixel diff runs in CI but only compares before/after — not against the design mockup.
Current Tooling
| Tool | Purpose | Gap |
|---|---|---|
| Google Stitch | AI design mockup generation | Good for ideation, but mockup-to-code fidelity is low. Stitch HTML is not production code. |
| Stitch MCP | Pull designs into Claude Code context | Works for design data, but cannot validate implementation matches design. |
| DESIGN.md | Design system spec for agents | Agents read it but don't mechanically validate against it. |
| visual-qa.ts | Render error detection | Catches blank pages, JS errors, broken nav. Does NOT catch wrong icons, wrong colors, wrong layout. |
| visual-screenshot.ts | Full-page screenshot capture | Captures pages but no comparison against reference design. |
| visual-diff-report.ts | Before/after pixel diff (reg-cli) | Catches regressions but not design compliance. Deployed to Cloudflare Pages for human review. |
| agent-browser | Exploratory testing | Good for ad-hoc bug hunts. Not reliable for visual validation (headless font issues). |
| Playwright | Browser automation | Already installed. Visual regression (toHaveScreenshot()) NOT yet configured. |
Session Analysis: What Went Wrong
Round-by-Round Breakdown
| Round | What happened | Root cause |
|---|---|---|
| 1-3 | Design tokens + layout shell | Correct — foundational work |
| 4-5 | Overview redesign | Used generic stats instead of cluster data. Agent didn't understand the Stitch mockup's information architecture — only copied surface styling. |
| 6-7 | Remediation Brief page | Close but wrong terminology, wrong button state, missing fields. Agent read the CEO spec but implemented imprecisely. |
| 8 | Adversarial review catches 3 blocking issues | Good — external review found what self-review missed. |
| 9-10 | Cross-review finds nav labels, branding, terminology gaps | Pattern: surface-level changes were applied but deeper meaning was missed. |
| 11 | CEO round 2 feedback: "page architecture is still the old model" | Agent applied new visual shell to old content model. Fundamental misunderstanding of intent. |
| 12-13 | Cluster-driven overview + remediation hierarchy fixes | Correct direction but icon implementation broke. |
| 14 | Material Symbols font fails in Docker | Agent chose Google CDN without considering deployment environment. |
| 15 | Agent says "icons are crisp" on broken screenshot | LLM vision cannot reliably validate pixel-level rendering. |
| 16 | Fingerprint icon is actually a person silhouette | Agent hand-drew SVG path instead of copying from a verified source. |
| 17 | Agent approves wrong icon AGAIN in screenshot review | Same fundamental problem — agent cannot self-approve visual output. |
Key Failure Patterns
-
Surface-level interpretation. Agent copies visual styling but misses information architecture intent. The Stitch mockup showed a cluster-driven hero; the agent built a generic-stats hero with the same colors.
-
Self-approval bias. Agent consistently approved its own visual output. When shown a screenshot, it described what it expected to see rather than what was actually rendered.
-
External dependency blindness. Agent added Google CDN fonts without considering that Docker containers may not have internet access. No automated check caught this.
-
SVG path fabrication. Instead of copying a verified icon from lucide.dev or another source, the agent drew SVG paths freehand — producing wrong shapes.
-
Iteration cost accumulation. Each feedback round required: read feedback → understand intent → modify code → typecheck → Docker rebuild (~30s) → seed data → set tenant → screenshot → verify. At 5-15 min per round × 17 rounds = 1.5-4 hours of iteration.
Proposed Solution: 5 Prevention Layers
Layer 1: Design Intent Capture (Before Code)
Problem: Agent misinterprets design intent. Solution: Structured design brief that forces understanding before implementation.
Before any UI implementation, create a DESIGN-INTENT.md per feature that answers:
- What is the information architecture? (not visual style — the content model)
- What data drives each section? (API hook → UI section mapping)
- What is the reading order? (what does the user see first, second, third)
- What are the interaction targets? (what is clickable, where does it go)
- What is the acceptance test? (specific assertions, not "looks like the mockup")
This document should be reviewed by the product owner BEFORE implementation starts — not after.
Layer 2: Design Token + Constraint Linting (Build Time)
Problem: Design system rules violated, external deps introduced. Solution: Automated lint checks.
- Grep-based pre-commit check blocking external CDN font/icon URLs
- ESLint rule or custom lint for hardcoded Tailwind colors (should use tokens)
- Design constraint validation: check for
border-classes that violate No-Line rule - Font dependency audit: flag any
@import url()that isn't bundled
Layer 3: Component-Level Visual Regression (Test Time)
Problem: Wrong icon shape, broken rendering, layout drift.
Solution: Playwright toHaveScreenshot() with golden files.
- Per-icon golden file tests (1% threshold)
- Per-section golden files for key page areas (hero, visualization, cards)
- Full-page golden files for all routes
- Baselines generated in CI (Linux), never locally (macOS font differences)
- Update baselines via dedicated PR (
--update-snapshots)
Layer 4: Agent Behavior Rules (Development Time)
Problem: Agent self-approves visual output.
Solution: Explicit rules in .claude/rules/.
- NEVER self-approve visual changes. Run automated checks, attach screenshots for human review.
- NEVER hand-draw SVG paths. Copy from lucide.dev, tabler-icons.io, or heroicons.
- NEVER use external CDN fonts. Use lucide-react or inline SVGs.
- ALWAYS run
visual-qa.tsbefore committing UI changes. - ALWAYS take a screenshot and describe what you ACTUALLY see (not what you expect).
- ALWAYS compare against the Stitch mockup when one exists.
Layer 5: Human-in-the-Loop Review (Pre-Merge)
Problem: Automated checks have limits — some visual quality requires human eyes. Solution: Structured visual review in PR workflow.
- CI generates visual diff report → Cloudflare Pages
- PR template includes "Visual Review" checklist
- Stitch mockup screenshot attached to PR for side-by-side comparison
- CEO/designer approval required for pages that changed visually
Design Tool Strategy
Current: Google Stitch
Strengths: AI-generated mockups, fast ideation, DESIGN.md export, MCP integration. Weaknesses: Mockup-to-code fidelity is low. Stitch HTML is not production code. The agent interprets mockups loosely, leading to divergence.
Potential Addition: Figma
When to add Figma:
- When we have a dedicated designer (not AI-generated mockups)
- When we need pixel-precise handoff (Figma Dev Mode)
- When we need component-level design specs (not just page screenshots)
Figma + Stitch workflow:
- Stitch for rapid ideation and first-pass mockups
- Figma for refinement, precise specs, and developer handoff
- Figma golden files as Playwright baselines
Not needed yet if we implement Layers 1-5 properly. The bigger problem is process, not tools.
Design-to-Code Validation: Reference Image Pipeline
Regardless of design tool, the key missing piece is:
Design mockup (Stitch/Figma) → Reference screenshot
↕ comparison
Implementation screenshot → Pixel diff report
This pipeline should run automatically in CI. The reference screenshots are the "golden files" that Playwright compares against.
Implementation Roadmap
Phase 1: Immediate (This Week)
- Add agent behavior rules to
.claude/rules/visual-review-tooling.md - Add
lint-icon-deps.shpre-commit check - Add
checkFontLoading()andcheckIconIntegrity()tovisual-qa.ts - Create GitHub issue for regression test suite
Phase 2: Short-Term (Next Sprint)
- Set up Playwright Test runner with
playwright.config.ts - Create
icons.spec.tswith golden-file icon tests - Create
pages.spec.tswith per-route full-page screenshots - Generate initial baselines from known-good build
- Add visual regression job to CI
Phase 3: Medium-Term (Next Month)
- Add component-level golden files for key UI sections
- Evaluate Figma for design refinement (if designer joins)
- Build Stitch-to-Playwright golden file pipeline
- Add design token validation (automated color/spacing checks)
- Implement structured design intent documents per feature
Phase 4: Long-Term (Quarterly)
- Cross-browser visual regression (Firefox, Safari)
- Accessibility regression testing
- Performance budget monitoring (CSS bundle size, load time)
- Design system component library with Storybook (if warranted)
Success Metrics
- Zero "agent said correct but wasn't" incidents — enforced by automated checks + human review.
- First-implementation accuracy > 80% — measured by rounds of feedback needed (target: ≤ 3 rounds, down from 17).
- Visual regression catch rate > 95% — measured by bugs found by CI vs bugs found by humans.
- Design-to-code cycle time < 2 hours per page — from mockup approval to merged PR.
Industry Research: AI-Assisted UI Development Landscape (2026)
Design Tool Chain: Stitch → Figma → Claude Code
The recommended 2026 workflow chains tools by strength:
| Tool | Role | When |
|---|---|---|
| Google Stitch | Explore — generate mockups from text/voice | Starting from zero, rapid ideation |
| Figma + Dev Mode | Refine — pixel-precise specs, design system | Designer refinement, developer handoff |
| v0 (Vercel) | Component generation — React + shadcn/ui | Individual component implementation |
| Claude Code + MCP | Full implementation — reads design context | Entire page/feature builds |
Key insight: Stitch and Figma are not competitors — they chain. Stitch generates first-pass designs in minutes (free), Figma allows designer refinement ($15/editor/mo Dev Mode), then Figma MCP feeds precise specs to Claude Code.
Figma MCP (official): claude plugin install figma@claude-plugins-official — extracts exact font sizes, spacing, colors, component structure from Figma files. Simpler setup than Stitch MCP. Supports bidirectional flow (Claude Code can push UIs back to Figma as editable layers).
Emerging Tools for Agent Visual Validation
| Tool | What | Maturity |
|---|---|---|
| ProofShot | Records browser sessions as proof artifacts for AI agents. Video + screenshots + errors bundled for human review. | New (March 2026), open source |
| claude-code-frontend-dev | 8-agent visual testing plugin. Auto-launches dev server → captures screenshot → Claude vision analyzes → iterates up to 5x. | Open source, experimental |
Playwright MCP (@playwright/mcp) | Microsoft's MCP server. AI agent controls browser via accessibility tree. Most mature MCP for testing. | Widely adopted since mid-2025 |
| Momentic | AI-native E2E testing with self-healing locators. 99% false positive reduction. YC-backed. | SaaS, team pricing |
The Measurement-Based Approach (Most Promising)
The pattern that eliminates subjective visual judgment entirely:
- Extract spec from Figma via MCP (exact pixel values)
- Render implementation in headless Playwright
- Measure with
page.evaluate()+getComputedStyle()(actual rendered values) - Feed numerical deltas to Claude: "heading font-size is 28px, spec says 24px"
- Apply deterministic corrections (not "make it look better")
This is documented at vadim.blog/pixel-perfect-playwright-figma-mcp and is the most reliable way to get first-attempt accuracy.
The DESIGN.md Standard
DESIGN.md has become a cross-tool standard for AI agent design context. awesome-design-md (4,385+ stars) provides 58 ready-made examples from real products. Our DESIGN.md follows this pattern already but should be enriched with the 9-section standard format (Visual Theme, Color Palette, Typography, Component Stylings, Layout, Depth/Elevation, Do's/Don'ts, Responsive, Agent Prompt Guide).
Design Token Validation
No mature generic tool exists for automated design token compliance. The best options are org-specific:
- IBM Carbon:
@carbon-design-system/stylelint-plugin-carbon-tokens - Atlassian:
@atlaskit/eslint-plugin-design-system(hasensure-design-token-usagerule) - Salesforce:
@salesforce-ux/eslint-plugin-slds
For SecurityV0, a custom ESLint rule would be needed — flagging hardcoded Tailwind colors that should use our token system.
Skills That Improve AI UI Output Quality
| Skill | Stars/Installs | Key Value |
|---|---|---|
| frontend-design (Anthropic official) | 277K+ installs | Anti-"AI slop" guidelines, distinctive aesthetics |
| Impeccable (Paul Bakaus, ex-Google) | 10,198 stars | 20 design commands (/polish, /audit, /typeset), curated anti-patterns |
| web-interface (Vercel) | — | 100+ rules: a11y, performance, UX |
How Teams Get "Right the First Time"
Teams achieving near-first-attempt accuracy do ALL of:
- Provide complete DESIGN.md with constraints
- Use Figma MCP for exact specifications (not vague descriptions)
- Install design-quality skills (Impeccable or frontend-design)
- Run measurement-based validation before human review
- Get stakeholder approval at design stage, not implementation stage — this is the single highest-impact change
Design Tool Strategy: Recommendation
Now: Stitch + DESIGN.md + Agent Rules
Continue using Stitch for ideation. Strengthen the DESIGN.md. Add agent behavior rules and automated checks. This addresses the immediate quality gap without new tool investment.
Next Sprint: Add Figma MCP
Install Figma MCP alongside Stitch MCP. Use Stitch for first-pass mockups, export to Figma for CEO/designer review, then use Figma MCP to feed precise specs to Claude Code. This closes the "agent misinterprets design intent" gap by giving it exact pixel values instead of screenshots.
Next Month: Measurement-Based Validation Loop
Build the Playwright + getComputedStyle() measurement pipeline. Extract specs from Figma MCP, compare numerically against rendered output, apply deterministic corrections. This is the only reliable way to achieve first-attempt accuracy with AI agents.
Quarterly: Evaluate Enterprise Tools
- Applitools Eyes Figma plugin for design-to-code baselines (enterprise pricing)
- Chromatic if we add Storybook (component-level visual testing)
- ProofShot for proof bundling and PR integration
Open Questions
- Should we invest in Figma now ($15/editor/mo) or wait until we have a human designer?
- Is the measurement-based approach (Figma MCP → Playwright → numerical deltas) worth the setup effort?
- Should visual regression be a merge gate (blocking) or advisory (non-blocking)?
- Should we adopt Impeccable skill alongside frontend-design?
- How do we handle design changes mid-implementation (Stitch mockup updated while coding)?