Visual UX Development & Testing: Multi-Source Research Synthesis
Context
Problem: When using Google Stitch + Stitch MCP + Claude Code for UI development, Opus doesn't check screenshots after fixing — the LLM is visually blind. This leads to syntactically correct but visually broken output, design drift, and 15+ feedback iterations (documented in PR #154's analysis of 17 iteration rounds).
Research sources synthesized:
- Codex — closed-loop VLM verification, measurement-based validation
- Gemini Deep Research — comprehensive tool landscape, failure modes, tiered architecture
- Claude Opus (own web research) — newer tools: ProofShot, UI Visual Validator Agent, autoVerify, Meticulous.ai
- PR #149 — UX strategy research (Wiz positioning + AI dev process analysis)
- PR #154 — 5-layer prevention strategy with 4-phase roadmap
- PR #156 — Agentic UI development standards (4 core rules)
What We Already Have
This research builds on existing visual testing infrastructure in SV0:
| Tool | What it does | Status | Ref |
|---|---|---|---|
visual-qa.ts | Headless Playwright: 11 pages, console errors, layout checks, screenshots | Active | sv0-platform |
ux-audit.ts | Deep UX audit: flows, interactions, accessibility, performance | Active | sv0-platform |
/visual-review skill | Claude reads screenshots + product docs, evaluates UI vs business vision | Adopted | Research |
| Visual diff pipeline | reg-cli pixel diff, interactive HTML at pr-N.sv0-visual-reviews.pages.dev | Implemented | Plan |
| PR preview envs | pr-N.dev.securityv0.com for live review | Active | CI |
What's missing (the gap this research addresses):
- Agent doesn't look at its own output — no closed-loop verification during development
- No deterministic pass/fail gate —
/visual-reviewis advisory, not blocking - No design-to-code comparison — agent has no design spec to compare against
- No component isolation — only page-level testing exists
Comparative Analysis: All Research Sources & PRs
Head-to-Head: Research Reports
| Dimension | Codex Research | Gemini Deep Research | Claude Research |
|---|---|---|---|
| Core thesis | Closed-loop VLM + measurement-based validation | 3-layer defense (pixel diff + DOM/structure + accessibility) | Agent already has tools — the gap is strict enforcement |
| Recommended foundation | Playwright + Figma MCP for spec extraction | Playwright toHaveScreenshot + tiered platform choice | Playwright + autoVerify (already built-in) + strict sub-agent |
| Design-to-code approach | Move from Stitch to Figma for implementation context; Stitch stays for ideation | Treat Stitch as design input, not visual oracle; Figma MCP for specs | Stitch MCP screen comparison + Figma MCP; both coexist |
| Agent verification | Agent receives computed style deltas ("expected 16px, actual 24px") | Agent must produce artifacts + meet thresholds, not claim "looks fine" | UI Visual Validator sub-agent (13-point checklist, defaults to failure) |
| Pass/fail mechanism | Numerical: getComputedStyle() delta comparison | Hybrid: pixel diff (pixelmatch) + perceptual (SSIM/LPIPS) + ARIA snapshots | Deterministic pixel diff + LLM as complement only |
| CI integration | Playwright scripts run post-fix | 3-tier platform choice (OSS → SaaS → Enterprise) | ProofShot on PRs + Playwright in CI |
| Component testing | Not addressed | Storybook stories as "specs for free"; Chromatic/Loki | Chromatic with TurboSnap (free tier) |
| Unique tools found | Midscene.js, UI-Tars, Set-of-Marks overlays | Skia Gold, LLMShot, BackstopJS Docker rendering, reg-suit | ProofShot, UI Visual Validator Agent, Meticulous.ai, Stagehand, Glance MCP |
| Biggest blind spot | No CI pipeline design; no component-level strategy | No awareness of Claude Code autoVerify or agent sub-agents | Less depth on perceptual metrics and cross-browser scaling |
| Depth | Focused & actionable (4 recommendations) | Broadest (7 sections, comparison tables, mermaid diagrams) | Most tool-aware (found 20+ specific tools with URLs) |
| Recency | References 2024-2025 tools | References through early 2026 | Most current (April 2026 tools, Cursor 3, VS Code 1.112) |
Head-to-Head: Existing PRs vs New Research
| Aspect | PR #154 (Quality Plan) | PR #156 (Standards) | New Research Adds |
|---|---|---|---|
| Problem diagnosis | Detailed: 17 iteration rounds analyzed, 5 root causes identified | Concise: 4 rules | All three reports confirm the same root causes. New: LLMs are "overly lenient validators" (not just blind — actively permissive) |
| Agent rules | 5 behavior rules (no SVG hand-drawing, no CDN fonts, no self-approval) | 4 core rules (no blind fixes, required validation, strict tokens, CI regression) | New: UI Visual Validator — a ready-made sub-agent that enforces 13 rules including responsive breakpoints, dark/light mode, WCAG contrast, touch targets. Replaces hand-written rules |
| Measurement | Playwright getComputedStyle() + Figma MCP specs | Playwright scripts for metric extraction | New: ProofShot captures video + screenshots + console errors automatically. Meticulous.ai records real sessions → auto-generates tests |
| Design pipeline | Stitch (ideation) → Figma (refinement) → Claude Code (implementation) | Not addressed | All sources agree on this split. Gemini adds: "Stitch = what to test, not the visual oracle" |
| CI gate | Phase 2 (next sprint): Playwright test runner + golden files | Playwright Action blocks PRs with visual drift | New: ProofShot proofshot pr posts artifacts directly to GitHub PRs. Chromatic TurboSnap snapshots only changed components (85% faster) |
| Timeline | 4 phases (this week → quarterly) | Immediate (standards doc) | Option B below aligns with PR #154's phases but adds specific tool choices |
Where Sources Disagree
| Topic | View A | View B | Resolution |
|---|---|---|---|
| Stitch's role | Codex: "Move from Stitch to Figma for implementation" | Claude: "Keep Stitch MCP screen comparison alongside Figma" | Keep both: Stitch for ideation + stakeholder review, Figma MCP for pixel-precise specs. Stitch MCP screen comparison is still useful for quick checks |
| LLM vision for pass/fail | Codex: Use VLM to compare screenshot vs mockup | Gemini: "VLM hallucination risk — never let agent claim success without deterministic thresholds" | Gemini is right: LLM vision for triage/summary only, deterministic metrics for pass/fail |
| Perceptual metrics | Gemini: SSIM + LPIPS for ranking diffs | Codex/Claude: Pixel diff sufficient | Start with pixel diff (Playwright pixelmatch). Add perceptual metrics only if false positive rate is too high |
| Component vs page testing | Codex: Page-level only | Gemini/Claude: Both component (Storybook) + page (E2E) | Both: Component catches isolated regressions, page catches integration issues |
| Existing tools vs new tools | PR #154: Custom visual-qa.ts script | Claude research: UI Visual Validator (ready-made, 13-point) | Evaluate UI Visual Validator first — if it covers our needs, skip building custom. Fall back to custom only for SV0-specific rules |
Tool Landscape (Consolidated)
Tool verification results (2026-04-07):
- ProofShot — Verified real.
AmElmo/proofshot, 767 stars, actively maintained. Description accurate.- UI Visual Validator Agent — Verified real.
cryptonerdcn/UI-Visual-Validator-Agent, 38 stars, last updated 2026-03-24. 13-point checklist claim not independently confirmed.- Glance MCP — Unverified. Only a 0-star repo with no description exists (
sandraschi/glance-mcp). Not an established tool. Treat as unreliable.- LLMShot — Unverified. Only a 0-star abandoned Shell repo exists (
markabrahams/llmshot). No evidence it is a visual testing tool.- autoVerify /
.claude/launch.json— Hallucinated. Neither exists in Claude Code. The equivalent is the/visual-reviewskill +settings.json.
| Category | OSS / Free | SaaS | Enterprise |
|---|---|---|---|
| Visual regression | Playwright toHaveScreenshot, BackstopJS, Lost Pixel, reg-suit | Chromatic (free tier), Percy (5K free/mo), Argos | Applitools Eyes, Skia Gold |
| Browser automation | Playwright MCP, Stagehand, Browser Use, Glance | — | — |
| Screenshot CI | ProofShot, Playwright artifacts | Meticulous.ai | Sauce Visual |
| Design-to-code | Figma MCP | Applitools Figma Plugin | Applitools Centra |
| Agent visual QA | UI Visual Validator Agent, Claude Code Frontend Dev | — | — |
| Component testing | Storybook + Loki | Chromatic, Happo | Applitools Storybook |
ProofShot vs SV0 Internal Visual Tools
ProofShot (github.com/AmElmo/proofshot, 767 stars, v1.3.5, MIT) was the most promising external tool found across all three research sources. Deep-dive comparison against our existing pipeline:
Feature Comparison
| Capability | SV0 Internal Tools | ProofShot | Winner |
|---|---|---|---|
| Interactive comparison UI | reg-cli: slider, overlay, blend, toggle, side-by-side (5 modes) | No interactive viewer — CLI text + static diff PNGs | SV0 |
| Prod vs dev comparison | visual-review.yml environments mode — compares two live URLs | No support — sessions tied to single URL origin; manual workaround only | SV0 |
| CI/CD integration | GitHub Actions → Cloudflare Pages, auto-deployed per PR | No CI integration, no GitHub Action | SV0 |
| Multi-page capture | 11+ routes, detail pages at multiple scroll positions, manifest.json inventory | Agent-driven navigation only, no declarative route list | SV0 |
| Sprint evidence mapping | Maps action plan items to before/after screenshots with verdict tracking | Not applicable | SV0 |
| Vite/ESM support | Works (Playwright) | Broken — CDP navigation fails with <script type="module"> (Issue #25, showstopper) | SV0 |
| Video recording | No | Yes (.webm screencast of full session) | ProofShot |
| AI agent skill installation | Custom /visual-review skill (SV0-specific) | Auto-installs skills for Claude Code, Cursor, Codex, Gemini CLI, Windsurf | ProofShot |
| Console/server error capture | visual-qa.ts checks console errors + failed network requests | Captures console output + server stderr, regex scanning for 10+ languages | Tie |
| PR comment with visual proof | Interactive HTML deployed to pr-N.sv0-reviews.pages.dev (linked from PR) | Markdown comment with inline screenshots + video embed | Tie (different strengths) |
| Pixel diff engine | reg-cli (mature, configurable threshold) | Delegates to agent-browser (opaque, basic pixel %) | SV0 |
| Maturity | Months of production use | 5 weeks old (created 2026-02-27) | SV0 |
ProofShot Limitations (Verified)
- Critical bug: Vite/React apps render blank pages via CDP (Issue #25) — blocks use with SV0 platform
- No cross-environment comparison: Can't compare prod vs dev out of the box
- No interactive diff viewer: No sliders, overlays, or side-by-side — a major regression from reg-cli
- No baseline management: Manual directory specification vs reg-cli's automated workflow
- No DOM snapshots or network capture: Only screenshots + video + console
- Thin diff logic: All comparison delegated to agent-browser; ProofShot has zero image processing code
What ProofShot Does Better
- Universal agent integration: One
proofshot installcommand teaches any AI agent (not just Claude Code) the verification workflow. Our/visual-reviewskill is SV0-specific. - Video proof: Session recordings (.webm) capture the full interaction flow — valuable for debugging, not just pass/fail.
- Self-contained HTML viewer:
viewer.htmlbundles video + timeline + logs + screenshots in one offline file (but is local-only, not uploaded to PRs).
Verdict
Do not adopt ProofShot as a replacement — SV0's tools are more capable in every dimension that matters for our workflow (environment comparison, interactive diffs, CI integration, Vite support). However, two ProofShot ideas are worth borrowing:
- Video recording — add Playwright screencast capture to
visual-screenshot.tsfor debugging context - Universal agent skill — generalize our
/visual-reviewskill pattern so other agents (not just Claude Code) can trigger visual verification
Proposal: Dual-Output Visual Tool (Human + Agent)
The Problem
Today our visual tools produce human-optimized output only: interactive HTML reports with sliders and overlays at pr-N.sv0-reviews.pages.dev. This works well for human reviewers but creates a gap:
- Humans get rich interactive comparison → manually decide "looks good" or "broken"
- Agents get nothing — they can't parse interactive HTML, can't use sliders, can't interpret overlay diffs
The current workflow requires a human in the loop at every visual checkpoint. When an agent claims "development is completed" and a PR is opened, a human must:
- Wait for deploy to dev
- Open the visual review URL
- Manually compare prod vs dev using sliders
- Identify issues and feed them back to the agent
This is the bottleneck. The agent should be able to self-assess before the human ever looks.
The Proposal: One Run, Two Outputs
Extend the existing visual-diff-report.ts pipeline to produce two outputs from a single comparison run:
Human output (existing, enhanced):
- Interactive HTML with slider/overlay/blend/toggle/side-by-side (already implemented via reg-cli)
- Deployed to Cloudflare Pages per PR (already implemented)
- Video recording of the capture session (new, borrowed from ProofShot concept)
Agent output (new):
-
visual-report.md— structured markdown summary that an LLM can parse:- Per-page status:
PASS/FAIL/CHANGED/NEW/REMOVED - Pixel diff percentage per page (from reg-cli
reg.json) - List of pages exceeding threshold, ranked by severity
- Console errors and failed network requests per page
- Before/after screenshot paths (agent can read these via vision)
- Computed style deltas for key elements (if measurement pipeline is enabled)
- Per-page status:
-
visual-report.json— machine-readable structured data:{
"summary": { "pass": 8, "fail": 2, "changed": 1, "new": 0, "removed": 0 },
"threshold": 0.01,
"pages": [
{
"route": "/dashboard",
"status": "FAIL",
"diffPercent": 4.2,
"diffPixels": 12847,
"screenshot": { "before": "before/dashboard.png", "after": "after/dashboard.png", "diff": "diff/dashboard.png" },
"consoleErrors": [],
"verdict": "Layout shift in header navigation — 4.2% pixel diff exceeds 1% threshold"
}
]
}
How Agents Would Use This
During development (agent self-check):
- Agent makes UI changes
- Agent runs
npx tsx scripts/visual-screenshot.ts→ captures current state - Agent runs
npx tsx scripts/visual-diff-report.ts --baseline main --format agent→ getsvisual-report.md - Agent reads the markdown: sees 2 pages FAIL, reads the diff percentages and verdicts
- Agent fixes the failing pages and re-runs — no human needed for the iteration loop
During PR review (human + agent):
- CI runs the full pipeline → deploys interactive HTML for humans + generates
visual-report.mdfor agents - Human reviews the interactive sliders for subjective quality
- Agent (or CI bot) enforces the deterministic thresholds — blocks merge if any page exceeds limit
- Both outputs come from the same single run — no duplicate work
What Changes in Existing Tools
| Script | Change | Effort |
|---|---|---|
visual-diff-report.ts | Add --format agent flag; emit visual-report.md + visual-report.json alongside index.html | Small — reg-cli already produces reg.json with all the data; this is a formatter |
visual-qa.ts | Already produces markdown report; add structured JSON output with per-page pass/fail | Small |
visual-review.yml | Upload visual-report.md as PR comment (in addition to deploying HTML) | Small |
| CLAUDE.md / agent rules | Add: "After UI changes, run visual-diff and read visual-report.md before claiming done" | Config only |
Comparison with ProofShot's Approach
| Aspect | ProofShot | SV0 Dual-Output (proposed) |
|---|---|---|
| Agent gets structured data | SUMMARY.md (basic: error count + screenshot list) | visual-report.md + .json (per-page status, diff %, thresholds, verdicts) |
| Human gets interactive UI | No (local viewer.html only, not deployed) | Yes (reg-cli HTML at Cloudflare Pages) |
| Environment comparison | No | Yes (prod vs dev, PR vs main) |
| Single run produces both | No (separate proofshot pr step) | Yes (one visual-diff-report.ts run) |
| Pass/fail thresholds | No (just reports percentages) | Yes (configurable per-page thresholds, CI gate) |
| Agent self-check loop | Agent reads SUMMARY.md manually | Agent reads visual-report.md, sees PASS/FAIL, iterates |
Why This Is Better Than Adopting an External Tool
- Builds on what works: reg-cli + Playwright + Cloudflare Pages pipeline is already battle-tested
- Zero new dependencies: The agent output is a formatter on top of existing
reg.jsondata - SV0-specific: Includes routes, entity pages, scroll positions, sprint evidence — no external tool knows our page inventory
- Dual audience by design: Not an afterthought — human and agent outputs are first-class from the same run
Process Options
Option A: Minimal — Playwright + autoVerify
Note:
autoVerifyand.claude/launch.jsonrequire verification — these may not exist as described. The equivalent capability in SV0 today is the/visual-reviewskill + existing Playwright scripts. This option's incremental value over existing infrastructure is limited to addingtoHaveScreenshot()assertions with baselines.
Timeline: 1 week | Cost: $0 | New dependencies: 0
| What | How |
|---|---|
| Enable autoVerify | .claude/launch.json config |
| Playwright snapshots | toHaveScreenshot() on 10-15 critical screens |
| Baselines in repo | maxDiffPixels threshold, manual updates |
| Agent runs tests | npx playwright test after every UI change |
Pros:
- Zero new dependencies or services
- Fast to implement, fully deterministic
- Already addresses the core "blind agent" problem
- Codex's measurement approach (
getComputedStyle) fits here too
Cons:
- No cross-browser coverage (single Chromium)
- No design-to-code comparison (agent still guesses vs design intent)
- No component isolation (only catches page-level issues)
- Manual baseline management scales poorly
- No PR-level visual artifacts for human review
- Doesn't address PR #154's finding that 15+ iterations needed — just catches obvious breaks
Best for: Quick win while evaluating larger options. Solves the "agent never looks" problem immediately.
Option B: Layered Pipeline (Recommended)
Timeline: 2-3 sprints | Cost: $0-50/mo (Chromatic free tier) | New dependencies: ProofShot, UI Visual Validator, Chromatic
Synthesizes the best ideas from all sources:
| Layer | Source | What | When |
|---|---|---|---|
| 1. Strict agent rules | PR #156 + Claude research | UI Visual Validator sub-agent (13-point checklist, defaults to failure) + autoVerify + CLAUDE.md rules: "No 'fixed' claim without screenshot evidence" | Immediate |
| 2. Measurement pipeline | Codex research + PR #154 | Playwright toHaveScreenshot() + getComputedStyle() extraction vs design specs. Mask volatile elements, disable animations, freeze time | Sprint 1 |
| 3. CI visual gate | Gemini research + Claude research + ProofShot analysis | Dual-output visual-diff-report (interactive HTML for humans + visual-report.md/.json for agents). CI posts agent-readable summary to PR. Rule: "No merge without visual artifacts + diffs within threshold" | Sprint 2 |
| 4. Component regression | Gemini research + Claude research | Chromatic for Storybook (free tier). TurboSnap = only snapshot changed components. Every Stitch screen → 1 Storybook story + 1 E2E screenshot | Sprint 3 |
Prerequisite: SV0 does not currently use Storybook. Adding Storybook (component isolation, stories for each screen) is ~1 sprint of setup before Chromatic integration can begin.
Pros:
- Addresses all root causes from PR #154's analysis (blind fixes, design drift, no guardrails)
- Layered defense: catches issues at agent time, CI time, AND review time
- UI Visual Validator is battle-tested (13-point checklist > hand-written rules from PR #156)
- ProofShot gives human reviewers video proof, not just screenshots
- Incremental: each layer works independently, can stop at any layer
- Mostly OSS (Chromatic free tier for component-level)
- Aligns with PR #154's 4-phase roadmap but adds specific tool choices
Cons:
- More setup than Option A (but each layer is independent)
- Chromatic adds a SaaS dependency (can substitute with Lost Pixel if needed)
- No design-to-code comparison yet (added in Option C)
- Storybook required for Layer 4 (skip if not using it)
Best for: Teams that want comprehensive visual QA without enterprise pricing. Solves both "agent is blind" and "human reviewer lacks evidence."
Option C: Full Stack — Design-to-Production
Timeline: Quarter | Cost: $200-2000/mo (Applitools, Meticulous) | New dependencies: Everything in B + Figma MCP, Applitools, Meticulous.ai, Stagehand
Everything in Option B, plus:
| Layer | Source | What |
|---|---|---|
| 5. Design source of truth | Codex + Gemini | Figma MCP for pixel-precise spec extraction. Stitch stays for ideation. Applitools Figma Plugin: compare production screenshots against Figma designs. Coverage map: every Stitch screen must have tests |
| 6. AI triage | Gemini + Claude research | Chromatic/Percy Visual Review Agent (reduces review burden ~40%). LLM summarizes diffs, pass/fail stays deterministic. Accessibility gates: toMatchAriaSnapshot() + axe-core |
| 7. Zero-maintenance testing | Claude research | Meticulous.ai: records user sessions → auto-generates visual tests → posts diffs on PRs. Stagehand: self-healing browser automation for dynamic content |
Pros:
- End-to-end from design intent to production verification
- Closes the Stitch → Figma → Code → Screenshot → Diff → Design comparison loop
- AI triage reduces human review burden by ~40%
- Meticulous.ai means no manual test writing for new pages
- Self-healing tests (Stagehand) reduce maintenance
- Accessibility is a first-class gate, not an afterthought
Cons:
- Significant cost (Applitools ~$500+/mo, Meticulous pricing varies)
- Complexity: 7 layers = more things that can break
- Figma MCP has limitations (business logic lost in roundtrip, ~6 free uses/mo)
- Quarter timeline before full value
- May be over-engineered for current team size
Best for: Teams scaling to multiple designers + developers, or where design fidelity is a competitive differentiator.
Decision Matrix
| Factor | Option A | Option B | Option C |
|---|---|---|---|
| Solves "agent is blind" | Yes | Yes | Yes |
| Solves "15+ iteration rounds" | Partially | Mostly | Yes |
| Design-to-code verification | No | No | Yes |
| Component isolation | No | Yes (Chromatic) | Yes |
| Human reviewer gets evidence | No (just pass/fail) | Yes (ProofShot video + screenshots) | Yes (+ AI triage) |
| Cross-browser | No | No (add Percy in C) | Yes |
| Accessibility gates | No | No | Yes |
| Time to first value | 1 week | 1 week (Layer 1) | 1 week (Layer 1) |
| Time to full value | 1 week | 2-3 sprints | Quarter |
| Monthly cost | $0 | $0-50 | $200-2000 |
| Maintenance burden | Low | Medium | High (but self-healing) |
How This Relates to Existing PRs
| PR | Status | Relationship to This Plan |
|---|---|---|
| #149 | Open | Provides the "why": AI agents produce mediocre UI due to completeness bias, no visual feedback, flat spatial reasoning. This plan addresses the "how to fix" |
| #154 | Open | Most aligned with Option B. PR #154's 5-layer prevention strategy maps to our 4 layers. Our plan adds specific tool choices (ProofShot, UI Visual Validator, Chromatic) that PR #154 left as "TBD" |
| #156 | Open | The 4 core rules become Layer 1 agent behavior rules. UI Visual Validator sub-agent supersedes hand-written rules with a battle-tested 13-point checklist |
| Adopted research | Adopted | Claude Code UI Testing produced /visual-review skill. This plan extends it with deterministic pass/fail gates and CI enforcement |
| Visual diff pipeline | Implemented | reg-cli + Cloudflare Pages provides before/after HTML diffs on PRs. This plan adds threshold-based blocking and agent-time verification |
Recommendation: Merge #154 and #156 as foundation docs, then implement chosen option using them as the specification. Update the docs as tool choices are finalized.
Verification Plan
After implementing the chosen option:
- Run a UI change through the full pipeline end-to-end
- Intentionally introduce a visual regression (wrong color, broken spacing) — verify pipeline catches it
- Verify agent refuses to claim "fixed" without screenshot evidence
- Check CI blocks PR merge when visual diff exceeds threshold
- Measure: feedback iterations needed vs the 15+ documented in PR #154's analysis
Next Action
Status: research-complete
Decision needed from: Product Owner
Options:
- Adopt Option B (Layered Pipeline, recommended) — create GitHub issue for 4-layer implementation across 2-3 sprints, building on existing visual-qa.ts + reg-cli pipeline. Includes dual-output proposal (human HTML + agent markdown/JSON from single run).
- Adopt Option A (Minimal) — add
toHaveScreenshot()assertions to existing Playwright scripts only - Adopt Option C (Full Stack) — create GitHub issue for quarter-long implementation with enterprise tooling
- Adopt Dual-Output only — implement
visual-report.md+.jsonoutput from existing visual-diff-report.ts without other layers (fastest path to agent self-check) - Defer — revisit after current sprint priorities are delivered
GitHub Issue: not yet created