Visual UX Development & Testing: Multi-Source Research Synthesis

Context

Problem: When using Google Stitch + Stitch MCP + Claude Code for UI development, Opus doesn't check screenshots after fixing — the LLM is visually blind. This leads to syntactically correct but visually broken output, design drift, and 15+ feedback iterations (documented in PR #154's analysis of 17 iteration rounds).

Research sources synthesized:

Codex — closed-loop VLM verification, measurement-based validation
Gemini Deep Research — comprehensive tool landscape, failure modes, tiered architecture
Claude Opus (own web research) — newer tools: ProofShot, UI Visual Validator Agent, autoVerify, Meticulous.ai
PR #149 — UX strategy research (Wiz positioning + AI dev process analysis)
PR #154 — 5-layer prevention strategy with 4-phase roadmap
PR #156 — Agentic UI development standards (4 core rules)

What We Already Have

This research builds on existing visual testing infrastructure in SV0:

Tool	What it does	Status	Ref
`visual-qa.ts`	Headless Playwright: 11 pages, console errors, layout checks, screenshots	Active	sv0-platform
`ux-audit.ts`	Deep UX audit: flows, interactions, accessibility, performance	Active	sv0-platform
`/visual-review` skill	Claude reads screenshots + product docs, evaluates UI vs business vision	Adopted	Research
Visual diff pipeline	reg-cli pixel diff, interactive HTML at `pr-N.sv0-visual-reviews.pages.dev`	Implemented	Plan
PR preview envs	`pr-N.dev.securityv0.com` for live review	Active	CI

What's missing (the gap this research addresses):

Agent doesn't look at its own output — no closed-loop verification during development
No deterministic pass/fail gate — /visual-review is advisory, not blocking
No design-to-code comparison — agent has no design spec to compare against
No component isolation — only page-level testing exists

Comparative Analysis: All Research Sources & PRs

Head-to-Head: Research Reports

Dimension	Codex Research	Gemini Deep Research	Claude Research
Core thesis	Closed-loop VLM + measurement-based validation	3-layer defense (pixel diff + DOM/structure + accessibility)	Agent already has tools — the gap is strict enforcement
Recommended foundation	Playwright + Figma MCP for spec extraction	Playwright `toHaveScreenshot` + tiered platform choice	Playwright + autoVerify (already built-in) + strict sub-agent
Design-to-code approach	Move from Stitch to Figma for implementation context; Stitch stays for ideation	Treat Stitch as design input, not visual oracle; Figma MCP for specs	Stitch MCP screen comparison + Figma MCP; both coexist
Agent verification	Agent receives computed style deltas ("expected 16px, actual 24px")	Agent must produce artifacts + meet thresholds, not claim "looks fine"	UI Visual Validator sub-agent (13-point checklist, defaults to failure)
Pass/fail mechanism	Numerical: `getComputedStyle()` delta comparison	Hybrid: pixel diff (pixelmatch) + perceptual (SSIM/LPIPS) + ARIA snapshots	Deterministic pixel diff + LLM as complement only
CI integration	Playwright scripts run post-fix	3-tier platform choice (OSS → SaaS → Enterprise)	ProofShot on PRs + Playwright in CI
Component testing	Not addressed	Storybook stories as "specs for free"; Chromatic/Loki	Chromatic with TurboSnap (free tier)
Unique tools found	Midscene.js, UI-Tars, Set-of-Marks overlays	Skia Gold, LLMShot, BackstopJS Docker rendering, reg-suit	ProofShot, UI Visual Validator Agent, Meticulous.ai, Stagehand, Glance MCP
Biggest blind spot	No CI pipeline design; no component-level strategy	No awareness of Claude Code autoVerify or agent sub-agents	Less depth on perceptual metrics and cross-browser scaling
Depth	Focused & actionable (4 recommendations)	Broadest (7 sections, comparison tables, mermaid diagrams)	Most tool-aware (found 20+ specific tools with URLs)
Recency	References 2024-2025 tools	References through early 2026	Most current (April 2026 tools, Cursor 3, VS Code 1.112)

Head-to-Head: Existing PRs vs New Research

Aspect	PR #154 (Quality Plan)	PR #156 (Standards)	New Research Adds
Problem diagnosis	Detailed: 17 iteration rounds analyzed, 5 root causes identified	Concise: 4 rules	All three reports confirm the same root causes. New: LLMs are "overly lenient validators" (not just blind — actively permissive)
Agent rules	5 behavior rules (no SVG hand-drawing, no CDN fonts, no self-approval)	4 core rules (no blind fixes, required validation, strict tokens, CI regression)	New: UI Visual Validator — a ready-made sub-agent that enforces 13 rules including responsive breakpoints, dark/light mode, WCAG contrast, touch targets. Replaces hand-written rules
Measurement	Playwright `getComputedStyle()` + Figma MCP specs	Playwright scripts for metric extraction	New: ProofShot captures video + screenshots + console errors automatically. Meticulous.ai records real sessions → auto-generates tests
Design pipeline	Stitch (ideation) → Figma (refinement) → Claude Code (implementation)	Not addressed	All sources agree on this split. Gemini adds: "Stitch = what to test, not the visual oracle"
CI gate	Phase 2 (next sprint): Playwright test runner + golden files	Playwright Action blocks PRs with visual drift	New: ProofShot `proofshot pr` posts artifacts directly to GitHub PRs. Chromatic TurboSnap snapshots only changed components (85% faster)
Timeline	4 phases (this week → quarterly)	Immediate (standards doc)	Option B below aligns with PR #154's phases but adds specific tool choices

Where Sources Disagree

Topic	View A	View B	Resolution
Stitch's role	Codex: "Move from Stitch to Figma for implementation"	Claude: "Keep Stitch MCP screen comparison alongside Figma"	Keep both: Stitch for ideation + stakeholder review, Figma MCP for pixel-precise specs. Stitch MCP screen comparison is still useful for quick checks
LLM vision for pass/fail	Codex: Use VLM to compare screenshot vs mockup	Gemini: "VLM hallucination risk — never let agent claim success without deterministic thresholds"	Gemini is right: LLM vision for triage/summary only, deterministic metrics for pass/fail
Perceptual metrics	Gemini: SSIM + LPIPS for ranking diffs	Codex/Claude: Pixel diff sufficient	Start with pixel diff (Playwright pixelmatch). Add perceptual metrics only if false positive rate is too high
Component vs page testing	Codex: Page-level only	Gemini/Claude: Both component (Storybook) + page (E2E)	Both: Component catches isolated regressions, page catches integration issues
Existing tools vs new tools	PR #154: Custom `visual-qa.ts` script	Claude research: UI Visual Validator (ready-made, 13-point)	Evaluate UI Visual Validator first — if it covers our needs, skip building custom. Fall back to custom only for SV0-specific rules

Tool Landscape (Consolidated)

Tool verification results (2026-04-07):

ProofShot — Verified real. AmElmo/proofshot, 767 stars, actively maintained. Description accurate.

UI Visual Validator Agent — Verified real. cryptonerdcn/UI-Visual-Validator-Agent, 38 stars, last updated 2026-03-24. 13-point checklist claim not independently confirmed.

Glance MCP — Unverified. Only a 0-star repo with no description exists (sandraschi/glance-mcp). Not an established tool. Treat as unreliable.

LLMShot — Unverified. Only a 0-star abandoned Shell repo exists (markabrahams/llmshot). No evidence it is a visual testing tool.

autoVerify / .claude/launch.json — Hallucinated. Neither exists in Claude Code. The equivalent is the /visual-review skill + settings.json.

Category	OSS / Free	SaaS	Enterprise
Visual regression	Playwright `toHaveScreenshot`, BackstopJS, Lost Pixel, reg-suit	Chromatic (free tier), Percy (5K free/mo), Argos	Applitools Eyes, Skia Gold
Browser automation	Playwright MCP, Stagehand, Browser Use, Glance	—	—
Screenshot CI	ProofShot, Playwright artifacts	Meticulous.ai	Sauce Visual
Design-to-code	Figma MCP	Applitools Figma Plugin	Applitools Centra
Agent visual QA	UI Visual Validator Agent, Claude Code Frontend Dev	—	—
Component testing	Storybook + Loki	Chromatic, Happo	Applitools Storybook

ProofShot vs SV0 Internal Visual Tools

ProofShot (github.com/AmElmo/proofshot, 767 stars, v1.3.5, MIT) was the most promising external tool found across all three research sources. Deep-dive comparison against our existing pipeline:

Feature Comparison

Capability	SV0 Internal Tools	ProofShot	Winner
Interactive comparison UI	reg-cli: slider, overlay, blend, toggle, side-by-side (5 modes)	No interactive viewer — CLI text + static diff PNGs	SV0
Prod vs dev comparison	`visual-review.yml` environments mode — compares two live URLs	No support — sessions tied to single URL origin; manual workaround only	SV0
CI/CD integration	GitHub Actions → Cloudflare Pages, auto-deployed per PR	No CI integration, no GitHub Action	SV0
Multi-page capture	11+ routes, detail pages at multiple scroll positions, manifest.json inventory	Agent-driven navigation only, no declarative route list	SV0
Sprint evidence mapping	Maps action plan items to before/after screenshots with verdict tracking	Not applicable	SV0
Vite/ESM support	Works (Playwright)	Broken — CDP navigation fails with `<script type="module">` (Issue #25, showstopper)	SV0
Video recording	No	Yes (.webm screencast of full session)	ProofShot
AI agent skill installation	Custom `/visual-review` skill (SV0-specific)	Auto-installs skills for Claude Code, Cursor, Codex, Gemini CLI, Windsurf	ProofShot
Console/server error capture	`visual-qa.ts` checks console errors + failed network requests	Captures console output + server stderr, regex scanning for 10+ languages	Tie
PR comment with visual proof	Interactive HTML deployed to `pr-N.sv0-reviews.pages.dev` (linked from PR)	Markdown comment with inline screenshots + video embed	Tie (different strengths)
Pixel diff engine	reg-cli (mature, configurable threshold)	Delegates to agent-browser (opaque, basic pixel %)	SV0
Maturity	Months of production use	5 weeks old (created 2026-02-27)	SV0

ProofShot Limitations (Verified)

Critical bug: Vite/React apps render blank pages via CDP (Issue #25) — blocks use with SV0 platform
No cross-environment comparison: Can't compare prod vs dev out of the box
No interactive diff viewer: No sliders, overlays, or side-by-side — a major regression from reg-cli
No baseline management: Manual directory specification vs reg-cli's automated workflow
No DOM snapshots or network capture: Only screenshots + video + console
Thin diff logic: All comparison delegated to agent-browser; ProofShot has zero image processing code

What ProofShot Does Better

Universal agent integration: One proofshot install command teaches any AI agent (not just Claude Code) the verification workflow. Our /visual-review skill is SV0-specific.
Video proof: Session recordings (.webm) capture the full interaction flow — valuable for debugging, not just pass/fail.
Self-contained HTML viewer: viewer.html bundles video + timeline + logs + screenshots in one offline file (but is local-only, not uploaded to PRs).

Verdict

Do not adopt ProofShot as a replacement — SV0's tools are more capable in every dimension that matters for our workflow (environment comparison, interactive diffs, CI integration, Vite support). However, two ProofShot ideas are worth borrowing:

Video recording — add Playwright screencast capture to visual-screenshot.ts for debugging context
Universal agent skill — generalize our /visual-review skill pattern so other agents (not just Claude Code) can trigger visual verification

Proposal: Dual-Output Visual Tool (Human + Agent)

The Problem

Today our visual tools produce human-optimized output only: interactive HTML reports with sliders and overlays at pr-N.sv0-reviews.pages.dev. This works well for human reviewers but creates a gap:

Humans get rich interactive comparison → manually decide "looks good" or "broken"
Agents get nothing — they can't parse interactive HTML, can't use sliders, can't interpret overlay diffs

The current workflow requires a human in the loop at every visual checkpoint. When an agent claims "development is completed" and a PR is opened, a human must:

Wait for deploy to dev
Open the visual review URL
Manually compare prod vs dev using sliders
Identify issues and feed them back to the agent

This is the bottleneck. The agent should be able to self-assess before the human ever looks.

The Proposal: One Run, Two Outputs

Extend the existing visual-diff-report.ts pipeline to produce two outputs from a single comparison run:

Human output (existing, enhanced):

Interactive HTML with slider/overlay/blend/toggle/side-by-side (already implemented via reg-cli)
Deployed to Cloudflare Pages per PR (already implemented)
Video recording of the capture session (new, borrowed from ProofShot concept)

Agent output (new):

visual-report.md — structured markdown summary that an LLM can parse:
- Per-page status: PASS / FAIL / CHANGED / NEW / REMOVED
- Pixel diff percentage per page (from reg-cli reg.json)
- List of pages exceeding threshold, ranked by severity
- Console errors and failed network requests per page
- Before/after screenshot paths (agent can read these via vision)
- Computed style deltas for key elements (if measurement pipeline is enabled)

visual-report.json — machine-readable structured data:

{
  "summary": { "pass": 8, "fail": 2, "changed": 1, "new": 0, "removed": 0 },
  "threshold": 0.01,
  "pages": [
    {
      "route": "/dashboard",
      "status": "FAIL",
      "diffPercent": 4.2,
      "diffPixels": 12847,
      "screenshot": { "before": "before/dashboard.png", "after": "after/dashboard.png", "diff": "diff/dashboard.png" },
      "consoleErrors": [],
      "verdict": "Layout shift in header navigation — 4.2% pixel diff exceeds 1% threshold"
    }
  ]
}

How Agents Would Use This

During development (agent self-check):

Agent makes UI changes
Agent runs npx tsx scripts/visual-screenshot.ts → captures current state
Agent runs npx tsx scripts/visual-diff-report.ts --baseline main --format agent → gets visual-report.md
Agent reads the markdown: sees 2 pages FAIL, reads the diff percentages and verdicts
Agent fixes the failing pages and re-runs — no human needed for the iteration loop

During PR review (human + agent):

CI runs the full pipeline → deploys interactive HTML for humans + generates visual-report.md for agents
Human reviews the interactive sliders for subjective quality
Agent (or CI bot) enforces the deterministic thresholds — blocks merge if any page exceeds limit
Both outputs come from the same single run — no duplicate work

What Changes in Existing Tools

Script	Change	Effort
`visual-diff-report.ts`	Add `--format agent` flag; emit `visual-report.md` + `visual-report.json` alongside `index.html`	Small — reg-cli already produces `reg.json` with all the data; this is a formatter
`visual-qa.ts`	Already produces markdown report; add structured JSON output with per-page pass/fail	Small
`visual-review.yml`	Upload `visual-report.md` as PR comment (in addition to deploying HTML)	Small
CLAUDE.md / agent rules	Add: "After UI changes, run visual-diff and read `visual-report.md` before claiming done"	Config only

Comparison with ProofShot's Approach

Aspect	ProofShot	SV0 Dual-Output (proposed)
Agent gets structured data	`SUMMARY.md` (basic: error count + screenshot list)	`visual-report.md` + `.json` (per-page status, diff %, thresholds, verdicts)
Human gets interactive UI	No (local viewer.html only, not deployed)	Yes (reg-cli HTML at Cloudflare Pages)
Environment comparison	No	Yes (prod vs dev, PR vs main)
Single run produces both	No (separate `proofshot pr` step)	Yes (one `visual-diff-report.ts` run)
Pass/fail thresholds	No (just reports percentages)	Yes (configurable per-page thresholds, CI gate)
Agent self-check loop	Agent reads SUMMARY.md manually	Agent reads visual-report.md, sees PASS/FAIL, iterates

Why This Is Better Than Adopting an External Tool

Builds on what works: reg-cli + Playwright + Cloudflare Pages pipeline is already battle-tested
Zero new dependencies: The agent output is a formatter on top of existing reg.json data
SV0-specific: Includes routes, entity pages, scroll positions, sprint evidence — no external tool knows our page inventory
Dual audience by design: Not an afterthought — human and agent outputs are first-class from the same run

Process Options

Option A: Minimal — Playwright + autoVerify

Note: autoVerify and .claude/launch.json require verification — these may not exist as described. The equivalent capability in SV0 today is the /visual-review skill + existing Playwright scripts. This option's incremental value over existing infrastructure is limited to adding toHaveScreenshot() assertions with baselines.

Timeline: 1 week | Cost: $0 | New dependencies: 0

What	How
Enable autoVerify	`.claude/launch.json` config
Playwright snapshots	`toHaveScreenshot()` on 10-15 critical screens
Baselines in repo	`maxDiffPixels` threshold, manual updates
Agent runs tests	`npx playwright test` after every UI change

Pros:

Zero new dependencies or services
Fast to implement, fully deterministic
Already addresses the core "blind agent" problem
Codex's measurement approach (getComputedStyle) fits here too

Cons:

No cross-browser coverage (single Chromium)
No design-to-code comparison (agent still guesses vs design intent)
No component isolation (only catches page-level issues)
Manual baseline management scales poorly
No PR-level visual artifacts for human review
Doesn't address PR #154's finding that 15+ iterations needed — just catches obvious breaks

Best for: Quick win while evaluating larger options. Solves the "agent never looks" problem immediately.

Option B: Layered Pipeline (Recommended)

Timeline: 2-3 sprints | Cost: $0-50/mo (Chromatic free tier) | New dependencies: ProofShot, UI Visual Validator, Chromatic

Synthesizes the best ideas from all sources:

Layer	Source	What	When
1. Strict agent rules	PR #156 + Claude research	UI Visual Validator sub-agent (13-point checklist, defaults to failure) + autoVerify + CLAUDE.md rules: "No 'fixed' claim without screenshot evidence"	Immediate
2. Measurement pipeline	Codex research + PR #154	Playwright `toHaveScreenshot()` + `getComputedStyle()` extraction vs design specs. Mask volatile elements, disable animations, freeze time	Sprint 1
3. CI visual gate	Gemini research + Claude research + ProofShot analysis	Dual-output visual-diff-report (interactive HTML for humans + `visual-report.md`/`.json` for agents). CI posts agent-readable summary to PR. Rule: "No merge without visual artifacts + diffs within threshold"	Sprint 2
4. Component regression	Gemini research + Claude research	Chromatic for Storybook (free tier). TurboSnap = only snapshot changed components. Every Stitch screen → 1 Storybook story + 1 E2E screenshot	Sprint 3

Prerequisite: SV0 does not currently use Storybook. Adding Storybook (component isolation, stories for each screen) is ~1 sprint of setup before Chromatic integration can begin.

Pros:

Addresses all root causes from PR #154's analysis (blind fixes, design drift, no guardrails)
Layered defense: catches issues at agent time, CI time, AND review time
UI Visual Validator is battle-tested (13-point checklist > hand-written rules from PR #156)
ProofShot gives human reviewers video proof, not just screenshots
Incremental: each layer works independently, can stop at any layer
Mostly OSS (Chromatic free tier for component-level)
Aligns with PR #154's 4-phase roadmap but adds specific tool choices

Cons:

More setup than Option A (but each layer is independent)
Chromatic adds a SaaS dependency (can substitute with Lost Pixel if needed)
No design-to-code comparison yet (added in Option C)
Storybook required for Layer 4 (skip if not using it)

Best for: Teams that want comprehensive visual QA without enterprise pricing. Solves both "agent is blind" and "human reviewer lacks evidence."

Option C: Full Stack — Design-to-Production

Timeline: Quarter | Cost: $200-2000/mo (Applitools, Meticulous) | New dependencies: Everything in B + Figma MCP, Applitools, Meticulous.ai, Stagehand

Everything in Option B, plus:

Layer	Source	What
5. Design source of truth	Codex + Gemini	Figma MCP for pixel-precise spec extraction. Stitch stays for ideation. Applitools Figma Plugin: compare production screenshots against Figma designs. Coverage map: every Stitch screen must have tests
6. AI triage	Gemini + Claude research	Chromatic/Percy Visual Review Agent (reduces review burden ~40%). LLM summarizes diffs, pass/fail stays deterministic. Accessibility gates: `toMatchAriaSnapshot()` + axe-core
7. Zero-maintenance testing	Claude research	Meticulous.ai: records user sessions → auto-generates visual tests → posts diffs on PRs. Stagehand: self-healing browser automation for dynamic content

Pros:

End-to-end from design intent to production verification
Closes the Stitch → Figma → Code → Screenshot → Diff → Design comparison loop
AI triage reduces human review burden by ~40%
Meticulous.ai means no manual test writing for new pages
Self-healing tests (Stagehand) reduce maintenance
Accessibility is a first-class gate, not an afterthought

Cons:

Significant cost (Applitools ~$500+/mo, Meticulous pricing varies)
Complexity: 7 layers = more things that can break
Figma MCP has limitations (business logic lost in roundtrip, ~6 free uses/mo)
Quarter timeline before full value
May be over-engineered for current team size

Best for: Teams scaling to multiple designers + developers, or where design fidelity is a competitive differentiator.

Decision Matrix

Factor	Option A	Option B	Option C
Solves "agent is blind"	Yes	Yes	Yes
Solves "15+ iteration rounds"	Partially	Mostly	Yes
Design-to-code verification	No	No	Yes
Component isolation	No	Yes (Chromatic)	Yes
Human reviewer gets evidence	No (just pass/fail)	Yes (ProofShot video + screenshots)	Yes (+ AI triage)
Cross-browser	No	No (add Percy in C)	Yes
Accessibility gates	No	No	Yes
Time to first value	1 week	1 week (Layer 1)	1 week (Layer 1)
Time to full value	1 week	2-3 sprints	Quarter
Monthly cost	$0	$0-50	$200-2000
Maintenance burden	Low	Medium	High (but self-healing)

How This Relates to Existing PRs

PR	Status	Relationship to This Plan
#149	Open	Provides the "why": AI agents produce mediocre UI due to completeness bias, no visual feedback, flat spatial reasoning. This plan addresses the "how to fix"
#154	Open	Most aligned with Option B. PR #154's 5-layer prevention strategy maps to our 4 layers. Our plan adds specific tool choices (ProofShot, UI Visual Validator, Chromatic) that PR #154 left as "TBD"
#156	Open	The 4 core rules become Layer 1 agent behavior rules. UI Visual Validator sub-agent supersedes hand-written rules with a battle-tested 13-point checklist
Adopted research	Adopted	Claude Code UI Testing produced `/visual-review` skill. This plan extends it with deterministic pass/fail gates and CI enforcement
Visual diff pipeline	Implemented	reg-cli + Cloudflare Pages provides before/after HTML diffs on PRs. This plan adds threshold-based blocking and agent-time verification

Recommendation: Merge #154 and #156 as foundation docs, then implement chosen option using them as the specification. Update the docs as tool choices are finalized.

Verification Plan

After implementing the chosen option:

Run a UI change through the full pipeline end-to-end
Intentionally introduce a visual regression (wrong color, broken spacing) — verify pipeline catches it
Verify agent refuses to claim "fixed" without screenshot evidence
Check CI blocks PR merge when visual diff exceeds threshold
Measure: feedback iterations needed vs the 15+ documented in PR #154's analysis

Next Action

Status: research-complete

Decision needed from: Product Owner

Options:

Adopt Option B (Layered Pipeline, recommended) — create GitHub issue for 4-layer implementation across 2-3 sprints, building on existing visual-qa.ts + reg-cli pipeline. Includes dual-output proposal (human HTML + agent markdown/JSON from single run).
Adopt Option A (Minimal) — add toHaveScreenshot() assertions to existing Playwright scripts only
Adopt Option C (Full Stack) — create GitHub issue for quarter-long implementation with enterprise tooling
Adopt Dual-Output only — implement visual-report.md + .json output from existing visual-diff-report.ts without other layers (fastest path to agent self-check)
Defer — revisit after current sprint priorities are delivered

GitHub Issue: not yet created

Context​

What We Already Have​

Comparative Analysis: All Research Sources & PRs​

Head-to-Head: Research Reports​

Head-to-Head: Existing PRs vs New Research​

Where Sources Disagree​

Tool Landscape (Consolidated)​

ProofShot vs SV0 Internal Visual Tools​

Feature Comparison​

ProofShot Limitations (Verified)​

What ProofShot Does Better​

Verdict​

Proposal: Dual-Output Visual Tool (Human + Agent)​

The Problem​

The Proposal: One Run, Two Outputs​

How Agents Would Use This​

What Changes in Existing Tools​

Comparison with ProofShot's Approach​

Why This Is Better Than Adopting an External Tool​

Process Options​

Option A: Minimal — Playwright + autoVerify​

Option B: Layered Pipeline (Recommended)​

Option C: Full Stack — Design-to-Production​

Decision Matrix​

How This Relates to Existing PRs​

Verification Plan​

Next Action​

Context

What We Already Have

Comparative Analysis: All Research Sources & PRs

Head-to-Head: Research Reports

Head-to-Head: Existing PRs vs New Research

Where Sources Disagree

Tool Landscape (Consolidated)

ProofShot vs SV0 Internal Visual Tools

Feature Comparison

ProofShot Limitations (Verified)

What ProofShot Does Better

Verdict

Proposal: Dual-Output Visual Tool (Human + Agent)

The Problem

The Proposal: One Run, Two Outputs

How Agents Would Use This

What Changes in Existing Tools

Comparison with ProofShot's Approach

Why This Is Better Than Adopting an External Tool

Process Options

Option A: Minimal — Playwright + autoVerify

Option B: Layered Pipeline (Recommended)

Option C: Full Stack — Design-to-Production

Decision Matrix

How This Relates to Existing PRs

Verification Plan

Next Action