Review Process Hardening Plan
The Problem
The March 2026 multi-perspective review was run with a fundamental blind spot: no reviewer agent saw the platform. All 7 agents — including the UX critic, enterprise executive, and CEO reviewer — evaluated the product by reading source code, API responses, and design specs. The actual rendered UI was never part of the review input.
This produced three categories of findings that were wrong or incomplete:
-
UX critic rated navigation as "weak" — but could only evaluate the component tree, not the rendered navigation flow. A jargon count of 23 was derived from code strings, not from what a user actually reads on screen.
-
Enterprise executive scored sellability at 1.8/5 — but never saw what a partner would actually show a CIO. The cluster verdict sentences that "pass the 5-second comprehension test" were evaluated as text, not as rendered cards with visual hierarchy.
-
Product QA found "hash IDs in breadcrumbs" — correctly, from code. But couldn't assess whether the breadcrumbs are even visible enough to matter, whether the hash truncation is readable, or whether the layout draws attention to the wrong element.
The review produced valid code-level and API-level findings. But for UX, readability, and partner deliverable quality, the accuracy ceiling is structurally limited without visual input.
What We're Building
Two new components that integrate with existing infrastructure:
Component 1: Visual Capture Skill (sv0-skills #7)
A Claude Code skill that produces a complete screenshot snapshot of the platform before any review cycle starts.
Builds on existing infrastructure in sv0-platform:
scripts/visual-screenshot.ts— Playwright-based page capture (already handles login, tenant selection, page navigation, scroll offsets)review-uiskill — already captures screenshots vianpm run qa:visualand evaluates against product visionvisual-reviewskill — already does before/after screenshot comparison for PRs- Demo tenant
demo-w1with seed data (29 active authority paths, 6 clusters, 51 findings)
What's new:
The existing tools capture screenshots for specific purposes (PR diff, vision alignment). The new skill produces a comprehensive, named snapshot that any reviewer agent can reference — covering all pages, interesting entities, edge states, and navigation patterns.
Component 2: Acceptance Review Researcher (sv0-intelligence #5)
A structured research workflow in sv0-intelligence that orchestrates the 7-agent review, collects results, and tracks MPAS-7 scores over time.
Builds on existing infrastructure in sv0-intelligence:
weekly_incidentresearcher pattern — gather → score → generate → publish pipelineshared/claude_client.py— dual-mode Claude wrapper (SDK or CLI)shared/signal_store.py— SQLite-backed persistencescheduler.py— cron-based recurring execution
What's new:
A second researcher (acceptance_review) that follows the same pattern but with different stages: prepare (visual capture) → review (7 agents) → synthesize (MPAS-7 delta report).
Implementation Phases
Phase 1: Visual Capture Skill
Where: sv0-skills/platform-visual-capture/SKILL.md
Depends on: Existing scripts/visual-screenshot.ts in sv0-platform
Effort: 2-3 sessions
GitHub: sv0-skills #7
1.1 Define the Capture Manifest
A YAML file listing every page and entity to capture:
# capture-manifest.yaml
pages:
- name: overview
path: /
description: "First thing any user sees — the 15-second CISO test"
- name: cluster-detail
path: /clusters/{cluster_id}
description: "Grouped finding summary — CISO executive readability"
instances:
- label: "unowned-sensitive-access"
resolve: "first cluster with 'unowned' in label"
- name: authority-paths-list
path: /authority-paths
description: "Analyst investigation starting point"
- name: path-detail-typical
path: /authority-paths/{path_id}
description: "Representative path with full evidence pack"
instances:
- label: "foundry-agent"
resolve: "first path with execution_30d > 0"
- name: path-detail-edge
path: /authority-paths/{path_id}
description: "Path with orphaned ownership + scope drift + LLM egress"
instances:
- label: "orphaned-llm-egress"
resolve: "first path matching all three conditions"
- name: findings-list
path: /findings
description: "Flat finding table — analyst-only, flagged as needing grouping"
- name: exposures
path: /exposures
description: "Exposure aggregation view"
captures_per_page:
- viewport: [1440, 900]
- full_page: true
- scroll_positions: [0, 50%, 100%]
1.2 Extend the Screenshot Script
Two existing scripts handle the two capture modes:
scripts/visual-screenshot.ts— captures standard pages (Overview, Findings, etc.) by path listscripts/visual-screenshot-detail.ts— captures entity-level detail pages (authority paths, clusters) at multiple scroll positions viaQA_DETAIL_PATHandQA_SCROLL_OFFSETS
The capture skill orchestrates both scripts in sequence: first the base script for all standard pages, then the detail script for each entity in the manifest. Extend with:
- A capture manifest (YAML) that defines which pages and entities to capture
- Dynamic entity resolution (e.g., "first cluster with 'unowned' in label") by querying the API before capture
- A named snapshot directory:
snapshots/YYYY-MM-DD-<label>/ - A
manifest.jsonoutput with captured paths, filenames, and metadata
Both scripts already handle:
- Playwright browser launch with auth bypass
- Tenant selection (
QA_TENANT_ID=demo-w1) - Page navigation with wait-for-network-idle
- Scroll offset captures (detail script:
QA_SCROLL_OFFSETS=0,600,1200) - Output directory configuration
Key reuse points from existing code:
QA_BASE_URL,QA_TENANT_ID,QA_OUTPUT_DIRenv varsQA_PAGESfor page selection (base script)QA_DETAIL_PATH,QA_DETAIL_PREFIX,QA_SCROLL_OFFSETSfor entity-level captures (detail script)- The
reg-clidiff infrastructure (for later delta comparison)
1.3 Package as Skill
Create sv0-skills/platform-visual-capture/SKILL.md:
---
name: platform-visual-capture
description: "Capture a complete visual snapshot of the SecurityV0 platform for review cycles"
allowed-tools: Bash(*), Read, Glob, Grep
argument-hint: "[label] [--env local|dev|staging] [--pages page1,page2]"
---
The skill:
- Checks platform is running (health endpoint)
- Confirms demo tenant data is present
- Runs the extended screenshot script against the capture manifest
- Writes the snapshot to the designated location
- Outputs the snapshot path and manifest for downstream use
1.4 Snapshot Storage
Snapshots are stored in sv0-intelligence (not sv0-platform) since they're research artifacts:
sv0-intelligence/
└── store/
└── snapshots/
├── 2026-03-19-demo-w1/
│ ├── manifest.json
│ ├── overview.png
│ ├── cluster-detail-unowned-sensitive-access.png
│ ├── path-detail-foundry-agent.png
│ └── ...
└── 2026-04-02-demo-w1/
└── ...
Phase 2: Agent Updates for Visual Input
Where: sv0-platform/.claude/agents/*.md
Depends on: Phase 1 (snapshot exists)
Effort: 1-2 sessions
GitHub: sv0-platform #100 (or follow-up)
2.1 Update Agent Tool Access
Currently, ux-critic has no Bash/curl access — it can only read files. For visual review, it needs access to screenshot files.
Update agent definitions to include Read access to the snapshot directory:
# ux-critic.md frontmatter
tools: Read, Grep, Glob
The Read tool already supports image files — Claude's multimodal capabilities let agents see PNG screenshots directly.
2.2 Add Visual Review Instructions
Each agent that evaluates UX-facing output gets a new section:
UX Critic — primary visual consumer:
- Evaluate rendered visual hierarchy (not just component tree)
- Count visible jargon terms from screenshots (not from code strings)
- Assess navigation flow from sidebar screenshots
- Grade information architecture from actual layout, not inferred structure
Enterprise Executive — partner handout test:
- Evaluate whether rendered output could be presented to a CIO
- Assess visual polish: alignment, spacing, typography, professional appearance
- Check that cluster verdict sentences render with appropriate visual prominence
CEO Reviewer — sellability:
- "Would I show this screenshot to a partner?" test
- Visual "wow factor" assessment
- Check that the demo path (Overview → Cluster → Path → Detail) looks compelling in screenshots
Product QA — spec match:
- Compare rendered output to UX spec mockups
- Verify that fix items (breadcrumbs, stat cards, governance labels) render correctly
- Check empty/edge states that aren't testable from code alone
CISO, SecOps, Security Auditor — minimal visual changes:
- These agents primarily evaluate content, not presentation
- Add: "review screenshot of [relevant page] to confirm data matches API response"
- Useful for catching rendering-vs-API discrepancies
2.3 Structured Visual Input Format
Each agent receives screenshots as part of its review input:
## Visual Snapshot
Snapshot: `2026-03-19-demo-w1`
Captured: 2026-03-19T14:30:00Z
Environment: localhost:8080, tenant demo-w1
### Pages
- Overview: `snapshots/2026-03-19-demo-w1/overview.png`
- Cluster Detail (Unowned Sensitive Access): `snapshots/2026-03-19-demo-w1/cluster-detail-unowned-sensitive-access.png`
- Path Detail (Foundry Agent): `snapshots/2026-03-19-demo-w1/path-detail-foundry-agent.png`
...
Phase 3: Acceptance Review Researcher
Where: sv0-intelligence/researchers/acceptance_review/
Depends on: Phase 1 (visual capture), Phase 2 (agent updates)
Effort: 4-6 sessions
GitHub: sv0-intelligence #5
3.1 Researcher Structure
Following the weekly_incident pattern:
researchers/acceptance_review/
├── main.py # Entry: prepare → review → synthesize
├── prepare.py # Stage 1: invoke visual capture skill, verify snapshot
├── review.py # Stage 2: load agent definitions, run reviews, collect outputs
├── synthesize.py # Stage 3: extract MPAS-7 scores, compute deltas, generate report
├── models.py # ReviewRun, AgentResult, MPAS7Score
└── prompts/
├── extract_scores.txt # Extract structured scores from agent output
└── synthesize.txt # Generate consolidated review brief
Agent definitions read from: sv0-platform/.claude/agents/*.md (not copied — read at runtime)
3.2 Stage 1: Prepare
def prepare(label: str, env: str = "local") -> Snapshot:
"""Invoke visual capture skill, return snapshot metadata."""
# 1. Check platform is running
# 2. Invoke platform-visual-capture skill (or shell out to screenshot script)
# 3. Verify snapshot directory and manifest.json exist
# 4. Return Snapshot(path, label, timestamp, page_count)
Platform startup options:
- Local (default): Assume platform is already running on
localhost:8080. The skill checks health, fails fast if down. - Dev server: Capture from
dev-sv0.fofanov.ai. No local startup needed. - Docker start:
docker compose up -din sv0-platform, wait for health. Useful for CI.
3.3 Stage 2: Review
def review(snapshot: Snapshot, agents: list[str] = ALL_AGENTS) -> list[AgentResult]:
"""Run reviewer agents against current platform state + snapshot."""
results = []
for agent_name in agents:
# 1. Load the agent definition from sv0-platform/.claude/agents/{agent_name}.md
# Strip YAML frontmatter, use markdown body as the system prompt
# 2. Construct the review prompt with:
# - Snapshot paths (for visual agents)
# - API endpoint (for data agents)
# - Previous cycle scores (for context)
# 3. Call claude_client.complete() with the agent body as system prompt:
# complete(prompt=review_prompt, system=agent_system_prompt)
# 4. Parse structured output into AgentResult
results.append(result)
return results
Agent invocation — system prompt embedding, not CLI flags:
There is no --agent flag in the Claude CLI. The researcher loads each agent's .md definition file from sv0-platform/.claude/agents/, strips the YAML frontmatter, and passes the markdown body as the system parameter to claude_client.complete(). This is the same pattern weekly_incident uses for its scoring and generation prompts — prompts are loaded from files and passed to the Claude client.
In CLI mode, claude_client.py prepends the system prompt using <system> XML tags in the prompt body. In SDK mode, it passes it as the system parameter to client.messages.create(). Both modes are transparent to the researcher code.
Since sv0-intelligence already has sv0-platform in its additionalDirectories (.claude/settings.json), agent definition files are directly readable.
Not all agents run every cycle. The researcher accepts an --agents flag:
--agents all— full 7-agent sweep (before partner demo, sprint completion)--agents secops,product-qa— targeted review after specific changes--agents ux-critic,enterprise-executive,ceo-reviewer— visual-focused review
3.4 Stage 3: Synthesize
def synthesize(results: list[AgentResult], previous: ReviewRun | None) -> ReviewBrief:
"""Extract MPAS-7 scores, compute deltas, generate consolidated brief."""
# 1. Extract per-role scores from each agent's structured output
# 2. If previous cycle exists, compute deltas
# 3. Flag critical/blocking findings
# 4. Flag pending CEO decisions
# 5. Generate consolidated markdown brief
Output: A review run directory:
sv0-intelligence/output/acceptance_review/
└── 2026-03-19/
├── run.json # Run metadata, MPAS-7 scores, deltas
├── brief.md # Consolidated brief (CEO-ready)
├── agent-ciso-executive.md # Individual agent output
├── agent-secops-analyst.md
├── agent-product-qa.md
├── agent-ux-critic.md
├── agent-security-auditor.md
├── agent-enterprise-executive.md
├── agent-ceo-reviewer.md
└── snapshot/ # Symlink to store/snapshots/2026-03-19-demo-w1/
3.5 MPAS-7 Score Tracking
Scores are stored in SQLite (extending the existing signal_store pattern):
CREATE TABLE review_runs (
id TEXT PRIMARY KEY,
run_date TEXT NOT NULL,
snapshot_label TEXT NOT NULL,
agents_run TEXT NOT NULL, -- JSON array of agent names
score_ciso REAL,
score_secops REAL,
score_product_qa TEXT, -- "X partial, Y missing" format
score_ux TEXT, -- "grade / N jargon terms" format
score_auditor INTEGER, -- critical issue count
score_enterprise REAL, -- 1-5 scale
score_ceo TEXT, -- "X/Y accepted" format
brief_path TEXT
);
Delta computation: compare current run against the most recent full run (all 7 agents).
3.6 Triggers
The researcher runs:
- Before partner demo: Full 7-agent sweep (manual trigger or scheduled)
- After sprint completion: Full sweep against latest code
- After targeted changes: Relevant agents only (e.g., fixed remediation → secops + product-qa)
- On demand:
python -m researchers.acceptance_review.main --agents all --label demo-w1
No automatic scheduling initially — triggered manually or via GitHub Actions workflow_dispatch.
Phase 4: Integration and First Run
Depends on: Phases 1-3 Effort: 1-2 sessions
4.1 First Validated Run
Execute the full pipeline against the current platform state:
- Run visual capture skill → produce
2026-04-XX-demo-w1snapshot - Run all 7 agents with visual input
- Synthesize MPAS-7 scores
- Compare against March 15 baseline (manually entered as the seed run)
- Review delta: did visual input change any agent's findings?
Expected outcome: The UX critic, enterprise executive, and CEO reviewer should produce qualitatively different findings when they can see the platform. Product QA may catch rendering issues invisible from code.
4.2 Baseline Calibration
If the first run's scores diverge significantly from March 15 (which had no visual input), investigate:
- Are the new findings genuine (visual problems the code review missed)?
- Or are the agents over-reacting to visual noise (screenshot artifacts, rendering differences)?
Calibrate agent prompts if needed. The goal is that visual input adds signal, not noise.
4.3 Documentation
After the first validated run:
- Update the March review topic index with the hardened process
- Document the visual capture workflow in
sv0-skills/README - Add the acceptance reviewer to
sv0-intelligence/README - Update
sv0-platform/.claude/agents/README with visual input instructions
Dependency Chain
Phase 1 (Visual Capture Skill)
↓
Phase 2 (Agent Updates) ← can start in parallel with late Phase 1
↓
Phase 3 (Acceptance Researcher) ← needs Phase 1 output format finalized
↓
Phase 4 (Integration + First Run)
Critical path: Phase 1 is the blocker. Without screenshots, everything else is partial.
What This Does NOT Change
- Agent definitions stay in sv0-platform (they need codebase access)
- The consolidated action plan (what to build) is unchanged
- Existing sprint plan priorities are unchanged
- This is process infrastructure, not product changes
- No new dependencies on external services (Playwright is already installed)
Effort Summary
| Phase | Effort | Blocks |
|---|---|---|
| Phase 1: Visual Capture Skill | 2-3 sessions | Everything |
| Phase 2: Agent Updates | 1-2 sessions | Phase 3 |
| Phase 3: Acceptance Researcher | 4-6 sessions | Phase 4 |
| Phase 4: Integration + First Run | 1-2 sessions | — |
| Total | 8-13 sessions | — |
GitHub Issues
- sv0-skills #7 — Platform UI Visual Capture skill (Phase 1)
- sv0-intelligence #5 — Multi-Perspective Platform Acceptance Review (Phases 3-4)
- sv0-platform #100 — Agent definitions (Phase 2 updates)