Skip to main content

Review Process Hardening Plan

The Problem

The March 2026 multi-perspective review was run with a fundamental blind spot: no reviewer agent saw the platform. All 7 agents — including the UX critic, enterprise executive, and CEO reviewer — evaluated the product by reading source code, API responses, and design specs. The actual rendered UI was never part of the review input.

This produced three categories of findings that were wrong or incomplete:

  1. UX critic rated navigation as "weak" — but could only evaluate the component tree, not the rendered navigation flow. A jargon count of 23 was derived from code strings, not from what a user actually reads on screen.

  2. Enterprise executive scored sellability at 1.8/5 — but never saw what a partner would actually show a CIO. The cluster verdict sentences that "pass the 5-second comprehension test" were evaluated as text, not as rendered cards with visual hierarchy.

  3. Product QA found "hash IDs in breadcrumbs" — correctly, from code. But couldn't assess whether the breadcrumbs are even visible enough to matter, whether the hash truncation is readable, or whether the layout draws attention to the wrong element.

The review produced valid code-level and API-level findings. But for UX, readability, and partner deliverable quality, the accuracy ceiling is structurally limited without visual input.


What We're Building

Two new components that integrate with existing infrastructure:

Component 1: Visual Capture Skill (sv0-skills #7)

A Claude Code skill that produces a complete screenshot snapshot of the platform before any review cycle starts.

Builds on existing infrastructure in sv0-platform:

  • scripts/visual-screenshot.ts — Playwright-based page capture (already handles login, tenant selection, page navigation, scroll offsets)
  • review-ui skill — already captures screenshots via npm run qa:visual and evaluates against product vision
  • visual-review skill — already does before/after screenshot comparison for PRs
  • Demo tenant demo-w1 with seed data (29 active authority paths, 6 clusters, 51 findings)

What's new:

The existing tools capture screenshots for specific purposes (PR diff, vision alignment). The new skill produces a comprehensive, named snapshot that any reviewer agent can reference — covering all pages, interesting entities, edge states, and navigation patterns.

Component 2: Acceptance Review Researcher (sv0-intelligence #5)

A structured research workflow in sv0-intelligence that orchestrates the 7-agent review, collects results, and tracks MPAS-7 scores over time.

Builds on existing infrastructure in sv0-intelligence:

  • weekly_incident researcher pattern — gather → score → generate → publish pipeline
  • shared/claude_client.py — dual-mode Claude wrapper (SDK or CLI)
  • shared/signal_store.py — SQLite-backed persistence
  • scheduler.py — cron-based recurring execution

What's new:

A second researcher (acceptance_review) that follows the same pattern but with different stages: prepare (visual capture) → review (7 agents) → synthesize (MPAS-7 delta report).


Implementation Phases

Phase 1: Visual Capture Skill

Where: sv0-skills/platform-visual-capture/SKILL.md Depends on: Existing scripts/visual-screenshot.ts in sv0-platform Effort: 2-3 sessions GitHub: sv0-skills #7

1.1 Define the Capture Manifest

A YAML file listing every page and entity to capture:

# capture-manifest.yaml
pages:
- name: overview
path: /
description: "First thing any user sees — the 15-second CISO test"

- name: cluster-detail
path: /clusters/{cluster_id}
description: "Grouped finding summary — CISO executive readability"
instances:
- label: "unowned-sensitive-access"
resolve: "first cluster with 'unowned' in label"

- name: authority-paths-list
path: /authority-paths
description: "Analyst investigation starting point"

- name: path-detail-typical
path: /authority-paths/{path_id}
description: "Representative path with full evidence pack"
instances:
- label: "foundry-agent"
resolve: "first path with execution_30d > 0"

- name: path-detail-edge
path: /authority-paths/{path_id}
description: "Path with orphaned ownership + scope drift + LLM egress"
instances:
- label: "orphaned-llm-egress"
resolve: "first path matching all three conditions"

- name: findings-list
path: /findings
description: "Flat finding table — analyst-only, flagged as needing grouping"

- name: exposures
path: /exposures
description: "Exposure aggregation view"

captures_per_page:
- viewport: [1440, 900]
- full_page: true
- scroll_positions: [0, 50%, 100%]

1.2 Extend the Screenshot Script

Two existing scripts handle the two capture modes:

  • scripts/visual-screenshot.ts — captures standard pages (Overview, Findings, etc.) by path list
  • scripts/visual-screenshot-detail.ts — captures entity-level detail pages (authority paths, clusters) at multiple scroll positions via QA_DETAIL_PATH and QA_SCROLL_OFFSETS

The capture skill orchestrates both scripts in sequence: first the base script for all standard pages, then the detail script for each entity in the manifest. Extend with:

  • A capture manifest (YAML) that defines which pages and entities to capture
  • Dynamic entity resolution (e.g., "first cluster with 'unowned' in label") by querying the API before capture
  • A named snapshot directory: snapshots/YYYY-MM-DD-<label>/
  • A manifest.json output with captured paths, filenames, and metadata

Both scripts already handle:

  • Playwright browser launch with auth bypass
  • Tenant selection (QA_TENANT_ID=demo-w1)
  • Page navigation with wait-for-network-idle
  • Scroll offset captures (detail script: QA_SCROLL_OFFSETS=0,600,1200)
  • Output directory configuration

Key reuse points from existing code:

  • QA_BASE_URL, QA_TENANT_ID, QA_OUTPUT_DIR env vars
  • QA_PAGES for page selection (base script)
  • QA_DETAIL_PATH, QA_DETAIL_PREFIX, QA_SCROLL_OFFSETS for entity-level captures (detail script)
  • The reg-cli diff infrastructure (for later delta comparison)

1.3 Package as Skill

Create sv0-skills/platform-visual-capture/SKILL.md:

---
name: platform-visual-capture
description: "Capture a complete visual snapshot of the SecurityV0 platform for review cycles"
allowed-tools: Bash(*), Read, Glob, Grep
argument-hint: "[label] [--env local|dev|staging] [--pages page1,page2]"
---

The skill:

  1. Checks platform is running (health endpoint)
  2. Confirms demo tenant data is present
  3. Runs the extended screenshot script against the capture manifest
  4. Writes the snapshot to the designated location
  5. Outputs the snapshot path and manifest for downstream use

1.4 Snapshot Storage

Snapshots are stored in sv0-intelligence (not sv0-platform) since they're research artifacts:

sv0-intelligence/
└── store/
└── snapshots/
├── 2026-03-19-demo-w1/
│ ├── manifest.json
│ ├── overview.png
│ ├── cluster-detail-unowned-sensitive-access.png
│ ├── path-detail-foundry-agent.png
│ └── ...
└── 2026-04-02-demo-w1/
└── ...

Phase 2: Agent Updates for Visual Input

Where: sv0-platform/.claude/agents/*.md Depends on: Phase 1 (snapshot exists) Effort: 1-2 sessions GitHub: sv0-platform #100 (or follow-up)

2.1 Update Agent Tool Access

Currently, ux-critic has no Bash/curl access — it can only read files. For visual review, it needs access to screenshot files.

Update agent definitions to include Read access to the snapshot directory:

# ux-critic.md frontmatter
tools: Read, Grep, Glob

The Read tool already supports image files — Claude's multimodal capabilities let agents see PNG screenshots directly.

2.2 Add Visual Review Instructions

Each agent that evaluates UX-facing output gets a new section:

UX Critic — primary visual consumer:

  • Evaluate rendered visual hierarchy (not just component tree)
  • Count visible jargon terms from screenshots (not from code strings)
  • Assess navigation flow from sidebar screenshots
  • Grade information architecture from actual layout, not inferred structure

Enterprise Executive — partner handout test:

  • Evaluate whether rendered output could be presented to a CIO
  • Assess visual polish: alignment, spacing, typography, professional appearance
  • Check that cluster verdict sentences render with appropriate visual prominence

CEO Reviewer — sellability:

  • "Would I show this screenshot to a partner?" test
  • Visual "wow factor" assessment
  • Check that the demo path (Overview → Cluster → Path → Detail) looks compelling in screenshots

Product QA — spec match:

  • Compare rendered output to UX spec mockups
  • Verify that fix items (breadcrumbs, stat cards, governance labels) render correctly
  • Check empty/edge states that aren't testable from code alone

CISO, SecOps, Security Auditor — minimal visual changes:

  • These agents primarily evaluate content, not presentation
  • Add: "review screenshot of [relevant page] to confirm data matches API response"
  • Useful for catching rendering-vs-API discrepancies

2.3 Structured Visual Input Format

Each agent receives screenshots as part of its review input:

## Visual Snapshot

Snapshot: `2026-03-19-demo-w1`
Captured: 2026-03-19T14:30:00Z
Environment: localhost:8080, tenant demo-w1

### Pages
- Overview: `snapshots/2026-03-19-demo-w1/overview.png`
- Cluster Detail (Unowned Sensitive Access): `snapshots/2026-03-19-demo-w1/cluster-detail-unowned-sensitive-access.png`
- Path Detail (Foundry Agent): `snapshots/2026-03-19-demo-w1/path-detail-foundry-agent.png`
...

Phase 3: Acceptance Review Researcher

Where: sv0-intelligence/researchers/acceptance_review/ Depends on: Phase 1 (visual capture), Phase 2 (agent updates) Effort: 4-6 sessions GitHub: sv0-intelligence #5

3.1 Researcher Structure

Following the weekly_incident pattern:

researchers/acceptance_review/
├── main.py # Entry: prepare → review → synthesize
├── prepare.py # Stage 1: invoke visual capture skill, verify snapshot
├── review.py # Stage 2: load agent definitions, run reviews, collect outputs
├── synthesize.py # Stage 3: extract MPAS-7 scores, compute deltas, generate report
├── models.py # ReviewRun, AgentResult, MPAS7Score
└── prompts/
├── extract_scores.txt # Extract structured scores from agent output
└── synthesize.txt # Generate consolidated review brief

Agent definitions read from: sv0-platform/.claude/agents/*.md (not copied — read at runtime)

3.2 Stage 1: Prepare

def prepare(label: str, env: str = "local") -> Snapshot:
"""Invoke visual capture skill, return snapshot metadata."""
# 1. Check platform is running
# 2. Invoke platform-visual-capture skill (or shell out to screenshot script)
# 3. Verify snapshot directory and manifest.json exist
# 4. Return Snapshot(path, label, timestamp, page_count)

Platform startup options:

  • Local (default): Assume platform is already running on localhost:8080. The skill checks health, fails fast if down.
  • Dev server: Capture from dev-sv0.fofanov.ai. No local startup needed.
  • Docker start: docker compose up -d in sv0-platform, wait for health. Useful for CI.

3.3 Stage 2: Review

def review(snapshot: Snapshot, agents: list[str] = ALL_AGENTS) -> list[AgentResult]:
"""Run reviewer agents against current platform state + snapshot."""
results = []
for agent_name in agents:
# 1. Load the agent definition from sv0-platform/.claude/agents/{agent_name}.md
# Strip YAML frontmatter, use markdown body as the system prompt
# 2. Construct the review prompt with:
# - Snapshot paths (for visual agents)
# - API endpoint (for data agents)
# - Previous cycle scores (for context)
# 3. Call claude_client.complete() with the agent body as system prompt:
# complete(prompt=review_prompt, system=agent_system_prompt)
# 4. Parse structured output into AgentResult
results.append(result)
return results

Agent invocation — system prompt embedding, not CLI flags:

There is no --agent flag in the Claude CLI. The researcher loads each agent's .md definition file from sv0-platform/.claude/agents/, strips the YAML frontmatter, and passes the markdown body as the system parameter to claude_client.complete(). This is the same pattern weekly_incident uses for its scoring and generation prompts — prompts are loaded from files and passed to the Claude client.

In CLI mode, claude_client.py prepends the system prompt using <system> XML tags in the prompt body. In SDK mode, it passes it as the system parameter to client.messages.create(). Both modes are transparent to the researcher code.

Since sv0-intelligence already has sv0-platform in its additionalDirectories (.claude/settings.json), agent definition files are directly readable.

Not all agents run every cycle. The researcher accepts an --agents flag:

  • --agents all — full 7-agent sweep (before partner demo, sprint completion)
  • --agents secops,product-qa — targeted review after specific changes
  • --agents ux-critic,enterprise-executive,ceo-reviewer — visual-focused review

3.4 Stage 3: Synthesize

def synthesize(results: list[AgentResult], previous: ReviewRun | None) -> ReviewBrief:
"""Extract MPAS-7 scores, compute deltas, generate consolidated brief."""
# 1. Extract per-role scores from each agent's structured output
# 2. If previous cycle exists, compute deltas
# 3. Flag critical/blocking findings
# 4. Flag pending CEO decisions
# 5. Generate consolidated markdown brief

Output: A review run directory:

sv0-intelligence/output/acceptance_review/
└── 2026-03-19/
├── run.json # Run metadata, MPAS-7 scores, deltas
├── brief.md # Consolidated brief (CEO-ready)
├── agent-ciso-executive.md # Individual agent output
├── agent-secops-analyst.md
├── agent-product-qa.md
├── agent-ux-critic.md
├── agent-security-auditor.md
├── agent-enterprise-executive.md
├── agent-ceo-reviewer.md
└── snapshot/ # Symlink to store/snapshots/2026-03-19-demo-w1/

3.5 MPAS-7 Score Tracking

Scores are stored in SQLite (extending the existing signal_store pattern):

CREATE TABLE review_runs (
id TEXT PRIMARY KEY,
run_date TEXT NOT NULL,
snapshot_label TEXT NOT NULL,
agents_run TEXT NOT NULL, -- JSON array of agent names
score_ciso REAL,
score_secops REAL,
score_product_qa TEXT, -- "X partial, Y missing" format
score_ux TEXT, -- "grade / N jargon terms" format
score_auditor INTEGER, -- critical issue count
score_enterprise REAL, -- 1-5 scale
score_ceo TEXT, -- "X/Y accepted" format
brief_path TEXT
);

Delta computation: compare current run against the most recent full run (all 7 agents).

3.6 Triggers

The researcher runs:

  • Before partner demo: Full 7-agent sweep (manual trigger or scheduled)
  • After sprint completion: Full sweep against latest code
  • After targeted changes: Relevant agents only (e.g., fixed remediation → secops + product-qa)
  • On demand: python -m researchers.acceptance_review.main --agents all --label demo-w1

No automatic scheduling initially — triggered manually or via GitHub Actions workflow_dispatch.


Phase 4: Integration and First Run

Depends on: Phases 1-3 Effort: 1-2 sessions

4.1 First Validated Run

Execute the full pipeline against the current platform state:

  1. Run visual capture skill → produce 2026-04-XX-demo-w1 snapshot
  2. Run all 7 agents with visual input
  3. Synthesize MPAS-7 scores
  4. Compare against March 15 baseline (manually entered as the seed run)
  5. Review delta: did visual input change any agent's findings?

Expected outcome: The UX critic, enterprise executive, and CEO reviewer should produce qualitatively different findings when they can see the platform. Product QA may catch rendering issues invisible from code.

4.2 Baseline Calibration

If the first run's scores diverge significantly from March 15 (which had no visual input), investigate:

  • Are the new findings genuine (visual problems the code review missed)?
  • Or are the agents over-reacting to visual noise (screenshot artifacts, rendering differences)?

Calibrate agent prompts if needed. The goal is that visual input adds signal, not noise.

4.3 Documentation

After the first validated run:

  • Update the March review topic index with the hardened process
  • Document the visual capture workflow in sv0-skills/ README
  • Add the acceptance reviewer to sv0-intelligence/ README
  • Update sv0-platform/.claude/agents/ README with visual input instructions

Dependency Chain

Phase 1 (Visual Capture Skill)

Phase 2 (Agent Updates) ← can start in parallel with late Phase 1

Phase 3 (Acceptance Researcher) ← needs Phase 1 output format finalized

Phase 4 (Integration + First Run)

Critical path: Phase 1 is the blocker. Without screenshots, everything else is partial.


What This Does NOT Change

  • Agent definitions stay in sv0-platform (they need codebase access)
  • The consolidated action plan (what to build) is unchanged
  • Existing sprint plan priorities are unchanged
  • This is process infrastructure, not product changes
  • No new dependencies on external services (Playwright is already installed)

Effort Summary

PhaseEffortBlocks
Phase 1: Visual Capture Skill2-3 sessionsEverything
Phase 2: Agent Updates1-2 sessionsPhase 3
Phase 3: Acceptance Researcher4-6 sessionsPhase 4
Phase 4: Integration + First Run1-2 sessions
Total8-13 sessions

GitHub Issues

  • sv0-skills #7 — Platform UI Visual Capture skill (Phase 1)
  • sv0-intelligence #5 — Multi-Perspective Platform Acceptance Review (Phases 3-4)
  • sv0-platform #100 — Agent definitions (Phase 2 updates)