Skip to main content

Research Brief: Acceptance-Driven Implementation Validation

What This Research Is

A simulation-based validation loop that tests proposed SecurityV0 platform changes before engineering invests in building them. Instead of surveying academic literature, this research:

  1. Takes a concrete proposed change (e.g., "replace generic remediation with named objects")
  2. Generates realistic before/after platform output for that change
  3. Runs the 7 reviewer agents against both versions
  4. Measures the acceptance score delta per role
  5. Identifies which changes deliver the most acceptance improvement per engineering effort

The output is a ranked backlog of validated changes with measured expected impact — not a literature review.


Why This Approach

The previous AutoResearchClaw run (March 18-19) applied its default academic pipeline to a product strategy problem. It produced literature synthesis from clinical guideline papers and hypothesis testing frameworks, but the research brief had explicitly requested "an actionable product roadmap, not a scientific paper."

What went wrong:

  • The research proposed dual-axis scoring — Sergey already decided to remove all visible scores
  • The research proposed a two-product split — contradicts the one-platform, audience-split model
  • Literature clusters (cardiovascular guidelines, deep learning, mental health papers) were not actionable for a security platform
  • Hypothesis testing framework was built to validate decisions the CEO already made
  • No concrete feature specs, screen layouts, or data model changes were produced

What should change:

  • Replace literature synthesis with simulated implementation testing
  • Treat Sergey's 28 feedback items and the cross-review corrections as ground truth, not as hypotheses to validate
  • Measure acceptance improvement from specific proposed changes, not from abstract framework comparisons
  • Output a prioritized implementation backlog ranked by measured acceptance delta

Research Objective

Produce a validated change backlog for SecurityV0 where each item has:

  • A concrete before/after specification (what the platform output looks like today vs. after the change)
  • Measured acceptance delta across the 7 reviewer roles
  • Which reviewer roles benefit most
  • Engineering effort estimate (from the sprint plan)
  • An acceptance-per-effort ratio that tells us what to build first

Method: Simulate → Review → Measure → Rank

Step 1: Define the Change Set

Each change is a discrete, implementable platform modification. Source: the consolidated action plan + Sergey's feedback items.

Proposed change set (18 changes across 5 categories):

Category A: Remediation Quality

IDChangeSimulates
A1Generic → named object remediationapplies_to: "LLM endpoint"applies_to: "Endpoint: Azure OpenAI (svc-foundry-agent701)"
A2Choke point deduplicationSame remediation in 3 clusters → single entry: "Applies across 3 clusters. One fix reduces 3 exposures."
A3Business impact caveat on remediationAdd: "Restricting this role may affect [service]. Verify with application owner before applying."

Category B: CISO Clarity

IDChangeSimulates
B1Execution confidence labels (plain English)"Grade A" → "Execution Confirmed" / "Previously Active" / "Standing Authority Only"
B2Global top-3 risk ranking on OverviewAdd: "Top risks across all clusters: 1) ..., 2) ..., 3) ..."
B3Business metric stat cards"Active Autonomous: 5 Identities" → "Sensitive Domains Reached: 6"
B4OWASP/NIST compliance tags on findingsAdd: "OWASP ASI-03 · NIST AC-2 · SOX: Segregation of duties"

Category C: Analyst Workflow

IDChangeSimulates
C1"What changed since yesterday" filterNew section: "3 new findings since your last visit (March 18)" with highlighted items
C2Full identity role scope in path viewPath row shows "5 roles across 5 paths" instead of just the 1 role used in this path
C3Named owners in ownership section"Service principal owner departed" → "Maria Lopez (departed March 1, 2026)"
C4Human-readable breadcrumbs/authority-paths/0a3a4bb8...Overview / Unowned Sensitive Access / svc-foundry-agent701

Category D: Data Completeness

IDChangeSimulates
D1Populated target_resource in evidencetarget_resource: ""target_resource: "GP_Clinical_Notes (SharePoint)"
D2Fixed added_roles in evidence packs"Review 0 role assignments" → "Review 2 role assignments: ap_write, ar_write"
D3Correct path/finding counts32 vs 30 discrepancy resolved. bySeverity shows total-scoped counts.

Category E: Report & Partner Deliverable

IDChangeSimulates
E1Assessment report (6-page executive format)Full report output: cover, exec summary in business language, findings, remediation roadmap, compliance mapping
E2Scan digest (1-page summary)Post-scan summary: top 3 risks, governance checklist, trend
E3Cluster verdicts extended below cluster levelFinding-level descriptions use the same business narrative pattern as cluster verdicts
E4Governance checklist with correct labelsDistinct labels: "Scope drift (3 paths)" instead of 3x "Orphaned identities"

Step 2: Generate Before/After Artifacts

For each change, generate two versions of the relevant platform output:

Before: Use the actual current platform output as baseline. Source from:

  • Live API responses from app.securityv0.com (tenant demo-w1)
  • The data snapshots captured in the March 15 multi-perspective review
  • The specific examples cited in Sergey's feedback

After: Apply the proposed change to the baseline output. The change must be:

  • Realistic (uses data that exists or would exist after a known data fix)
  • Specific (exact text, not placeholders like "[entity name here]")
  • Minimal (only change what the proposal specifies — don't improve other aspects)

Format: Each artifact is a structured document that includes:

  • The relevant screen/section output (e.g., "Authority Path Detail — Remediation Section")
  • Enough surrounding context for a reviewer to evaluate (not just the changed line)
  • Metadata: which change IDs are applied, what data was modified

Step 3: Run 7-Persona Reviews

Submit each before/after pair to the 7 reviewer agents. Each agent uses its existing definition from sv0-platform/.claude/agents/:

AgentDefinitionScoring Rubric
CISO Executiveciso-reviewer.mdCan I get the "so what?" in 15 seconds? (1-10)
SecOps Analystsecops-analyst.mdCan I act on this finding without asking someone? (1-10)
Product QAproduct-qa.mdDoes this match the spec? (pass/partial/fail per item)
UX Criticux-critic.mdIs this clear to a first-time user? Jargon count. IA grade. (A-F)
Security Auditorsecurity-auditor.mdIs the data internally consistent? (issue count)
Enterprise Executiveenterprise-executive.mdCan a partner present this without rewriting? (1-5)
CEO (Sergey)ceo-reviewer.mdIs this sellable, honest, and aligned with product vision? (accept/reject per item)

Scoring protocol:

  • Each reviewer scores both the before and after version independently
  • Scores use the same rubric as the March 15 review for comparability
  • Each reviewer must provide specific textual feedback explaining score changes
  • Aggregate into the MPAS-7 format (see consolidated action plan)

Step 4: Measure Deltas

For each change, compute:

MetricDefinition
Per-role deltaAfter score minus Before score for each of the 7 roles
Aggregate acceptance liftMean delta across all 7 roles, weighted equally
Role coverageHow many of the 7 roles show improvement (target: change helps ≥3 roles)
Regression checkAny role where the After score is LOWER than Before (flag as risk)

Step 5: Rank and Bundle

Combine the measured deltas with engineering effort estimates from the sprint plan:

Acceptance-per-effort = Aggregate acceptance lift / Effort (sessions)

Group changes into implementable bundles:

  • Bundle 1 (Demo Blockers): Changes that must ship together for the demo to work
  • Bundle 2 (Highest ROI): Top 5 changes by acceptance-per-effort ratio
  • Bundle 3 (Report MVP): Minimum set for a partner-presentable report
  • Bundle 4 (Remaining): Everything else, ordered by ratio

What to Avoid

Explicit anti-patterns from the previous research run:

Anti-PatternWhy It FailedRule
Literature synthesis from unrelated domainsCardiovascular guidelines and deep learning papers are not actionable for a security platform UX problemDo NOT cite academic papers unless they directly describe a pattern already used in security tooling (e.g., Wiz, CrowdStrike, Datadog)
Hypothesis testing for decided questionsSergey already decided: no scores, no effort estimates, plain English, cut > add. Testing these as hypotheses wastes cycles.Treat Sergey's 28 feedback items as requirements, not hypotheses
Proposing features that contradict CEO decisionsDual-axis scoring, two-product split, visible evidence grades — all contradicted explicit decisionsBefore proposing ANY feature, check against the 8 guiding principles in the consolidated action plan
Abstract framework recommendations"Implement an Essential Actions view" without specifying what it looks like, what data it uses, or where it appearsEvery recommendation must include: exact text/layout, data source, file location
Scoring with synthetic personasLLM-simulated "CISO persona" may not match real CISO behaviorUse the actual reviewer agent definitions from sv0-platform/.claude/agents/. These have been calibrated against Sergey's feedback. Validate rubric against the 28 feedback items as ground truth.
Over-engineering experiment design50 synthetic scenarios × 5 seeds × 4 regime cells × Wilcoxon tests is overkill for a product questionUse the actual platform output, not synthetic scenarios. Run each reviewer once per before/after pair. Variance comes from the 7 different perspectives, not from repeated sampling.

What to Try

Compound Change Testing

Individual changes may have interaction effects. Test key bundles as compounds:

CompoundChangesHypothesis
"Fix the engine"D1 + D2 + D3Data completeness alone lifts SecOps and Auditor scores
"CISO in 15 seconds"B1 + B2 + B3 + B4Combination of confidence labels + global ranking + business metrics + compliance tags lifts CISO score past 85%
"Analyst day-1"A1 + C1 + C2 + C3 + C4Named remediation + what-changed + role scope + real names + breadcrumbs makes SecOps score ≥80%
"Partner-ready report"E1 + E3 + B4Assessment report + extended verdicts + compliance tags lifts Enterprise Executive past 3.5/5
"Full package"All 18 changesUpper bound — what's the maximum acceptance we can achieve?

Run individual changes first, then compounds, to isolate which changes drive the most improvement.

Cross-Role Regression Detection

Some changes that help one role may hurt another. Examples to watch:

  • Adding compliance tags (B4) helps CISO/Executive but might add visual noise for SecOps → check UX score
  • Business metric stat cards (B3) help CISO but remove analyst-relevant data → check SecOps score
  • Extended verdicts below cluster level (E3) help Executive but might feel dumbed-down to SecOps → check both

Flag any change where a role's score drops by ≥1 point. These need design mitigation (e.g., progressive disclosure, role-specific views).

Sergey Feedback Ground Truth Validation

The 28 feedback items have known accept/reject status. Use them as validation:

  1. For each accepted item, the corresponding change should improve ≥1 role score
  2. For each "open question" item, the research should produce data that helps resolve the question
  3. For each deferred item, the research should NOT propose to address it (scope control)

If the simulated reviews disagree with Sergey's actual decisions, investigate why — the simulation may have a calibration problem.


Validation: How to Know the Research Worked

Success Criteria

CriterionThreshold
CoverageAll 18 proposed changes have measured before/after deltas
DiscriminationChanges show measurable variance (not all +1 or all +0). At least 3 changes show ≥2 point lift for their target role.
AlignmentThe ranking by acceptance-per-effort does NOT contradict Sergey's explicit priorities. If it does, explain why.
Regression detectionAny cross-role regressions are identified and have a proposed mitigation
ActionabilityThe final ranked backlog can be handed to engineering without further interpretation

Failure Modes to Watch

Failure ModeDetectionResponse
All scores identical before/afterReviewers are not sensitive to changesMake before/after differences more concrete. Check that the "after" artifact actually reflects the change.
Reviewers contradict each other chaoticallyNo coherent signal across rolesCheck reviewer agent calibration against the March 15 baseline. Re-run with more surrounding context.
Rankings contradict Sergey's prioritiesSimulation doesn't match real-world prioritiesUse the 28 feedback items as calibration anchors. Weight CEO role higher in the ranking formula.
Research drifts into literatureAutoResearchClaw defaults to academic modeThe topic override (below) explicitly forbids literature synthesis. If it still happens, skip to experiment stages.

AutoResearchClaw Configuration

Topic Override

Validate 18 concrete SecurityV0 platform changes by generating before/after
platform output and measuring acceptance score deltas across 7 reviewer roles.

DO NOT synthesize academic literature. DO NOT generate hypotheses about
already-decided questions. DO NOT propose features that contradict the
guiding principles listed in the research brief.

Input: 18 proposed changes with before/after specifications.
Output: Ranked backlog with measured per-role acceptance deltas.

Experiment Mode

simulated — no code execution. Generate before/after text artifacts and score them.

Pipeline Configuration

  • Skip stages: Literature search (stages 2-3), gap analysis (stage 4), literature synthesis (stage 7). These produced the academic drift in the previous run.
  • Start from: Problem decomposition (stage 5) — use the 18 changes as the decomposed problem set
  • Emphasis stages: Experiment execution (stages 10-16) — this is where the before/after testing happens
  • Gate stages:
    • Stage 5: Human approval of the change set (verify all 18 changes are correctly specified)
    • Stage 9: Human approval of experiment design (verify before/after artifacts are realistic)
    • Stage 17: Human approval of results (verify rankings make sense before final synthesis)

Input Files

FilePurpose
sv0-documentation/docs/product/reviews/march-2026-platform-review/2026-03-21-consolidated-action-plan.mdGuiding principles, MPAS-7 targets, what was already decided
sv0-documentation/docs/product/reviews/march-2026-platform-review/2026-03-15-multi-perspective-platform-review.mdBaseline review scores, current platform state
sv0-documentation/docs/product/reviews/march-2026-platform-review/2026-03-16-sergey-feedback-tracker.md28 feedback items as ground truth
sv0-documentation/docs/product/reviews/march-2026-platform-review/2026-03-16-sprint-cross-review-report.mdCorrections to apply (contradictions, gaps)
sv0-platform/.claude/agents/*.md7 reviewer agent definitions
Live API responses from app.securityv0.com (tenant demo-w1)Current platform output for "before" artifacts

Expected Deliverables

  1. Change Scorecard (1 page per change): before/after text, per-role delta, aggregate lift, regression flags
  2. Compound Bundle Scorecards (5 bundles): interaction effects, total lift, regression flags
  3. Ranked Backlog (1 page): all 18 changes sorted by acceptance-per-effort, grouped into implementation bundles
  4. Calibration Report (1 page): how well the simulated reviews align with Sergey's 28 feedback decisions
  5. Decision Support for open questions: data to help resolve the 5 items Sergey marked as "OPEN QUESTION"

Next Action

Status: idea

Decision needed from: Ivan (CTO)

Options:

  1. Run as specified — configure AutoResearchClaw with the topic override and pipeline config above, using the 18 changes as input
  2. Reduce scope — test only the Phase 0 + Phase 1 changes (10 items) as a faster validation cycle
  3. Manual validation instead — skip AutoResearchClaw, implement Phase 0, then re-run the 7 agents manually against the live platform
  4. Hybrid — implement Phase 0 first (known blockers, no research needed), then run this research for Phase 1+ changes where the priority is less obvious

Recommended: Option 4 (Hybrid). Phase 0 items are clearly demo-blocking — build them, don't simulate them. Use this research to validate Phase 1+ priorities and resolve the 5 open questions with data instead of guessing.

GitHub Issue: not yet created