Research Brief: Acceptance-Driven Implementation Validation

What This Research Is

A simulation-based validation loop that tests proposed SecurityV0 platform changes before engineering invests in building them. Instead of surveying academic literature, this research:

Takes a concrete proposed change (e.g., "replace generic remediation with named objects")
Generates realistic before/after platform output for that change
Runs the 7 reviewer agents against both versions
Measures the acceptance score delta per role
Identifies which changes deliver the most acceptance improvement per engineering effort

The output is a ranked backlog of validated changes with measured expected impact — not a literature review.

Why This Approach

The previous AutoResearchClaw run (March 18-19) applied its default academic pipeline to a product strategy problem. It produced literature synthesis from clinical guideline papers and hypothesis testing frameworks, but the research brief had explicitly requested "an actionable product roadmap, not a scientific paper."

What went wrong:

The research proposed dual-axis scoring — Sergey already decided to remove all visible scores
The research proposed a two-product split — contradicts the one-platform, audience-split model
Literature clusters (cardiovascular guidelines, deep learning, mental health papers) were not actionable for a security platform
Hypothesis testing framework was built to validate decisions the CEO already made
No concrete feature specs, screen layouts, or data model changes were produced

What should change:

Replace literature synthesis with simulated implementation testing
Treat Sergey's 28 feedback items and the cross-review corrections as ground truth, not as hypotheses to validate
Measure acceptance improvement from specific proposed changes, not from abstract framework comparisons
Output a prioritized implementation backlog ranked by measured acceptance delta

Research Objective

Produce a validated change backlog for SecurityV0 where each item has:

A concrete before/after specification (what the platform output looks like today vs. after the change)
Measured acceptance delta across the 7 reviewer roles
Which reviewer roles benefit most
Engineering effort estimate (from the sprint plan)
An acceptance-per-effort ratio that tells us what to build first

Method: Simulate → Review → Measure → Rank

Step 1: Define the Change Set

Each change is a discrete, implementable platform modification. Source: the consolidated action plan + Sergey's feedback items.

Proposed change set (18 changes across 5 categories):

Category A: Remediation Quality

ID	Change	Simulates
A1	Generic → named object remediation	`applies_to: "LLM endpoint"` → `applies_to: "Endpoint: Azure OpenAI (svc-foundry-agent701)"`
A2	Choke point deduplication	Same remediation in 3 clusters → single entry: "Applies across 3 clusters. One fix reduces 3 exposures."
A3	Business impact caveat on remediation	Add: "Restricting this role may affect [service]. Verify with application owner before applying."

Category B: CISO Clarity

ID	Change	Simulates
B1	Execution confidence labels (plain English)	"Grade A" → "Execution Confirmed" / "Previously Active" / "Standing Authority Only"
B2	Global top-3 risk ranking on Overview	Add: "Top risks across all clusters: 1) ..., 2) ..., 3) ..."
B3	Business metric stat cards	"Active Autonomous: 5 Identities" → "Sensitive Domains Reached: 6"
B4	OWASP/NIST compliance tags on findings	Add: "OWASP ASI-03 · NIST AC-2 · SOX: Segregation of duties"

Category C: Analyst Workflow

ID	Change	Simulates
C1	"What changed since yesterday" filter	New section: "3 new findings since your last visit (March 18)" with highlighted items
C2	Full identity role scope in path view	Path row shows "5 roles across 5 paths" instead of just the 1 role used in this path
C3	Named owners in ownership section	"Service principal owner departed" → "Maria Lopez (departed March 1, 2026)"
C4	Human-readable breadcrumbs	`/authority-paths/0a3a4bb8...` → `Overview / Unowned Sensitive Access / svc-foundry-agent701`

Category D: Data Completeness

ID	Change	Simulates
D1	Populated `target_resource` in evidence	`target_resource: ""` → `target_resource: "GP_Clinical_Notes (SharePoint)"`
D2	Fixed `added_roles` in evidence packs	"Review 0 role assignments" → "Review 2 role assignments: ap_write, ar_write"
D3	Correct path/finding counts	32 vs 30 discrepancy resolved. bySeverity shows total-scoped counts.

Category E: Report & Partner Deliverable

ID	Change	Simulates
E1	Assessment report (6-page executive format)	Full report output: cover, exec summary in business language, findings, remediation roadmap, compliance mapping
E2	Scan digest (1-page summary)	Post-scan summary: top 3 risks, governance checklist, trend
E3	Cluster verdicts extended below cluster level	Finding-level descriptions use the same business narrative pattern as cluster verdicts
E4	Governance checklist with correct labels	Distinct labels: "Scope drift (3 paths)" instead of 3x "Orphaned identities"

Step 2: Generate Before/After Artifacts

For each change, generate two versions of the relevant platform output:

Before: Use the actual current platform output as baseline. Source from:

Live API responses from app.securityv0.com (tenant demo-w1)
The data snapshots captured in the March 15 multi-perspective review
The specific examples cited in Sergey's feedback

After: Apply the proposed change to the baseline output. The change must be:

Realistic (uses data that exists or would exist after a known data fix)
Specific (exact text, not placeholders like "[entity name here]")
Minimal (only change what the proposal specifies — don't improve other aspects)

Format: Each artifact is a structured document that includes:

The relevant screen/section output (e.g., "Authority Path Detail — Remediation Section")
Enough surrounding context for a reviewer to evaluate (not just the changed line)
Metadata: which change IDs are applied, what data was modified

Step 3: Run 7-Persona Reviews

Submit each before/after pair to the 7 reviewer agents. Each agent uses its existing definition from sv0-platform/.claude/agents/:

Agent	Definition	Scoring Rubric
CISO Executive	`ciso-reviewer.md`	Can I get the "so what?" in 15 seconds? (1-10)
SecOps Analyst	`secops-analyst.md`	Can I act on this finding without asking someone? (1-10)
Product QA	`product-qa.md`	Does this match the spec? (pass/partial/fail per item)
UX Critic	`ux-critic.md`	Is this clear to a first-time user? Jargon count. IA grade. (A-F)
Security Auditor	`security-auditor.md`	Is the data internally consistent? (issue count)
Enterprise Executive	`enterprise-executive.md`	Can a partner present this without rewriting? (1-5)
CEO (Sergey)	`ceo-reviewer.md`	Is this sellable, honest, and aligned with product vision? (accept/reject per item)

Scoring protocol:

Each reviewer scores both the before and after version independently
Scores use the same rubric as the March 15 review for comparability
Each reviewer must provide specific textual feedback explaining score changes
Aggregate into the MPAS-7 format (see consolidated action plan)

Step 4: Measure Deltas

For each change, compute:

Metric	Definition
Per-role delta	After score minus Before score for each of the 7 roles
Aggregate acceptance lift	Mean delta across all 7 roles, weighted equally
Role coverage	How many of the 7 roles show improvement (target: change helps ≥3 roles)
Regression check	Any role where the After score is LOWER than Before (flag as risk)

Step 5: Rank and Bundle

Combine the measured deltas with engineering effort estimates from the sprint plan:

Acceptance-per-effort = Aggregate acceptance lift / Effort (sessions)

Group changes into implementable bundles:

Bundle 1 (Demo Blockers): Changes that must ship together for the demo to work
Bundle 2 (Highest ROI): Top 5 changes by acceptance-per-effort ratio
Bundle 3 (Report MVP): Minimum set for a partner-presentable report
Bundle 4 (Remaining): Everything else, ordered by ratio

What to Avoid

Explicit anti-patterns from the previous research run:

Anti-Pattern	Why It Failed	Rule
Literature synthesis from unrelated domains	Cardiovascular guidelines and deep learning papers are not actionable for a security platform UX problem	Do NOT cite academic papers unless they directly describe a pattern already used in security tooling (e.g., Wiz, CrowdStrike, Datadog)
Hypothesis testing for decided questions	Sergey already decided: no scores, no effort estimates, plain English, cut > add. Testing these as hypotheses wastes cycles.	Treat Sergey's 28 feedback items as requirements, not hypotheses
Proposing features that contradict CEO decisions	Dual-axis scoring, two-product split, visible evidence grades — all contradicted explicit decisions	Before proposing ANY feature, check against the 8 guiding principles in the consolidated action plan
Abstract framework recommendations	"Implement an Essential Actions view" without specifying what it looks like, what data it uses, or where it appears	Every recommendation must include: exact text/layout, data source, file location
Scoring with synthetic personas	LLM-simulated "CISO persona" may not match real CISO behavior	Use the actual reviewer agent definitions from `sv0-platform/.claude/agents/`. These have been calibrated against Sergey's feedback. Validate rubric against the 28 feedback items as ground truth.
Over-engineering experiment design	50 synthetic scenarios × 5 seeds × 4 regime cells × Wilcoxon tests is overkill for a product question	Use the actual platform output, not synthetic scenarios. Run each reviewer once per before/after pair. Variance comes from the 7 different perspectives, not from repeated sampling.

What to Try

Compound Change Testing

Individual changes may have interaction effects. Test key bundles as compounds:

Compound	Changes	Hypothesis
"Fix the engine"	D1 + D2 + D3	Data completeness alone lifts SecOps and Auditor scores
"CISO in 15 seconds"	B1 + B2 + B3 + B4	Combination of confidence labels + global ranking + business metrics + compliance tags lifts CISO score past 85%
"Analyst day-1"	A1 + C1 + C2 + C3 + C4	Named remediation + what-changed + role scope + real names + breadcrumbs makes SecOps score ≥80%
"Partner-ready report"	E1 + E3 + B4	Assessment report + extended verdicts + compliance tags lifts Enterprise Executive past 3.5/5
"Full package"	All 18 changes	Upper bound — what's the maximum acceptance we can achieve?

Run individual changes first, then compounds, to isolate which changes drive the most improvement.

Cross-Role Regression Detection

Some changes that help one role may hurt another. Examples to watch:

Adding compliance tags (B4) helps CISO/Executive but might add visual noise for SecOps → check UX score
Business metric stat cards (B3) help CISO but remove analyst-relevant data → check SecOps score
Extended verdicts below cluster level (E3) help Executive but might feel dumbed-down to SecOps → check both

Flag any change where a role's score drops by ≥1 point. These need design mitigation (e.g., progressive disclosure, role-specific views).

Sergey Feedback Ground Truth Validation

The 28 feedback items have known accept/reject status. Use them as validation:

For each accepted item, the corresponding change should improve ≥1 role score
For each "open question" item, the research should produce data that helps resolve the question
For each deferred item, the research should NOT propose to address it (scope control)

If the simulated reviews disagree with Sergey's actual decisions, investigate why — the simulation may have a calibration problem.

Validation: How to Know the Research Worked

Success Criteria

Criterion	Threshold
Coverage	All 18 proposed changes have measured before/after deltas
Discrimination	Changes show measurable variance (not all +1 or all +0). At least 3 changes show ≥2 point lift for their target role.
Alignment	The ranking by acceptance-per-effort does NOT contradict Sergey's explicit priorities. If it does, explain why.
Regression detection	Any cross-role regressions are identified and have a proposed mitigation
Actionability	The final ranked backlog can be handed to engineering without further interpretation

Failure Modes to Watch

Failure Mode	Detection	Response
All scores identical before/after	Reviewers are not sensitive to changes	Make before/after differences more concrete. Check that the "after" artifact actually reflects the change.
Reviewers contradict each other chaotically	No coherent signal across roles	Check reviewer agent calibration against the March 15 baseline. Re-run with more surrounding context.
Rankings contradict Sergey's priorities	Simulation doesn't match real-world priorities	Use the 28 feedback items as calibration anchors. Weight CEO role higher in the ranking formula.
Research drifts into literature	AutoResearchClaw defaults to academic mode	The topic override (below) explicitly forbids literature synthesis. If it still happens, skip to experiment stages.

AutoResearchClaw Configuration

Topic Override

Validate 18 concrete SecurityV0 platform changes by generating before/after
platform output and measuring acceptance score deltas across 7 reviewer roles.

DO NOT synthesize academic literature. DO NOT generate hypotheses about
already-decided questions. DO NOT propose features that contradict the
guiding principles listed in the research brief.

Input: 18 proposed changes with before/after specifications.
Output: Ranked backlog with measured per-role acceptance deltas.

Experiment Mode

simulated — no code execution. Generate before/after text artifacts and score them.

Pipeline Configuration

Skip stages: Literature search (stages 2-3), gap analysis (stage 4), literature synthesis (stage 7). These produced the academic drift in the previous run.
Start from: Problem decomposition (stage 5) — use the 18 changes as the decomposed problem set
Emphasis stages: Experiment execution (stages 10-16) — this is where the before/after testing happens
Gate stages:
- Stage 5: Human approval of the change set (verify all 18 changes are correctly specified)
- Stage 9: Human approval of experiment design (verify before/after artifacts are realistic)
- Stage 17: Human approval of results (verify rankings make sense before final synthesis)

Input Files

File	Purpose
`sv0-documentation/docs/product/reviews/march-2026-platform-review/2026-03-21-consolidated-action-plan.md`	Guiding principles, MPAS-7 targets, what was already decided
`sv0-documentation/docs/product/reviews/march-2026-platform-review/2026-03-15-multi-perspective-platform-review.md`	Baseline review scores, current platform state
`sv0-documentation/docs/product/reviews/march-2026-platform-review/2026-03-16-sergey-feedback-tracker.md`	28 feedback items as ground truth
`sv0-documentation/docs/product/reviews/march-2026-platform-review/2026-03-16-sprint-cross-review-report.md`	Corrections to apply (contradictions, gaps)
`sv0-platform/.claude/agents/*.md`	7 reviewer agent definitions
Live API responses from `app.securityv0.com` (tenant `demo-w1`)	Current platform output for "before" artifacts

Expected Deliverables

Change Scorecard (1 page per change): before/after text, per-role delta, aggregate lift, regression flags
Compound Bundle Scorecards (5 bundles): interaction effects, total lift, regression flags
Ranked Backlog (1 page): all 18 changes sorted by acceptance-per-effort, grouped into implementation bundles
Calibration Report (1 page): how well the simulated reviews align with Sergey's 28 feedback decisions
Decision Support for open questions: data to help resolve the 5 items Sergey marked as "OPEN QUESTION"

Next Action

Status: idea

Decision needed from: Ivan (CTO)

Options:

Run as specified — configure AutoResearchClaw with the topic override and pipeline config above, using the 18 changes as input
Reduce scope — test only the Phase 0 + Phase 1 changes (10 items) as a faster validation cycle
Manual validation instead — skip AutoResearchClaw, implement Phase 0, then re-run the 7 agents manually against the live platform
Hybrid — implement Phase 0 first (known blockers, no research needed), then run this research for Phase 1+ changes where the priority is less obvious

Recommended: Option 4 (Hybrid). Phase 0 items are clearly demo-blocking — build them, don't simulate them. Use this research to validate Phase 1+ priorities and resolve the 5 open questions with data instead of guessing.

GitHub Issue: not yet created

What This Research Is​

Why This Approach​

Research Objective​

Method: Simulate → Review → Measure → Rank​

Step 1: Define the Change Set​

Category A: Remediation Quality​

Category B: CISO Clarity​

Category C: Analyst Workflow​

Category D: Data Completeness​

Category E: Report & Partner Deliverable​

Step 2: Generate Before/After Artifacts​

Step 3: Run 7-Persona Reviews​

Step 4: Measure Deltas​

Step 5: Rank and Bundle​

What to Avoid​

What to Try​

Compound Change Testing​

Cross-Role Regression Detection​

Sergey Feedback Ground Truth Validation​

Validation: How to Know the Research Worked​

Success Criteria​

Failure Modes to Watch​

AutoResearchClaw Configuration​

Topic Override​

Experiment Mode​

Pipeline Configuration​

Input Files​

Expected Deliverables​

Next Action​