Research Brief: Acceptance-Driven Implementation Validation
What This Research Is
A simulation-based validation loop that tests proposed SecurityV0 platform changes before engineering invests in building them. Instead of surveying academic literature, this research:
- Takes a concrete proposed change (e.g., "replace generic remediation with named objects")
- Generates realistic before/after platform output for that change
- Runs the 7 reviewer agents against both versions
- Measures the acceptance score delta per role
- Identifies which changes deliver the most acceptance improvement per engineering effort
The output is a ranked backlog of validated changes with measured expected impact — not a literature review.
Why This Approach
The previous AutoResearchClaw run (March 18-19) applied its default academic pipeline to a product strategy problem. It produced literature synthesis from clinical guideline papers and hypothesis testing frameworks, but the research brief had explicitly requested "an actionable product roadmap, not a scientific paper."
What went wrong:
- The research proposed dual-axis scoring — Sergey already decided to remove all visible scores
- The research proposed a two-product split — contradicts the one-platform, audience-split model
- Literature clusters (cardiovascular guidelines, deep learning, mental health papers) were not actionable for a security platform
- Hypothesis testing framework was built to validate decisions the CEO already made
- No concrete feature specs, screen layouts, or data model changes were produced
What should change:
- Replace literature synthesis with simulated implementation testing
- Treat Sergey's 28 feedback items and the cross-review corrections as ground truth, not as hypotheses to validate
- Measure acceptance improvement from specific proposed changes, not from abstract framework comparisons
- Output a prioritized implementation backlog ranked by measured acceptance delta
Research Objective
Produce a validated change backlog for SecurityV0 where each item has:
- A concrete before/after specification (what the platform output looks like today vs. after the change)
- Measured acceptance delta across the 7 reviewer roles
- Which reviewer roles benefit most
- Engineering effort estimate (from the sprint plan)
- An acceptance-per-effort ratio that tells us what to build first
Method: Simulate → Review → Measure → Rank
Step 1: Define the Change Set
Each change is a discrete, implementable platform modification. Source: the consolidated action plan + Sergey's feedback items.
Proposed change set (18 changes across 5 categories):
Category A: Remediation Quality
| ID | Change | Simulates |
|---|---|---|
| A1 | Generic → named object remediation | applies_to: "LLM endpoint" → applies_to: "Endpoint: Azure OpenAI (svc-foundry-agent701)" |
| A2 | Choke point deduplication | Same remediation in 3 clusters → single entry: "Applies across 3 clusters. One fix reduces 3 exposures." |
| A3 | Business impact caveat on remediation | Add: "Restricting this role may affect [service]. Verify with application owner before applying." |
Category B: CISO Clarity
| ID | Change | Simulates |
|---|---|---|
| B1 | Execution confidence labels (plain English) | "Grade A" → "Execution Confirmed" / "Previously Active" / "Standing Authority Only" |
| B2 | Global top-3 risk ranking on Overview | Add: "Top risks across all clusters: 1) ..., 2) ..., 3) ..." |
| B3 | Business metric stat cards | "Active Autonomous: 5 Identities" → "Sensitive Domains Reached: 6" |
| B4 | OWASP/NIST compliance tags on findings | Add: "OWASP ASI-03 · NIST AC-2 · SOX: Segregation of duties" |
Category C: Analyst Workflow
| ID | Change | Simulates |
|---|---|---|
| C1 | "What changed since yesterday" filter | New section: "3 new findings since your last visit (March 18)" with highlighted items |
| C2 | Full identity role scope in path view | Path row shows "5 roles across 5 paths" instead of just the 1 role used in this path |
| C3 | Named owners in ownership section | "Service principal owner departed" → "Maria Lopez (departed March 1, 2026)" |
| C4 | Human-readable breadcrumbs | /authority-paths/0a3a4bb8... → Overview / Unowned Sensitive Access / svc-foundry-agent701 |
Category D: Data Completeness
| ID | Change | Simulates |
|---|---|---|
| D1 | Populated target_resource in evidence | target_resource: "" → target_resource: "GP_Clinical_Notes (SharePoint)" |
| D2 | Fixed added_roles in evidence packs | "Review 0 role assignments" → "Review 2 role assignments: ap_write, ar_write" |
| D3 | Correct path/finding counts | 32 vs 30 discrepancy resolved. bySeverity shows total-scoped counts. |
Category E: Report & Partner Deliverable
| ID | Change | Simulates |
|---|---|---|
| E1 | Assessment report (6-page executive format) | Full report output: cover, exec summary in business language, findings, remediation roadmap, compliance mapping |
| E2 | Scan digest (1-page summary) | Post-scan summary: top 3 risks, governance checklist, trend |
| E3 | Cluster verdicts extended below cluster level | Finding-level descriptions use the same business narrative pattern as cluster verdicts |
| E4 | Governance checklist with correct labels | Distinct labels: "Scope drift (3 paths)" instead of 3x "Orphaned identities" |
Step 2: Generate Before/After Artifacts
For each change, generate two versions of the relevant platform output:
Before: Use the actual current platform output as baseline. Source from:
- Live API responses from
app.securityv0.com(tenantdemo-w1) - The data snapshots captured in the March 15 multi-perspective review
- The specific examples cited in Sergey's feedback
After: Apply the proposed change to the baseline output. The change must be:
- Realistic (uses data that exists or would exist after a known data fix)
- Specific (exact text, not placeholders like "[entity name here]")
- Minimal (only change what the proposal specifies — don't improve other aspects)
Format: Each artifact is a structured document that includes:
- The relevant screen/section output (e.g., "Authority Path Detail — Remediation Section")
- Enough surrounding context for a reviewer to evaluate (not just the changed line)
- Metadata: which change IDs are applied, what data was modified
Step 3: Run 7-Persona Reviews
Submit each before/after pair to the 7 reviewer agents. Each agent uses its existing definition from sv0-platform/.claude/agents/:
| Agent | Definition | Scoring Rubric |
|---|---|---|
| CISO Executive | ciso-reviewer.md | Can I get the "so what?" in 15 seconds? (1-10) |
| SecOps Analyst | secops-analyst.md | Can I act on this finding without asking someone? (1-10) |
| Product QA | product-qa.md | Does this match the spec? (pass/partial/fail per item) |
| UX Critic | ux-critic.md | Is this clear to a first-time user? Jargon count. IA grade. (A-F) |
| Security Auditor | security-auditor.md | Is the data internally consistent? (issue count) |
| Enterprise Executive | enterprise-executive.md | Can a partner present this without rewriting? (1-5) |
| CEO (Sergey) | ceo-reviewer.md | Is this sellable, honest, and aligned with product vision? (accept/reject per item) |
Scoring protocol:
- Each reviewer scores both the before and after version independently
- Scores use the same rubric as the March 15 review for comparability
- Each reviewer must provide specific textual feedback explaining score changes
- Aggregate into the MPAS-7 format (see consolidated action plan)
Step 4: Measure Deltas
For each change, compute:
| Metric | Definition |
|---|---|
| Per-role delta | After score minus Before score for each of the 7 roles |
| Aggregate acceptance lift | Mean delta across all 7 roles, weighted equally |
| Role coverage | How many of the 7 roles show improvement (target: change helps ≥3 roles) |
| Regression check | Any role where the After score is LOWER than Before (flag as risk) |
Step 5: Rank and Bundle
Combine the measured deltas with engineering effort estimates from the sprint plan:
Acceptance-per-effort = Aggregate acceptance lift / Effort (sessions)
Group changes into implementable bundles:
- Bundle 1 (Demo Blockers): Changes that must ship together for the demo to work
- Bundle 2 (Highest ROI): Top 5 changes by acceptance-per-effort ratio
- Bundle 3 (Report MVP): Minimum set for a partner-presentable report
- Bundle 4 (Remaining): Everything else, ordered by ratio
What to Avoid
Explicit anti-patterns from the previous research run:
| Anti-Pattern | Why It Failed | Rule |
|---|---|---|
| Literature synthesis from unrelated domains | Cardiovascular guidelines and deep learning papers are not actionable for a security platform UX problem | Do NOT cite academic papers unless they directly describe a pattern already used in security tooling (e.g., Wiz, CrowdStrike, Datadog) |
| Hypothesis testing for decided questions | Sergey already decided: no scores, no effort estimates, plain English, cut > add. Testing these as hypotheses wastes cycles. | Treat Sergey's 28 feedback items as requirements, not hypotheses |
| Proposing features that contradict CEO decisions | Dual-axis scoring, two-product split, visible evidence grades — all contradicted explicit decisions | Before proposing ANY feature, check against the 8 guiding principles in the consolidated action plan |
| Abstract framework recommendations | "Implement an Essential Actions view" without specifying what it looks like, what data it uses, or where it appears | Every recommendation must include: exact text/layout, data source, file location |
| Scoring with synthetic personas | LLM-simulated "CISO persona" may not match real CISO behavior | Use the actual reviewer agent definitions from sv0-platform/.claude/agents/. These have been calibrated against Sergey's feedback. Validate rubric against the 28 feedback items as ground truth. |
| Over-engineering experiment design | 50 synthetic scenarios × 5 seeds × 4 regime cells × Wilcoxon tests is overkill for a product question | Use the actual platform output, not synthetic scenarios. Run each reviewer once per before/after pair. Variance comes from the 7 different perspectives, not from repeated sampling. |
What to Try
Compound Change Testing
Individual changes may have interaction effects. Test key bundles as compounds:
| Compound | Changes | Hypothesis |
|---|---|---|
| "Fix the engine" | D1 + D2 + D3 | Data completeness alone lifts SecOps and Auditor scores |
| "CISO in 15 seconds" | B1 + B2 + B3 + B4 | Combination of confidence labels + global ranking + business metrics + compliance tags lifts CISO score past 85% |
| "Analyst day-1" | A1 + C1 + C2 + C3 + C4 | Named remediation + what-changed + role scope + real names + breadcrumbs makes SecOps score ≥80% |
| "Partner-ready report" | E1 + E3 + B4 | Assessment report + extended verdicts + compliance tags lifts Enterprise Executive past 3.5/5 |
| "Full package" | All 18 changes | Upper bound — what's the maximum acceptance we can achieve? |
Run individual changes first, then compounds, to isolate which changes drive the most improvement.
Cross-Role Regression Detection
Some changes that help one role may hurt another. Examples to watch:
- Adding compliance tags (B4) helps CISO/Executive but might add visual noise for SecOps → check UX score
- Business metric stat cards (B3) help CISO but remove analyst-relevant data → check SecOps score
- Extended verdicts below cluster level (E3) help Executive but might feel dumbed-down to SecOps → check both
Flag any change where a role's score drops by ≥1 point. These need design mitigation (e.g., progressive disclosure, role-specific views).
Sergey Feedback Ground Truth Validation
The 28 feedback items have known accept/reject status. Use them as validation:
- For each accepted item, the corresponding change should improve ≥1 role score
- For each "open question" item, the research should produce data that helps resolve the question
- For each deferred item, the research should NOT propose to address it (scope control)
If the simulated reviews disagree with Sergey's actual decisions, investigate why — the simulation may have a calibration problem.
Validation: How to Know the Research Worked
Success Criteria
| Criterion | Threshold |
|---|---|
| Coverage | All 18 proposed changes have measured before/after deltas |
| Discrimination | Changes show measurable variance (not all +1 or all +0). At least 3 changes show ≥2 point lift for their target role. |
| Alignment | The ranking by acceptance-per-effort does NOT contradict Sergey's explicit priorities. If it does, explain why. |
| Regression detection | Any cross-role regressions are identified and have a proposed mitigation |
| Actionability | The final ranked backlog can be handed to engineering without further interpretation |
Failure Modes to Watch
| Failure Mode | Detection | Response |
|---|---|---|
| All scores identical before/after | Reviewers are not sensitive to changes | Make before/after differences more concrete. Check that the "after" artifact actually reflects the change. |
| Reviewers contradict each other chaotically | No coherent signal across roles | Check reviewer agent calibration against the March 15 baseline. Re-run with more surrounding context. |
| Rankings contradict Sergey's priorities | Simulation doesn't match real-world priorities | Use the 28 feedback items as calibration anchors. Weight CEO role higher in the ranking formula. |
| Research drifts into literature | AutoResearchClaw defaults to academic mode | The topic override (below) explicitly forbids literature synthesis. If it still happens, skip to experiment stages. |
AutoResearchClaw Configuration
Topic Override
Validate 18 concrete SecurityV0 platform changes by generating before/after
platform output and measuring acceptance score deltas across 7 reviewer roles.
DO NOT synthesize academic literature. DO NOT generate hypotheses about
already-decided questions. DO NOT propose features that contradict the
guiding principles listed in the research brief.
Input: 18 proposed changes with before/after specifications.
Output: Ranked backlog with measured per-role acceptance deltas.
Experiment Mode
simulated — no code execution. Generate before/after text artifacts and score them.
Pipeline Configuration
- Skip stages: Literature search (stages 2-3), gap analysis (stage 4), literature synthesis (stage 7). These produced the academic drift in the previous run.
- Start from: Problem decomposition (stage 5) — use the 18 changes as the decomposed problem set
- Emphasis stages: Experiment execution (stages 10-16) — this is where the before/after testing happens
- Gate stages:
- Stage 5: Human approval of the change set (verify all 18 changes are correctly specified)
- Stage 9: Human approval of experiment design (verify before/after artifacts are realistic)
- Stage 17: Human approval of results (verify rankings make sense before final synthesis)
Input Files
| File | Purpose |
|---|---|
sv0-documentation/docs/product/reviews/march-2026-platform-review/2026-03-21-consolidated-action-plan.md | Guiding principles, MPAS-7 targets, what was already decided |
sv0-documentation/docs/product/reviews/march-2026-platform-review/2026-03-15-multi-perspective-platform-review.md | Baseline review scores, current platform state |
sv0-documentation/docs/product/reviews/march-2026-platform-review/2026-03-16-sergey-feedback-tracker.md | 28 feedback items as ground truth |
sv0-documentation/docs/product/reviews/march-2026-platform-review/2026-03-16-sprint-cross-review-report.md | Corrections to apply (contradictions, gaps) |
sv0-platform/.claude/agents/*.md | 7 reviewer agent definitions |
Live API responses from app.securityv0.com (tenant demo-w1) | Current platform output for "before" artifacts |
Expected Deliverables
- Change Scorecard (1 page per change): before/after text, per-role delta, aggregate lift, regression flags
- Compound Bundle Scorecards (5 bundles): interaction effects, total lift, regression flags
- Ranked Backlog (1 page): all 18 changes sorted by acceptance-per-effort, grouped into implementation bundles
- Calibration Report (1 page): how well the simulated reviews align with Sergey's 28 feedback decisions
- Decision Support for open questions: data to help resolve the 5 items Sergey marked as "OPEN QUESTION"
Next Action
Status: idea
Decision needed from: Ivan (CTO)
Options:
- Run as specified — configure AutoResearchClaw with the topic override and pipeline config above, using the 18 changes as input
- Reduce scope — test only the Phase 0 + Phase 1 changes (10 items) as a faster validation cycle
- Manual validation instead — skip AutoResearchClaw, implement Phase 0, then re-run the 7 agents manually against the live platform
- Hybrid — implement Phase 0 first (known blockers, no research needed), then run this research for Phase 1+ changes where the priority is less obvious
Recommended: Option 4 (Hybrid). Phase 0 items are clearly demo-blocking — build them, don't simulate them. Use this research to validate Phase 1+ priorities and resolve the 5 open questions with data instead of guessing.
GitHub Issue: not yet created