Skip to main content

Research Summary: What We Learned and What We're Using

Source: AutoResearchClaw automated research pipeline, March 18-19, 2026. Full academic paper at research/2026-03-19-platform-evolution-multi-stakeholder-acceptance.md. Raw artifacts at ~/dev/AutoResearchClaw/artifacts/sv0-platform/.

Method: The research used 7 simulated reviewer personas (the same ones from the March 15 platform review) to evaluate 50 synthetic NHI security scenarios under different product configurations. It tested what happens when you fix data quality, split the product into two outputs, or change how reports are formatted. All numbers are analytical projections, not empirical measurements.


The Five Findings

1. Fix the data first, UI second

Filling empty fields and fixing count mismatches is projected to improve technical reviewer acceptance by +11 to +14 points with zero UI changes. Count discrepancies between summary and detail views — fixing these alone is projected to get the SecOps and Security Auditor personas past their acceptance threshold.

What we're doing: This maps directly to Phase 0 and Phase 3 of the consolidated action plan. Remediation naming objects (Phase 0.1), fixing path counts (Phase 3.2), fixing meta.bySeverity scoping (Phase 3.4), and fixing role_history completeness mismatch (Phase 3.5). Note: added_roles (3.1) and target_resource (3.3) were confirmed as not bugs by the Mar 21 code audit — connectors populate both fields correctly.

Why it matters: It validates the phase ordering in our plan — demo blockers and data quality before CISO clarity and polish.


2. The platform and the report are different products

The research found that trying to serve CISOs and analysts from the same screen creates a structural conflict. What builds analyst trust (full evidence, drill-down, technical detail) actively erodes executive confidence.

The projected improvement: a two-product split (analyst workbench + executive report generator) scores +13 points higher in combined acceptance than any single-artifact approach. Partner rewrite drops from 60-65% to 15-20%.

What we're doing: This aligns with what's already planned. The platform UI is the analyst tool. The report generator (Phase 4) is the executive/partner deliverable. Sergey's direction: "Partners sell the report, not the tool." The architecture is "one engine, many channels" — not two separate products, but two separate outputs from the same evidence API.

Key design constraint from the research: The report generator should query the evidence API for synthesized verdicts, not raw findings. If the report path can access raw evidence, technical detail gradually creeps in and degrades readability. This is an architectural constraint for Phase 4.2 (Report Service).


3. Opinionated reports outsell rich ones ("legibility inversion")

The most counterintuitive finding. Non-technical buyers (CISOs, partners, executives) rate opinionated single-verdict reports approximately 2 points higher on a 10-point purchase intent scale than analytically rich dual-axis formats.

Even adding a collapsed technical appendix hurts — the mere presence of a "Technical Appendix" section signals complexity and shifts the buyer's frame from "expert recommendation I can act on" to "complex analysis I need to evaluate."

What we're doing: This validates Sergey's decisions: "Business conclusions, not technical facts." "Remove scores entirely." "Plain English." The cluster verdict sentences that "pass the 5-second comprehension test" are exactly the right pattern.

What we're changing: The assessment report template (Phase 4.3) originally included "Appendix: methodology, evidence integrity, data sources." The research says this hurts purchase intent even when collapsed. The report should lead with what to do and why, not how we figured it out. Methodology belongs in the evidence pack for auditors, not in the executive report.


4. Disagreement between tools IS the signal

Instead of averaging risk scores from different sources into one number (which hides the interesting stuff), the research uses Jensen-Shannon divergence to measure how much sources disagree. When one tool says "low risk" and another says "critical," that disagreement itself points to sophisticated attacks exploiting gaps between monitoring tools.

Projected 0.82 AUROC for novel threat detection vs 0.74 for averaged scores.

What we're doing: Filed for future implementation. We currently have 2 connectors (Entra ID, ServiceNow). This becomes actionable when we have 4+ sources producing independent severity assessments. The concept maps to Sergey's insight: "One fix reduces multiple exposures — that's the real value." Divergence scoring finds the choke points where monitoring tools disagree.

Why we're waiting: On incomplete data (our current state with empty fields), divergence scoring performs worse than a single source (projected AUROC drops to 0.62). The data quality fixes in Phase 0/3 are a hard prerequisite.


5. Implementation order matters — effects are multiplicative

The research found that the three improvements interact:

  • Engine completeness alone: +9 points
  • Architecture split alone: +7 points
  • Both together: +19 points (not +16 — the interaction adds +3)

Doing them out of order wastes effort. Presentation on bad data doesn't stick. Divergence scoring on incomplete data is worse than useless. The recommended sequence matches our phase ordering: data quality → report generator → divergence scoring.

What we're doing: Our consolidated action plan already follows this order: Phase 0 (demo blockers / data quality) → Phase 1-3 (clarity + data) → Phase 4 (reports) → future (divergence scoring). The research validates that this sequencing isn't just prioritization — it's structurally required.


What We Rejected From the Research

ProposalWhy We Rejected It
Dual-axis scoring (urgency × confidence) for the workbenchSergey: "Remove scores entirely." The workbench shows sorted lists, not score grids.
Visible A/B/C evidence gradesSergey: "Avoid ABC grading. Use plain English." We use "Execution Confirmed" / "Previously Active" / "Standing Authority Only."
Two fully separate productsSergey's model is one platform with audience-appropriate views + a report generator. Close to the research proposal but not a hard architectural split into two independent applications.
Security Posture Index (SPI) roll-up metricNot in current scope. A composable roll-up score could be useful in the executive report but needs careful design to avoid the "score" problem Sergey flagged. Filed for consideration during Phase 4 report template design.
Effort/cost estimates in reportsSergey: "Too risky." The research doesn't address this directly but the enterprise executive agent has been updated to evaluate compensating elements (responsible role, compliance mapping, choke points) instead.

What's New That We're Adding

Three items from the final research paper that weren't in the consolidated action plan:

1. Remove methodology appendix from assessment report

The legibility inversion finding says even a collapsed appendix hurts purchase intent. The assessment report template (Phase 4.3) should NOT include "Appendix: methodology, evidence integrity, data sources." Methodology and evidence integrity belong in the separate evidence export (for auditors), not in the board-facing assessment report.

2. Report API constraint: no raw evidence access

The report generator (Phase 4.2) should query the evidence API for synthesized verdicts and cluster summaries, not raw findings. This is an architectural constraint that prevents technical detail from gradually creeping into executive output. The API should expose two endpoint families: full-fidelity (for the platform UI) and pre-synthesized (for the report generator).

3. Feedback channels for future calibration

Four feedback channels worth instrumenting over time (not this sprint, but worth planning for):

  • Finding accuracy ratings from analysts (flag false positives/negatives)
  • Remediation outcome tracking (did the action resolve the finding in the next scan?)
  • Report consumption analytics (which sections are read and shared?)
  • Divergence-outcome correlation (when we have enough data for JSD scoring)

These build the ground-truth dataset needed to validate the research projections with real production data.


How the Research Was Done

The research pipeline (AutoResearchClaw v0.3.1) ran 23 stages over ~12 hours:

  1. Stages 1-3: Literature search across clinical guideline frameworks, reporting standards, and machine learning ensemble methods
  2. Stages 4-7: Gap analysis and literature synthesis — identified 3 structural gaps (no multi-persona acceptance framework, no evidence-to-narrative method, no cross-source disagreement treatment)
  3. Stage 8: Multi-agent hypothesis debate (pragmatist, innovator, contrarian perspectives) — produced 4 hypotheses
  4. Stages 9-16: Experiment design and execution — 50 synthetic NHI scenarios evaluated by 7 simulated personas across 6 conditions
  5. Stages 17-23: Result synthesis, peer review simulation (3 reviewers), paper revision

Limitations acknowledged in the paper:

  • All scores are simulated projections, not measurements from real humans
  • 50 scenarios is too small for statistical significance on AUROC claims
  • Only validated on NHI scenarios (our domain), not broader security categories
  • Two-product maintenance cost not evaluated
  • Legibility inversion may attenuate as executives become more sophisticated

The research is directional evidence, not proof. We're using it to validate sequencing decisions and surface design constraints, not as a specification.