SecurityV0 — Proposed Changes

Companion to 2026-04-12 architecture audit.

This document translates audit findings into concrete proposals. Each change is described with its current state, what changes, and why the outcome is better. Changes are grouped by category and ordered by dependency — later changes depend on earlier ones being done first.

Part A — Immediate Changes

Sections 1–7 are changes that should be made now, regardless of future architectural direction. They fix bugs, close security gaps, replace a broken worker queue, and lay the foundation for any scale work.

1. Fix the Product Before Fixing the Architecture

The AWS connector is SecurityV0's newest and most important connector. All three of its core functions are currently broken. These must be fixed before any architecture work — rearchitecting a broken product solves the wrong problem.

1.1 CloudTrail Extractor — Implement It

Current state: sv0-connectors/integrations/aws/src/sv0_aws/cli/main.py:146 initializes cloudtrail_evidence = [] and never populates it. The extractor does not exist. Every AWS execution evidence record is empty. The dormant_authority finding rule fires on 100% of AWS Lambda functions because there is no execution evidence to contradict it. Every AWS finding in the platform today is suspect.

Proposed change: Implement the CloudTrail extractor. Query CloudTrail Lake or S3 event archive for AssumeRole, Invoke, and service API events scoped to the connector's account and time window. Map events to entity IDs and populate execution_evidence records with timestamps, event types, and source IPs.

Why better:

The dormant_authority rule produces correct findings instead of false positives on all Lambda functions
proven_authority and last_seen evidence becomes real
The platform's core value proposition — detecting when NHI permissions are unused — works on AWS for the first time

Depends on: Nothing — this is the starting point

1.2 Assumed-Role ARN Parser — 5-Line Fix

Current state: sv0-connectors/integrations/aws/src/sv0_aws/core/transformer.py:1768 returns None for any ARN in the format arn:aws:sts::ACCOUNT:assumed-role/ROLE/SESSION. This format covers assumed-role events from Lambda, ECS tasks, Step Functions, and Bedrock agents — the primary NHI workload types in AWS. 80–90% of real AWS workload events are silently dropped.

Proposed change: Add an elif ":assumed-role/" in arn: branch before the return None fallback. Parse the role name and account from the assumed-role ARN and return the corresponding entity ID.

elif ":assumed-role/" in arn:
    parts = arn.split(":")
    account_id = parts[4]
    role_parts = parts[5].split("/")
    # role_parts = ["assumed-role", "ROLE_NAME", "SESSION_NAME"]
    role_name = role_parts[1]
    return f"aws_iam_role:{account_id}:{role_name}"
    # Must match the format used by _get_role_node_id_from_arn() for existing
    # role entities — a mismatched format produces IDs that never join to anything.

Why better:

80–90% of AWS NHI workload events are correctly associated with their identity entity
Authority chain traversal works for assumed-role identities (the dominant pattern in AWS)
CloudTrail evidence (once implemented) maps to the correct entity IDs

Depends on: Nothing — independent fix

1.3 `privilege_justification_gap` Rule — Two Failure Modes to Fix

Current state: This rule produces zero findings on all AWS data due to two independent bugs:

Bug A — Resource ID mismatch (privilege-justification-gap.ts:48-50): path.resource_id is a MongoDB ObjectID hex string (e.g., "507f1f77bcf86cd799439011"), not an ARN. The evidenceMatchesResource() comparison always fails for AWS sources. The resource_key field was introduced to fix this, but CloudTrail evidence records don't have resource_key populated (blocked by the missing CloudTrail extractor, section 1.1).

Bug B — normalized_action never set by AWS connector (transformer.py:1619–1628, path-materializer.ts:147): The path materializer reads perm.properties.normalized_action to populate the actions array on execution paths (path-materializer.ts:147: const action = (perm.properties.normalized_action as string) ?? "unknown"). There is no fallback to properties.action. The Entra-ServiceNow and Azure-Foundry connectors both set this field via the shared normalize_arm_action() helper, producing "read", "write", "admin", "execute". The AWS connector sets only properties.action (raw IAM string: iam:PassRole, s3:GetObject, etc.) and never sets normalized_action. Every AWS execution path therefore has actions: ["unknown"]. No ingestion middleware bridges the gap.

Consequence: privilege_justification_gap's action_mismatch detection branch (hasWriteActions() at line 46, which tests for "write", "admin", "delete", "update", "create", "execute") never triggers on AWS data. The no_activity branch can still fire, but the more precise "granted write, only observed reads" signal is completely dead for all AWS entities.

Scope of impact: scope_drift is NOT affected — it checks role additions against sensitive domains via ep.sensitivity, not via actions. The broken rules are: privilege_justification_gap (action_mismatch sub-type) and any future action-based rules including escalation_capable.

Proposed change:

Fix A: Use resource_key for matching once CloudTrail evidence populates it (already implemented in evidenceMatchesResource() — the fix is in the extractor, not the rule).

Fix B: Add normalized_action emission to the AWS connector's permission node builder:

# sv0-connectors/integrations/aws/core/transformer.py
# In _transform_policy_document(), alongside existing properties.action:

ESCALATION_PREFIXES = {"iam", "sts", "organizations", "iam-access-analyzer"}
WRITE_VERBS = {"put", "create", "delete", "update", "write", "modify", "attach",
               "detach", "tag", "untag", "revoke", "reset", "set", "associate", "remove"}

ESCALATION_ACTIONS = frozenset({
    # IAM privilege escalation
    "iam:passrole", "iam:createrole", "iam:attachrolepolicy", "iam:putrolepolicy",
    "iam:attachuserpolicy", "iam:putuserpolicy", "iam:createpolicy",
    "iam:createpolicyversion", "iam:setdefaultpolicyversion",
    # STS assumption
    "sts:assumerole", "sts:assumerolewithwebidentity", "sts:assumerolewithsaml",
    # AWS Organizations
    "organizations:createaccount", "organizations:delegateadminaccount",
    # GCP
    "iam.serviceaccounts.actas",
    # Azure (normalized lowercase)
    "microsoft.authorization/roleassignments/write",
    "microsoft.authorization/roledefinitions/write",
})

_WRITE_PREFIXES = (
    "put", "create", "delete", "update", "write",
    "attach", "detach", "remove", "add", "set",
    "import", "upload", "publish",
    # Extended mutation verbs (A-M1)
    "batchwrite", "copy", "send", "reject", "enable", "disable",
    "restore", "merge", "push", "abort", "register", "deregister",
    "associate", "disassociate", "grant", "revoke",
    "terminate", "stop", "cancel", "resize", "modify",
    "replace", "replicate", "tag", "untag",
)

_READ_PREFIXES = (
    "get", "list", "describe", "read", "view", "head",
    "check", "test", "preview", "batch_get", "batchget",
    # Extended read verbs (A-M1)
    "receive", "search", "query", "scan", "lookup", "select",
)

# Execute classification — non-mutating invocation and cryptographic operations
# that are neither pure reads nor state mutations. "execute" was moved here
# from _WRITE_PREFIXES (A-M1): the platform's WRITE_LEVEL_ACTIONS set in
# privilege-justification-gap.ts still includes "execute", so the existing
# write-detection behaviour is preserved for actions previously classified
# as "write" via the "execute" prefix.
_EXECUTE_PREFIXES = (
    "invoke", "run", "start",
    "decrypt", "encrypt", "sign", "verify", "generate",
    "execute", "batchexecutestatement",
)

def _normalize_action(raw_action: str) -> str:
    """Deterministic normalizer — no heuristics, no ML.

    Classification precedence:
      1. Exact-match against ESCALATION_ACTIONS → "escalation"
      2. Verb prefix match against _WRITE_PREFIXES → "write"
      3. Verb prefix match against _READ_PREFIXES → "read"
      4. Verb prefix match against _EXECUTE_PREFIXES → "execute"
      5. Fallback → "unknown" (never "admin")
    """
    lowered = raw_action.lower()
    # Use exact match (case-insensitive) against known escalation actions
    # to avoid substring false positives (e.g. iam:GetRole contains "role").
    if lowered in ESCALATION_ACTIONS:
        return "escalation"
    _, _, verb = lowered.partition(":")
    if any(verb.startswith(w) for w in _WRITE_PREFIXES):
        return "write"
    if any(verb.startswith(w) for w in _READ_PREFIXES):
        return "read"
    if any(verb.startswith(w) for w in _EXECUTE_PREFIXES):
        return "execute"
    return "unknown"  # not "admin" — unknown is the lowest-severity default

A-M1 — Extended prefix lists and new "execute" classification: The original prefix lists left common AWS verbs like lambda:InvokeFunction, dynamodb:BatchWriteItem, s3:CopyObject, sqs:SendMessage, kms:Decrypt, and stepfunctions:StartExecution classified as "unknown". A new "execute" classification covers non-mutating invocation and cryptographic operations (invoke, run, start, decrypt, encrypt, sign, verify, generate, execute, batchexecutestatement). The execute prefix was moved out of _WRITE_PREFIXES so mutations and invocations are no longer conflated. This is safe for the platform because WRITE_LEVEL_ACTIONS in privilege-justification-gap.ts already includes "execute", preserving existing write-detection semantics. Implementers must lift the prefix lists and the _normalize_action body verbatim — do not improvise. The connector also ships a regression fixture (tests/unit/fixtures/aws_action_corpus.txt) with ~50 real AWS actions, covering IAM/S3/Lambda/DynamoDB/KMS/SQS/SNS/Step Functions/EC2/RDS/Secrets Manager, parameterised into test_normalize_action_corpus as the drift guard.

Set "normalized_action": _normalize_action(permission["action"]) alongside "action" in the properties dict.

Why better:

privilege_justification_gap write-detection fires on AWS data for the first time
Raw IAM action is still preserved in properties.action for future inspection
Unblocks the escalation_capable rule (section 1.4)
No changes to the platform materializer or any evaluator rule

Depends on: Nothing — purely additive to the connector

1.4 `escalation_capable` — New Evaluator Rule

Current state: No finding rule detects entities that hold roles permitting privilege escalation or impersonation. An NHI with iam:PassRole, iam:CreateRole, iam:AttachRolePolicy, Microsoft.Authorization/roleAssignments/write, or iam.serviceAccounts.actAs can acquire arbitrary permissions. These are the primary lateral movement primitives in cloud environments. SecurityV0 is currently blind to them entirely.

This is a categorical gap: none of the 15 existing rules inspect what a role permits at the permission level. Rules operate on destination sensitivity, execution evidence, and role membership — not on whether the role's actions themselves are escalation-enabling.

Proposed change: New rule src/evaluator/rules/escalation-capable.ts:

const ESCALATION_NORMALIZED_ACTION = "escalation";

export const escalationCapableRule: FindingRule = {
  name: "escalation_capable",

  async evaluate(entity: EntityDoc, ctx: EvaluationContext): Promise<RuleFindingCandidate | null> {
    if (entity.entity_type !== "identity" && entity.entity_type !== "workload") return null;

    const escalationPaths = (entity.execution_paths ?? []).filter(
      p => p.actions.includes(ESCALATION_NORMALIZED_ACTION)
    );
    if (escalationPaths.length === 0) return null;

    // getEvidenceWithRunsAs() traverses RUNS_AS edges so workloads that
    // assume a role are correctly detected as having exercised escalation
    // authority. getExecutionEvidence() misses these because the evidence
    // document is attached to the role, not the workload identity.
    const evidence = await ctx.getEvidenceWithRunsAs(entity._id, 1);
    const exercised = evidence.length > 0;

    const severity: FindingSeverity = exercised ? "critical" : "high";

    // evidenceClaim must be constructed here and spread into the return value.
    // The EvaluationContext.buildEvidenceClaim() method seals the claim with
    // a timestamp and the tenantId so it can be integrity-checked by the
    // evidence assembler. Omitting it means the finding has no traceable
    // evidence link and will fail the evidence pack integrity check.
    const evidenceClaim = ctx.buildEvidenceClaim({
      escalation_path_count: escalationPaths.length,
      exercised,
      via_roles: [...new Set(escalationPaths.flatMap(p => p.via_roles))],
    });

    return {
      findingId: stableFindingId(ctx.tenantId, "escalation_capable", entity._id),
      findingType: "escalation_capable",
      severity,
      status: "active",
      entityId: entity._id,
      affectedResources: escalationPaths.map(p => p.resource_id),
      explanation: `Entity holds ${escalationPaths.length} escalation-capable permission(s) ` +
        `(IAM/role-manipulation actions). ${exercised ? "Execution evidence present — actively used." : "No execution evidence — standing risk."}`,
      evidenceRefs: {
        escalation_path_count: escalationPaths.length,
        exercised,
        via_roles: [...new Set(escalationPaths.flatMap(p => p.via_roles))],
      },
      ...evidenceClaim,
    };
  }
};

Add "escalation_capable" to FINDING_TYPES in domain/findings/types.ts and register the rule in the evaluator orchestrator.

Why better:

Detects the most dangerous NHI pattern in cloud environments: automation that can grant itself or others arbitrary permissions
Severity auto-escalates to critical when CloudTrail confirms the capability was exercised
Deterministic, zero-heuristic — same logic as all other rules
Directly answers the CISO question: "which of our automations can become anything?"
Several vendor platforms detect this capability gap. SecurityV0 does not.

Depends on: 1.3 Bug B fix (normalized_action in AWS connector) — without it, rule never fires on AWS data

1.5 ServiceNow Pagination — Fix 429 Break

Current state: sv0-connectors/integrations/entra-servicenow/src/entra_servicenow/adapters/servicenow_client.py:421:

if response.status_code != 200:
    break

Any non-200 response — including 429 (rate limited) and 503 (temporary) — silently stops pagination and returns a partial result. Baselines built from truncated data produce false findings. The connector reports success.

Proposed change:

if response.status_code == 429:
    # Cap at 5 minutes — a runaway Retry-After (misconfigured SNow, adversarial header)
    # could otherwise hold the worker indefinitely and block other tenants.
    retry_after = min(int(response.headers.get("Retry-After", 60)), 300)
    time.sleep(retry_after)
    continue  # retry this page, don't advance cursor
elif response.status_code != 200:
    raise ConnectorError(f"Unexpected status {response.status_code}")

Note: urllib3 retries apply at the adapter level and cover transient TCP/TLS failures before the pagination loop ever sees a status code. This fix addresses the separate issue where the pagination cursor breaks silently after urllib3 retries are exhausted and the 429 reaches application code.

Why better:

ServiceNow baseline is complete, not silently truncated
Rate limiting is handled correctly instead of terminating the sync
Errors surface as actual failures instead of silent partial results
Cap prevents a malformed or adversarial Retry-After from stalling the worker indefinitely

Depends on: Nothing — independent fix

2. Security — Fix the Auth Gaps

These are not architectural problems. They are implementation gaps that must be closed before any customer is onboarded.

2.1 `REQUIRE_AUTH` Default — Invert It

Current state: sv0-platform/docker-compose.deploy.yml:46:

REQUIRE_AUTH: "${REQUIRE_AUTH:-false}"

If the production .env file omits this variable, the entire API is unauthenticated. Any request with an X-Tenant-Id header gets full * scopes to any tenant's data. One forgotten environment variable is a cross-tenant IDOR.

Proposed change:

REQUIRE_AUTH: "${REQUIRE_AUTH:-true}"

Default is secure. Development environments explicitly opt out.

Why better:

A misconfigured production deploy fails closed instead of open
Follows the security principle: safe default, explicit opt-out for development

Depends on: Nothing

2.2 `DevAuthProvider` — Add Production Gate

Current state: src/api/auth/provider-factory.ts accepts AUTH_PROVIDER=dev without checking environment. If deployed to production with this setting, any caller who reaches /auth/callback?code=dev-bypass receives a super-admin session. No code-level guard prevents it.

Proposed change:

if (provider === "dev" && process.env.NODE_ENV === "production") {
  throw new Error("DevAuthProvider cannot be used in production. Set AUTH_PROVIDER=workos.");
}

Process crash on startup — loud, not silent.

Why better:

A misconfigured production deploy fails at startup instead of silently allowing auth bypass
The error message is explicit and actionable

Depends on: Nothing

2.3 Mount the New Auth Middleware

Current state: src/api/auth/auth-middleware.ts implements the full session → tenant → membership validation pipeline. src/api/app.ts lines 26–29 explicitly note it is not yet mounted. The old authMiddleware is live in production — it has no membership model, no role validation, and the REQUIRE_AUTH=false bypass.

Proposed change: Complete WorkOS integration and mount createSessionMiddleware → createTenantMiddleware → createMembershipMiddleware in app.ts. Remove the old authMiddleware. The new pipeline is already written — it needs WorkOS credentials and end-to-end testing.

Why better:

Tenant membership is validated on every request (users can only access tenants they belong to)
Role-based access is enforced at the middleware layer
The WorkOS-backed verifySession() runs instead of the stub returning null
Super-admin access is tracked through the proper org-membership check

Depends on: 2.1, 2.2

2.4 Super-Admin Check — Replace Email Domain with Allowlist

Current state: sv0-platform/src/api/routes/auth.ts:76:

const isSuperAdmin = result.email.endsWith("@securityv0.com");

This grants super-admin to every @securityv0.com account — all employees, all contractors, anyone with that domain. A domain takeover or email provider breach grants platform-wide super-admin. A terminated employee retains super-admin for 7 days (iron-session TTL).

Proposed change: Replace with explicit RBAC membership check: read super-admin status from the WorkOS organization's role claim, not the email domain. Maintain a named allowlist of specific user IDs in configuration, not a domain pattern.

Why better:

Super-admin access is revocable immediately (deprovisioning from WorkOS org revokes it)
Adding a @securityv0.com email to any OAuth provider doesn't grant platform access
Least privilege: only explicitly designated users have super-admin, not everyone at the company

Depends on: 2.3

2.5 Add BFS Document Limit

Current state: sv0-platform/src/storage/mongo/adapters/subgraph-adapter.ts:158 — reverse-lookup query has no .limit(). A sufficiently dense or deep graph triggers a full collection scan, loading 50K+ documents into the Node.js heap. In a 512MB container, this is a denial-of-service vector against all tenants via the dashboard API.

Proposed change:

.find({ tenant_id: tenantId, "relationships.target_id": { $in: frontier } })
.limit(MAX_BFS_DOCUMENTS)  // e.g., 5000
.toArray()

Return a GraphTruncated warning in the response when the limit is hit. The frontend already shows graph size warnings.

Why better:

One large tenant's graph query cannot exhaust the API process's memory
All tenants sharing the API process are protected from one tenant's dashboard use
Truncation is visible to the user rather than causing a silent OOM crash

Depends on: Nothing — independent fix

3. Replace the Worker Queue

This is the single most impactful infrastructure change. Everything else — event-driven sync, per-tenant job priority, cell architecture — becomes easier once this is done.

3.1 Replace `WorkerJob[]` Array — Queue Implementation Comparison

Current state: sv0-platform/src/workers/runtime.ts:26:

private readonly queue: WorkerJob[] = [];

This is an in-process, in-memory, unbounded JavaScript array. It processes one job at a time, serially. Jobs are lost on process restart. There is no retry, no dead-letter queue, no visibility into queue depth, no way to prioritize jobs. The worker and the API share the same Node.js process — a stuck sync job degrades API response times.

Three viable replacements — comparison:

Dimension	BullMQ + Redis	pg_boss (PostgreSQL)	graphile-worker (PostgreSQL)
New infrastructure	Redis required	PostgreSQL (already in stack if using TimescaleDB)	PostgreSQL
Jobs survived restart	Yes — Redis AOF	Yes — ACID transactions	Yes — ACID transactions
Concurrency	4+ parallel workers	Configurable workers	Configurable workers
DLQ / retry	Built-in	Built-in (failed jobs table)	Built-in
Priority queues	Yes — BullMQ priority field	Yes — `priority` column	Yes — `priority` column
Job visibility	Grafana + Bull Board	SQL query on jobs table	SQL query on jobs table
Cell compatibility	Each cell gets own Redis	Each cell gets own PostgreSQL (already needed)	Each cell gets own PostgreSQL
SKIP LOCKED	No (Redis-based)	Yes — standard PostgreSQL pattern	Yes
Cron/scheduled jobs	Yes	Yes	Yes
TypeScript support	First-class	First-class	First-class

Recommendation:

If TimescaleDB is adopted (section 10), PostgreSQL is already in the stack. pg_boss or graphile-worker then removes the need for Redis entirely — one fewer infrastructure component. BullMQ + Redis makes sense if Redis is already committed for other reasons (session revocation, rate limiting).

Correctness note on graphile-worker priority: graphile-worker's priority is implemented via ORDER BY priority DESC, created_at ASC in the SKIP LOCKED query — correct semantics but no per-queue default. Must be set per-job. pg_boss has a native priority column with ordered dequeue. The table marks them identically; in practice pg_boss is the cleaner choice if priority queues matter.

With pg_boss v10 (PostgreSQL-backed queue):

// pg_boss v10 API — note: v9 used boss.work(), v10 uses boss.createWorker()
const boss = new PgBoss('postgresql://localhost/sv0');
await boss.start();

// Create queues with retry config
await boss.createQueue('sync_ingestion', { retryLimit: 3, retryDelay: 30 });
await boss.createQueue('evaluate_findings', { retryLimit: 3 });
await boss.createQueue('build_evidence_pack', { retryLimit: 5 });

// v10 worker API
const worker = boss.createWorker({
  name: 'sync_ingestion',
  teamSize: 4,
  fetch: createSyncIngestionHandler(storage),
});
await worker.start();

The worker handlers (createSyncIngestionHandler, createEvaluateFindingsHandler, createBuildEvidencePackHandler) don't change.

With BullMQ + Redis (if Redis already exists):

// BullMQ queue — same handler contract
const syncQueue = new Queue('sync_ingestion', { connection: redisConnection });
const worker = new Worker('sync_ingestion', createSyncIngestionHandler(storage), {
  connection: redisConnection, concurrency: 4,
});

Why better than the array (both options):

Dimension	Current (WorkerJob[] array)	pg_boss or BullMQ
Jobs survived restart	0 — all lost	All — persisted
Concurrency	1 job at a time	4+ parallel workers
Saturation point (daily)	35 tenants	~140 tenants (4× workers)
Stuck job behavior	Blocks all tenants forever	Timeout → DLQ → alert
Job priority	None — pure FIFO	Per-tenant priority lanes
Visibility	None	Queue depth, job latency, error rate

Depends on: Redis (if BullMQ) OR PostgreSQL (if pg_boss/graphile-worker — already needed for TimescaleDB)

4. Event-Driven Sync

The current sync model re-processes the entire tenant graph on every run. This is the root cause of the O(I×R×P×Res) write amplification, the worker queue saturation, and the hours-long detection latency. Event-driven sync fixes all three.

4.1 Connector Delta Mode — Send Changes, Not Full Snapshots (Near-Term)

Current state: Every connector run sends the complete tenant graph — all entities, all relationships — regardless of what changed. The platform re-ingests, diffs, and re-materializes everything. For a 5,000-entity tenant this is ~640,000 MongoDB reads per sync.

Proposed change: Connectors cache the previous sync's entity state (hashed by source ID). On the next run, they diff current state against cache and send only changed entities — new, modified, or deleted. The platform's affectedEntityIds parameter in the path materializer already supports incremental updates. Connectors just need to use it.

This requires no platform changes. Only connector-side delta logic.

Why better:

Metric	Full snapshot sync	Delta sync
Entities sent per sync (typical change)	5,000	20–50
MongoDB reads (path materialization)	~640,000	~1,300
Worker time per sync	85–240 seconds	5–15 seconds
Worker saturation point (daily)	35 tenants (+ BullMQ: 140)	~2,000+ tenants
Detection latency	Next sync window	Same sync window (same latency, just cheaper)

Depends on: Nothing for correctness — delta mode produces correct results with the current serial queue. 3.1 (BullMQ) is a throughput optimization: parallel workers let fast delta jobs run concurrently rather than waiting behind slow full-snapshot jobs.

4.2 True Event-Driven Sync — Real-Time Change Detection (Medium-Term)

Current state: The platform detects permission changes on the next scheduled scan. If a service account gains admin access at 9am and the daily sync runs at midnight, the platform doesn't know for 15 hours.

Proposed change: Subscribe connectors to cloud provider change event streams:

AWS: CloudTrail → SQS queue → connector event processor watches for CreatePolicy, AttachRolePolicy, PutRolePolicy, AssumeRole events
Entra ID: Microsoft Graph change notifications (webhooks) for service principal and app role assignment changes
ServiceNow: Change event webhooks on cmdb_ci_service_account and sys_user_has_role tables

Each event triggers a targeted sync of only the affected entities, not the full graph.

Architecture:

Cloud Event Stream (CloudTrail SQS / Graph Webhook / ServiceNow webhook)
         ↓
  Connector (long-running service, not CLI)
         ↓
  POST /api/v1/ingest/normalized-graph (delta only, affected entities)
         ↓
  Platform: path materialization for affected entities only (~1,300 reads)
         ↓
  Evaluator: re-run rules for affected entities only
         ↓
  Finding surfaced: within 30 seconds of the cloud event

A daily reconciliation scan (full graph) runs alongside to catch any events the stream missed.

Why better:

Metric	Cron-based batch	Event-driven
Detection latency	Hours (next scheduled scan)	15–60 seconds
Platform load	Proportional to graph size × tenants	Proportional to actual IAM change rate
Worker queue depth	Spikes at sync windows	Flat — constant small jobs
Scale ceiling	~140 tenants (BullMQ + 4 workers)	Effectively unlimited at normal change rates
Security value	Historical snapshot	Real-time authority drift detection

Detection latency drops from hours (next scheduled scan) to 15–60 seconds — the difference between catching a permission escalation before damage and finding it in the audit log after.

Depends on: 1.1 (CloudTrail for AWS), 3.1 (BullMQ for parallel small jobs), 4.1 (delta logic)

Operational change for customers: Connectors become long-running services instead of cron jobs. Customers deploy them as a container or systemd service rather than adding to crontab. The --submit push model is unchanged; only the connector's runtime model changes.

5. Tenant Isolation — Without Cell Architecture

These three changes address the real current isolation risks without building a control plane.

5.1 Per-Tenant MongoDB Collections via StorageAdapter

Current state: All 23 MongoDB collections are shared across tenants, separated only by tenant_id field predicates. One missing tenant_id filter in any of 50+ query paths is a platform-wide cross-tenant data leak. MongoDB has no backstop — the application is the sole enforcement point.

Proposed change: Add a tenantId → collectionName resolver inside the StorageAdapter implementation. Reads and writes route to tenant-specific collections (e.g., entities_acme, entities_globex) within the same MongoDB instance.

// Inside MongoStorageAdapter — only this file changes
private collectionName(tenantId: string, base: string): string {
  return `${base}_${tenantId.replace(/-/g, "_")}`;
}

// Before:
db.collection("entities").find({ tenant_id: tenantId, ... })

// After:
db.collection(this.collectionName(tenantId, "entities")).find({ ... })
// tenant_id filter no longer needed — the collection IS the isolation

Application code, connectors, API routes, and evaluators: zero changes. The StorageAdapter interface is the only surface area.

Why better:

A missing tenant_id filter can no longer leak cross-tenant data — the wrong collection simply returns no results
Each collection has independent indexes — a large tenant's query doesn't compete with a small tenant's index scans
Collection-level backup and restore per tenant (useful for enterprise data export requirements)
Each collection maps directly to what a cell's database would contain — this is the migration step before physical cell extraction

Depends on: Nothing — StorageAdapter abstraction already provides the boundary

5.2 Per-Tenant Rate Limiting at the API Layer

Current state: No per-tenant rate limiting exists at the worker layer. One tenant can flood the ingest endpoint with large payloads, saturating the worker queue and degrading all other tenants' sync windows.

Proposed change: Token bucket rate limiter keyed by tenantId in Express middleware. Configurable per-tier limits (evaluation tenants: 10 syncs/hour; production tenants: 100 syncs/hour; enterprise tenants: unlimited with quota monitoring).

Why better:

One tenant's aggressive connector usage cannot degrade another tenant's pipeline
Limits are visible and configurable per customer tier
Provides the operational signal ("Tenant X is hitting rate limits") that prompts operational review of that tenant's tier

Depends on: Nothing — middleware change only

5.3 Per-Tenant Job Priorities in BullMQ

Current state: All tenant jobs share one FIFO queue. An evaluation tenant's sync job blocks an enterprise customer's sync job that arrived 1 millisecond later.

Proposed change: Assign BullMQ job priorities based on tenant tier. Enterprise: priority 1. Production: priority 5. Evaluation: priority 10. BullMQ processes higher-priority jobs first.

Why better:

Enterprise customers get faster evidence pack generation regardless of queue depth
Evaluation tenants don't consume resources at the expense of paying customers
Priority configuration enables tier-based service level differentiation

Depends on: 3.1 (BullMQ)

6. Infrastructure — Docker Compose to k3s

Current state: Docker Compose runs in production on two Hetzner CPX21 VMs. Docker Compose is a development and single-host orchestration tool. It cannot:

Scale horizontally across hosts
Do rolling deployments without downtime
Route traffic away from failed containers automatically
Enforce resource limits in a meaningful way
Automate cell provisioning

Proposed change: Migrate to k3s (lightweight Kubernetes) on the existing Hetzner hardware. Same VMs, same container images, same Dockerfiles — only the orchestration layer changes.

# Install k3s on existing Hetzner CPX21
curl -sfL https://get.k3s.io | sh -

# Deploy via Helm charts (one chart per service)
helm install sv0-platform ./charts/sv0-platform
helm install sv0-workers ./charts/sv0-workers

Why better:

Capability	Docker Compose	k3s
Rolling deployments	Downtime on every deploy	Zero-downtime rolling update
Health-based routing	Manual	Automatic liveness/readiness probes
Resource quotas	Soft limits only	Hard enforced limits + eviction
Horizontal scaling	Not possible	Add nodes, scale replicas
Cell provisioning	Impossible	`helm install cell-eu-02 ./charts/sv0-cell`
Secret management	`.env` files on disk	Kubernetes Secrets + external-secrets operator

Cell architecture becomes a Helm chart: Once on k3s, provisioning a new cell is:

helm install cell-eu-02 ./charts/sv0-cell \
  --set region=eu \
  --set mongodb.tier=M20 \
  --set workers.replicas=4

New cell live in 15 minutes, zero downtime for other cells. This is impossible on Docker Compose.

Depends on: Nothing — infrastructure change, no code changes

7. Cell Architecture — Incremental, Triggered

Cell architecture is the right long-term direction. It is not the right immediate investment. The incremental path avoids a big-bang rewrite.

The triggers — all must be true before starting:

100+ tenants with active sync workloads
Demonstrated customer requirement for physical data isolation (not just field-level tenant_id)
Measured noisy-neighbor degradation (actual P95 latency correlation between tenants, not theoretical)
All items in sections 1–6 above are complete
Operational capacity sufficient to manage multiple independent MongoDB instances, Redis instances, and Helm deployments

When triggered — what the migration looks like:

The StorageAdapter per-tenant collections (5.1) map directly to a cell's database. BullMQ (3.1) is already cell-native — each cell gets its own Redis instance. k3s (6) makes cell provisioning a Helm command. The only genuinely new component is the cell router — a thin stateless proxy (~200 lines) that maps tenant_id → cell from a registry table.

Already done:        Per-tenant collections (maps to per-cell DB)
Already done:        BullMQ (each cell gets independent Redis queue)
Already done:        k3s (cell = helm install)

New:                 Cell router service
                     Reads tenant-to-cell map from Postgres table
                     Proxies connector pushes and dashboard requests to correct cell
                     Connectors don't change — they push to api.securityv0.com as before

New:                 First enterprise single-tenant cell
                     One Helm install, one Atlas M30, one tenant
                     This IS cell architecture — cell roster of 1
                     No generalized control plane yet

Later:               General cell provisioning when customer count justifies it

Application code, connectors, API routes: zero changes at any step. The migration is additive.

Part B — Conditional Architecture

Sections 8–13 are architectural options, not a roadmap. Each is independently adoptable. Adopt when the specific trigger condition is met — not before. The appropriate trigger for each section is stated in the section header.

8. Event Sourcing — Formalize the Events Collection

8.1 Two Field Additions That Open the Entire ES Migration Path

Current state: The events collection already exists. diff-engine.ts already produces deterministic EventDoc[] with 16 typed event kinds (role_assigned, permission_granted, credential_rotated, etc.) via buildEventSourceRecordId(). insertEvents() is already in the StorageAdapter interface. However, two problems prevent the events collection from being a real event store:

schema.ts:5 sets TWO_YEARS_SECONDS = 63_072_000 as a TTL on the events collection — events older than 2 years are deleted. Event sourcing requires an immutable, permanent log.
Events have no sequence_number per tenant. Without ordering guarantees, you cannot replay events to reconstruct state or detect gaps.

Proposed change:

// src/domain/events/types.ts — add two fields to EventDoc
interface EventDoc {
  // existing fields unchanged...
  event_id: string;          // was already computable via buildEventSourceRecordId() — now primary key
  sequence_number: number;   // monotonically increasing per (tenant_id, entity_id)
  // rest unchanged
}

Remove the TTL index from the events collection in schema.ts:303. Make event_id the unique index key.

Generating sequence_number safely: MongoDB has no native sequence primitive. Do not derive sequence_number from a BullMQ job counter or from Date.now() — parallel workers produce duplicates or collisions. Use an atomic $inc on a dedicated counters collection:

// In StorageAdapter — called inside the same session that writes the EventDoc
async function nextSequenceNumber(tenantId: string, entityId: string): Promise<number> {
  const result = await db.collection("counters").findOneAndUpdate(
    { _id: `${tenantId}:${entityId}` },
    { $inc: { seq: 1 } },
    { upsert: true, returnDocument: "after" }
  );
  return result.seq;
}

This guarantees monotonic, gap-free sequence numbers even with 4 parallel ingestion workers. Use a session/transaction if event write + counter increment must be atomic (required if you add replay correctness guarantees).

Why better:

Zero structural change — all existing queries use queryEvents() filtered by entityId, eventType, since/until — none break
Unlocks time-travel queries: "what authority paths existed on March 15?" (replay events up to that timestamp)
Fixes the evidence pack integrity gap: bind EvidencePackDoc.integrity_hash to the event sequence range at assembly time, not just a point-in-time content hash
Every subsequent pattern (Kuzu projection, Kafka publishing, federated edge event feed) becomes a consumer of this event stream
entity_versions collection already provides entity-level snapshots; the event store provides the change log that makes those snapshots trustworthy

Depends on: Nothing — purely additive

8.2 Evidence Pack Integrity — Bind to Event Range

Current state: src/evidence/integrity.ts computes SHA256 over pack content at assembly time. The pack references findings via finding_id, but findings are mutable — a future evaluator run updates them in place. The pack is sealed, but what it references is not. An auditor cannot independently verify the finding state that triggered the pack.

Proposed change: Augment computeIntegrityHash() to include source_sync_id and the highest sequence_number from the events log at assembly time. Store as source_event_range: { from_sequence: number, to_sequence: number } on EvidencePackDoc.

Why better:

An evidence pack is now cryptographically bound to an immutable event range
Auditors can replay events from_sequence..to_sequence and verify the pack content is correct
Satisfies compliance requirements for tamper-evident audit trails (SOC 2 CC7.2, ISO 27001 A.12.4)

Depends on: 8.1

9. Kuzu — Native Graph Read Model

9.1 Replace MongoDB BFS with In-Process Graph Database

Current state: Path materialization in path-materializer.ts issues deeply nested getEntitiesByIds() calls:

// path-materializer.ts — the nested round-trip loop
const roles = await storageAdapter.getEntitiesByIds(tenantId, roleIds);       // N round-trips
for (const role of roles) {
  const permissions = await storageAdapter.getEntitiesByIds(tenantId, permIds); // N×R round-trips
  for (const perm of permissions) {
    const resources = await storageAdapter.getEntitiesByIds(tenantId, resourceIds); // N×R×P round-trips
  }
}

For 500 identities with 8 roles and 10 permissions/role: 45,000+ MongoDB round-trips per sync. The BFS reverse-lookup in subgraph-adapter.ts:63-66 has no .limit() — it loads potentially thousands of documents per hop.

Proposed change: Add Kuzu (embedded in-process graph DB, MIT license, no new infrastructure) as a read model alongside MongoDB. The StorageAdapter interface is unchanged — only the routing inside a new HybridStorageAdapter changes:

// src/storage/kuzu/hybrid-adapter.ts
export class HybridStorageAdapter implements StorageAdapter {
  constructor(
    private readonly mongo: MongoStorageAdapter,
    private readonly kuzu: KuzuGraphAdapter
  ) {}

  // Graph traversal → Kuzu (single Cypher query replaces nested loop)
  getSubgraph(tenantId: string, query: SubgraphQuery) {
    return this.kuzu.getSubgraph(tenantId, query);
  }

  // Entity writes → dual write (MongoDB is source of truth)
  async upsertEntity(entity: EntityDoc) {
    const result = await this.mongo.upsertEntity(entity);
    await this.kuzu.syncEntity(entity);  // project structural fields only
    return result;
  }

  // Everything else → MongoDB unchanged
  getEntity(...args) { return this.mongo.getEntity(...args); }
  // ...all other 58 methods delegate to mongo
}

The path materialization Cypher query that replaces 45,000 MongoDB round-trips:

// One query per tenant replaces the entire nested loop
MATCH (e:Entity {entity_id: $entityId, tenant_id: $tenantId})
      -[:HAS_ROLE]->(role:Entity)
      -[:GRANTS]->(perm:Entity)
      -[:APPLIES_TO]->(resource:Entity)
RETURN e.entity_id, role.entity_id, perm.entity_id,
       perm.properties_action AS action, resource.entity_id,
       resource.sensitivity

UNION ALL

MATCH (e:Entity {entity_id: $entityId, tenant_id: $tenantId})
      -[:RUNS_AS]->(identity:Entity)
      -[:HAS_ROLE]->(role:Entity)-[:GRANTS]->(perm:Entity)
      -[:APPLIES_TO]->(resource:Entity)
RETURN e.entity_id, role.entity_id, perm.entity_id,
       perm.properties_action AS action, resource.entity_id,
       resource.sensitivity

New queries that become possible with Kuzu (impossible in MongoDB today):

// Shortest path between any workload and a sensitive resource
MATCH p = shortestPath((w:Entity {entity_id: $workloadId})-[*]->(r:Entity {entity_id: $resourceId}))
RETURN p

// Blast radius: how many workloads lose access if this role is deleted?
MATCH (role:Entity {entity_id: $roleId})<-[:HAS_ROLE]-(w:Entity)
RETURN count(w) AS affected_workloads

// MAX_AUTH_CHAIN_DEPTH=1 can be removed — Kuzu handles arbitrary depth
MATCH (e:Entity)-[:AUTHENTICATES_TO*1..4]->(crossSystem:Entity)
      -[:HAS_ROLE]->(r)-[:GRANTS]->(p)-[:APPLIES_TO]->(res)
WHERE e.tenant_id = $tenantId
RETURN e.entity_id, res.entity_id, res.sensitivity

Why better:

Metric	Current (MongoDB BFS)	With Kuzu
Path materialization round-trips	~45,000 per sync (500 identities)	1 Cypher query per identity
Subgraph BFS per hop	Full collection scan, no limit	Adjacency-list traversal, O(degree)
accessible_by write-amplification	8,000 resource upserts per sync	Eliminated — answered on demand
MAX_AUTH_CHAIN_DEPTH constraint	=1 (deeper is too expensive)	Arbitrary depth in Cypher
Cross-system lateral movement paths	Not queryable	Single variable-length path query

Cons:

Dual-write consistency risk: MongoDB write succeeds, Kuzu sync crashes → Kuzu is stale. Mitigation: startup rebuild from MongoDB (seconds for 5K entities, in-process)
Kuzu is v0.x (pre-1.0 as of 2026) — newer, but production-grade query engine
One Kuzu database directory per tenant — LRU cache of open connections needed for multi-tenant
~50–100 MB additional RAM per tenant's loaded graph

Migration path:

Phase 1: Replace getSubgraph() only — the BFS UI queries. Zero correctness risk, immediate user-visible improvement.
Phase 2: Replace path-materializer nested loop. Requires test coverage verifying execution_paths[] matches MongoDB BFS output.
Phase 3: Per-tenant Kuzu database management, startup rebuild, LRU cache.

Depends on: Nothing independent of other changes. The StorageAdapter abstraction makes this additive. Trigger: When BFS dashboard queries become noticeably slow (>2s), or when path materialization takes >3 minutes per tenant sync.

9.2 Graph Engine Options — Kuzu vs Apache AGE

Three viable options for replacing the MongoDB BFS:

	Kuzu	Apache AGE	MongoDB recursive CTE (none)
Architecture	Embedded in-process (like SQLite)	PostgreSQL extension	Current — no graph engine
Query language	Cypher (openCypher)	Cypher (openCypher) within SQL	Application-level BFS loops
Infra footprint	Kuzu file per tenant (100–200MB)	PostgreSQL (already in stack with TimescaleDB)	—
ACID guarantees	Kuzu-internal only — separate from MongoDB	PostgreSQL ACID — same transaction as other writes	—
Dual-write risk	Yes — Kuzu can lag MongoDB	No — single PostgreSQL write	—
Variable-length path queries	Yes — arbitrary depth	Yes — arbitrary depth	Blocked (MAX_AUTH_CHAIN_DEPTH=1)
Pre-1.0 stability risk	Yes — Kuzu v0.x	No — PostgreSQL 16 + AGE 1.x	—
Rejection reason (ADR-003)	Not evaluated	Exponential blowup on variable-length paths	—

On ADR-003's AGE rejection: ADR-003 rejected AGE on two grounds: (1) exponential blowup on unbounded variable-length path queries, and (2) no AWS managed service for AGE. The first concern is valid but mitigated by bounded queries — SecurityV0's path materialization uses bounded traversals constrained by the IAM hierarchy (MATCH p = (identity)-[*..4]->(resource) does not exhibit exponential blowup). The second concern — no AWS managed service — is not addressed by bounded queries and would require acceptance of self-hosting AGE as a PostgreSQL extension. Any re-evaluation of AGE must explicitly address both rejection criteria, not just the first.

Recommendation:

If PostgreSQL is already in the stack (TimescaleDB), Apache AGE is the better choice:

No additional infrastructure component (AGE is a PostgreSQL extension)
No dual-write consistency gap — graph writes happen in the same transaction as entity writes
No v0.x stability risk
Bounded Cypher queries cover SecurityV0's traversal patterns without hitting AGE's exponential blowup case

If staying MongoDB-only, Kuzu is the correct embedded graph engine — the dual-write risk is mitigated by the startup rebuild from MongoDB.

-- Apache AGE setup (PostgreSQL extension)
CREATE EXTENSION IF NOT EXISTS age;
LOAD 'age';
SET search_path = ag_catalog, "$user", public;

-- Create graph per tenant
SELECT create_graph('tenant_acme');

-- Import entities as vertices
SELECT * FROM cypher('tenant_acme', $$
  CREATE (:Identity { entity_id: '...', name: 'lambda-executor', type: 'service_account' })
$$) AS (result agtype);

-- Authority path query — bounded depth avoids AGE's exponential blowup
SELECT * FROM cypher('tenant_acme', $$
  MATCH path = (i:Identity)-[:HAS_ROLE|GRANTS_PERMISSION*1..4]->(r:Resource)
  WHERE i.entity_id = $entityId
  RETURN nodes(path), relationships(path)
$$, $$ { "entityId": "..." } $$) AS (nodes agtype, rels agtype);

10. Execution Evidence Time-Series Store

10.1 Quick Win — Batch the Evidence Lookups First

Current state: The dormant_authority rule calls getExecutionEvidence(entityId, 1) once per entity in the evaluation loop. At 2,000 entities with CloudTrail active, this is 2,000 sequential MongoDB queries per evaluation run — >30 seconds per tenant.

The fix that gives 90% of the benefit, without new infrastructure:

// CURRENT: 2,000 sequential queries
for (const entity of entities) {
  const evidence = await ctx.getExecutionEvidence(entity._id, 1);
  // ...
}

// FIXED: 1 query for the entire tenant
const lastSeen = await ctx.getLastEvidenceTimestamps(entities.map(e => e._id));
// Add to StorageAdapter:
// getLastEvidenceTimestamps(entityIds: string[]): Promise<Map<string, Date>>
// → db.collection("execution_evidence").aggregate([
//     { $match: { tenant_id, entity_id: { $in: entityIds } } },
//     { $group: { _id: "$entity_id", last_seen: { $max: "$occurred_at" } } }
//   ])

Add getLastEvidenceTimestamps() to StorageAdapter, implement in MongoStorageAdapter. The evaluator pre-loads the map once, then resolves dormancy in O(1) per entity.

Depends on: Nothing — purely additive to StorageAdapter Trigger: Do this before adding CloudTrail. It must exist before evidence volume grows.

10.2 Time-Series Store — TimescaleDB (When Triggered)

If evidence volume grows beyond what the batching fix handles (trigger: evaluation taking >30s per tenant after the batch fix, or evidence rows exceeding 1M per tenant), TimescaleDB is the right choice.

Why TimescaleDB over ClickHouse:

	TimescaleDB	ClickHouse
Transactions	ACID — writes never silently fail	No transactions — dual-write gap
Query language	Standard PostgreSQL SQL	ClickHouse-specific dialect and functions
Consistency model	Synchronous, consistent	Eventual (`FINAL` keyword required to deduplicate)
Point-lookup speed	Fast (B-tree index)	Slow — columnar format is wrong shape for single-row lookups
Aggregation at 100M rows	Fast (continuous aggregates)	Fast (columnar compression)
Aggregation at 1B+ rows	Slower	ClickHouse wins here
Ops complexity	PostgreSQL extension — same tooling	Separate system, separate SQL, separate ops
Hosting	Supabase, Timescale Cloud, or any PostgreSQL host	ClickHouse Cloud, Tinybird, or self-hosted

SecurityV0's workload is at the tens-of-millions-of-rows range, not billions. ClickHouse's columnar advantage is irrelevant at this scale. TimescaleDB's ACID consistency eliminates the dual-write gap problem entirely.

Proposed TimescaleDB schema:

CREATE TABLE execution_evidence (
    tenant_id        TEXT NOT NULL,
    occurred_at      TIMESTAMPTZ NOT NULL,
    entity_id        TEXT NOT NULL,           -- 24-char hex MongoDB ID
    source_system    TEXT NOT NULL,
    source_record_id TEXT NOT NULL,           -- dedup key
    evidence_type    TEXT NOT NULL,
    action           TEXT NOT NULL,
    resource_key     TEXT,
    outcome          TEXT,
    execution_count  INT DEFAULT 1,
    confidence       TEXT,
    payload_hash     TEXT,
    sync_id          TEXT,
    fetched_at       TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    UNIQUE (tenant_id, entity_id, occurred_at, source_system, source_record_id)
);

SELECT create_hypertable('execution_evidence', 'occurred_at',
  chunk_time_interval => INTERVAL '1 month');
CREATE INDEX ON execution_evidence (tenant_id, entity_id, occurred_at DESC);

-- Continuous aggregate — pre-computed, updated incrementally
CREATE MATERIALIZED VIEW evidence_last_seen
WITH (timescaledb.continuous) AS
  SELECT tenant_id, entity_id, MAX(occurred_at) AS last_seen_at
  FROM execution_evidence
  GROUP BY tenant_id, entity_id;

-- Retention on the TimescaleDB projection (not the source-of-truth event log).
-- The MongoDB events collection is immutable and has no TTL (see §8.1).
-- This table is a materialized projection of those events for query performance;
-- dropping rows here does not create a reconstruction gap as long as the source
-- events are retained. If you later adopt full event sourcing with replay,
-- align this retention window with your compliance floor.
SELECT add_retention_policy('execution_evidence', INTERVAL '2 years');

Before/after for dormant_authority rule (with TimescaleDB):

// Evaluator queries the continuous aggregate — milliseconds regardless of raw row count
const rows = await pgClient.query(`
  SELECT entity_id, last_seen_at
  FROM evidence_last_seen
  WHERE tenant_id = $1
    AND entity_id = ANY($2)
    AND last_seen_at >= NOW() - INTERVAL '90 days'
`, [tenantId, entityIds]);
// Pre-load into Map<entityId, Date>, then O(1) per entity in evaluation loop

Cross-tenant analytics (impossible in MongoDB):

-- Evidence coverage by type per tenant
SELECT evidence_type, time_bucket('1 day', occurred_at) AS day, COUNT(*) AS events
FROM execution_evidence
WHERE tenant_id = $1
GROUP BY evidence_type, day ORDER BY day DESC;

-- Which entities exercised write actions in the last 30 days?
SELECT entity_id, SUM(execution_count) AS total_writes
FROM execution_evidence
WHERE action IN ('write', 'escalation', 'admin')
  AND occurred_at >= NOW() - INTERVAL '30 days'
GROUP BY entity_id ORDER BY total_writes DESC;

Integration architecture:

MongoDB: source of truth for entity documents, findings, evidence packs (unchanged)
TimescaleDB: analytics read model for evidence (synchronous dual-write — both succeed or transaction rolls back)
StorageAdapter: sumExecutionEvidenceCount() and getLastEvidenceTimestamps() route to TimescaleDB. getExecutionEvidence(limit=1) stays in MongoDB (point-lookup).

Hosted options:

Option	Cost	Ops burden
Supabase (PostgreSQL + TimescaleDB included)	~$25–100/month	Near-zero
Timescale Cloud (managed)	~$50–200/month	Near-zero
Self-hosted on existing PostgreSQL	€0 additional	Low — same tooling as existing DB ops

Depends on: CloudTrail extractor (1.1) — no evidence volume exists until CloudTrail is implemented. Do 10.1 first. Trigger: dormant_authority evaluation still >30s per tenant after 10.1 batch fix, or evidence rows exceed 1M per tenant

11. Durable Event Bus — When the Internal Queue Isn't Enough

11.1 Event Bus Options — Kafka, NATS JetStream, Redpanda, or pg_notify

Current state: The in-process WorkerJob[] array (runtime.ts:26) is the only queue. The workers/index.ts registers 10ms sleep stub handlers — the real handler implementations in src/workers/handlers/ are wired through src/index.ts (the API server process), not the worker process. This means the worker process is effectively dead; all processing happens inline in the API server. This is a critical production gap.

The queue replacement in section 3 (pg_boss or BullMQ) fixes internal job processing. An event bus becomes relevant only when you need fan-out to independent consumers or external system integration that a job queue can't provide.

What an event bus provides that a job queue cannot:

Job queue: produce → ONE consumer → job deleted
Event bus: produce → [consumer A reads] → [consumer B reads] → [consumer C reads]
           messages retained for N days; any new consumer can replay from beginning

Independent consumer groups with replay. A SIEM forwarder can subscribe to finding events without platform code changes. Both read from the same topic independently.
Fan-out without producer coupling. One IAM change event updates the graph engine, writes to TimescaleDB, and forwards to a SIEM — without the producer knowing about any of them.
Streaming ingestion from connectors. When CloudTrail is streaming (not batch), connectors emit thousands of small events per minute. An event bus is the natural receiver; an HTTP endpoint is not.

Options compared:

	Kafka / Redpanda	NATS JetStream	PostgreSQL LISTEN/NOTIFY
Retention	Days to years (log-based)	Configurable (file-backed)	Not retained — fire and forget
Replay	Full replay from offset 0	Replay from sequence number	No replay
Fan-out	Consumer groups — independent offsets	Push-based subscribers	All listeners receive simultaneously
Ordering	Per-partition ordering	Per-subject ordering	No ordering guarantees
Throughput	Millions/sec (memory); ~500K/sec file-backed	Millions of msgs/sec	Thousands of msgs/sec
Infra footprint	3-node cluster minimum	Single binary (Go, ~20MB)	0 — already in PostgreSQL
Ops complexity	High — topic management, consumer lag, schema registry	Low — single binary, no ZooKeeper/KRaft	Near-zero
Ecosystem	Kafka Connect, ksqlDB, broad tooling	Growing — NATS CLI, leaf nodes	SQL tooling only
At-scale winner	Yes — billions of messages/day	Yes — up to hundreds of millions/day	No — inappropriate at scale

Recommendation for SecurityV0:

Phase 1 (internal processing): pg_boss or BullMQ (section 3). No event bus needed yet.
Phase 2 (SIEM/SOAR integration or external fan-out): NATS JetStream. Single binary, Go, 20MB RAM per node. Subject-based routing maps directly to the topic design below. No JVM, no ZooKeeper, no cluster management at startup scale.
Phase 3 (streaming CloudTrail at scale or 500+ tenants): Redpanda. Kafka-compatible API — every Kafka consumer already works. No JVM. 3 Hetzner CCX22 nodes (4vCPU/8GB each, ~€57/month) — Redpanda's minimum recommended spec per node is 4vCPU/8GB; CPX11 (2GB) is undersized and will OOM under load. Migrate from NATS JetStream if fan-out consumers grow beyond what NATS handles.
Skip vanilla Kafka unless forced by ecosystem tooling requirements (Kafka Connect, ksqlDB). The ops overhead at startup scale is not justified.

Subject/topic design (same across all options):

sv0.iam.graph.submitted      — connector scan complete (NormalizedGraph payload)
sv0.iam.entity.changed       — per-entity diff event (from diff-engine.ts EventDoc)
sv0.path.materialized        — execution paths recomputed for entity
sv0.finding.evaluated        — finding created or updated
sv0.evidence.built           — evidence pack sealed

All partitioned by tenant_id — per-tenant ordering, no cross-tenant interference.

NATS JetStream consumer example:

import { connect, StringCodec } from 'nats';

const nc = await connect({ servers: 'nats://localhost:4222' });
const js = nc.jetstream();

// Durable consumer — survives restarts, picks up from last ack
const consumer = await js.consumers.get('sv0-events', 'cg-evaluator');

for await (const msg of await consumer.consume()) {
  const event = JSON.parse(msg.string());
  await evaluatorService.handleEntityChanged(event);
  msg.ack();
}

Consumer groups (same pattern regardless of event bus):

cg-path-materializer  entity.changed  → materializeExecutionPaths
cg-evaluator          path.materialized → EvaluatorService.evaluateTenant
cg-evidence-builder   finding.evaluated → buildEvidencePack
cg-timescaledb-writer entity.changed + finding.evaluated → TimescaleDB dual-write
cg-siem-forwarder     finding.evaluated → SIEM/SOAR push (zero platform changes)

Managed options for Hetzner startup:

Option	Cost	Ops burden	Notes
Redpanda (self-hosted)	~€21/month (3 Hetzner nodes)	Low — single binary, no JVM	Kafka-compatible, best for Hetzner
WarpStream	~$0–50/month (S3-backed)	Near-zero — stateless brokers	~250–500ms p50 latency on standard S3; S3 Express One Zone reduces this but at higher storage cost — unsuitable for sub-second delivery requirements
Confluent Cloud	~$720/month minimum	Near-zero	Overpriced at startup scale
Amazon MSK	~$360–550/month	Low	Natural fit only if moving to AWS

Recommended: Redpanda self-hosted for Hetzner. Same Kafka API, no JVM, runs in 512MB RAM, 3-node cluster on Hetzner CPX11 instances.

Pros:

Event replay for correctness — re-evaluate 90 days of scans after a rule bug fix
Fan-out enables SIEM/SOAR integration with zero platform code changes
Streaming CloudTrail events (when implemented) are natively Kafka-shaped
Per-partition tenant ordering prevents evaluate_findings race condition that currently works by accident

Cons:

Significant operational overhead vs. BullMQ: topic management, consumer lag monitoring, schema registry, offset management
The evaluate_findings → sync_ingestion ordering dependency (currently serial by accident) must be made explicit: evaluator consumer checks sync.status === "completed" before processing, retry with backoff if not ready
Connector must be updated to publish to Kafka topic instead of HTTP POST (coordinated release)

Depends on: BullMQ (section 3) must be working first; k3s (section 6) for deployment Trigger: (any one is sufficient)

Fan-out to 2+ independent external consumers (SIEM, ML pipeline)
Replay requirement discovered after an evaluation rule bug
CloudTrail streaming connector is implemented (batch HTTP → streaming events)
Tenant count exceeds 100 with concurrent scan bursts overwhelming BullMQ priority queues

12. Federated Edge Processing

12.1 Run Evaluation Rules Inside the Customer's Environment

Current state: Every connector pushes a complete NormalizedGraph to the central platform. For a medium AWS account (500 nodes, 1,500 edges), this is 1–4 MB of JSON containing trust policy documents, inline policy JSON, access key metadata, Bedrock agent instructions, and secret ARNs. All of this lands in SecurityV0's shared MongoDB instance in Germany.

What moves to the connector:

9 of 14 evaluation rules can run entirely on the local NormalizedGraph with no external data:

Rule	Portable?	Notes
`dormant_authority`	Yes	Local CloudTrail evidence + in-memory graph
`unproven_execution`	Yes	Pure local data
`reachable_sensitive_domain`	Yes	Pure local data
`llm_egress`	Yes	Single property check
`external_egress`	Yes	Single property check
`unknown_identity_binding`	Yes	All targets are in local graph
`unresolved_cross_system_auth`	Yes	Pure property check
`orphaned_ownership`	Yes	All owner entities in local graph
`ownership_unknown`	Yes	Pure local data
`scope_drift`	No*	Requires entity version history across scans
`reachability_drift`	No*	Requires entity version history across scans
`ownership_drift`	No*	Requires entity version history across scans
`ownership_ambiguous`	Partial	Version history = first-scan only without persistence
`privilege_justification_gap`	Conditional	Needs CloudTrail evidence

*Drift rules can be made portable if the connector saves a "last-scan snapshot" to local disk between runs (e.g., ~/.sv0-aws/last-snapshot.json).

What the connector sends in federated mode (instead of full NormalizedGraph):

{
  "syncId": "...",
  "connectorVersion": "1.4.0",
  "ruleEngineVersion": "2.1.0",
  "entitySummary": { "totalEntities": 847, "byType": { "identity": 312, ... } },
  "findings": [
    {
      "findingId": "eval:abc123",
      "findingType": "dormant_authority",
      "severity": "high",
      "entityDisplayName": "my-lambda-execution-role",
      "entityType": "identity",
      "explanation": "Identity has 3 execution paths but no evidence in 90 days.",
      "evidenceClaim": { "claim_type": "execution_absent", "evidence_strength": "deterministic" }
    }
  ],
  "postureSummary": { "activePaths": 1240, "dormantPaths": 89 },
  "graphIntegrityHash": "SHA256 of all entity hashes"
}

Raw IAM data that never leaves the customer's environment: trust policy documents, inline policy JSON, access key IDs, secret ARNs, Bedrock agent instructions, resource policy documents, role ARNs.

What breaks in the dashboard (federated mode):

Feature	Status
Findings list and detail	Fully functional
Posture metrics dashboard	Fully functional
Sync history	Fully functional
Entity list (properties tab)	Degraded — display names only, no IAM properties
Graph explorer	Completely broken — no entity graph stored
Temporal compare	Completely broken — no entity version history
Execution chains	Completely broken
Evidence pack detail	Severely degraded — finding text only

Hybrid tier model:

Standard tier:  POST /api/v1/ingest/normalized-graph (current, unchanged)
                Full entity storage, all dashboard features, all 14 rules server-side

Enterprise tier: POST /api/v1/ingest/findings (new endpoint)
                 Findings-only ingestion, graph explorer disabled, data never leaves customer env

Tenant ingestion_mode: "standard" | "federated" field controls which path is active. Dashboard shows "Federated Mode" badge and disables graph-dependent features.

Finding ID stability requirement: The connector must compute finding IDs using the exact same formula as src/ingestion/graph-transformer.ts:buildStableEntityId() — SHA256 of tenantId:sourceSystem:sourceId. The Python connector must replicate this hash exactly.

Pros:

Raw IAM topology never leaves customer's cloud environment
GDPR data residency story changes from "we protect your data" to "we never receive your data"
Platform breach blast radius changes from "complete cloud attack surface for every customer" to "finding metadata and resource display names"
FedRAMP path: federal agency IAM data never touches non-FedRAMP infrastructure

Cons:

Graph explorer, temporal compare, execution chains disabled — these are the platform's most visually distinctive features
Rule versioning becomes a coordination problem: ruleEngineVersion must be enforced and customers must upgrade connectors to get new rules
Drift rules require connector-side state persistence (local file or S3) — adds operational complexity
Platform → connector API: how does the platform send "baseline" back to the connector for drift rules? (unsolved)
3 connectors need porting (AWS + Entra-ServiceNow + Azure Foundry) — effort multiplies

Minimal federated mode (pragmatic first step): Ship 9 portable rules without drift detection. Label drift rules as "requires Standard mode." Deliver the data residency story immediately at half the engineering effort. Build full drift support when a specific customer requires it.

Depends on: 8.1 (event store formalization for graphIntegrityHash binding), 1.1 (CloudTrail for privilege_justification_gap) Trigger: Customer with demonstrated IAM data residency requirement

13. Architecture Combination Tiers

The five patterns above are not independent — some strongly reinforce each other and some conflict when adopted simultaneously. This section defines the recommended adoption sequence based on the synergy analysis.

13.1 Synergy Matrix

	Event Sourcing	Kuzu	Federated Edge	Kafka	TimescaleDB
Event Sourcing	—	STRONG	STRONG	STRONG	STRONG
Kuzu	STRONG	—	Neutral	Neutral	Neutral
Federated Edge	STRONG	Neutral	—	STRONG	Neutral
Kafka	STRONG	Neutral	STRONG	—	STRONG
TimescaleDB	STRONG	Neutral	Neutral	STRONG	—

Note on Kuzu + TimescaleDB: Unlike ClickHouse (which had a weak conflict due to dual-write consistency), TimescaleDB writes are ACID transactions. The multi-store consistency risk is reduced but the single event stream (Event Sourcing) remains the cleanest reconciliation model.

13.2 Tier 1 — Production-Ready (up to ~50 tenants)

Patterns: Event Sourcing (formalized) + TimescaleDB

This is the highest-leverage combination given the actual code state. The events collection already exists. TimescaleDB fits directly behind StorageAdapter.sumExecutionEvidenceCount() and getLastEvidenceTimestamps(). Zero connector changes. Zero API changes. Zero dashboard changes.

Connector (Python, unchanged)
  → POST /api/v1/ingest/normalized-graph (unchanged)
  → Diff Engine (unchanged) → MongoDB (entities, entity_versions, events [no TTL, sequence_number])
                           → TimescaleDB (execution_evidence, ACID dual-write)
Worker: BullMQ replacing WorkerJob[] array
Dashboard: unchanged

Solves:

Evidence pack integrity (event range binding)
Real sumExecutionEvidenceCount once CloudTrail is live
Serial queue saturation (BullMQ with 4 parallel workers)
Preservation of full event history for future ES migration

Cost: TimescaleDB on Supabase free tier or self-hosted on existing PostgreSQL + Redis (already needed for BullMQ)

13.3 Tier 2 — Growth (100+ tenants)

Patterns added: Kuzu as graph read model + formalized CQRS boundary

Tier 1's event store enables clean Kuzu projection: graph change events → Kuzu write. The StorageAdapter routes getSubgraph() and path materialization to Kuzu.

Connector (Python)
  → Kafka topic: sv0.iam.entity.changed (replaces direct HTTP push for streaming tenants)
  → MongoDB (source of truth, unchanged)
  → Kuzu (graph projection, rebuilt from entity.changed events)
  → TimescaleDB (evidence, consuming from Kafka topic directly)

Worker: BullMQ for document operations + Kafka consumer groups for graph/evidence projections
Dashboard: graph queries → Kuzu (faster BFS, arbitrary depth, blast radius queries)

Solves:

O(I×R×P×Res) path materialization eliminated
MAX_AUTH_CHAIN_DEPTH=1 constraint removed
100+ tenants with parallel sync workers (Kafka consumer groups, one partition per tenant)
SIEM/SOAR integration via new Kafka consumer group (zero platform code changes)

13.4 Tier 3 — Enterprise-Scale (triggered by 500+ tenants or enterprise isolation requirement)

All five patterns active. Federated Edge is the differentiating feature.

CUSTOMER ENVIRONMENT:
  Federated Edge Agent → scan IAM → local path materialization → local rule evaluation
                       → publishes FindingsPayload to sv0.findings.{tenant} Kafka topic
                       → CloudTrail stream → aggregated evidence events (no raw calls)

SECURITYV0 INFRASTRUCTURE:
  Kafka → findings-ingestion consumer → MongoDB (findings, entity summaries, posture)
       → TimescaleDB (evidence events from customer edge)
       → Kuzu (graph projections from delta events)
  Dashboard: findings + posture (full) | graph explorer (disabled in federated mode)

Enterprise value props:

Raw IAM data never leaves customer's cloud
Customer-key signing of evidence events (cryptographic non-repudiation)
FedRAMP Moderate eligible (IAM data never on non-FedRAMP infrastructure)
GDPR data residency: EU edge agent + EU Kafka + EU TimescaleDB partition = complete residency

13.5 Anti-Patterns — What NOT to Combine

1. Kuzu + TimescaleDB without Event Sourcing as coordinator Three write surfaces (MongoDB, Kuzu, TimescaleDB) without a shared event stream creates inconsistency risk. TimescaleDB's ACID writes reduce (but don't eliminate) the risk — Kuzu can still lag if its sync job fails. Fix: formalize event store first (8.1), then project into both from the same event stream.

2. Federated Edge + Kafka before canonical resource identity is fixed The resource_key field is not yet a stable first-class identifier. Federated edge agents publishing evidence events will produce records that cannot be matched to authority paths because the join key (resource_key) differs between the connector and the platform. Fix: complete the canonical resource identity refactor before shipping federated edge.

3. Full Kafka as the internal worker queue BullMQ achieves the same throughput improvement with 10% of the operational complexity. Kafka's value is as an event bus for the connector → platform boundary and for fan-out to external consumers — not as an internal job runner. The technology stack analysis doc in sv0-documentation explicitly flags this as an anti-pattern.

4. Federated Edge as the first architectural change Requires: stable delta event format (needs event sourcing), durable outbound channel (needs Kafka), canonical resource_key (needs refactor), customer-side deployment tooling. Building the enterprise feature on a foundation with unresolved critical bugs doubles the surface area for failures.

13.6 Migration Compatibility Summary

Pattern	Can be adopted incrementally?	Seam	Connector changes
Event Sourcing (formalize events)	Yes — 2 field additions	`schema.ts`, `EventDoc`	None
TimescaleDB for evidence	Yes — ACID dual-write, feature flag	`StorageAdapter.sumExecutionEvidenceCount()`	None
BullMQ (section 3)	Yes — drop-in swap	`WorkerRuntime` class	None
Kuzu (shadow then cutover)	Yes — shadow read model	`StorageAdapter.getSubgraph()`	None
Kafka for connector intake	Coordinated connector release	`/ingest/normalized-graph` endpoint	All connectors
Federated edge processing	Product launch, per-tenant migration	Entire connector deployment model	All connectors

Summary Table

Change	Category	Why Better	Depends On
CloudTrail extractor	Product	AWS evidence works	—
ARN parser fix	Product	80-90% events correctly mapped	—
`normalized_action` fix (AWS connector)	Product	Unblocks write-detection and escalation rule on AWS	—
privilege_justification_gap fix	Product	Rule produces findings on AWS	ARN parser + normalized_action
`escalation_capable` rule	Product	Detects NHIs with IAM escalation authority	normalized_action fix
ServiceNow 429 fix	Product	Baselines complete, not truncated	—
REQUIRE_AUTH default	Security	Secure by default	—
DevAuthProvider gate	Security	Production crash instead of bypass	—
Mount new auth middleware	Security	WorkOS membership validation live	REQUIRE_AUTH fix
Super-admin allowlist	Security	Revocable, least-privilege	New middleware
BFS document limit	Security	No tenant can OOM the API	—
BullMQ migration	Infrastructure	Persistent, parallel, recoverable queue	Redis
Per-tenant collections	Isolation	Missing filter can't leak cross-tenant	—
Per-tenant rate limiting	Isolation	One tenant can't starve others	—
Per-tenant BullMQ priority	Isolation	Enterprise jobs aren't blocked by eval	BullMQ
Docker Compose → k3s	Infrastructure	Zero-downtime deploys, cell-ready	—
Connector delta mode	Scale	~99% write amplification reduction	BullMQ
Event-driven sync	Scale	15-60 second detection latency	CloudTrail, BullMQ, delta mode
Cell architecture	Scale	Per-cell blast radius, FedRAMP, GDPR	All above (when triggered)
Formalize event store (remove TTL + add sequence_number)	Architecture	Immutable log, time-travel queries, ES migration path	—
Evidence pack integrity — bind to event range	Architecture	Cryptographically tamper-evident audit trail (SOC 2, ISO 27001)	Event store (8.1)
Kuzu or Apache AGE graph read model	Architecture	Single Cypher query replaces 45K+ MongoDB round-trips	Event store recommended
TimescaleDB evidence time-series	Architecture	`dormant_authority` eval: N queries → 1; ACID writes	CloudTrail extractor (1.1)
Event bus (NATS JetStream, Redpanda, or Kafka)	Architecture	Fan-out, replay, streaming CloudTrail ingestion	Event store, BullMQ
Federated edge processing	Architecture	9/14 rules evaluate at connector; platform load drops	Event store feed to connectors
Architecture combination tiers	Architecture	Synergy map — avoid anti-pattern deployments	Choose tier before starting

Delivery Sequence

Dependencies determine order. This is the recommended sequence for Part A changes. Part B changes are adopted individually when their trigger conditions are met.

Phase 1 — Fix the product (prerequisite for everything else):

CloudTrail extractor (1.1)
ARN parser fix (1.2)
privilege_justification_gap resource ID and normalized_action fixes (1.3)
ServiceNow 429 fix (1.5)
REQUIRE_AUTH default invert (2.1)
DevAuthProvider production gate (2.2)

Phase 2 — Worker queue and isolation:

BullMQ migration (3.1) — replaces in-process array, adds Redis, enables parallel workers and per-tenant priorities
Per-tenant MongoDB collections (5.1), rate limiting (5.2), priority lanes (5.3), BFS document limit (2.5)

Phase 3 — Auth hardening:

Mount new auth middleware — WorkOS end-to-end (2.3)
Super-admin allowlist (2.4)
Session revocation via Redis store

Phase 4 — Infrastructure:

Docker Compose → k3s (6) — rolling deploys, cell-ready

Phase 5 — Connector efficiency (when Phase 1–4 are complete):

Connector delta mode (4.1) — connectors send only changed entities
Event-driven sync per connector (4.2) — AWS: CloudTrail → SQS; Entra/ServiceNow: webhooks

When triggered — cell architecture (7):

Cell router service (~200 lines)
First single-tenant enterprise cell
General cell provisioning when scale demands it

After Phase 1–4, SecurityV0 has:

A working AWS connector with real execution evidence
Auth that validates WorkOS sessions and org membership
A persistent, parallel, recoverable job queue
Per-tenant collection isolation (no cross-tenant data risk from missing filters)
Zero-downtime deployments
A clear path to cells when scale demands it

After Phase 5, SecurityV0 additionally has:

Real-time permission change detection (15–60 second latency vs. hours)
Write amplification reduced by ~99%
Scale ceiling moved from 140 tenants (BullMQ batch) to effectively unlimited at normal IAM change rates

Part A — Immediate Changes​

1. Fix the Product Before Fixing the Architecture​

1.1 CloudTrail Extractor — Implement It​

1.2 Assumed-Role ARN Parser — 5-Line Fix​

1.3 privilege_justification_gap Rule — Two Failure Modes to Fix​

1.4 escalation_capable — New Evaluator Rule​

1.5 ServiceNow Pagination — Fix 429 Break​

2. Security — Fix the Auth Gaps​

2.1 REQUIRE_AUTH Default — Invert It​

2.2 DevAuthProvider — Add Production Gate​

2.3 Mount the New Auth Middleware​

2.4 Super-Admin Check — Replace Email Domain with Allowlist​

2.5 Add BFS Document Limit​

3. Replace the Worker Queue​

3.1 Replace WorkerJob[] Array — Queue Implementation Comparison​

4. Event-Driven Sync​

4.1 Connector Delta Mode — Send Changes, Not Full Snapshots (Near-Term)​

4.2 True Event-Driven Sync — Real-Time Change Detection (Medium-Term)​

5. Tenant Isolation — Without Cell Architecture​

5.1 Per-Tenant MongoDB Collections via StorageAdapter​

5.2 Per-Tenant Rate Limiting at the API Layer​

5.3 Per-Tenant Job Priorities in BullMQ​

6. Infrastructure — Docker Compose to k3s​

7. Cell Architecture — Incremental, Triggered​

Part B — Conditional Architecture​

8. Event Sourcing — Formalize the Events Collection​

8.1 Two Field Additions That Open the Entire ES Migration Path​

8.2 Evidence Pack Integrity — Bind to Event Range​

9. Kuzu — Native Graph Read Model​

9.1 Replace MongoDB BFS with In-Process Graph Database​

9.2 Graph Engine Options — Kuzu vs Apache AGE​

10. Execution Evidence Time-Series Store​

10.1 Quick Win — Batch the Evidence Lookups First​

10.2 Time-Series Store — TimescaleDB (When Triggered)​

11. Durable Event Bus — When the Internal Queue Isn't Enough​

11.1 Event Bus Options — Kafka, NATS JetStream, Redpanda, or pg_notify​

12. Federated Edge Processing​

12.1 Run Evaluation Rules Inside the Customer's Environment​

13. Architecture Combination Tiers​

13.1 Synergy Matrix​

13.2 Tier 1 — Production-Ready (up to ~50 tenants)​

13.3 Tier 2 — Growth (100+ tenants)​

13.4 Tier 3 — Enterprise-Scale (triggered by 500+ tenants or enterprise isolation requirement)​

13.5 Anti-Patterns — What NOT to Combine​

13.6 Migration Compatibility Summary​

Summary Table​

Delivery Sequence​

Part A — Immediate Changes

1. Fix the Product Before Fixing the Architecture

1.1 CloudTrail Extractor — Implement It

1.2 Assumed-Role ARN Parser — 5-Line Fix

1.3 `privilege_justification_gap` Rule — Two Failure Modes to Fix

1.4 `escalation_capable` — New Evaluator Rule

1.5 ServiceNow Pagination — Fix 429 Break

2. Security — Fix the Auth Gaps

2.1 `REQUIRE_AUTH` Default — Invert It

2.2 `DevAuthProvider` — Add Production Gate

2.3 Mount the New Auth Middleware

2.4 Super-Admin Check — Replace Email Domain with Allowlist

2.5 Add BFS Document Limit

3. Replace the Worker Queue

3.1 Replace `WorkerJob[]` Array — Queue Implementation Comparison

4. Event-Driven Sync

4.1 Connector Delta Mode — Send Changes, Not Full Snapshots (Near-Term)

4.2 True Event-Driven Sync — Real-Time Change Detection (Medium-Term)

5. Tenant Isolation — Without Cell Architecture

5.1 Per-Tenant MongoDB Collections via StorageAdapter

5.2 Per-Tenant Rate Limiting at the API Layer

5.3 Per-Tenant Job Priorities in BullMQ

6. Infrastructure — Docker Compose to k3s

7. Cell Architecture — Incremental, Triggered

Part B — Conditional Architecture

8. Event Sourcing — Formalize the Events Collection

8.1 Two Field Additions That Open the Entire ES Migration Path

8.2 Evidence Pack Integrity — Bind to Event Range

9. Kuzu — Native Graph Read Model

9.1 Replace MongoDB BFS with In-Process Graph Database

9.2 Graph Engine Options — Kuzu vs Apache AGE

10. Execution Evidence Time-Series Store

10.1 Quick Win — Batch the Evidence Lookups First

10.2 Time-Series Store — TimescaleDB (When Triggered)

11. Durable Event Bus — When the Internal Queue Isn't Enough

11.1 Event Bus Options — Kafka, NATS JetStream, Redpanda, or pg_notify

12. Federated Edge Processing

12.1 Run Evaluation Rules Inside the Customer's Environment

13. Architecture Combination Tiers

13.1 Synergy Matrix

13.2 Tier 1 — Production-Ready (up to ~50 tenants)

13.3 Tier 2 — Growth (100+ tenants)

13.4 Tier 3 — Enterprise-Scale (triggered by 500+ tenants or enterprise isolation requirement)

13.5 Anti-Patterns — What NOT to Combine

13.6 Migration Compatibility Summary

Summary Table

Delivery Sequence