Skip to main content

ETL Pipeline Strengthening Plan — Execution Evidence & Determinism

Date: 2026-02-20 Authors: Three-agent architectural review (Gemini3 perspective, Codex perspective, fresh Staff+/CISO review) Status: APPROVED FOR IMPLEMENTATION


Executive Summary

Three independent reviews of the connector ETL pipeline reached the same verdict: the pipeline is not audit-grade deterministic end-to-end. Chain-of-custody breaks at multiple hops. The platform's stated design constraints (Deterministic, Explainable, Evidence-grade, Temporal) are not currently met.

The gaps fall into three buckets:

  1. Platform-internal (entirely our control): Identity auth is broken at ingest; ingestion pipeline has no transactional safety; evidence is overwritten not appended.
  2. Connector-level (azure-foundry): Execution evidence uses threads as a proxy for runs — fundamentally incorrect; no scan manifest; thread-agent attribution is wrong.
  3. Cross-system (SN→Azure→Foundry): The SN-to-Azure hop lacks runtime proof; two connectors have no explicit correlation; time-window handling is nondeterministic.

This document defines the complete set of fixes, ranked by leverage, organized into a phased implementation roadmap.


Part 1: Evidence Gap Registry (Consolidated — All Three Reviews)

P0 — Critical (chain-of-custody broken; platform claims are false today)

IDGapSystemWhere it breaksHow to verify gap existsMinimal fixBest fix
G01JWT decoded without signature verificationPlatform auth.ts:113Any party can craft a JWT with arbitrary claims; submitter impersonation possiblePOST with a forged JWT — observe it is acceptedVerify JWT via JWKS endpointPer-connector app registrations with JWKS verification
G02API-key identity collapsed to api-key-clientPlatform auth.ts:55All API-key connectors appear as the same submitter; audit trail broken for connector attributionSubmit from two connectors using different API keys; observe both show as api-key-client in sync recordsMap each API key to a unique principal ID in key registryPer-connector service accounts with unique IDs + key rotation
G03Thread-count used as execution evidence (azure-foundry)Connector foundry_client.py:572A thread is a conversation container, not an execution record; thread stuffing can fake activityCreate empty threads with no runs; observe run_count_30d increasesFetch /threads/{id}/runs per threadUse Foundry run-level API; filter by assistant_id
G04Thread-agent attribution counts all project threadsConnector foundry_client.py:562All threads in a project are credited to every agent regardless of which agent ran themCreate 2 agents, run only one; observe both show same run_count_30dFilter threads by assistant_idFetch run records with assistant_id; aggregate at run level
G05No scan manifest / no graph integrity hashConnector (missing)No way to verify scan was complete, replay it, or detect tamperingThere is no manifest file or hash — absence is the proofAdd ScanManifest dataclass with start/end time and API call countSHA256 of canonical graph JSON + signed manifest with connector version
G06No end-to-end correlation ID (run_id not propagated)Platform request-id.ts:9 + ConnectorsCannot trace a connector run to a platform sync to an evidence pack; audit trail has no threadAttempt to correlate a specific connector submission to its resulting findings — no join key existsPersist run_id/request_id on ConnectorSyncDocOpenTelemetry traceparent propagated from connector CLI through platform workers
G07No deterministic ServiceNow BR/SI runtime proofConnector/ServiceNow transformer.py:1193Execution chains are structural inferences ("script can call endpoint"), not runtime proof ("script did call endpoint")Review graph output; observe that BR nodes have no observed_at from outbound logsEnable glide.rest.outbound_log_level=elevated; ingest sys_outbound_http_logState-delta inferencing as compensating control; outbound log ingestion pipeline

P1 — High (evidence is wrong or destructible; no idempotency)

IDGapSystemWhere it breaksMinimal fixBest fix
G08Execution evidence upsert overwrites prior proofPlatform execution-evidence-adapter.ts:54, schema.ts:313Historical provenance destroyed on each sync; evidence pack sections may reference replaced rowsChange upsert to insert-if-not-exists on (entity_id, observed_at, source_system)Append-only execution_evidence_events collection; separate summary rollup
G09In-memory queue + non-transactional writesPlatform runtime.ts:26, sync-ingestion.ts:75Process crash mid-ingestion leaves partial writes committed with no rollback or resumeMove queue to MongoDB-backed collection; add stage checkpoint markerSaga/outbox pattern with per-stage idempotency tokens
G10In-memory sync deduplication lost on restartPlatform ingest-service.ts:13Same graph can be ingested twice if process restarts; duplicate entities/evidence createdDB-backed dedupe keyed by (tenant_id, sync_id)Idempotency key stored in syncs collection with status; reject duplicate sync_id
G11Evidence rows with empty entity_idPlatform graph-transformer.ts:168Evidence pack references evidence with no entity — referential integrity violationReject evidence rows with null entity_id at ingest boundaryQuarantine to DLQ; alert on quarantine rate
G12Evidence rows with synthetic timestamps (fallback to now)Platform graph-transformer.ts:207Evidence timestamps are fabricated; temporal ordering is meaninglessReject evidence rows with missing observed_at from sourceMandatory source_hash + origin timestamp from source API response
G13Run outcome hardcoded as "success"Connector transformer.py (azure-foundry)Evidence packs always show success; failed/cancelled runs appear as successful executionRead actual run status from Foundry run-level APIInclude run status, failure reason, and duration in execution_evidence node
G14Principal provenance not persisted on sync recordsPlatform ingest-service.ts:45, syncs/types.ts:41Who submitted the graph is not recorded on the sync documentAdd submitter_id and submitter_type fields to ConnectorSyncDocAdd submitter_id, submitter_type, request_id, client_ip to sync record
G15ServiceNow-to-Foundry trigger correlation missingCross-system (both connectors)When SN workflow triggers a Foundry agent, no edge links them; separate connectors have no joinDocument gap and surface in evidence_completenessParse SN outbound REST logs for Foundry endpoint URLs; emit TRIGGERS_ON edges

P2 — Medium-High (correctness gaps; authority under-reported)

IDGapSystemImpactFix
G16Schema drift: agent_run_summary not in platform enumPlatform evidence/types.ts vs Connector transformer.py:420Connector-emitted type silently accepted or droppedExtend enum; add CI validation that connector types are registered in platform
G17Time-window nondeterminism (SN date-only cutoff)Connector servicenow_client.py:1552Replay produces different results; watermark is coarseExplicit UTC window start/end recorded per scan; store in manifest
G18Time-window nondeterminism (Foundry proxy timestamps)Connector foundry_client.py:572Sliding 30-day window changes daily; replay inconsistentRecord window boundaries in scan manifest; use immutable watermark ledger
G19Group-inherited RBAC not capturedConnector (azure-foundry)Roles granted via Entra group membership are invisible; authority is under-reportedQuery Entra group memberships for managed identity; resolve group role assignments
G20Action normalization too coarse (5 values)Connector transformer.py:539-553privilege_justification_gap evaluator cannot detect fine-grained mismatchesPreserve Azure-specific permission scope alongside normalized action
G21Node ID truncation collision risk (80-char limit)Connector edge_resolver.py:186, transformer.py:530,536Two resources with same name in different RGs collide after truncationReplace truncation with SHA256-based IDs (already used for workspaces)
G22Non-deterministic syncId (uuid4())Connector transformer.py:485Two scans of same tenant produce different syncIds; diff/replay not possibleDerive syncId from SHA256(tenant_id + connector_version + scan_start_time)
G23Ownership at SP level, not agent levelConnector (azure-foundry)OWNED_BY targets managed identity SP owners, not the Entra user who deployed the agentCapture project Contributor/Owner role holders as agent-level owners
G24No evidenceConfidence field in NormalizedGraphSchema (missing)All evidence displayed with equal visual weight; structural and deterministic evidence indistinguishableAdd evidenceConfidence: STRUCTURAL | TEMPORAL_INFERRED | DETERMINISTIC to graph schema

P3 — Known SaaS constraints (surface, not necessarily fix)

IDGapNotes
G25No outbound trace ID across SaaS boundary (SN→Azure)Cannot force SN scripts to send traceparent. Surface as STRUCTURAL confidence.
G26Unreliable HTTP body logging from SN (PII concerns)Customer controlled. Propose advisor engine recommendations.
G27Temporal correlation weakness (5-min sliding window SN↔Entra)Display as dashed lines in graph UI; label as TEMPORAL_INFERRED.
G28Multi-tenant pipeline isolation not reviewedNeither prior review examined this; needs a dedicated security review.

Part 2: The "If You Only Fix 3 Things" — Consensus

All three reviews converge on the same top 3:

Fix 1: Run-level execution evidence in azure-foundry (G03 + G04 + G13)

Why it's #1: The current thread-counting approach produces numbers that are provably wrong for multi-agent projects and cannot prove a specific agent actually executed. This is the most egregious correctness violation because it is the core purpose of the connector.

Exact change:

# Replace: list threads and count
# With: list threads, then for each thread fetch /threads/{id}/runs filtered by assistant_id
# Aggregate: run_count, last_run_at, tool_calls used, outcomes

Outcome: Dormant authority detection becomes correct; tool-invocation evidence becomes available; thread stuffing threat is eliminated; outcome field is accurate.

Fix 2: Signed scan manifest + deterministic syncId (G05 + G06 + G22)

Why it's #2: Without a manifest, the connector's output is not verifiable, not auditable, and cannot be replayed. This is a prerequisite for evidence-grade claims.

Exact change:

# Add to cli/main.py:
manifest = ScanManifest(
scan_id=sha256(f"{tenant_id}:{connector_version}:{started_at}"),
connector_version=importlib.metadata.version("azure-foundry"),
config_snapshot={"subscriptions": [...], "tenant_id": ...}, # no secrets
started_at=started_at.isoformat(),
completed_at=datetime.utcnow().isoformat(),
api_calls=[{"endpoint": ..., "status": ..., "item_count": ...}],
graph_hash=sha256(canonical_json(graph)),
)
# Embed in NormalizedGraph.metadata.scanManifest

Outcome: Every sync is reproducible; auditors can verify graph integrity; platform can detect connector upgrades that change graph shape.

Fix 3: Platform identity hardening at ingest (G01 + G02 + G14)

Why it's #3: JWT without signature verification is a security vulnerability entirely within our control. API-key identity collapse means we have no per-connector audit trail. These two issues mean the platform cannot reliably answer "who submitted this data."

Exact changes:

  • auth.ts:113: Verify JWT signature via JWKS endpoint before trusting claims.
  • auth.ts:55: Map each API key to a unique principal; store key→principal mapping in DB.
  • ingest-service.ts:45: Persist submitter_id, submitter_type, request_id on every ConnectorSyncDoc.

Outcome: Submitter impersonation eliminated; per-connector attribution restored; sync records form a complete audit trail of who submitted what.


Part 3: Phased Implementation Roadmap

Phase 0: Immediate (Days 1–3) — Unblock Basic Determinism

Goal: Stop the bleeding. Eliminate the most critical correctness and security gaps that are entirely within our control and fast to fix.

TaskOwnerFile(s)Effort
JWT JWKS verificationPlatformauth.ts:1131d
Per-key principal mappingPlatformauth.ts:550.5d
Persist submitter fields on ConnectorSyncDocPlatformingest-service.ts, syncs/types.ts0.5d
Fix thread-agent attribution (filter by assistant_id)Connectorfoundry_client.py:5620.5d
Replace hardcoded "success" with actual run statusConnectorfoundry_client.py, transformer.py0.5d
Enable SN outbound log elevation doc + advisor messageConnectorservicenow_client.py0.5d
Deterministic syncId (SHA256 of config+time)Connectortransformer.py:4850.5d

Definition of done: JWT forged tokens are rejected. Two different API keys resolve to different principal IDs. Thread attribution is correct for multi-agent projects.


Phase 1: Evidence Integrity (Weeks 1–2) — Make Output Verifiable

Goal: Every connector run produces a verifiable, reproducible artifact. The platform records who submitted what with what run identity.

TaskOwnerFile(s)Effort
Scan manifest + SHA256 graph hashConnectorcli/main.py, transformer.py2d
Source hashes on nodes (source_hash: SHA256(raw_api_response))ConnectorAll node emitters in transformer.py1d
evidenceConfidence field in NormalizedGraph schemaSchemasv0-platform/src/ingestion/types.ts0.5d
Run-level execution evidence (replace thread-counting)Connectorfoundry_client.py:522-6053d
Reject evidence rows with null entity_idPlatformgraph-transformer.ts:1680.5d
Reject evidence rows with missing observed_atPlatformgraph-transformer.ts:2070.5d
DB-backed sync deduplication keyed by (tenant_id, sync_id)Platformingest-service.ts:131d

Definition of done: Two runs of the same scan produce the same syncId and graph hash. Every evidence node has entity_id, observed_at, and source_hash. Null/fake evidence is quarantined not silently accepted.

Invariants to assert:

  • nodesCreated + nodesUpdated + nodesUnchanged = total_discovered_nodes
  • graph_hash(run_2) == graph_hash(run_1) if no source data changed
  • evidence rows with entity_id IS NULL = 0

Phase 2: Pipeline Idempotency (Weeks 2–4) — Survive Failures

Goal: The platform survives crashes, restarts, and duplicate submissions without corrupting state.

TaskOwnerFile(s)Effort
Replace in-memory job queue with MongoDB-backed queuePlatformruntime.ts:262d
Stage checkpoint model (mark stage as complete atomically)Platformsync-ingestion.ts2d
Append-only execution evidence eventsPlatformexecution-evidence-adapter.ts:54, schema.ts:3132d
Separate summary rollup from raw evidence eventsPlatformNew collection execution_evidence_summaries1d
Tool-call invocation evidence (per tool_call in each run)Connectorfoundry_client.py2d
Time-window boundaries recorded in manifestConnectorservicenow_client.py:1552, foundry_client.py:5721d

Definition of done: Platform can be killed and restarted mid-ingestion; on restart it resumes from the last completed stage. Re-submitting the same graph twice produces no duplicate entities or evidence.


Phase 3: Authority Completeness (Weeks 4–6) — Close the Graph

Goal: The authority graph is complete. No missing roles from group membership. No over-reported blast radius from connection over-attribution.

TaskOwnerFile(s)Effort
Group-inherited RBAC expansionConnectorazure_client.py2d
Fine-grained action normalization (preserve Azure permission scope)Connectortransformer.py:539-5531d
Node ID collision fix (SHA256-based IDs)Connectoredge_resolver.py:186, transformer.py:530,5361d
Agent-level ownership (Contributor/Owner on project resource)Connectortransformer.py1d
schema drift CI check — connector types registered in platformPlatform/CIevidence/types.ts + connector transformer.py0.5d
Extend evidenceConfidence in NormalizedGraph schemaSchemasv0-platform/src/ingestion/types.ts0.5d
UI: dashed lines for TEMPORAL_INFERRED edgesPlatform UIGraph Explorer1d

Definition of done: Group-inherited roles appear in authority paths. No schema drift is silently ignored. Graph Explorer visually distinguishes structural vs. inferred evidence.


Phase 4: Cross-Connector Correlation (Weeks 6–8) — End-to-End Chain

Goal: ServiceNow and Azure Foundry connector outputs can be deterministically joined. A single connector run traces from trigger to evidence pack.

TaskOwnerFile(s)Effort
OpenTelemetry traceparent propagation (connector → platform workers)Platform + Connectorrequest-id.ts, all connectors, runtime.ts3d
Cross-connector dependency declaration in NormalizedGraphSchemasv0-platform/src/ingestion/types.ts1d
Platform enforces sync ordering from dependsOnPlatformingest-service.ts1d
SN outbound REST log ingestion for Foundry endpoint correlationConnectorservicenow_client.py, transformer.py3d
TRIGGERS_ON edge type: SN flow → Foundry agentSchemarelationship-types.ts0.5d
connectorVersion field in NormalizedGraphSchemasv0-platform/src/ingestion/types.ts0.5d

Definition of done: Searching for a run_id traces from connector CLI invocation through platform sync to evidence pack. SN→Foundry trigger relationships appear as TRIGGERS_ON edges in the graph.


Phase 5: Hardening (Weeks 8–10) — Evidence-Grade

Goal: The pipeline is hardened against adversarial and accidental evidence fraud. An auditor can verify the complete chain independently.

TaskOwnerEffort
Signed run manifests (Ed25519 private key per connector instance)Connector2d
Immutable append-only evidence store (MongoDB TTL + no-update policy)Platform2d
WORM retention policy on evidence collections (Azure Immutable Blob or equivalent)Infra2d
Independent verifier service: re-derives evidence completeness from raw logsPlatform3d
Incremental sync with monotonic watermark ledgerConnector2d
Multi-tenant pipeline isolation auditPlatform2d
Rate limiting on ingest endpointPlatform1d
Credential isolation: per-scope ClientSecretCredentialConnector1d

Part 4: ServiceNow Outbound Evidence Plan

What evidence is missing today

ServiceNow Business Rules and Script Includes can call Azure APIs. The platform infers these calls from static script analysis (script contains .setValue() referencing the integration) but cannot prove the code path executed at runtime.

Missing:

  1. Outbound HTTP request log — which HTTP body was sent, to which endpoint, at what timestamp, from which SN user context
  2. Response status — did the call succeed or fail?
  3. Triggering context — what business rule / flow action initiated the outbound call
  4. User context — which SN user action triggered the workflow that led to the outbound call

Minimum configuration changes (SN admin actions)

System Property: glide.rest.outbound_log_level = elevated
Table: sys_outbound_http_log (enable; set retention = 90 days)
Integration logging: REST Message → per-message → Log = All
Outbound HTTP log viewer: System Web Services → Outbound → REST Message Log

This gives us: timestamp, URL, method, headers (no body by default due to PII), response code, response time, integration user.

For body logging (where PII policy permits):

System Property: glide.outbound_http.log.body = true

Fallback plan if SN logging cannot be raised

If the customer cannot or will not enable outbound logging:

  1. Server-side receipt logging in Azure — The Azure API gateway (APIM or custom middleware) logs every inbound request with: timestamp, source IP, Authorization header identity, request path, request size. This proves the Azure side received the call even when SN side has no log.

  2. Signed request envelopes — The SN integration script can be modified to include a nonce + HMAC signature in each outbound request. The Azure receiver verifies the signature and logs the receipt. This cryptographically links the SN script execution to the Azure receipt.

  3. State-delta inferencing — Take high-fidelity snapshots of the target state (Entra SP, Azure resources) before and after each expected execution window. If the state changed in a manner consistent with what the SN script should do, that is evidence the script ran. Label this as TEMPORAL_INFERRED confidence.

  4. Surface in evidence_completeness — When SN outbound logging is not available, the evidence pack's evidence_completeness section must explicitly state: servicenow_outbound_logs: unavailable_not_enabled with an advisor recommendation linking to the configuration doc.


Part 5: Deterministic Verification Specification

Run Manifest Schema

Every connector run must produce a manifest stored alongside the NormalizedGraph payload:

interface ScanManifest {
scan_id: string; // SHA256(tenant_id + connector_id + connector_version + started_at)
connector_id: string; // e.g., "azure-foundry"
connector_version: string; // semver from pyproject.toml
tenant_id: string;
config_snapshot: { // no secrets
subscriptions?: string[];
scope?: string;
[key: string]: unknown;
};
started_at: string; // ISO 8601 UTC
completed_at: string; // ISO 8601 UTC
time_window?: { // for connectors with incremental windows
from: string; // ISO 8601 UTC
to: string; // ISO 8601 UTC
};
api_calls: Array<{
endpoint: string;
status: number;
item_count: number;
etag?: string;
}>;
errors: Array<{
endpoint: string;
error_type: string;
message: string;
}>;
graph_stats: {
node_count: number;
edge_count: number;
node_types: Record<string, number>;
edge_types: Record<string, number>;
};
graph_hash: string; // SHA256 of canonical JSON (sorted keys, no whitespace)
manifest_signature?: string; // Ed25519 signature of scan_id + graph_hash (Phase 5)
}

Per-Stage Invariants

StageInvariantAssert at
Connector outputnode_count + edge_count > 0Pre-submission
Connector outputgraph_hash == SHA256(canonical_json(graph))Pre-submission
Connector outputEvery node has source_hashPre-submission
Platform ingesttotal_nodes_processed == manifest.graph_stats.node_countPost-ingest
Platform ingestevidence_rows_with_null_entity_id == 0Post-ingest
Platform evaluatefindings_evaluated_count >= 0Post-evaluate
Evidence packpack.graph_hash == manifest.graph_hashPost-build

Auditor Replay Protocol

Given a scan_id, an auditor can:

  1. Retrieve the ConnectorSyncDoc from MongoDB using sync_id (derived from scan_id)
  2. Retrieve the ScanManifest from the sync doc's metadata.scanManifest
  3. Verify SHA256(canonical_json(stored_graph)) == manifest.graph_hash
  4. Retrieve all entities tagged with source_sync_id == scan_id and verify counts match manifest.graph_stats
  5. Retrieve all evidence packs for findings triggered by this sync; verify pack.previous_pack_id chains correctly
  6. (Phase 5) Verify manifest.manifest_signature with the connector's public key

Part 6: Threat Model Defenses

ThreatAttackDefense
Thread stuffingAttacker creates empty threads to make dormant agent appear activeRequire run-level evidence (threads → runs); runs must have status = "completed"
Phantom identityTwo projects share MI; blast radius of project A bleeds into project BEmit WARNING annotation when multiple projects share a managed identity
Stale connectionConnection removed at app level but still in project configCross-reference connection usage with run-level tool_calls
Execution evidence fraudOutputs written without upstream execution (no log chain)Signed run manifests; server-side receipt logging in Azure as independent verification
Log disablementSN outbound logs disabled post-run to cover tracksAzure API gateway receipt logs are independent; state-delta inferencing as backup
Correlation ID spoofingAttacker submits forged run_id to link to legitimate runrun_id must be derived from content (config + timestamp hash), not caller-supplied; JWKS JWT verification
Authority path misattributionAll connectors share one SPN → all submissions look like same submitterPer-connector service accounts (G02); JWKS-verified per-connector JWTs (G01)
Replay attackOld graph re-submitted to overwrite current stateIdempotency: DB-backed sync_id dedupe; reject duplicate sync_id within rolling 24h window

Part 7: Gaps Neither Prior Review Addressed

These need separate work items:

  1. Multi-tenant pipeline isolation — Neither prior review examined whether tenant data can leak between connector runs in the worker pipeline. Requires dedicated security review.
  2. Rate limiting on ingest endpoint — No protection against a connector flooding the API. Add per-tenant rate limiting.
  3. Data retention and evidence lifecycle — No policy for how long evidence is retained. GDPR/compliance implications unclear.
  4. Connector credential management — How connector credentials (client secrets, API keys) are stored, rotated, and compromised is unreviewed. Needs a secrets management runbook.
  5. UI evidence confidence rendering — ✅ Done 2026-02-20. Evidence table in Exposure Detail now shows color-coded confidence badges (DETERMINISTIC = green, TEMPORAL_INFERRED = yellow, STRUCTURAL = gray) with proof_notes tooltip. Graph edge visual differentiation (solid/dashed/dotted) is still pending.
  6. Rollback/recovery scenarios — No defined recovery path when a sync partially completes. Needs a runbook.

Summary: Scorecard Against Design Constraints

ConstraintCurrent StateAfter Phase 0-1After Phase 2-3After Phase 4-5
Deterministic❌ Thread counts change daily; syncId is clock-derived (not replay-safe)⚠️ syncId now opaque UUID (honest run ID); thread-agent outcome fixed✅ Scan manifests + graph hashes + time-window boundaries✅ Full cross-connector determinism
Read-only✅ Met today
Explainable⚠️ Evidence chains have gaps; SN→Azure not proven⚠️ Platform-side chains complete⚠️ Authority paths complete✅ Cross-connector chains with TRIGGERS_ON
Temporal⚠️ Evidence overwrites history✅ Append-only evidence events
Evidence-grade❌ JWT not verified; identities collapsed; threads not runs⚠️ Identity hardened; thread-agent fixed✅ Run-level evidence; graph hashes✅ Signed manifests; WORM retention; independent verifier