Skip to main content

API Data Quality Analysis: Automation Classification & Execution Evidence

Author: DEVELOPER (automation-analysis team) Date: 2026-02-12 Context: Analysis of /api/v1/entities response structure for automation classification gaps Dataset: 92 identity entities from http://localhost:3000/api/v1/entities


Executive Summary

Primary Finding: The API response structure conflates "no data collected" with "confirmed zero", has a 29% classification gap rate, and lacks confidence indicators. This makes downstream analysis ambiguous and prevents reliable automation risk assessment.

Critical Questions:

  1. Does execution_count_30d: 0 mean "we checked and found zero" or "we didn't check"?
  2. Is a 29% execution_mode: "unknown" rate acceptable for security-relevant automation classification?
  3. Should the platform validate/override connector-provided classifications?

Recommendation: Add data quality metadata to entity schema, expose classification confidence levels in API responses, and provide reclassification endpoints.


1. Hypothesis

The Null Ambiguity Problem:

The current entity schema uses default values (0, null, "unknown") without distinguishing data unavailability from confirmed absence. This creates three failure modes:

1.1 Semantic Overload

  • execution_count_30d: 0 could mean:
    • A) We queried sys_flow_context and found exactly 0 matching records
    • B) We skipped execution data collection (connector config/permissions)
    • C) This automation type has no deterministic execution log table (business_rule, system_execution)

1.2 Classification Gap Propagation

  • Connector returns execution_mode: "unknown" for 29% of flows
  • Platform ingests this verbatim with no validation or fallback
  • Every downstream consumer (UI, evaluator, reporting) inherits the gap

1.3 Temporal Data Quality Decay

  • No last_data_collection_timestamp field
  • No way to know if execution_count_30d: 0 reflects data from today or last month
  • Stale data presented as current

Core Issue: The schema assumes the connector is always authoritative and complete. In practice, connectors have permissions gaps, API limits, and implementation bugs.


2. API Data Quality Audit

2.1 Field-by-Field Reliability Assessment

Based on analysis of 92 identity entities (77 internal_inventory, 9 dormant_authority, 5 unknown, 1 active_external):

FieldPopulatedNull/DefaultTrustworthy?Notes
display_name100%0%✅ YesAlways present from source system
status100%0%✅ Yes"active" or "disabled" from source
identitySubtype100%0%✅ YesDeterministic from sys_class_name
automation_type100%0%✅ YesDerived from subtype (flow, business_rule, etc.)
sys_created_by100%0%✅ YesServiceNow audit field
sys_updated_by100%0%✅ YesServiceNow audit field
triggerTypes~95%*~5%⚠️ PartialFlow trigger types extracted from sys_hub_trigger_instance
endpoint_url~15%85% null⚠️ PartialOnly populated when REST step detected in flow actions
last_observed_execution_timestamp0%**100% null❌ NoALL internal_inventory have null (0 exec → no timestamp)
execution_count_30d100%0% (all zeros)AMBIGUOUSSee §2.2
execution_evidence_refs0%100% emptyAMBIGUOUSEmpty means "no executions" OR "didn't check"
identity_binding_status100%0%✅ Yes"bound" or "unlinked" from RUNS_AS edge resolution
egress_host~15%85% null✅ YesNull correctly means "no external egress detected"
egress_base_url~15%85% null✅ YesNull correctly means "no external egress detected"
egress_category100%0%✅ Yes"none", "internal", "external", "cloud", "llm", "unknown"
referenced_tables~90%~10% empty✅ YesExtracted from flow actions/triggers
data_domains100%0%⚠️ PartialFalls back to "unknown" if table→domain mapping missing
ownership_status100%0%✅ Yes"valid" or "orphaned" from OWNED_BY edge validation
risk_group100%0%✅ YesDeterministic from egress_category + data_domains
risk_group_label100%0%✅ YesDisplay label for risk_group
risk_group_priority100%0%✅ YesP1-P4 from risk_group
execution_mode71%29% "unknown"GAPSee §4
security_relevance100%0%⚠️ DERIVEDComputed from other fields; trustworthy IFF inputs are

* Some flows have empty triggerTypes array → classified as unknown ** Within internal_inventory subset

2.2 The execution_count_30d: 0 Problem

Observation: ALL 77 internal_inventory entities have execution_count_30d: 0.

Three Possible Interpretations:

Interpretation A: Confirmed Zero (Optimistic)

The connector successfully queried sys_flow_context for each flow and confirmed 0 matching records in the last 30 days.

Evidence supporting:

  • Connector code has explicit discover_flow_executions() method
  • Uses two-pass approach: _get_table_count() for count, then _get_table() for evidence
  • Returns {} (empty dict) only if count query returns 0

Evidence against:

  • Zero flows with execution_count > 0 in the internal_inventory set
  • Statistically unlikely that EXACTLY ZERO of 77 flows executed in 30 days
  • Some flows are system-default ITSM workflows (Change, Incident) — these should execute

Interpretation B: Data Collection Skipped (Pessimistic)

The connector didn't collect execution data for these flows (permissions/config/bug).

Evidence supporting:

  • No data_collection_timestamp field to prove recency
  • execution_evidence_refs: [] for all — no proof that query was attempted
  • Connector code has execution_data: dict[str, dict] | None = None — optional parameter

Evidence against:

  • Connector doesn't have a "skip execution collection" flag
  • Code shows execution data is collected before transform step
  • No error logs in connector output about missing permissions

Interpretation C: Heterogeneous (Most Likely)

  • Flows/Jobs: execution data WAS collected, 0 is accurate
  • Business Rules/System Execution: execution data CANNOT be collected (no deterministic log table)
    elif subtype in ("business_rule", "system_execution"):
    # No deterministic SN-side execution log for BRs/SIs
    props["last_observed_execution_timestamp"] = None
    props["execution_count_30d"] = 0
    props["execution_evidence_refs"] = []

Conclusion: The API response conflates "no execution records found" (flows/jobs) with "no execution records exist in ServiceNow" (business_rules/system_execution). There's no field indicating data availability.


3. The "0 vs null" Problem: Proposed Schema

3.1 Current Schema (Ambiguous)

interface EntityProperties {
execution_count_30d: number; // 0 means ???
execution_evidence_refs: string[]; // [] means ???
last_observed_execution_timestamp: string | null; // null means ???
}

3.2 Proposed Schema (Explicit Data Quality)

interface EntityProperties {
// Execution data
execution_count_30d: number;
execution_evidence_refs: string[];
last_observed_execution_timestamp: string | null;

// NEW: Data quality metadata
execution_data_availability: ExecutionDataAvailability;
execution_data_collected_at?: string; // ISO timestamp
execution_data_source?: string; // "sys_flow_context" | "sys_trigger" | "unavailable"
execution_data_notes?: string; // "Permissions denied" | "No execution log for business_rule"
}

type ExecutionDataAvailability =
| "available" // Data was collected, count is accurate
| "partial" // Data was collected but incomplete (API limit, timeout)
| "unavailable_no_log" // Source system has no execution log for this automation type
| "unavailable_no_access" // Connector lacks permissions to query execution logs
| "not_collected"; // Execution data collection was skipped (connector config)

3.3 Interpretation Rules

function interpretExecutionCount(entity: EntityDoc): string {
const { execution_count_30d, properties } = entity;
const availability = properties.execution_data_availability;

if (availability === "available" && execution_count_30d === 0) {
return "Confirmed zero executions in last 30 days";
}
if (availability === "unavailable_no_log") {
return "Execution count unavailable (no execution log for this automation type)";
}
if (availability === "not_collected") {
return "Execution count not collected";
}
if (execution_count_30d > 0) {
return `${execution_count_30d} executions in last 30 days`;
}
return "Unknown execution status";
}

3.4 UI Impact

Current UI (Ambiguous):

Executions (30d): 0

Proposed UI (Explicit):

Executions (30d): 0 ✓ (verified 2026-02-12)
Executions (30d): 0 ⚠ (no execution log available)
Executions (30d): — (data not collected)

4. execution_mode Gap Analysis

4.1 Current State

  • 77 internal_inventory entities
  • 45 autonomous (58%)
  • 10 operator_assisted (13%)
  • 22 unknown (29%) ← PROBLEM

4.2 Root Cause: Trigger Type Gaps

The connector classifies execution_mode based on triggerTypes:

# Connector: transformer.py lines 1159-1209
_AUTONOMOUS_TRIGGERS = {
"record", "schedule", "event", "data_change",
"record_create", "record_update", "record_create_or_update",
"daily", "weekly", "run_once", "repeat",
}
_OPERATOR_ASSISTED_TRIGGERS = {"service_catalog", "email", "inbound_action"}
_HUMAN_TRIGGERED_TRIGGERS = {"ui_action", "manual"}

# Classification logic
if not trigger_types:
return "unknown" # ← GAP SOURCE 1

for tt in trigger_types:
if tt in _AUTONOMOUS_TRIGGERS:
return "autonomous"
# ... check operator_assisted, human_triggered ...

return "unknown" # ← GAP SOURCE 2: unrecognized trigger type

Gap Sources:

  1. Empty triggerTypes array — Flow has no triggers configured
  2. Unrecognized trigger types — ServiceNow emits trigger types not in allowlist

Example Entity with Gap:

{
"display_name": "Knowledge - Approval Publish",
"identitySubtype": "flow_designer_flow",
"triggerTypes": ["knowledge management"], // Not in any allowlist
"execution_mode": "unknown" // ← Classification gap
}

4.3 Is 29% Acceptable?

Arguments FOR (acceptable gap):

  • Internal inventory flows are low-priority (not displayed by default)
  • 71% classification success for security-relevant automations may be enough
  • Trigger type allowlists can be expanded iteratively

Arguments AGAINST (blocking issue):

  • execution_mode is used in findings generation and risk scoring
  • "Unknown" execution mode prevents accurate dormant authority detection
  • 29% gap means 1 in 3 flows can't be properly risk-assessed
  • Gap rate may be higher for security-relevant automations (not yet tested)

Recommendation:

  • ✅ Acceptable for internal_inventory (hidden by default)
  • ❌ Blocking for dormant_authority and active_external
  • ACTION: Collect trigger type gap stats for security-relevant subset

4.4 Should Platform Override Connector Classification?

Current: Connector is authoritative. Platform ingests execution_mode verbatim.

Option A: Platform Fallback (Conservative)

// During ingestion, if execution_mode === "unknown"
if (properties.identitySubtype === "business_rule") {
properties.execution_mode = "autonomous"; // Business rules always run autonomously
}
if (properties.identitySubtype === "system_execution") {
properties.execution_mode = "autonomous"; // Script includes are autonomous
}
// For flows, keep "unknown" → user must manually classify

Option B: Platform Re-Classification (Aggressive)

// Platform computes execution_mode from trigger types + subtype + egress signals
// Ignore connector value entirely
// Pro: Single source of truth, consistent across connectors
// Con: Duplicates connector logic, divergence risk

Option C: Platform Validation + Override Flag (Hybrid)

interface EntityProperties {
execution_mode: ExecutionMode;
execution_mode_source: "connector" | "platform_override" | "user_override";
execution_mode_confidence: "high" | "low" | "unknown";
}

// Platform validates connector value, flags low-confidence classifications
if (properties.execution_mode === "unknown" && properties.identitySubtype === "business_rule") {
properties.execution_mode = "autonomous";
properties.execution_mode_source = "platform_override";
properties.execution_mode_confidence = "high";
}

Recommendation: Option C (validation + metadata). Preserves connector authority while providing quality guardrails.


5. API Improvement Proposals

5.1 Automation Summary Endpoint

Current Gap: No way to get aggregate automation stats without fetching all entities.

Proposed Endpoint: GET /api/v1/automations/summary

interface AutomationSummaryResponse {
total_count: number;
by_subtype: Record<string, number>; // "flow_designer_flow": 83
by_execution_mode: Record<string, number>; // "autonomous": 45, "unknown": 22
by_security_relevance: Record<string, number>; // "internal_inventory": 77
by_egress_category: Record<string, number>; // "none": 77, "external": 5
with_execution_evidence: number; // count where execution_count_30d > 0
with_identity_binding: number; // count where identity_binding_status == "bound"
classification_gaps: {
execution_mode_unknown: number; // 22
trigger_types_empty: number; // count where triggerTypes == []
execution_data_unavailable: number; // count where execution_data_availability != "available"
};
}

Use Cases:

  • Dashboard: show automation inventory at a glance
  • Data quality monitoring: track classification gap rate over time
  • Connector validation: confirm execution data collection success

Effort: 4-6 hours (new route + aggregation pipeline)


5.2 Classification Override/Reclassification Endpoint

Current Gap: No way to manually override execution_mode or security_relevance when connector gets it wrong.

Proposed Endpoint: PATCH /api/v1/entities/:id/classification

interface ClassificationOverrideRequest {
execution_mode?: ExecutionMode;
security_relevance?: SecurityRelevance;
override_reason?: string; // Required when overriding connector value
}

interface ClassificationOverrideResponse {
entity_id: string;
previous_classification: {
execution_mode: string;
execution_mode_source: string;
};
updated_classification: {
execution_mode: string;
execution_mode_source: "user_override";
override_reason: string;
overridden_at: string;
overridden_by: string; // user ID from auth context
};
}

Workflow:

  1. User views entity detail page
  2. Sees execution_mode: "unknown" with confidence indicator
  3. Clicks "Manually Classify"
  4. Selects "Autonomous" from dropdown, provides reason: "Business rule always runs on record insert"
  5. API updates entity properties + creates audit event
  6. Evaluator re-runs on next sync to pick up classification change

Persistence:

interface EntityProperties {
execution_mode: ExecutionMode;
execution_mode_source: "connector" | "platform_override" | "user_override";
execution_mode_override_reason?: string;
execution_mode_overridden_at?: string;
execution_mode_overridden_by?: string;
}

Sync Behavior:

  • Next connector sync should NOT overwrite user override
  • Add protected_fields: string[] to entity metadata
  • During ingestion, skip update of protected fields unless connector value changed

Effort: 8-12 hours (endpoint + UI + sync protection logic)


5.3 Data Quality Indicators per Entity

Current Gap: No visibility into which entity fields are trustworthy vs. defaulted/stale.

Proposed Addition: data_quality metadata in entity response

interface EntityDoc {
// ... existing fields ...
data_quality: DataQualityReport;
}

interface DataQualityReport {
overall_score: number; // 0-100, weighted sum of field confidence
field_confidence: Record<string, FieldConfidence>;
warnings: string[]; // ["execution_mode classification unknown", "no execution data collected"]
last_validated_at?: string;
}

interface FieldConfidence {
level: "high" | "medium" | "low" | "unavailable";
source: "connector" | "platform_derived" | "user_override" | "default";
collected_at?: string;
notes?: string;
}

// Example
{
"data_quality": {
"overall_score": 72,
"field_confidence": {
"execution_mode": {
"level": "low",
"source": "connector",
"notes": "Unrecognized trigger type 'knowledge management'"
},
"execution_count_30d": {
"level": "high",
"source": "connector",
"collected_at": "2026-02-12T14:23:00Z"
},
"egress_category": {
"level": "high",
"source": "platform_derived",
"notes": "Derived from endpoint_url analysis"
}
},
"warnings": [
"execution_mode classification unknown - manual review recommended",
"No execution evidence in last 30 days"
]
}
}

UI Impact:

  • Entity detail page shows data quality score badge (🟢 High / 🟡 Medium / 🔴 Low)
  • Field-level tooltips explain confidence level
  • Warnings surface in "Data Quality" tab

Effort: 12-16 hours (schema extension + computation logic + UI)


5.4 Filter: classification_status=incomplete

Current Gap: No way to find entities that need manual review/classification.

Proposed Query Parameter: GET /api/v1/entities?classification_status=incomplete

Logic:

function isClassificationIncomplete(entity: EntityDoc): boolean {
return (
entity.properties.execution_mode === "unknown" ||
entity.properties.security_relevance === "unknown" ||
entity.properties.execution_data_availability === "not_collected" ||
(entity.properties.triggerTypes?.length === 0 && entity.properties.identitySubtype === "flow_designer_flow")
);
}

Use Cases:

  • Connector validation: "Show me all automations with classification gaps"
  • User task list: "Review these 22 flows with unknown execution mode"
  • Data quality dashboard: "Incomplete classifications: 22 of 92 (24%)"

Implementation:

// In MongoStorageAdapter.queryEntities()
if (query.classificationStatus === "incomplete") {
filter.$or = [
{ "properties.execution_mode": "unknown" },
{ "properties.security_relevance": "unknown" },
{ "properties.execution_data_availability": "not_collected" },
];
}

Effort: 2-3 hours (query parameter + filter logic)


6. Collaboration with INTEGRATOR

6.1 Question: Is ALL 77 Entities Having execution_count=0 Suspicious?

Data Point: 77 internal_inventory flows, 100% have execution_count_30d: 0.

Possible Explanations:

A. Accurate Reflection of Reality

  • These are template flows, system-default workflows, or disabled automations
  • They genuinely have not executed in the last 30 days
  • The connector correctly queried sys_flow_context and found 0 matching records

B. Data Collection Issue

  • Connector has a bug in execution data collection
  • Permissions issue: can't read sys_flow_context table
  • API limit: only fetched execution data for first N flows, rest defaulted to 0

C. Classification Filter Bias

The security_relevance classification logic is:

if has_external_egress and exec_count > 0:
props["security_relevance"] = "active_external"
elif has_external_egress or binding == "bound":
props["security_relevance"] = "dormant_authority"
elif exec_count > 0:
props["security_relevance"] = "dormant_authority"
else:
props["security_relevance"] = "internal_inventory"

By definition, anything in internal_inventory MUST have exec_count == 0 (otherwise it would be dormant_authority).

So the question becomes: Are there ANY flows in the full 92-entity dataset with execution_count_30d > 0?

INTEGRATOR Action Items:

  1. Run connector with debug logging: confirm execution data collection was attempted for all flows
  2. Check sys_flow_context table permissions: can the OAuth integration read it?
  3. Manually query sys_flow_context in ServiceNow for 2-3 sample flows: confirm 0 records exist
  4. Check for flows in dormant_authority or active_external with execution_count_30d > 0 → proves collection works

Expected Outcome:

  • If collection works: some flows should have exec_count > 0
  • If all 92 entities have exec_count=0: connector bug or permissions issue

6.2 Question: Can Execution Data Collection Be Improved?

Current Limitations:

Automation TypeExecution Log TableDeterministic Join?Supported?
Flow Designer Flowsys_flow_contextYes (flow reference)✅ Yes
Scheduled Jobsys_triggerYes (document reference)✅ Yes
Business Rule❌ NoneN/A❌ No
System Execution (Script Include)❌ NoneN/A❌ No

Potential ServiceNow APIs to Explore:

  1. syslog Table (sys_log)

    • Generic execution log for scripts, business rules, scheduled jobs
    • Contains: timestamp, source (script name), message, level
    • Join: fuzzy match on source field (not deterministic)
    • Risk: high false positive rate, noise from unrelated logs
  2. System Execution Tracker (sys_execution_tracker)

    • Tracks long-running jobs and async operations
    • May contain business rule executions if they take >N seconds
    • Join: source_table + source field
    • Risk: only captures slow executions, not representative
  3. Table History (sys_audit)

    • Tracks record changes (insert, update, delete)
    • Business rules execute on these events
    • Indirect signal: if sys_audit shows record changes on tables that have business_rule triggers, infer execution
    • Risk: correlation, not causation
  4. Flow Designer Execution Context (sys_hub_action_instance)

    • Granular action-level execution log (individual steps within a flow)
    • Join: flow reference
    • Benefit: proves flow executed AND which steps ran (egress actions)
    • Connector currently uses sys_flow_context (flow-level) — sys_hub_action_instance is more detailed

INTEGRATOR Recommendations:

  1. Priority 1: Validate sys_flow_context collection is working (see §6.1)
  2. Priority 2: Explore sys_hub_action_instance for action-level execution evidence
  3. Priority 3: Research sys_log for business_rule execution inference (high effort, low confidence)

If execution data is truly unavailable for business_rules:

  • Set execution_data_availability: "unavailable_no_log" explicitly
  • Update UI to show "Execution count unavailable for this automation type"
  • Don't default to 0 — use null or a sentinel value (-1)

7. Schema Enhancement Proposals

7.1 Connector-Side Schema (NormalizedNode)

File: /Users/lucky/dev/securityv0/sv0-platform/src/ingestion/types.ts

Current:

export interface NormalizedNode {
nodeId: string;
nodeType: NormalizedNodeType;
sourceSystem: string;
sourceId: string;
displayName: string;
status: NodeStatus;
createdAt?: string;
lastModifiedAt?: string;
properties: Record<string, unknown>; // ← Unstructured
}

Proposed Addition (Automation Properties):

// New type for automation-specific properties
export interface AutomationProperties {
// Existing fields
identitySubtype: IdentitySubtype;
automation_type: string;
triggerTypes?: string[];
endpoint_url?: string | null;

// Execution evidence
execution_count_30d: number;
execution_evidence_refs: string[];
last_observed_execution_timestamp?: string | null;

// NEW: Data quality metadata
execution_data_availability: ExecutionDataAvailability;
execution_data_collected_at?: string;
execution_data_source?: string;
execution_data_notes?: string;

// Classification
execution_mode: ExecutionMode;
execution_mode_confidence: "high" | "low" | "unknown";
security_relevance: SecurityRelevance;

// Egress
egress_category: EgressCategory;
egress_host?: string | null;
egress_base_url?: string | null;

// Identity binding
identity_binding_status: "bound" | "unlinked";

// Risk assessment
risk_group: string;
risk_group_label: string;
risk_group_priority: string;
ownership_status: "valid" | "orphaned";

// Referenced data
referenced_tables?: string[];
data_domains?: string[];
}

export type ExecutionMode = "autonomous" | "operator_assisted" | "human_triggered" | "unknown";
export type SecurityRelevance = "active_external" | "dormant_authority" | "internal_inventory" | "unknown";
export type EgressCategory = "none" | "internal" | "external" | "cloud" | "llm" | "unknown";
export type ExecutionDataAvailability =
| "available"
| "partial"
| "unavailable_no_log"
| "unavailable_no_access"
| "not_collected";
export type IdentitySubtype =
| "flow_designer_flow"
| "business_rule"
| "scheduled_job"
| "system_execution"
| "oauth_app"
| "service_principal";

Migration Strategy:

  1. Add types to ingestion/types.ts
  2. Connector already emits these properties (they're in properties: Record<string, unknown>)
  3. Platform ingestion validates against type (runtime check, not compile-time)
  4. UI can now type-safely access entity.properties.execution_mode as ExecutionMode

Effort: 2-3 hours (type definitions + validation)


7.2 Platform-Side Schema (EntityDoc)

File: /Users/lucky/dev/securityv0/sv0-platform/src/domain/entities/types.ts

Current:

export interface EntityDoc {
_id: string;
tenant_id: string;
entity_type: EntityType;
source_system: string;
source_id: string;
properties: Record<string, unknown>; // ← Unstructured
relationships: EntityRelationship[];
execution_paths?: ExecutionPath[];
accessible_by?: AccessibleByEntry[];
sync_version: number;
last_synced_at: Date;
created_at: Date;
updated_at: Date;
}

Proposed Addition:

export interface EntityDoc {
// ... existing fields ...

// NEW: Data quality metadata
data_quality?: DataQualityReport;

// NEW: User overrides
user_overrides?: UserOverrideMetadata;
}

export interface DataQualityReport {
overall_score: number; // 0-100
field_confidence: Record<string, FieldConfidence>;
warnings: string[];
last_validated_at?: Date;
}

export interface FieldConfidence {
level: "high" | "medium" | "low" | "unavailable";
source: "connector" | "platform_derived" | "user_override" | "default";
collected_at?: Date;
notes?: string;
}

export interface UserOverrideMetadata {
protected_fields: string[]; // Fields that won't be overwritten by connector sync
overrides: Record<string, FieldOverride>;
}

export interface FieldOverride {
field_name: string;
original_value: unknown;
override_value: unknown;
override_reason: string;
overridden_at: Date;
overridden_by: string; // user ID
}

Computation Logic (during ingestion):

// In ingestion/normalizer.ts
function computeDataQuality(entity: EntityDoc): DataQualityReport {
const confidence: Record<string, FieldConfidence> = {};
const warnings: string[] = [];

if (entity.properties.execution_mode === "unknown") {
confidence.execution_mode = {
level: "low",
source: "connector",
notes: "Trigger type not recognized by connector"
};
warnings.push("execution_mode classification unknown - manual review recommended");
} else {
confidence.execution_mode = {
level: "high",
source: "connector",
collected_at: new Date(entity.last_synced_at)
};
}

if (entity.properties.execution_count_30d === 0 && !entity.properties.execution_data_availability) {
confidence.execution_count_30d = {
level: "medium",
source: "connector",
notes: "Zero count, but availability status unknown"
};
warnings.push("Execution count is 0 - unclear if data was collected");
}

// ... more field checks ...

const overall_score = computeOverallScore(confidence);

return {
overall_score,
field_confidence: confidence,
warnings,
last_validated_at: new Date()
};
}

Effort: 8-12 hours (schema + computation + storage)


7.3 MongoDB Index Additions

File: /Users/lucky/dev/securityv0/sv0-platform/src/storage/mongo/collections.ts

Proposed Indexes:

// For classification_status filter
await entities.createIndex({
tenant_id: 1,
"properties.execution_mode": 1,
"properties.security_relevance": 1
});

// For data quality queries
await entities.createIndex({
tenant_id: 1,
"data_quality.overall_score": 1
});

// For user override tracking
await entities.createIndex({
tenant_id: 1,
"user_overrides.protected_fields": 1
});

Effort: 1 hour (index creation + migration script)


8. Pre-Ingest Filter Analysis

8.1 Current State

Connector Code: transformer.py lines 103-111

# Optionally filter internal_inventory automations (connector-side pre-filter).
# Default OFF to preserve Phase 1 inventory completeness gate.
if filter_internal_inventory:
filtered_count = self._filter_internal_inventory()
if filtered_count > 0:
logging.getLogger(__name__).info(
"Filtered %d internal_inventory automation(s) from NormalizedGraph output",
filtered_count,
)

Filter Logic: Lines 1211-1260

def _filter_internal_inventory(self) -> int:
"""Remove internal_inventory automation nodes and their orphaned edges/owner nodes.

Filtering criteria: security_relevance == "internal_inventory" means:
- egress_category in (none, internal, unknown)
- identity_binding_status == "unlinked"
- execution_count_30d == 0
"""
# Find automation node IDs to remove
remove_node_ids: set[str] = set()
for node in self._nodes:
if node.get("nodeType") == "autonomous_identity":
rel = node.get("properties", {}).get("security_relevance")
if rel == "internal_inventory":
remove_node_ids.add(node["nodeId"])

# Remove nodes
self._nodes = [n for n in self._nodes if n["nodeId"] not in remove_node_ids]

# Remove orphaned edges
self._edges = [
e for e in self._edges
if e["sourceNodeId"] not in remove_node_ids
and e["targetNodeId"] not in remove_node_ids
]

# Remove orphaned owner nodes (OWNED_BY targets with no other edges)
# ... (omitted for brevity)

return len(remove_node_ids)

8.2 Question: Should Filtering Happen Pre-Ingest or Post-Ingest?

Option A: Pre-Ingest (Current Implementation, Disabled by Default)

Pros:

  • Reduces entity count before platform ingestion (lower storage, faster queries)
  • Simplifies platform by not storing irrelevant data
  • Graph layout is immediately clean (no 77 internal_inventory nodes)

Cons:

  • Inventory incompleteness — can't retroactively include entities if criteria change
  • Audit gap — no record that these automations exist in the source system
  • Temporal loss — can't track when internal_inventory automations become security-relevant
  • Discovery validation impossible — can't prove connector scanned all flows if some are filtered out

Option B: Post-Ingest (Platform Filters)

Pros:

  • Complete inventory — every discovered automation is stored
  • Temporal tracking — can see when execution_count changes from 0 → N (dormant → active)
  • Audit trail — proves connector scanned all entities, none were lost
  • Flexible filtering — UI can show/hide internal_inventory on demand
  • Reclassification — if connector gets security_relevance wrong, platform can override

Cons:

  • Higher entity count (92 instead of ~15)
  • Requires UI/API default filters to hide noise
  • Graph layout requires filtering logic

Recommendation: Option B (Post-Ingest with Default Filters)

Rationale:

  1. Discovery is broad, analysis is narrow — connector should discover everything, platform should filter for security relevance
  2. Temporal use case — a flow with 0 executions today may have 10 executions tomorrow. If it's filtered pre-ingest, we lose that transition.
  3. Audit/compliance — "How many automations exist in ServiceNow?" should be 92, not 15
  4. Data quality validation — can compare connector output to manual ServiceNow queries only if all entities are ingested

Implementation:

  • Keep filter_internal_inventory: bool = False (default OFF in connector)
  • Add default filter to platform API: GET /api/v1/entities?entity_type=identity&security_relevance!=internal_inventory
  • Add default filter to UI Automations page
  • Graph browse mode defaults to same filter (see automation-filtering-graph-strategy.md §S1)

Migration Path:

  • Currently deployed: internal_inventory entities ARE ingested (filter is off)
  • No migration needed
  • Just add default filters to API/UI

8.3 When Should Pre-Ingest Filtering Be Used?

Valid Use Cases:

  1. Connector has a bug that discovers non-existent/duplicate entities → filter in connector until bug is fixed
  2. Source system permissions limit — can't query execution data for some entities → filter them to avoid misleading 0 counts
  3. Scale issues — 10,000+ automations discovered, platform can't handle load → filter to top N by relevance

Invalid Use Cases:

  1. Hiding false positives — this should be done via UI filters, not pre-ingest
  2. Improving graph layout — this is a UI problem, not a data problem
  3. "Cleaning up" the inventory — defeats the purpose of deterministic discovery

Recommendation for sv0-connectors:

  • Remove filter_internal_inventory parameter entirely (simplifies connector interface)
  • Always emit all discovered entities
  • Let platform handle relevance filtering

9. Challenge Questions for Other Roles

9.1 For Product Owner

Q1: Should we expose data quality confidence levels in the UI?

Context: 29% of flows have execution_mode: "unknown". Currently, the UI shows this as a plain value. Should we add visual indicators (🟢 High / 🟡 Low / 🔴 Unknown) to signal data quality?

Impact:

  • Users can prioritize manual review of low-confidence entities
  • Reduces false confidence in incomplete data
  • Adds visual noise to UI

Recommendation: Yes, but make it subtle (icon + tooltip, not full badge).


Q2: Is "Show all automations" a toggle or a separate page?

Context: 77 of 92 automations are internal_inventory (hidden by default). Should users see them via:

  • A. Toggle switch "Show internal inventory" on main Automations page
  • B. Separate "Automation Inventory (All)" page
  • C. Filter dropdown: "Security-relevant only" / "All automations"

Recommendation: Option C (filter dropdown) — most flexible, matches existing filter patterns.


Q3: Should incomplete classification block findings generation?

Context: If execution_mode: "unknown", should the evaluator still generate dormant_authority findings, or skip the entity?

Options:

  • Block: conservative, avoids false positives, but reduces finding coverage
  • Allow: aggressive, treats "unknown" as "autonomous" (assume worst case)
  • Flag: generate finding but mark it as "low_confidence"

Recommendation: Option C (flag with low_confidence).


9.2 For CISO

Q1: Is "incomplete classification" itself a finding we should surface?

Example Finding:

Finding Type: INCOMPLETE_AUTOMATION_CLASSIFICATION
Severity: Low
Title: 22 automations with unknown execution mode
Description: 22 of 92 discovered automations have execution_mode="unknown" due to
unrecognized trigger types. Manual classification recommended to ensure
complete risk assessment.
Evidence:
- Automation IDs: [list of 22 entity IDs]
- Trigger types causing gaps: ["knowledge management", "email handler", ...]
- Recommendation: Expand connector trigger type allowlist or manually classify

Pro: Surfaces data quality gaps as actionable items Con: Not a security risk per se, more of an operational issue

Recommendation: Yes, but as severity "Informational" (not Low/Medium/High).


Q2: What is the acceptable classification gap rate?

Current: 29% of flows have execution_mode: "unknown"

Question: What threshold should trigger an alert?

  • 0% (perfect classification required)?
  • <10% (acceptable noise)?
  • <25% (current state is acceptable)?

Recommendation: <10% for security-relevant automations, <50% for internal_inventory.


Q3: Should we trust execution_count=0 or require manual verification?

Context: ALL 77 internal_inventory flows have execution_count_30d: 0. No execution_data_availability metadata to confirm this is accurate.

Options:

  • Trust it: assume connector is correct, proceed with analysis
  • Flag it: show warning "Execution count may be incomplete"
  • Block it: require INTEGRATOR to validate before accepting data

Recommendation: Flag it (option B) until INTEGRATOR confirms collection works (see §6).


9.3 For Architect

Q1: Should classification be a platform concern or remain connector-side?

Current: Connector computes execution_mode, security_relevance, risk_group. Platform ingests verbatim.

Alternative: Platform recomputes these during ingestion based on normalized properties.

Pros of Platform Classification:

  • Single source of truth
  • Consistent across all connectors
  • Easier to evolve classification logic (no connector updates needed)

Cons:

  • Duplicates logic between connector and platform
  • Connector loses autonomy
  • What if connector has better context (e.g., ServiceNow-specific trigger types)?

Recommendation: Hybrid — connector provides raw signals (trigger types, egress URLs), platform derives classification. Connector can provide hints, but platform is authoritative.


Q2: Should execution_data_availability be part of the NormalizedGraph schema?

Context: Proposed new field to distinguish "confirmed zero" from "data not collected".

Question: Should this be:

  • A. Required field in NormalizedGraph (connector MUST provide it)
  • B. Optional field (connector MAY provide it, platform infers if absent)
  • C. Platform-computed only (connector doesn't emit it, platform adds during ingestion)

Recommendation: Option A (required). Data quality is critical — connectors should explicitly declare availability.


Q3: Should we support connector-to-platform data quality feedback?

Context: Connector knows when it hits API limits, permissions errors, or timeouts during data collection.

Proposed: Add collection_warnings to NormalizedGraph:

export interface NormalizedGraph {
// ... existing fields ...
collectionWarnings?: CollectionWarning[];
}

export interface CollectionWarning {
field: string; // "execution_count_30d"
severity: "info" | "warning" | "error";
message: string; // "API limit reached, execution count may be incomplete"
affected_entities?: string[]; // nodeIds
}

Benefit: Platform can surface connector issues in UI, not just logs.

Recommendation: Yes — this closes the feedback loop between connector and platform.


9.4 For Integrator

Q1: What additional ServiceNow APIs would improve execution_count reliability?

Current: Uses sys_flow_context (flow-level) and sys_trigger (job-level).

Gaps:

  • Business rules: no execution log
  • Flows: only flow-level count, not action-level detail

Proposed Research:

  1. sys_hub_action_instance — action-level execution log (which flow steps ran)
  2. sys_log — generic script execution log (may contain business rule executions)
  3. sys_execution_tracker — long-running job tracker
  4. sys_audit — table history (indirect signal for business rule executions)

Question: Which of these are feasible with OAuth app permissions?

Expected Effort: 4-8 hours research + testing


Q2: Can we get last_modified_date for flows?

Context: Currently missing from entity properties. Would help identify recently-edited flows (potential new risk).

ServiceNow Field: sys_updated_on in sys_hub_flow table

Question: Already collected but not emitted, or not collected?

Action: Check connector code, add to properties if available.


Q3: Should connector emit a "data collection report" after each scan?

Proposed: After discovery, connector emits a summary:

{
"collection_summary": {
"flows_discovered": 83,
"flows_with_execution_data": 0, // ← KEY METRIC
"flows_skipped_no_permissions": 0,
"execution_data_sources": ["sys_flow_context", "sys_trigger"],
"collection_duration_seconds": 42,
"api_calls_made": 156,
"api_limits_hit": 0
}
}

Benefit: Immediate visibility into connector health, not just entity data.

Recommendation: Yes — include in connector sync metadata.


10. Summary & Recommendations

10.1 Critical Issues (Blocking)

IssueImpactRecommendationEffort
Null AmbiguityCan't distinguish "confirmed zero" from "not checked"Add execution_data_availability to schema8-12h
29% execution_mode GapCan't classify 1 in 3 automationsExpand trigger type allowlist + platform fallback4-6h
No Data Quality MetadataCan't assess confidence in entity propertiesAdd data_quality to EntityDoc12-16h
FeatureUse CaseEffort
Automation Summary EndpointDashboard stats, connector validation4-6h
Classification Override APIManual review workflow8-12h
classification_status=incomplete FilterFind entities needing review2-3h
Pre-Ingest Filter RemovalPreserve inventory completeness1h

10.3 INTEGRATOR Action Items

  1. Priority 1: Validate execution data collection works (check for flows with exec_count > 0)
  2. Priority 2: Research sys_hub_action_instance for action-level execution evidence
  3. Priority 3: Add last_modified_date to flow properties
  4. Priority 4: Emit data collection summary in sync metadata

10.4 Platform Schema Enhancements

// 1. Add to NormalizedNode properties (connector emits)
interface AutomationProperties {
execution_data_availability: ExecutionDataAvailability;
execution_data_collected_at?: string;
execution_mode_confidence: "high" | "low" | "unknown";
}

// 2. Add to EntityDoc (platform computes)
interface EntityDoc {
data_quality?: DataQualityReport;
user_overrides?: UserOverrideMetadata;
}

// 3. Add to NormalizedGraph (connector emits)
interface NormalizedGraph {
collectionWarnings?: CollectionWarning[];
}

10.5 API Additions

GET  /api/v1/automations/summary
GET /api/v1/entities?classification_status=incomplete
PATCH /api/v1/entities/:id/classification

10.6 Answers to Core Questions

Q: Does execution_count_30d: 0 mean "we checked and found zero" or "we didn't check"? A: Currently ambiguous. Recommendation: Add execution_data_availability field to make this explicit.

Q: Is a 29% execution_mode: "unknown" rate acceptable? A: Acceptable for internal_inventory (hidden by default), blocking for security-relevant automations. Recommendation: Platform fallback for known subtypes (business_rule → autonomous).

Q: Should the platform have a fallback classification if the connector returns "unknown"? A: Yes, with metadata indicating override. Use execution_mode_source: "platform_override" to track provenance.


Appendix A: TypeScript Type Definitions

File: /Users/lucky/dev/securityv0/sv0-platform/src/domain/entities/automation-types.ts (new)

/**
* Automation-specific types for identity entities.
* These types provide structure for properties that were previously untyped (Record<string, unknown>).
*/

export type IdentitySubtype =
| "flow_designer_flow"
| "business_rule"
| "scheduled_job"
| "system_execution"
| "oauth_app"
| "service_principal";

export type ExecutionMode = "autonomous" | "operator_assisted" | "human_triggered" | "unknown";

export type SecurityRelevance =
| "active_external" // Has external egress + execution evidence
| "dormant_authority" // Has capability but no recent execution
| "internal_inventory" // No external egress, no execution, unlinked
| "unknown";

export type EgressCategory = "none" | "internal" | "external" | "cloud" | "llm" | "unknown";

export type ExecutionDataAvailability =
| "available" // Data was collected, count is accurate
| "partial" // Data was collected but incomplete (API limit, timeout)
| "unavailable_no_log" // Source system has no execution log for this automation type
| "unavailable_no_access" // Connector lacks permissions to query execution logs
| "not_collected"; // Execution data collection was skipped

export interface AutomationProperties {
// Identity classification
identitySubtype: IdentitySubtype;
automation_type: string; // "flow", "business_rule", "job", "script"

// Trigger configuration
triggerTypes?: string[];

// Execution evidence
execution_count_30d: number;
execution_evidence_refs: string[];
last_observed_execution_timestamp?: string | null;

// Data quality metadata
execution_data_availability: ExecutionDataAvailability;
execution_data_collected_at?: string; // ISO 8601 timestamp
execution_data_source?: string; // "sys_flow_context" | "sys_trigger" | "unavailable"
execution_data_notes?: string;

// Classification
execution_mode: ExecutionMode;
execution_mode_confidence: "high" | "low" | "unknown";
execution_mode_source?: "connector" | "platform_override" | "user_override";
security_relevance: SecurityRelevance;

// Egress analysis
egress_category: EgressCategory;
egress_host?: string | null;
egress_base_url?: string | null;
endpoint_url?: string | null;

// Identity binding
identity_binding_status: "bound" | "unlinked";

// Risk assessment
risk_group: string; // "RG1" | "RG2" | "RG3" | "RG4" | "RG5"
risk_group_label: string;
risk_group_priority: string; // "P1" | "P2" | "P3" | "P4"

// Ownership
ownership_status: "valid" | "orphaned";
sys_created_by?: string;
sys_updated_by?: string;

// Referenced data
referenced_tables?: string[];
data_domains?: string[];
}

export interface DataQualityReport {
overall_score: number; // 0-100, weighted sum of field confidence scores
field_confidence: Record<string, FieldConfidence>;
warnings: string[];
last_validated_at?: Date;
}

export interface FieldConfidence {
level: "high" | "medium" | "low" | "unavailable";
source: "connector" | "platform_derived" | "user_override" | "default";
collected_at?: Date;
notes?: string;
}

export interface UserOverrideMetadata {
protected_fields: string[]; // Field names that won't be overwritten by connector sync
overrides: Record<string, FieldOverride>;
}

export interface FieldOverride {
field_name: string;
original_value: unknown;
override_value: unknown;
override_reason: string;
overridden_at: Date;
overridden_by: string; // User ID from auth context
}

export interface CollectionWarning {
field: string; // Property name that was affected
severity: "info" | "warning" | "error";
message: string;
affected_entities?: string[]; // nodeIds of entities affected by this warning
}

Appendix B: API Endpoint Specifications

B.1 Automation Summary Endpoint

GET /api/v1/automations/summary

Query Parameters:

  • tenant_id (from auth context)
  • source_system (optional): filter by source system

Response:

{
"total_count": 92,
"by_subtype": {
"flow_designer_flow": 83,
"business_rule": 2,
"oauth_app": 3,
"service_principal": 2,
"system_execution": 2
},
"by_execution_mode": {
"autonomous": 45,
"operator_assisted": 10,
"unknown": 22,
"human_triggered": 0
},
"by_security_relevance": {
"internal_inventory": 77,
"dormant_authority": 9,
"active_external": 1,
"unknown": 5
},
"by_egress_category": {
"none": 77,
"external": 5,
"llm": 3,
"cloud": 2,
"internal": 3,
"unknown": 2
},
"with_execution_evidence": 0,
"with_identity_binding": 9,
"classification_gaps": {
"execution_mode_unknown": 22,
"trigger_types_empty": 5,
"execution_data_unavailable": 2
},
"data_quality": {
"overall_average_score": 72,
"entities_with_warnings": 24,
"low_confidence_count": 22
}
}

B.2 Classification Override Endpoint

PATCH /api/v1/entities/:id/classification

Request Body:

{
"execution_mode": "autonomous",
"override_reason": "Business rule always runs on record insert, not human-triggered"
}

Response:

{
"entity_id": "027e16c40dc1009472308597",
"previous_classification": {
"execution_mode": "unknown",
"execution_mode_source": "connector"
},
"updated_classification": {
"execution_mode": "autonomous",
"execution_mode_source": "user_override",
"execution_mode_confidence": "high",
"override_reason": "Business rule always runs on record insert, not human-triggered",
"overridden_at": "2026-02-12T15:30:00Z",
"overridden_by": "user-123"
},
"protected_fields": ["execution_mode"]
}

B.3 Classification Status Filter

GET /api/v1/entities?classification_status=incomplete

Logic: Returns entities where ANY of:

  • properties.execution_mode === "unknown"
  • properties.security_relevance === "unknown"
  • properties.execution_data_availability === "not_collected"
  • properties.triggerTypes.length === 0 AND properties.identitySubtype === "flow_designer_flow"

Response:

{
"data": [
{
"_id": "027e16c40dc1009472308597",
"entity_type": "identity",
"properties": {
"display_name": "Knowledge - Approval Publish",
"execution_mode": "unknown",
"triggerTypes": ["knowledge management"],
"execution_data_availability": "available"
},
"data_quality": {
"overall_score": 65,
"warnings": ["execution_mode classification unknown - manual review recommended"]
}
}
],
"cursor": null,
"meta": {
"total_count": 22
}
}

Appendix C: Implementation Checklist

Phase 1: Schema Enhancements (8-12 hours)

  • Add automation-types.ts with structured types
  • Add execution_data_availability to AutomationProperties
  • Add execution_mode_confidence to AutomationProperties
  • Add DataQualityReport to EntityDoc
  • Add UserOverrideMetadata to EntityDoc
  • Create MongoDB indexes for new fields

Phase 2: API Endpoints (12-16 hours)

  • Implement GET /api/v1/automations/summary
  • Implement PATCH /api/v1/entities/:id/classification
  • Add classification_status=incomplete query parameter
  • Add data quality computation during ingestion
  • Add protected_fields sync behavior

Phase 3: Connector Updates (INTEGRATOR) (8-12 hours)

  • Add execution_data_availability to transformer output
  • Add execution_mode_confidence to transformer output
  • Add execution_data_collected_at timestamp
  • Add collectionWarnings to NormalizedGraph
  • Validate execution data collection works (check for exec_count > 0)
  • Research sys_hub_action_instance API

Phase 4: UI Updates (8-12 hours)

  • Add data quality badge to entity detail page
  • Add "Manually Classify" button for entities with execution_mode=unknown
  • Add classification override modal
  • Add "Show internal inventory" filter toggle
  • Add classification_status=incomplete filter to Automations page
  • Add automation summary dashboard widget

Phase 5: Testing (4-6 hours)

  • Unit tests for data quality computation
  • Integration tests for classification override
  • E2E test for incomplete classification filter
  • Connector test: verify execution_data_availability is populated
  • Manual test: override execution_mode, verify sync doesn't revert

END OF ANALYSIS