Skip to main content

LLM Integration Opportunities in SecurityV0

Date: 2026-03-11 Status: Research — not yet an ADR Scope: sv0-platform (evaluator, evidence, UI), sv0-connectors (classification pipeline) Trigger: Discussion about whether the "What Happened" narrative in the Authority Exposure Brief should use LLM generation vs. the current deterministic template approach.


1. Executive Summary

SecurityV0 currently generates all natural language text deterministically: templates, hardcoded strings, and rule-based classifiers. This is intentional — security tooling requires auditability, predictability, and traceability to evidence.

The question is not "use LLM everywhere" but "where does semantic understanding produce meaningfully better results than rules, and where does determinism matter too much to give up?"

This document maps every text-generation and classification point across the platform and connectors, assesses each for LLM fit, and proposes an architecture that uses LLMs as an opt-in enrichment layer with deterministic fallbacks — running offline models where latency and privacy matter, and cloud models where reasoning depth matters.

Key conclusions:

  • The buildNarrative() "What Happened" summary: keep deterministic — it is already high-quality, correct, and auditable.
  • Connector classification pipeline (egress, origin, permission, script analysis): highest-value LLM target — rules have hard coverage limits and semantic gaps that only language understanding can fill.
  • Per-finding explanation and remediation: medium-value — LLM can add context not expressible in templates, but must stay grounded in evidence.
  • Architecture: local model-first for classification (privacy, speed, cost), cloud model only for complex reasoning tasks; always with a deterministic fallback path.

2. Current Text Generation Inventory

2.1 sv0-platform: Natural Language Points

LocationFileWhat Is GeneratedMethod
Authority Exposure Brief — Section Aui/src/pages/RiskClusterDetailPage.tsx buildNarrative()"N identities accessed sensitive systems (domains) M times in 30d. Governance clause."Template: action_phrase + governance_clause from cluster def + live numbers
Risk cluster card verdictui/src/components/PathRiskClusterCard.tsx buildVerdictSentence()One-line summary per cluster cardSame template approach
Per-finding explanationsrc/evaluator/rules/*.ts (16 rules)deterministic_explanation field on each findingHardcoded string per rule type, some dynamic values injected
Remediation actionssrc/evidence/remediation.ts + src/services/remediation-service.tsaction + rationale + reduction_effect per actionSwitch on finding type, context-aware builders
Evidence pack markdownsrc/evidence/markdown.tsFull markdown export of evidence packTemplate-based formatting
Cluster remediation bulletssrc/services/risk-cluster-service.ts RISK_CLUSTER_DEFS3-5 bullets per cluster type (7 types)Hardcoded strings
Page-level static textVarious pages/*.tsxSection labels, explainer textHardcoded JSX strings

2.2 sv0-connectors: Classification Pipeline

StageFileWhat It ClassifiesCurrent Method
Egress classificationcore/egress_classifier.pyEndpoint type: llm / external / internal / none / unknownHardcoded domain catalog + regex markers for dynamic URLs
Data origin (sensitivity)core/origin_classifier.pyData domain: hr / identity / customer / financial / unknownPattern matching on ServiceNow table names (sn_hr_*, sys_user*)
Ownership validationcore/ownership_validator.pyStatus: valid / invalid / ambiguousDeterministic rules on owner activity, count, type
Risk groupingcore/risk_grouper.pyRisk group: RG1–RG5Hardcoded matrix: egress × sensitivity
Permission canonicalizationcore/permission_mapper.pyOAA type: DataRead / DataWrite / … / UncategorizedHardcoded mapping + fallback pattern-match
ARM role actionsshared/sv0_azure/arm_roles.pyActions: read / write / delete / adminHardcoded for 40+ known roles, conservative fallback
Script analysisadapters/servicenow_client.pyTable mutations, REST call targets from script codeRegex pattern matching (not AST)
Resource sensitivitycore/transformer.pySensitivity: restricted / confidential / internal / publicHardcoded table list + domain mapping

3. Where LLM Adds Real Value (and Where It Doesn't)

3.1 Do NOT use LLM — deterministic is correct

The buildNarrative() "What Happened" summary:

  • Text quality is already high ("3 autonomous identities accessed sensitive systems…")
  • Numbers and domains must be exact — LLM cannot improve on precision
  • Auditability: CISOs need to trace every word back to evidence; LLM phrasing variation undermines this
  • Cost and latency: adding an LLM call to every page load for no meaningful gain
  • Decision: keep deterministic. Improve the action_phrase / governance_clause vocabulary editorially if needed.

Finding deterministic_explanation field:

  • The field name itself signals the contract: it is deterministically derived from evidence
  • Changing this to LLM output would break the audit chain
  • Decision: keep deterministic. The explanation must be machine-reproducible.

Ownership validation (valid/invalid/ambiguous):

  • This is a binary governance decision with clear rules
  • LLM "softening" this would introduce false confidence
  • Decision: keep deterministic. Rules are correct and complete.

3.2 HIGH VALUE — use LLM

A. Egress Endpoint Classification

Files: core/egress_classifier.py in both connectors

Current problem: The hardcoded LLM_CATALOG covers ~8 known providers. Dynamic URLs (built from variables: ${}, gs.getProperty, config lookups) always fall back to unknown. Unknown egress = invisible risk.

What LLM adds:

  • Classify novel endpoints by hostname semantics: "Is api.bedrock.us-east-1.amazonaws.com an LLM endpoint?"
  • Evaluate partial URL patterns: "What does https://${env.AI_GATEWAY}/v1/chat probably resolve to?"
  • Classify connection objects by description, display name, connection type metadata

Risk level: Low — LLM is expanding coverage of unknown cases, not overriding deterministic ones. Keep the existing catalog as primary; LLM fills the gap.

Recommended model: Local/offline (fast, no data egress, endpoint hostnames are not sensitive). Ollama with llama3.2 or phi-3-mini.


B. Data Origin / Sensitivity Classification from Table Names

Files: core/origin_classifier.py, core/transformer.py

Current problem: Coverage limited to tables that match sn_hr_*, sys_user*, customer_* etc. Custom tables, vendor extensions, and customer-specific naming fall through to unknown. Unknown domain = missing sensitivity signal.

What LLM adds:

  • Semantic table name interpretation: "What does x_acme_employee_onboarding likely contain?"
  • Context from automation description + table name together
  • Cross-reference: if a Business Rule fires on x_vendor_contracts and sends to an external endpoint, that is probably sensitive even without a matching pattern

Risk level: Low — again filling unknown coverage, not overriding explicit mappings.

Recommended model: Local/offline. Table names and descriptions are customer data; no cloud egress.


C. ServiceNow Script Analysis (Semantic Code Understanding)

Files: adapters/servicenow_client.py (analyze_script_mutations, analyze_script_queries)

Current problem: Script analysis is regex-based string matching. It misses:

  • Dynamic GlideRecord table names: gr.initialize(tableName) where tableName is a variable
  • Indirect REST calls: callMyHelper() where helper calls the external endpoint
  • Chained script includes
  • Any complexity beyond GlideRecord('literal_table_name')

What LLM adds:

  • AST-level code understanding (JavaScript/GlideScript)
  • "What tables does this script touch?" with reasoning about variable resolution
  • "Does this script call any external HTTP endpoints?"
  • "What data does this script read vs. write?"

Risk level: Medium — script code is customer IP. Cloud model requires explicit tenant consent or PII stripping. Prefer local model; fall back to current regex if model unavailable.

Recommended model: Local/offline for privacy. codellama or deepseek-coder perform well on JavaScript. Cloud model (Claude) for complex multi-file chains if tenant has opted in.


D. Unknown Permission Canonicalization

Files: core/permission_mapper.py, shared/sv0_azure/arm_roles.py

Current problem: Unmapped Azure/ServiceNow permissions fall back to conservative defaults (["read", "write"]). Custom Azure RBAC roles (common in enterprise tenants) are unknown. This over-counts write permissions.

What LLM adds:

  • "What does the Azure permission Microsoft.MachineLearningServices/workspaces/connections/listsecrets/action actually allow?" → classified as DataRead + admin-level
  • Custom RBAC role interpretation from role description
  • Cross-reference Microsoft docs for unknown permission strings

Risk level: Low — permission strings are not sensitive, cloud model acceptable.

Recommended model: Cloud (Claude Haiku or GPT-4o-mini). Permission strings benefit from up-to-date cloud model knowledge. Cache results aggressively (same permission → same classification, immutable).


3.3 MEDIUM VALUE — use LLM with care

E. Per-Path Contextual Explanation (new field, not replacing deterministic_explanation)

Files: Would be a new contextual_summary field in FindingDoc or evidence pack

Current gap: The deterministic explanation says what happened (e.g., "no OWNED_BY relationship detected"). It does not say why this matters in context (e.g., "this unbound Foundry agent is routing data to an LLM while accessing HR profiles — the combination of unowned + sensitive + LLM egress is the highest-risk pattern in the system").

What LLM adds:

  • Cross-finding narrative: synthesize multiple findings on a path into a coherent risk story
  • Severity calibration rationale: "This is critical because X + Y + Z together"
  • Comparison context: "This path executed 47 times last month vs. 3 times the month before — the spike is anomalous"

Implementation constraint: Must be clearly labelled as AI-generated and supplementary to the deterministic fields. Must cite specific evidence refs. Cannot contradict the deterministic fields.

Recommended model: Cloud (Claude Sonnet). This is a reasoning-heavy task that benefits from a capable model. Run async after findings are written (not in the hot path).


F. Remediation Advice Personalization

Files: src/evidence/remediation.ts, src/services/risk-cluster-service.ts

Current gap: Remediation bullets are per-cluster-type (7 types, hardcoded). They are correct but generic: "Assign an active owner" applies to every orphaned path regardless of context.

What LLM adds:

  • Context-specific phrasing: "Assign an owner from the HR Engineering team — this automation accesses sn_hr_core_profile and should be owned by someone accountable for HR data"
  • Priority adjustment based on execution volume: "With 681 executions in 30 days, this should be remediated this sprint, not next quarter"
  • Org-aware suggestions (if tenant metadata is available)

Implementation constraint: Same as per-path explanation — clearly labelled, supplementary, not replacing the structured structured_actions[] array.


4. Architectural Design

4.1 LLM Layer Position

Connector Pipeline (Python)
┌──────────────────────────────────────────────────────────┐
│ Extract → Correlate → Classify → Transform → Submit │
│ ▲ │
│ ┌───────────┴───────────┐ │
│ │ Rule-based (primary) │ │
│ │ LLM enrichment │ ← NEW: fills │
│ │ (for unknowns only) │ unknown cases │
│ └───────────────────────┘ │
└──────────────────────────────────────────────────────────┘

Platform Pipeline (TypeScript)
┌──────────────────────────────────────────────────────────┐
│ Ingest → Evaluate → Evidence → API → UI │
│ ▲ │
│ ┌───────────┴───────────┐ │
│ │ Async enrichment │ ← NEW: runs │
│ │ (contextual_summary) │ after write │
│ └───────────────────────┘ │
└──────────────────────────────────────────────────────────┘

Key principle: LLM is never in the hot/synchronous path for classification that has a deterministic answer. LLM only runs when:

  1. A rule returns unknown / uncategorized (connector classification)
  2. All deterministic fields are already written and we are adding supplementary context (platform)

4.2 Model Tier Strategy

TierModelWhenWhy
T0 — DeterministicNo modelRules produce a confident answerZero latency, full auditability, no cost
T1 — Local/OfflineOllama (llama3.2, phi-3.5-mini, deepseek-coder)Classifying unknown outputs from rules; script analysis; table name semanticsNo data egress, fast (<200ms on CPU for small inputs), no per-call cost, runs in container
T2 — Cloud (efficient)Claude Haiku / GPT-4o-miniPermission string lookup; cases where T1 confidence is lowBetter knowledge of Azure/ServiceNow APIs, cheap, cacheable
T3 — Cloud (capable)Claude SonnetPer-path contextual summary; complex multi-finding synthesisReasoning-heavy, async, not latency-sensitive

Decision ladder per classification call:

1. Apply rule → confident result? → done (T0)
2. Rule returns unknown → try T1 local model → confidence > threshold? → done (T1)
3. T1 confidence low → try T2 cloud (if tenant has opted in and quota available) → done (T2)
4. T2 unavailable or quota exceeded → return deterministic fallback value → done (T0 fallback)

4.3 Offline / Local Model Setup

Runtime: Ollama — runs as a sidecar container or locally on the developer machine.

Models:

  • Text/classification: llama3.2:3b (fast, ~2GB, sufficient for classification tasks)
  • Code analysis: deepseek-coder:6.7b (JavaScript/GlideScript understanding)
  • Fallback if Ollama unavailable: rule-based result as-is

Deployment options:

Option A: Sidecar container (production)
docker run -d -p 11434:11434 ollama/ollama
Pulled once, models cached on volume

Option B: Local dev
brew install ollama && ollama serve
ollama pull llama3.2:3b deepseek-coder:6.7b

Option C: None / degraded mode
All LLM calls skip to fallback
Connectors still run, all outputs deterministic

Privacy guarantee: T1 never sends data off-host. Customer table names, script code, and property values stay local.


4.4 Cloud Model Integration

Provider: Anthropic Claude (primary), OpenAI (secondary / fallback) Auth: Environment variable ANTHROPIC_API_KEY / OPENAI_API_KEY (pre-resolved from 1Password at container start, same pattern as GH_TOKEN)

Opt-in model: Off by default for customer data (scripts, table names with actual data). On by default for metadata-only tasks (permission strings, ARM role names — these are not sensitive).

Quota / rate limit handling:

class LLMEnricher:
def classify(self, input: str, task: str) -> ClassificationResult:
# T0: rule
result = self.rules.classify(input)
if result.confidence == "high":
return result

# T1: local
if self.ollama.available():
result = self.ollama.classify(input, task)
if result.confidence >= self.threshold:
return result

# T2: cloud (if permitted and quota available)
if self.cloud_permitted and not self.quota_exceeded:
try:
return self.cloud.classify(input, task)
except RateLimitError:
self.quota_exceeded = True # backoff for this run
except Exception:
pass # fall through

# T0 fallback
return self.rules.fallback_classification(input)

Caching: All LLM classification results are cached by input hash. Permission strings and table names are stable — cache is long-lived (7 days). Endpoint hostnames: 24h TTL.


4.5 Fallback Guarantees

The system must remain fully operational with zero LLM availability. This means:

ScenarioBehavior
Ollama not installed / not runningT1 skipped, goes to T2 or T0 fallback
Cloud API key missingT2 skipped, goes to T0 fallback
Cloud quota exhaustedquota_exceeded flag set, T2 skipped for remainder of run
Cloud rate limitedExponential backoff (max 3 retries), then T0 fallback
Cloud returns malformed responseValidation error → T0 fallback
Local model returns low-confidenceT2 if available, else T0 fallback
All models unavailable100% deterministic output — same as today

The fallback value is always the current behavior — not an error, not null. The connector produces a valid NormalizedGraph regardless.

Observability: Each enriched field carries a _source metadata annotation:

{
"egress_category": "llm",
"_source": {
"egress_category": "t1_local_llama3.2",
"confidence": 0.91
}
}

This allows downstream audit: "Was this classification rule-based or AI-inferred?"


5. Implementation Roadmap

Goal: Reduce unknown egress and unknown data domain rates in connector output.

Tasks:

  1. Add LLMEnricher utility class to sv0-connectors/shared/sv0_common/
  2. Wire into egress_classifier.py — classify unknown endpoints via T1
  3. Wire into origin_classifier.py — classify unknown table names via T1
  4. Add _source annotation to all classified fields
  5. Add Ollama to local dev setup (docker-compose sidecar)
  6. Add OLLAMA_URL and ANTHROPIC_API_KEY to environment resolution

Effort: Medium — the classification interfaces are clean and isolated.


Phase 2 — Permission Interpretation (Cloud, Cached)

Goal: Reduce Uncategorized permission type rate; improve blast radius accuracy.

Tasks:

  1. Add permission classification cache (Redis or in-memory with TTL)
  2. Wire T2 cloud call into permission_mapper.py for unmapped permissions
  3. Extend arm_roles.py to handle custom RBAC roles via T2 + cache

Effort: Small — permission strings are not sensitive, no privacy concern.


Phase 3 — Platform Contextual Summaries (Async, Opt-in)

Goal: Add contextual_summary field to high-priority findings for deeper narrative.

Tasks:

  1. Add async enrichment job that runs after finding write
  2. Build prompt with: finding type, evidence refs, path metadata, cluster membership
  3. Write contextual_summary to finding doc
  4. Display in FindingDetail.tsx as a clearly-labelled "AI Insight" section
  5. Add tenant-level opt-in flag (features.ai_contextual_summaries)

Effort: Medium — requires async job infrastructure and UI changes.


Phase 4 — Script Code Analysis (Opt-in, Privacy-Gated)

Goal: Improve Business Rule and Script Include coverage by understanding script logic.

Tasks:

  1. Add script analysis via local deepseek-coder model
  2. Gate behind explicit tenant opt-in (customer code is sensitive)
  3. Parse: "what tables does this script touch, what external URLs does it call?"
  4. Merge results with existing regex-based analysis (LLM supplements, doesn't replace)

Effort: High — requires careful privacy handling and evaluation of model quality on GlideScript.


6. What to Keep Deterministic (Summary)

ComponentReason
buildNarrative() / buildVerdictSentence()Numbers must be exact; text is already professional; CISO auditability
deterministic_explanation on findingsField contract requires reproducibility; named "deterministic" by design
Ownership validation (valid/invalid/ambiguous)Binary governance decision; rules are complete and correct
Risk group assignment (RG1–RG5)Directly drives remediation priority; must be auditable
All severity scoresRegulatory and compliance implications; LLM severity variation unacceptable
Finding status transitionsWorkflow decisions; must be user-controlled and auditable

7. Open Questions

  1. Ollama in production: Do we deploy a sidecar per connector run, or a shared Ollama instance? What GPU/memory allocation?
  2. Model versioning: When a local model is updated, historical classifications may change. Do we re-classify, or lock by model version?
  3. Tenant opt-in UX: How do tenants enable/disable AI enrichment? Settings page toggle? Per-feature flags?
  4. Confidence threshold: What confidence score from the local model triggers escalation to cloud? Needs empirical calibration.
  5. Script analysis scope: Phase 4 requires customer code goes to a local model — is this acceptable to all tenants? Likely needs a DPA clause.
  6. Evaluation dataset: Before shipping Phase 1, we need a labeled dataset of unknown egress URLs and table names to measure precision/recall improvement.

8. References

  • sv0-platform/ui/src/pages/RiskClusterDetailPage.tsxbuildNarrative() function
  • sv0-platform/src/services/risk-cluster-service.tsRISK_CLUSTER_DEFS, 7 cluster types
  • sv0-platform/src/evaluator/rules/ — 16 finding rule implementations
  • sv0-platform/src/evidence/remediation.ts — remediation action builders
  • sv0-connectors/integrations/entra-servicenow/src/entra_servicenow/core/egress_classifier.py
  • sv0-connectors/integrations/entra-servicenow/src/entra_servicenow/core/origin_classifier.py
  • sv0-connectors/integrations/entra-servicenow/src/entra_servicenow/core/permission_mapper.py
  • sv0-connectors/shared/sv0_azure/sv0_azure/arm_roles.py
  • Ollama — local model serving
  • Anthropic Claude API — claude-haiku-4-5 for classification, claude-sonnet-4-6 for reasoning

-- Delta (sv0-delta)


Next Action

Status: research-complete Decision needed from: PO (Ivan) Options:

  1. Adopt — create GitHub issue in sv0-platform for Phase 1: LLMEnricher service, egress URL classifier (T1 Ollama → T2 Haiku fallback), unknown resolution in origin_classifier.py
  2. Defer — revisit after AWS connector is scoped (competing implementation bandwidth)
  3. Reject — deterministic rules sufficient for current customer base

Prompt injection risk: Any T1/T2 classification of customer-controlled strings must treat model output as untrusted — validate against allowlist, never eval. Flag for security review before Phase 1 ships.

GitHub Issue: https://github.com/SecurityV0/sv0-platform/issues/72