LLM Integration Opportunities in SecurityV0

Date: 2026-03-11 Status: Research — not yet an ADR Scope: sv0-platform (evaluator, evidence, UI), sv0-connectors (classification pipeline) Trigger: Discussion about whether the "What Happened" narrative in the Authority Exposure Brief should use LLM generation vs. the current deterministic template approach.

1. Executive Summary

SecurityV0 currently generates all natural language text deterministically: templates, hardcoded strings, and rule-based classifiers. This is intentional — security tooling requires auditability, predictability, and traceability to evidence.

The question is not "use LLM everywhere" but "where does semantic understanding produce meaningfully better results than rules, and where does determinism matter too much to give up?"

This document maps every text-generation and classification point across the platform and connectors, assesses each for LLM fit, and proposes an architecture that uses LLMs as an opt-in enrichment layer with deterministic fallbacks — running offline models where latency and privacy matter, and cloud models where reasoning depth matters.

Key conclusions:

The buildNarrative() "What Happened" summary: keep deterministic — it is already high-quality, correct, and auditable.
Connector classification pipeline (egress, origin, permission, script analysis): highest-value LLM target — rules have hard coverage limits and semantic gaps that only language understanding can fill.
Per-finding explanation and remediation: medium-value — LLM can add context not expressible in templates, but must stay grounded in evidence.
Architecture: local model-first for classification (privacy, speed, cost), cloud model only for complex reasoning tasks; always with a deterministic fallback path.

2. Current Text Generation Inventory

2.1 sv0-platform: Natural Language Points

Location	File	What Is Generated	Method
Authority Exposure Brief — Section A	`ui/src/pages/RiskClusterDetailPage.tsx` `buildNarrative()`	"N identities accessed sensitive systems (domains) M times in 30d. Governance clause."	Template: `action_phrase` + `governance_clause` from cluster def + live numbers
Risk cluster card verdict	`ui/src/components/PathRiskClusterCard.tsx` `buildVerdictSentence()`	One-line summary per cluster card	Same template approach
Per-finding explanation	`src/evaluator/rules/*.ts` (16 rules)	`deterministic_explanation` field on each finding	Hardcoded string per rule type, some dynamic values injected
Remediation actions	`src/evidence/remediation.ts` + `src/services/remediation-service.ts`	`action` + `rationale` + `reduction_effect` per action	Switch on finding type, context-aware builders
Evidence pack markdown	`src/evidence/markdown.ts`	Full markdown export of evidence pack	Template-based formatting
Cluster remediation bullets	`src/services/risk-cluster-service.ts` `RISK_CLUSTER_DEFS`	3-5 bullets per cluster type (7 types)	Hardcoded strings
Page-level static text	Various `pages/*.tsx`	Section labels, explainer text	Hardcoded JSX strings

2.2 sv0-connectors: Classification Pipeline

Stage	File	What It Classifies	Current Method
Egress classification	`core/egress_classifier.py`	Endpoint type: llm / external / internal / none / unknown	Hardcoded domain catalog + regex markers for dynamic URLs
Data origin (sensitivity)	`core/origin_classifier.py`	Data domain: hr / identity / customer / financial / unknown	Pattern matching on ServiceNow table names (`sn_hr_`, `sys_user`)
Ownership validation	`core/ownership_validator.py`	Status: valid / invalid / ambiguous	Deterministic rules on owner activity, count, type
Risk grouping	`core/risk_grouper.py`	Risk group: RG1–RG5	Hardcoded matrix: egress × sensitivity
Permission canonicalization	`core/permission_mapper.py`	OAA type: DataRead / DataWrite / … / Uncategorized	Hardcoded mapping + fallback pattern-match
ARM role actions	`shared/sv0_azure/arm_roles.py`	Actions: read / write / delete / admin	Hardcoded for 40+ known roles, conservative fallback
Script analysis	`adapters/servicenow_client.py`	Table mutations, REST call targets from script code	Regex pattern matching (not AST)
Resource sensitivity	`core/transformer.py`	Sensitivity: restricted / confidential / internal / public	Hardcoded table list + domain mapping

3. Where LLM Adds Real Value (and Where It Doesn't)

3.1 Do NOT use LLM — deterministic is correct

The buildNarrative() "What Happened" summary:

Text quality is already high ("3 autonomous identities accessed sensitive systems…")
Numbers and domains must be exact — LLM cannot improve on precision
Auditability: CISOs need to trace every word back to evidence; LLM phrasing variation undermines this
Cost and latency: adding an LLM call to every page load for no meaningful gain
Decision: keep deterministic. Improve the action_phrase / governance_clause vocabulary editorially if needed.

Finding deterministic_explanation field:

The field name itself signals the contract: it is deterministically derived from evidence
Changing this to LLM output would break the audit chain
Decision: keep deterministic. The explanation must be machine-reproducible.

Ownership validation (valid/invalid/ambiguous):

This is a binary governance decision with clear rules
LLM "softening" this would introduce false confidence
Decision: keep deterministic. Rules are correct and complete.

3.2 HIGH VALUE — use LLM

A. Egress Endpoint Classification

Files: core/egress_classifier.py in both connectors

Current problem: The hardcoded LLM_CATALOG covers ~8 known providers. Dynamic URLs (built from variables: ${}, gs.getProperty, config lookups) always fall back to unknown. Unknown egress = invisible risk.

What LLM adds:

Classify novel endpoints by hostname semantics: "Is api.bedrock.us-east-1.amazonaws.com an LLM endpoint?"
Evaluate partial URL patterns: "What does https://${env.AI_GATEWAY}/v1/chat probably resolve to?"
Classify connection objects by description, display name, connection type metadata

Risk level: Low — LLM is expanding coverage of unknown cases, not overriding deterministic ones. Keep the existing catalog as primary; LLM fills the gap.

Recommended model: Local/offline (fast, no data egress, endpoint hostnames are not sensitive). Ollama with llama3.2 or phi-3-mini.

B. Data Origin / Sensitivity Classification from Table Names

Files: core/origin_classifier.py, core/transformer.py

Current problem: Coverage limited to tables that match sn_hr_*, sys_user*, customer_* etc. Custom tables, vendor extensions, and customer-specific naming fall through to unknown. Unknown domain = missing sensitivity signal.

What LLM adds:

Semantic table name interpretation: "What does x_acme_employee_onboarding likely contain?"
Context from automation description + table name together
Cross-reference: if a Business Rule fires on x_vendor_contracts and sends to an external endpoint, that is probably sensitive even without a matching pattern

Risk level: Low — again filling unknown coverage, not overriding explicit mappings.

Recommended model: Local/offline. Table names and descriptions are customer data; no cloud egress.

C. ServiceNow Script Analysis (Semantic Code Understanding)

Files: adapters/servicenow_client.py (analyze_script_mutations, analyze_script_queries)

Current problem: Script analysis is regex-based string matching. It misses:

Dynamic GlideRecord table names: gr.initialize(tableName) where tableName is a variable
Indirect REST calls: callMyHelper() where helper calls the external endpoint
Chained script includes
Any complexity beyond GlideRecord('literal_table_name')

What LLM adds:

AST-level code understanding (JavaScript/GlideScript)
"What tables does this script touch?" with reasoning about variable resolution
"Does this script call any external HTTP endpoints?"
"What data does this script read vs. write?"

Risk level: Medium — script code is customer IP. Cloud model requires explicit tenant consent or PII stripping. Prefer local model; fall back to current regex if model unavailable.

Recommended model: Local/offline for privacy. codellama or deepseek-coder perform well on JavaScript. Cloud model (Claude) for complex multi-file chains if tenant has opted in.

D. Unknown Permission Canonicalization

Files: core/permission_mapper.py, shared/sv0_azure/arm_roles.py

Current problem: Unmapped Azure/ServiceNow permissions fall back to conservative defaults (["read", "write"]). Custom Azure RBAC roles (common in enterprise tenants) are unknown. This over-counts write permissions.

What LLM adds:

"What does the Azure permission Microsoft.MachineLearningServices/workspaces/connections/listsecrets/action actually allow?" → classified as DataRead + admin-level
Custom RBAC role interpretation from role description
Cross-reference Microsoft docs for unknown permission strings

Risk level: Low — permission strings are not sensitive, cloud model acceptable.

Recommended model: Cloud (Claude Haiku or GPT-4o-mini). Permission strings benefit from up-to-date cloud model knowledge. Cache results aggressively (same permission → same classification, immutable).

3.3 MEDIUM VALUE — use LLM with care

E. Per-Path Contextual Explanation (new field, not replacing `deterministic_explanation`)

Files: Would be a new contextual_summary field in FindingDoc or evidence pack

Current gap: The deterministic explanation says what happened (e.g., "no OWNED_BY relationship detected"). It does not say why this matters in context (e.g., "this unbound Foundry agent is routing data to an LLM while accessing HR profiles — the combination of unowned + sensitive + LLM egress is the highest-risk pattern in the system").

What LLM adds:

Cross-finding narrative: synthesize multiple findings on a path into a coherent risk story
Severity calibration rationale: "This is critical because X + Y + Z together"
Comparison context: "This path executed 47 times last month vs. 3 times the month before — the spike is anomalous"

Implementation constraint: Must be clearly labelled as AI-generated and supplementary to the deterministic fields. Must cite specific evidence refs. Cannot contradict the deterministic fields.

Recommended model: Cloud (Claude Sonnet). This is a reasoning-heavy task that benefits from a capable model. Run async after findings are written (not in the hot path).

F. Remediation Advice Personalization

Files: src/evidence/remediation.ts, src/services/risk-cluster-service.ts

Current gap: Remediation bullets are per-cluster-type (7 types, hardcoded). They are correct but generic: "Assign an active owner" applies to every orphaned path regardless of context.

What LLM adds:

Context-specific phrasing: "Assign an owner from the HR Engineering team — this automation accesses sn_hr_core_profile and should be owned by someone accountable for HR data"
Priority adjustment based on execution volume: "With 681 executions in 30 days, this should be remediated this sprint, not next quarter"
Org-aware suggestions (if tenant metadata is available)

Implementation constraint: Same as per-path explanation — clearly labelled, supplementary, not replacing the structured structured_actions[] array.

4. Architectural Design

4.1 LLM Layer Position

Connector Pipeline (Python)
┌──────────────────────────────────────────────────────────┐
│  Extract → Correlate → Classify → Transform → Submit     │
│                          ▲                               │
│              ┌───────────┴───────────┐                   │
│              │  Rule-based (primary) │                   │
│              │  LLM enrichment       │  ← NEW: fills     │
│              │  (for unknowns only)  │    unknown cases  │
│              └───────────────────────┘                   │
└──────────────────────────────────────────────────────────┘

Platform Pipeline (TypeScript)
┌──────────────────────────────────────────────────────────┐
│  Ingest → Evaluate → Evidence → API → UI                 │
│                          ▲                               │
│              ┌───────────┴───────────┐                   │
│              │  Async enrichment     │  ← NEW: runs      │
│              │  (contextual_summary) │    after write    │
│              └───────────────────────┘                   │
└──────────────────────────────────────────────────────────┘

Key principle: LLM is never in the hot/synchronous path for classification that has a deterministic answer. LLM only runs when:

A rule returns unknown / uncategorized (connector classification)
All deterministic fields are already written and we are adding supplementary context (platform)

4.2 Model Tier Strategy

Tier	Model	When	Why
T0 — Deterministic	No model	Rules produce a confident answer	Zero latency, full auditability, no cost
T1 — Local/Offline	Ollama (`llama3.2`, `phi-3.5-mini`, `deepseek-coder`)	Classifying `unknown` outputs from rules; script analysis; table name semantics	No data egress, fast (<200ms on CPU for small inputs), no per-call cost, runs in container
T2 — Cloud (efficient)	Claude Haiku / GPT-4o-mini	Permission string lookup; cases where T1 confidence is low	Better knowledge of Azure/ServiceNow APIs, cheap, cacheable
T3 — Cloud (capable)	Claude Sonnet	Per-path contextual summary; complex multi-finding synthesis	Reasoning-heavy, async, not latency-sensitive

Decision ladder per classification call:

Apply rule → confident result? → done (T0)
Rule returns unknown → try T1 local model → confidence > threshold? → done (T1)
T1 confidence low → try T2 cloud (if tenant has opted in and quota available) → done (T2)
T2 unavailable or quota exceeded → return deterministic fallback value → done (T0 fallback)

4.3 Offline / Local Model Setup

Runtime: Ollama — runs as a sidecar container or locally on the developer machine.

Models:

Text/classification: llama3.2:3b (fast, ~2GB, sufficient for classification tasks)
Code analysis: deepseek-coder:6.7b (JavaScript/GlideScript understanding)
Fallback if Ollama unavailable: rule-based result as-is

Deployment options:

Option A: Sidecar container (production)
  docker run -d -p 11434:11434 ollama/ollama
  Pulled once, models cached on volume

Option B: Local dev
  brew install ollama && ollama serve
  ollama pull llama3.2:3b deepseek-coder:6.7b

Option C: None / degraded mode
  All LLM calls skip to fallback
  Connectors still run, all outputs deterministic

Privacy guarantee: T1 never sends data off-host. Customer table names, script code, and property values stay local.

4.4 Cloud Model Integration

Provider: Anthropic Claude (primary), OpenAI (secondary / fallback) Auth: Environment variable ANTHROPIC_API_KEY / OPENAI_API_KEY (pre-resolved from 1Password at container start, same pattern as GH_TOKEN)

Opt-in model: Off by default for customer data (scripts, table names with actual data). On by default for metadata-only tasks (permission strings, ARM role names — these are not sensitive).

Quota / rate limit handling:

class LLMEnricher:
    def classify(self, input: str, task: str) -> ClassificationResult:
        # T0: rule
        result = self.rules.classify(input)
        if result.confidence == "high":
            return result

        # T1: local
        if self.ollama.available():
            result = self.ollama.classify(input, task)
            if result.confidence >= self.threshold:
                return result

        # T2: cloud (if permitted and quota available)
        if self.cloud_permitted and not self.quota_exceeded:
            try:
                return self.cloud.classify(input, task)
            except RateLimitError:
                self.quota_exceeded = True  # backoff for this run
            except Exception:
                pass  # fall through

        # T0 fallback
        return self.rules.fallback_classification(input)

Caching: All LLM classification results are cached by input hash. Permission strings and table names are stable — cache is long-lived (7 days). Endpoint hostnames: 24h TTL.

4.5 Fallback Guarantees

The system must remain fully operational with zero LLM availability. This means:

Scenario	Behavior
Ollama not installed / not running	T1 skipped, goes to T2 or T0 fallback
Cloud API key missing	T2 skipped, goes to T0 fallback
Cloud quota exhausted	`quota_exceeded` flag set, T2 skipped for remainder of run
Cloud rate limited	Exponential backoff (max 3 retries), then T0 fallback
Cloud returns malformed response	Validation error → T0 fallback
Local model returns low-confidence	T2 if available, else T0 fallback
All models unavailable	100% deterministic output — same as today

The fallback value is always the current behavior — not an error, not null. The connector produces a valid NormalizedGraph regardless.

Observability: Each enriched field carries a _source metadata annotation:

{
  "egress_category": "llm",
  "_source": {
    "egress_category": "t1_local_llama3.2",
    "confidence": 0.91
  }
}

This allows downstream audit: "Was this classification rule-based or AI-inferred?"

5. Implementation Roadmap

Phase 1 — Connector Classification (Recommended First Step)

Goal: Reduce unknown egress and unknown data domain rates in connector output.

Tasks:

Add LLMEnricher utility class to sv0-connectors/shared/sv0_common/
Wire into egress_classifier.py — classify unknown endpoints via T1
Wire into origin_classifier.py — classify unknown table names via T1
Add _source annotation to all classified fields
Add Ollama to local dev setup (docker-compose sidecar)
Add OLLAMA_URL and ANTHROPIC_API_KEY to environment resolution

Effort: Medium — the classification interfaces are clean and isolated.

Phase 2 — Permission Interpretation (Cloud, Cached)

Goal: Reduce Uncategorized permission type rate; improve blast radius accuracy.

Tasks:

Add permission classification cache (Redis or in-memory with TTL)
Wire T2 cloud call into permission_mapper.py for unmapped permissions
Extend arm_roles.py to handle custom RBAC roles via T2 + cache

Effort: Small — permission strings are not sensitive, no privacy concern.

Phase 3 — Platform Contextual Summaries (Async, Opt-in)

Goal: Add contextual_summary field to high-priority findings for deeper narrative.

Tasks:

Add async enrichment job that runs after finding write
Build prompt with: finding type, evidence refs, path metadata, cluster membership
Write contextual_summary to finding doc
Display in FindingDetail.tsx as a clearly-labelled "AI Insight" section
Add tenant-level opt-in flag (features.ai_contextual_summaries)

Effort: Medium — requires async job infrastructure and UI changes.

Phase 4 — Script Code Analysis (Opt-in, Privacy-Gated)

Goal: Improve Business Rule and Script Include coverage by understanding script logic.

Tasks:

Add script analysis via local deepseek-coder model
Gate behind explicit tenant opt-in (customer code is sensitive)
Parse: "what tables does this script touch, what external URLs does it call?"
Merge results with existing regex-based analysis (LLM supplements, doesn't replace)

Effort: High — requires careful privacy handling and evaluation of model quality on GlideScript.

6. What to Keep Deterministic (Summary)

Component	Reason
`buildNarrative()` / `buildVerdictSentence()`	Numbers must be exact; text is already professional; CISO auditability
`deterministic_explanation` on findings	Field contract requires reproducibility; named "deterministic" by design
Ownership validation (valid/invalid/ambiguous)	Binary governance decision; rules are complete and correct
Risk group assignment (RG1–RG5)	Directly drives remediation priority; must be auditable
All severity scores	Regulatory and compliance implications; LLM severity variation unacceptable
Finding status transitions	Workflow decisions; must be user-controlled and auditable

7. Open Questions

Ollama in production: Do we deploy a sidecar per connector run, or a shared Ollama instance? What GPU/memory allocation?
Model versioning: When a local model is updated, historical classifications may change. Do we re-classify, or lock by model version?
Tenant opt-in UX: How do tenants enable/disable AI enrichment? Settings page toggle? Per-feature flags?
Confidence threshold: What confidence score from the local model triggers escalation to cloud? Needs empirical calibration.
Script analysis scope: Phase 4 requires customer code goes to a local model — is this acceptable to all tenants? Likely needs a DPA clause.
Evaluation dataset: Before shipping Phase 1, we need a labeled dataset of unknown egress URLs and table names to measure precision/recall improvement.

8. References

sv0-platform/ui/src/pages/RiskClusterDetailPage.tsx — buildNarrative() function
sv0-platform/src/services/risk-cluster-service.ts — RISK_CLUSTER_DEFS, 7 cluster types
sv0-platform/src/evaluator/rules/ — 16 finding rule implementations
sv0-platform/src/evidence/remediation.ts — remediation action builders
sv0-connectors/integrations/entra-servicenow/src/entra_servicenow/core/egress_classifier.py
sv0-connectors/integrations/entra-servicenow/src/entra_servicenow/core/origin_classifier.py
sv0-connectors/integrations/entra-servicenow/src/entra_servicenow/core/permission_mapper.py
sv0-connectors/shared/sv0_azure/sv0_azure/arm_roles.py
Ollama — local model serving
Anthropic Claude API — claude-haiku-4-5 for classification, claude-sonnet-4-6 for reasoning

-- Delta (sv0-delta)

Next Action

Status: research-complete Decision needed from: PO (Ivan) Options:

Adopt — create GitHub issue in sv0-platform for Phase 1: LLMEnricher service, egress URL classifier (T1 Ollama → T2 Haiku fallback), unknown resolution in origin_classifier.py
Defer — revisit after AWS connector is scoped (competing implementation bandwidth)
Reject — deterministic rules sufficient for current customer base

Prompt injection risk: Any T1/T2 classification of customer-controlled strings must treat model output as untrusted — validate against allowlist, never eval. Flag for security review before Phase 1 ships.

GitHub Issue: https://github.com/SecurityV0/sv0-platform/issues/72

1. Executive Summary​

2. Current Text Generation Inventory​

2.1 sv0-platform: Natural Language Points​

2.2 sv0-connectors: Classification Pipeline​

3. Where LLM Adds Real Value (and Where It Doesn't)​

3.1 Do NOT use LLM — deterministic is correct​

3.2 HIGH VALUE — use LLM​

A. Egress Endpoint Classification​

B. Data Origin / Sensitivity Classification from Table Names​

C. ServiceNow Script Analysis (Semantic Code Understanding)​

D. Unknown Permission Canonicalization​

3.3 MEDIUM VALUE — use LLM with care​

E. Per-Path Contextual Explanation (new field, not replacing deterministic_explanation)​

F. Remediation Advice Personalization​

4. Architectural Design​

4.1 LLM Layer Position​

4.2 Model Tier Strategy​

4.3 Offline / Local Model Setup​

4.4 Cloud Model Integration​

4.5 Fallback Guarantees​

5. Implementation Roadmap​

Phase 1 — Connector Classification (Recommended First Step)​

Phase 2 — Permission Interpretation (Cloud, Cached)​

Phase 3 — Platform Contextual Summaries (Async, Opt-in)​

Phase 4 — Script Code Analysis (Opt-in, Privacy-Gated)​

6. What to Keep Deterministic (Summary)​

7. Open Questions​

8. References​

Next Action​

1. Executive Summary

2. Current Text Generation Inventory

2.1 sv0-platform: Natural Language Points

2.2 sv0-connectors: Classification Pipeline

3. Where LLM Adds Real Value (and Where It Doesn't)

3.1 Do NOT use LLM — deterministic is correct

3.2 HIGH VALUE — use LLM

A. Egress Endpoint Classification

B. Data Origin / Sensitivity Classification from Table Names

C. ServiceNow Script Analysis (Semantic Code Understanding)

D. Unknown Permission Canonicalization

3.3 MEDIUM VALUE — use LLM with care

E. Per-Path Contextual Explanation (new field, not replacing `deterministic_explanation`)

F. Remediation Advice Personalization

4. Architectural Design

4.1 LLM Layer Position

4.2 Model Tier Strategy

4.3 Offline / Local Model Setup

4.4 Cloud Model Integration

4.5 Fallback Guarantees

5. Implementation Roadmap

Phase 1 — Connector Classification (Recommended First Step)

Phase 2 — Permission Interpretation (Cloud, Cached)

Phase 3 — Platform Contextual Summaries (Async, Opt-in)

Phase 4 — Script Code Analysis (Opt-in, Privacy-Gated)

6. What to Keep Deterministic (Summary)

7. Open Questions

8. References

Next Action