LLM Integration Opportunities in SecurityV0
Date: 2026-03-11 Status: Research — not yet an ADR Scope: sv0-platform (evaluator, evidence, UI), sv0-connectors (classification pipeline) Trigger: Discussion about whether the "What Happened" narrative in the Authority Exposure Brief should use LLM generation vs. the current deterministic template approach.
1. Executive Summary
SecurityV0 currently generates all natural language text deterministically: templates, hardcoded strings, and rule-based classifiers. This is intentional — security tooling requires auditability, predictability, and traceability to evidence.
The question is not "use LLM everywhere" but "where does semantic understanding produce meaningfully better results than rules, and where does determinism matter too much to give up?"
This document maps every text-generation and classification point across the platform and connectors, assesses each for LLM fit, and proposes an architecture that uses LLMs as an opt-in enrichment layer with deterministic fallbacks — running offline models where latency and privacy matter, and cloud models where reasoning depth matters.
Key conclusions:
- The
buildNarrative()"What Happened" summary: keep deterministic — it is already high-quality, correct, and auditable. - Connector classification pipeline (egress, origin, permission, script analysis): highest-value LLM target — rules have hard coverage limits and semantic gaps that only language understanding can fill.
- Per-finding explanation and remediation: medium-value — LLM can add context not expressible in templates, but must stay grounded in evidence.
- Architecture: local model-first for classification (privacy, speed, cost), cloud model only for complex reasoning tasks; always with a deterministic fallback path.
2. Current Text Generation Inventory
2.1 sv0-platform: Natural Language Points
| Location | File | What Is Generated | Method |
|---|---|---|---|
| Authority Exposure Brief — Section A | ui/src/pages/RiskClusterDetailPage.tsx buildNarrative() | "N identities accessed sensitive systems (domains) M times in 30d. Governance clause." | Template: action_phrase + governance_clause from cluster def + live numbers |
| Risk cluster card verdict | ui/src/components/PathRiskClusterCard.tsx buildVerdictSentence() | One-line summary per cluster card | Same template approach |
| Per-finding explanation | src/evaluator/rules/*.ts (16 rules) | deterministic_explanation field on each finding | Hardcoded string per rule type, some dynamic values injected |
| Remediation actions | src/evidence/remediation.ts + src/services/remediation-service.ts | action + rationale + reduction_effect per action | Switch on finding type, context-aware builders |
| Evidence pack markdown | src/evidence/markdown.ts | Full markdown export of evidence pack | Template-based formatting |
| Cluster remediation bullets | src/services/risk-cluster-service.ts RISK_CLUSTER_DEFS | 3-5 bullets per cluster type (7 types) | Hardcoded strings |
| Page-level static text | Various pages/*.tsx | Section labels, explainer text | Hardcoded JSX strings |
2.2 sv0-connectors: Classification Pipeline
| Stage | File | What It Classifies | Current Method |
|---|---|---|---|
| Egress classification | core/egress_classifier.py | Endpoint type: llm / external / internal / none / unknown | Hardcoded domain catalog + regex markers for dynamic URLs |
| Data origin (sensitivity) | core/origin_classifier.py | Data domain: hr / identity / customer / financial / unknown | Pattern matching on ServiceNow table names (sn_hr_*, sys_user*) |
| Ownership validation | core/ownership_validator.py | Status: valid / invalid / ambiguous | Deterministic rules on owner activity, count, type |
| Risk grouping | core/risk_grouper.py | Risk group: RG1–RG5 | Hardcoded matrix: egress × sensitivity |
| Permission canonicalization | core/permission_mapper.py | OAA type: DataRead / DataWrite / … / Uncategorized | Hardcoded mapping + fallback pattern-match |
| ARM role actions | shared/sv0_azure/arm_roles.py | Actions: read / write / delete / admin | Hardcoded for 40+ known roles, conservative fallback |
| Script analysis | adapters/servicenow_client.py | Table mutations, REST call targets from script code | Regex pattern matching (not AST) |
| Resource sensitivity | core/transformer.py | Sensitivity: restricted / confidential / internal / public | Hardcoded table list + domain mapping |
3. Where LLM Adds Real Value (and Where It Doesn't)
3.1 Do NOT use LLM — deterministic is correct
The buildNarrative() "What Happened" summary:
- Text quality is already high ("3 autonomous identities accessed sensitive systems…")
- Numbers and domains must be exact — LLM cannot improve on precision
- Auditability: CISOs need to trace every word back to evidence; LLM phrasing variation undermines this
- Cost and latency: adding an LLM call to every page load for no meaningful gain
- Decision: keep deterministic. Improve the
action_phrase/governance_clausevocabulary editorially if needed.
Finding deterministic_explanation field:
- The field name itself signals the contract: it is deterministically derived from evidence
- Changing this to LLM output would break the audit chain
- Decision: keep deterministic. The explanation must be machine-reproducible.
Ownership validation (valid/invalid/ambiguous):
- This is a binary governance decision with clear rules
- LLM "softening" this would introduce false confidence
- Decision: keep deterministic. Rules are correct and complete.
3.2 HIGH VALUE — use LLM
A. Egress Endpoint Classification
Files: core/egress_classifier.py in both connectors
Current problem: The hardcoded LLM_CATALOG covers ~8 known providers. Dynamic URLs (built from variables: ${}, gs.getProperty, config lookups) always fall back to unknown. Unknown egress = invisible risk.
What LLM adds:
- Classify novel endpoints by hostname semantics: "Is
api.bedrock.us-east-1.amazonaws.coman LLM endpoint?" - Evaluate partial URL patterns: "What does
https://${env.AI_GATEWAY}/v1/chatprobably resolve to?" - Classify connection objects by description, display name, connection type metadata
Risk level: Low — LLM is expanding coverage of unknown cases, not overriding deterministic ones. Keep the existing catalog as primary; LLM fills the gap.
Recommended model: Local/offline (fast, no data egress, endpoint hostnames are not sensitive). Ollama with llama3.2 or phi-3-mini.
B. Data Origin / Sensitivity Classification from Table Names
Files: core/origin_classifier.py, core/transformer.py
Current problem: Coverage limited to tables that match sn_hr_*, sys_user*, customer_* etc. Custom tables, vendor extensions, and customer-specific naming fall through to unknown. Unknown domain = missing sensitivity signal.
What LLM adds:
- Semantic table name interpretation: "What does
x_acme_employee_onboardinglikely contain?" - Context from automation description + table name together
- Cross-reference: if a Business Rule fires on
x_vendor_contractsand sends to an external endpoint, that is probably sensitive even without a matching pattern
Risk level: Low — again filling unknown coverage, not overriding explicit mappings.
Recommended model: Local/offline. Table names and descriptions are customer data; no cloud egress.
C. ServiceNow Script Analysis (Semantic Code Understanding)
Files: adapters/servicenow_client.py (analyze_script_mutations, analyze_script_queries)
Current problem: Script analysis is regex-based string matching. It misses:
- Dynamic GlideRecord table names:
gr.initialize(tableName)wheretableNameis a variable - Indirect REST calls:
callMyHelper()where helper calls the external endpoint - Chained script includes
- Any complexity beyond
GlideRecord('literal_table_name')
What LLM adds:
- AST-level code understanding (JavaScript/GlideScript)
- "What tables does this script touch?" with reasoning about variable resolution
- "Does this script call any external HTTP endpoints?"
- "What data does this script read vs. write?"
Risk level: Medium — script code is customer IP. Cloud model requires explicit tenant consent or PII stripping. Prefer local model; fall back to current regex if model unavailable.
Recommended model: Local/offline for privacy. codellama or deepseek-coder perform well on JavaScript. Cloud model (Claude) for complex multi-file chains if tenant has opted in.
D. Unknown Permission Canonicalization
Files: core/permission_mapper.py, shared/sv0_azure/arm_roles.py
Current problem: Unmapped Azure/ServiceNow permissions fall back to conservative defaults (["read", "write"]). Custom Azure RBAC roles (common in enterprise tenants) are unknown. This over-counts write permissions.
What LLM adds:
- "What does the Azure permission
Microsoft.MachineLearningServices/workspaces/connections/listsecrets/actionactually allow?" → classified asDataRead+admin-level - Custom RBAC role interpretation from role description
- Cross-reference Microsoft docs for unknown permission strings
Risk level: Low — permission strings are not sensitive, cloud model acceptable.
Recommended model: Cloud (Claude Haiku or GPT-4o-mini). Permission strings benefit from up-to-date cloud model knowledge. Cache results aggressively (same permission → same classification, immutable).
3.3 MEDIUM VALUE — use LLM with care
E. Per-Path Contextual Explanation (new field, not replacing deterministic_explanation)
Files: Would be a new contextual_summary field in FindingDoc or evidence pack
Current gap: The deterministic explanation says what happened (e.g., "no OWNED_BY relationship detected"). It does not say why this matters in context (e.g., "this unbound Foundry agent is routing data to an LLM while accessing HR profiles — the combination of unowned + sensitive + LLM egress is the highest-risk pattern in the system").
What LLM adds:
- Cross-finding narrative: synthesize multiple findings on a path into a coherent risk story
- Severity calibration rationale: "This is critical because X + Y + Z together"
- Comparison context: "This path executed 47 times last month vs. 3 times the month before — the spike is anomalous"
Implementation constraint: Must be clearly labelled as AI-generated and supplementary to the deterministic fields. Must cite specific evidence refs. Cannot contradict the deterministic fields.
Recommended model: Cloud (Claude Sonnet). This is a reasoning-heavy task that benefits from a capable model. Run async after findings are written (not in the hot path).
F. Remediation Advice Personalization
Files: src/evidence/remediation.ts, src/services/risk-cluster-service.ts
Current gap: Remediation bullets are per-cluster-type (7 types, hardcoded). They are correct but generic: "Assign an active owner" applies to every orphaned path regardless of context.
What LLM adds:
- Context-specific phrasing: "Assign an owner from the HR Engineering team — this automation accesses
sn_hr_core_profileand should be owned by someone accountable for HR data" - Priority adjustment based on execution volume: "With 681 executions in 30 days, this should be remediated this sprint, not next quarter"
- Org-aware suggestions (if tenant metadata is available)
Implementation constraint: Same as per-path explanation — clearly labelled, supplementary, not replacing the structured structured_actions[] array.
4. Architectural Design
4.1 LLM Layer Position
Connector Pipeline (Python)
┌──────────────────────────────────────────────────────────┐
│ Extract → Correlate → Classify → Transform → Submit │
│ ▲ │
│ ┌───────────┴───────────┐ │
│ │ Rule-based (primary) │ │
│ │ LLM enrichment │ ← NEW: fills │
│ │ (for unknowns only) │ unknown cases │
│ └───────────────────────┘ │
└──────────────────────────────────────────────────────────┘
Platform Pipeline (TypeScript)
┌──────────────────────────────────────────────────────────┐
│ Ingest → Evaluate → Evidence → API → UI │
│ ▲ │
│ ┌───────────┴───────────┐ │
│ │ Async enrichment │ ← NEW: runs │
│ │ (contextual_summary) │ after write │
│ └───────────────────────┘ │
└──────────────────────────────────────────────────────────┘
Key principle: LLM is never in the hot/synchronous path for classification that has a deterministic answer. LLM only runs when:
- A rule returns
unknown/uncategorized(connector classification) - All deterministic fields are already written and we are adding supplementary context (platform)
4.2 Model Tier Strategy
| Tier | Model | When | Why |
|---|---|---|---|
| T0 — Deterministic | No model | Rules produce a confident answer | Zero latency, full auditability, no cost |
| T1 — Local/Offline | Ollama (llama3.2, phi-3.5-mini, deepseek-coder) | Classifying unknown outputs from rules; script analysis; table name semantics | No data egress, fast (<200ms on CPU for small inputs), no per-call cost, runs in container |
| T2 — Cloud (efficient) | Claude Haiku / GPT-4o-mini | Permission string lookup; cases where T1 confidence is low | Better knowledge of Azure/ServiceNow APIs, cheap, cacheable |
| T3 — Cloud (capable) | Claude Sonnet | Per-path contextual summary; complex multi-finding synthesis | Reasoning-heavy, async, not latency-sensitive |
Decision ladder per classification call:
1. Apply rule → confident result? → done (T0)
2. Rule returns unknown → try T1 local model → confidence > threshold? → done (T1)
3. T1 confidence low → try T2 cloud (if tenant has opted in and quota available) → done (T2)
4. T2 unavailable or quota exceeded → return deterministic fallback value → done (T0 fallback)
4.3 Offline / Local Model Setup
Runtime: Ollama — runs as a sidecar container or locally on the developer machine.
Models:
- Text/classification:
llama3.2:3b(fast, ~2GB, sufficient for classification tasks) - Code analysis:
deepseek-coder:6.7b(JavaScript/GlideScript understanding) - Fallback if Ollama unavailable: rule-based result as-is
Deployment options:
Option A: Sidecar container (production)
docker run -d -p 11434:11434 ollama/ollama
Pulled once, models cached on volume
Option B: Local dev
brew install ollama && ollama serve
ollama pull llama3.2:3b deepseek-coder:6.7b
Option C: None / degraded mode
All LLM calls skip to fallback
Connectors still run, all outputs deterministic
Privacy guarantee: T1 never sends data off-host. Customer table names, script code, and property values stay local.
4.4 Cloud Model Integration
Provider: Anthropic Claude (primary), OpenAI (secondary / fallback)
Auth: Environment variable ANTHROPIC_API_KEY / OPENAI_API_KEY (pre-resolved from 1Password at container start, same pattern as GH_TOKEN)
Opt-in model: Off by default for customer data (scripts, table names with actual data). On by default for metadata-only tasks (permission strings, ARM role names — these are not sensitive).
Quota / rate limit handling:
class LLMEnricher:
def classify(self, input: str, task: str) -> ClassificationResult:
# T0: rule
result = self.rules.classify(input)
if result.confidence == "high":
return result
# T1: local
if self.ollama.available():
result = self.ollama.classify(input, task)
if result.confidence >= self.threshold:
return result
# T2: cloud (if permitted and quota available)
if self.cloud_permitted and not self.quota_exceeded:
try:
return self.cloud.classify(input, task)
except RateLimitError:
self.quota_exceeded = True # backoff for this run
except Exception:
pass # fall through
# T0 fallback
return self.rules.fallback_classification(input)
Caching: All LLM classification results are cached by input hash. Permission strings and table names are stable — cache is long-lived (7 days). Endpoint hostnames: 24h TTL.
4.5 Fallback Guarantees
The system must remain fully operational with zero LLM availability. This means:
| Scenario | Behavior |
|---|---|
| Ollama not installed / not running | T1 skipped, goes to T2 or T0 fallback |
| Cloud API key missing | T2 skipped, goes to T0 fallback |
| Cloud quota exhausted | quota_exceeded flag set, T2 skipped for remainder of run |
| Cloud rate limited | Exponential backoff (max 3 retries), then T0 fallback |
| Cloud returns malformed response | Validation error → T0 fallback |
| Local model returns low-confidence | T2 if available, else T0 fallback |
| All models unavailable | 100% deterministic output — same as today |
The fallback value is always the current behavior — not an error, not null. The connector produces a valid NormalizedGraph regardless.
Observability: Each enriched field carries a _source metadata annotation:
{
"egress_category": "llm",
"_source": {
"egress_category": "t1_local_llama3.2",
"confidence": 0.91
}
}
This allows downstream audit: "Was this classification rule-based or AI-inferred?"
5. Implementation Roadmap
Phase 1 — Connector Classification (Recommended First Step)
Goal: Reduce unknown egress and unknown data domain rates in connector output.
Tasks:
- Add
LLMEnricherutility class tosv0-connectors/shared/sv0_common/ - Wire into
egress_classifier.py— classifyunknownendpoints via T1 - Wire into
origin_classifier.py— classifyunknowntable names via T1 - Add
_sourceannotation to all classified fields - Add Ollama to local dev setup (docker-compose sidecar)
- Add
OLLAMA_URLandANTHROPIC_API_KEYto environment resolution
Effort: Medium — the classification interfaces are clean and isolated.
Phase 2 — Permission Interpretation (Cloud, Cached)
Goal: Reduce Uncategorized permission type rate; improve blast radius accuracy.
Tasks:
- Add permission classification cache (Redis or in-memory with TTL)
- Wire T2 cloud call into
permission_mapper.pyfor unmapped permissions - Extend
arm_roles.pyto handle custom RBAC roles via T2 + cache
Effort: Small — permission strings are not sensitive, no privacy concern.
Phase 3 — Platform Contextual Summaries (Async, Opt-in)
Goal: Add contextual_summary field to high-priority findings for deeper narrative.
Tasks:
- Add async enrichment job that runs after finding write
- Build prompt with: finding type, evidence refs, path metadata, cluster membership
- Write
contextual_summaryto finding doc - Display in
FindingDetail.tsxas a clearly-labelled "AI Insight" section - Add tenant-level opt-in flag (
features.ai_contextual_summaries)
Effort: Medium — requires async job infrastructure and UI changes.
Phase 4 — Script Code Analysis (Opt-in, Privacy-Gated)
Goal: Improve Business Rule and Script Include coverage by understanding script logic.
Tasks:
- Add script analysis via local
deepseek-codermodel - Gate behind explicit tenant opt-in (customer code is sensitive)
- Parse: "what tables does this script touch, what external URLs does it call?"
- Merge results with existing regex-based analysis (LLM supplements, doesn't replace)
Effort: High — requires careful privacy handling and evaluation of model quality on GlideScript.
6. What to Keep Deterministic (Summary)
| Component | Reason |
|---|---|
buildNarrative() / buildVerdictSentence() | Numbers must be exact; text is already professional; CISO auditability |
deterministic_explanation on findings | Field contract requires reproducibility; named "deterministic" by design |
| Ownership validation (valid/invalid/ambiguous) | Binary governance decision; rules are complete and correct |
| Risk group assignment (RG1–RG5) | Directly drives remediation priority; must be auditable |
| All severity scores | Regulatory and compliance implications; LLM severity variation unacceptable |
| Finding status transitions | Workflow decisions; must be user-controlled and auditable |
7. Open Questions
- Ollama in production: Do we deploy a sidecar per connector run, or a shared Ollama instance? What GPU/memory allocation?
- Model versioning: When a local model is updated, historical classifications may change. Do we re-classify, or lock by model version?
- Tenant opt-in UX: How do tenants enable/disable AI enrichment? Settings page toggle? Per-feature flags?
- Confidence threshold: What confidence score from the local model triggers escalation to cloud? Needs empirical calibration.
- Script analysis scope: Phase 4 requires customer code goes to a local model — is this acceptable to all tenants? Likely needs a DPA clause.
- Evaluation dataset: Before shipping Phase 1, we need a labeled dataset of
unknownegress URLs and table names to measure precision/recall improvement.
8. References
sv0-platform/ui/src/pages/RiskClusterDetailPage.tsx—buildNarrative()functionsv0-platform/src/services/risk-cluster-service.ts—RISK_CLUSTER_DEFS, 7 cluster typessv0-platform/src/evaluator/rules/— 16 finding rule implementationssv0-platform/src/evidence/remediation.ts— remediation action builderssv0-connectors/integrations/entra-servicenow/src/entra_servicenow/core/egress_classifier.pysv0-connectors/integrations/entra-servicenow/src/entra_servicenow/core/origin_classifier.pysv0-connectors/integrations/entra-servicenow/src/entra_servicenow/core/permission_mapper.pysv0-connectors/shared/sv0_azure/sv0_azure/arm_roles.py- Ollama — local model serving
- Anthropic Claude API — claude-haiku-4-5 for classification, claude-sonnet-4-6 for reasoning
-- Delta (sv0-delta)
Next Action
Status: research-complete Decision needed from: PO (Ivan) Options:
- Adopt — create GitHub issue in
sv0-platformfor Phase 1:LLMEnricherservice, egress URL classifier (T1 Ollama → T2 Haiku fallback),unknownresolution inorigin_classifier.py - Defer — revisit after AWS connector is scoped (competing implementation bandwidth)
- Reject — deterministic rules sufficient for current customer base
Prompt injection risk: Any T1/T2 classification of customer-controlled strings must treat model output as untrusted — validate against allowlist, never eval. Flag for security review before Phase 1 ships.
GitHub Issue: https://github.com/SecurityV0/sv0-platform/issues/72