ADR-011: ELK.js Graph Layout Engine

Status

Accepted (2026-02-22)

Supersedes: Dagre layout decision documented in 00-overview.md (2026-01-26).

Context

The platform's Graph Explorer uses ReactFlow (@xyflow/react v12) for interactive graph visualization. The original layout engine was Dagre (Sugiyama-based DAG layout), chosen in Phase 1 for its simplicity and determinism.

The scaling problem

Dagre places all same-rank nodes into a single vertical column. This was acceptable at MVP scale (10-30 nodes) but becomes unusable as tenant graphs grow:

15 identities → a single column 900px+ tall
20 resources → stacked vertically, requiring excessive scrolling
50+ total nodes → the graph becomes a narrow vertical strip, wider than tall, unreadable without zooming

The root cause is fundamental to Dagre's algorithm: it has no concept of distributing same-rank nodes across multiple sub-layers or rows. The nodesep parameter controls spacing but not arrangement.

What we tried

Increasing nodesep/ranksep: Makes the problem worse (larger gaps, same single-column layout)
Post-Dagre position redistribution: Nodes spread into grids per rank, but edges route terribly because Dagre's crossing minimization assumed single-column ranks
Filtering to reduce node count: Helps but doesn't solve the core problem — security teams need the full graph for investigation

Why now

W1 (Exposure Discovery) introduces authority-path graphs with 8 workloads, 5 identities, 17 resources, and dozens of edges per tenant. Real customer environments will have 50-200+ entities. The layout must handle this scale.

Decision

Replace Dagre with ELK.js (elkjs npm package) as the sole graph layout engine across all graph components (GraphCanvas, MiniGraph).

Why ELK.js

ELK (Eclipse Layout Kernel) is a mature Java layout library ported to JavaScript. Its layered algorithm (org.eclipse.elk.layered) uses a 5-phase pipeline that directly addresses our limitations:

Cycle Breaking — handles edge direction conflicts
Layer Assignment — assigns nodes to ranks (like Dagre)
Crossing Minimization — LAYER_SWEEP strategy, significantly better than Dagre
Node Placement — NETWORK_SIMPLEX distributes same-layer nodes across sub-rows
Edge Routing — orthogonal routing with crossing avoidance

Key capability: Partitioning

ELK's elk.partitioning.activate option maps directly to our executionLayer() concept. Each node is assigned a partitioning.partition value (0-6), and ELK guarantees:

Nodes within the same partition are placed in the same rank column (or adjacent sub-columns)
Partition ordering is preserved left-to-right
Nodes within a partition are distributed across sub-rows to minimize height and edge crossings

This replaces our current workaround of invisible constraint edges between layer representatives.

Layout configuration

Both modes share a compact base with mode-specific overrides:

const COMPACT_BASE = {
  "elk.algorithm": "layered",
  "elk.direction": "RIGHT",
  "elk.edgeRouting": "POLYLINE",          // No vertical channel reservation
  "elk.spacing.nodeNode": "20",
  "elk.spacing.edgeNode": "10",
  "elk.layered.spacing.nodeNodeBetweenLayers": "80",
  "elk.layered.spacing.edgeNodeBetweenLayers": "20",
  "elk.layered.spacing.edgeEdgeBetweenLayers": "10",
  "elk.layered.crossingMinimization.strategy": "LAYER_SWEEP",
  "elk.layered.nodePlacement.strategy": "NETWORK_SIMPLEX",
  "elk.layered.considerModelOrder.strategy": "NODES_AND_EDGES",
  "elk.layered.compaction.postCompaction.strategy": "EDGE_LENGTH",
  "elk.separateConnectedComponents": "false",
};

// Execution flow: adds partitioning for causal left-to-right ordering
const execFlowOptions = { ...COMPACT_BASE, "elk.partitioning.activate": "true" };

// Neighborhood: uses base config as-is (no partitioning)
const neighborhoodOptions = { ...COMPACT_BASE };

Critical spacing lessons learned (2026-02-22):

POLYLINE edge routing is essential — the default ORTHOGONAL reserves vertical channels between nodes for edge bends, inflating vertical spread by 3-5x even with tight nodeNode spacing
NETWORK_SIMPLEX node placement produces compact columns; BRANDES_KOEPF tries to align nodes with neighbors in adjacent layers, spreading them out
separateConnectedComponents: false prevents extra gaps between disconnected subgraphs

Per-node partition assignment (reuses existing executionLayer() logic):

Partition	Entity Types	Position
0	Trigger resources, owners	Leftmost
1	Workloads, workload-subtype identities
2	Connections, credentials, OAuth apps
3	Service principals, managed identities
4	Roles
5	Permissions
6	Non-trigger resources	Rightmost

Async layout with Web Worker

Unlike Dagre (synchronous), ELK returns a Promise. The layout uses the Web Worker variant from day one (elkjs/lib/elk-worker.min.js) — since the API is async either way, using the worker costs no extra complexity and keeps the UI thread free for all graph sizes.

The layout is wrapped in a useElkLayout() hook that returns:

nodes / edges — positioned ReactFlow elements
isLayouting — boolean for loading state

Loading state handling: The sync→async transition introduces a brief empty canvas flash. To prevent this:

Show a spinner overlay while isLayouting === true
Keep the previous layout visible underneath during re-layout (don't clear nodes before new positions arrive)
Only swap to new positions once ELK completes

What stays the same

ReactFlow (@xyflow/react v12) remains the rendering layer
Node rendering: EntityNode component, colors, finding badges, data domain tags — unchanged
Edge styling: Color-coded by relationship type, dashed/solid/dotted — unchanged
BFS path highlighting: Depth-limited neighbor highlighting — unchanged
Determinism: All inputs sorted lexicographically before layout (E3 pattern preserved). ELK's considerModelOrder.strategy: "NODES_AND_EDGES" ensures stable output for identical sorted input — critical for evidence-grade screenshots
Execution flow edge reversal: EXEC_FLOW_REVERSE_EDGES set logic preserved
Filter sidebar: Entity type, findings, relationship type, source system filters — unchanged

Future: Compound graph containers (Phase 2)

ELK natively supports compound graphs (elk.hierarchyHandling: "INCLUDE_CHILDREN"). After the base migration, visual group containers can be added per partition:

Labeled containers ("Identities (12)", "Resources (20)")
Expand/collapse via ReactFlow's parentId + hidden pattern
Auto-collapse when total node count exceeds threshold
ELK routes edges across group boundaries correctly

This is deferred to a follow-up implementation.

Alternatives Considered

Keep Dagre with post-layout redistribution

Redistribute same-rank nodes into a grid after Dagre computes positions. Edges route poorly because Dagre's crossing minimization assumed single-column ranks. Band-aid, not a solution.

d3-force with rank constraints

Strong forceX pins nodes to rank columns, forceCollide prevents overlap, forceManyBody spreads same-rank nodes. Produces organic/physics-based layouts that are harder to read for security causal chain analysis. No edge routing. Non-deterministic without explicit seeding.

Swim lane layout (manual)

Partition canvas into horizontal bands by entity type. High implementation effort. No automated edge crossing minimization. Nodes can be dragged out of lanes. Grouping by entity type breaks causal ordering since types span multiple ranks.

Keep Dagre for small graphs, ELK for large

Maintaining two layout codepaths (two edge-mapping functions, two position transforms, two sets of edge cases) is not worth the marginal benefit. ELK handles small graphs equally well — <5ms for 10 nodes.

Consequences

Positive

Graphs with 50-200+ nodes become usable — same-rank nodes distributed, not stacked
Better edge crossing minimization (5-phase pipeline vs Dagre's simpler heuristic)
Native partitioning replaces constraint-edge workaround — cleaner code
Foundation for compound graph containers (Phase 2)
Single layout engine for all graph contexts (GraphCanvas, MiniGraph)
Web Worker from day one — UI thread never blocked regardless of graph size

Negative

Bundle size increase: +1.4MB for elkjs (acceptable for internal platform)
Async layout adds minor complexity (hook + cancellation pattern)
dagre dependency removed — any Dagre-specific behavior is lost (none identified)

Migration scope

Component	Change
`ui/package.json`	Remove `@dagrejs/dagre`, add `elkjs`
`ui/src/components/graph/layout.ts`	Replace Dagre layout with async ELK layout
`ui/src/components/graph/useElkLayout.ts`	New async hook with stale-request cancellation
`ui/src/components/graph/constants.ts`	ELK option config (shared COMPACT_BASE)
`ui/src/components/graph/GraphCanvas.tsx`	Consume async hook, layout visible entities only
`ui/src/components/graph/MiniGraph.tsx`	Consume async layout hook
`ui/src/components/AuthorityPathDiagram.tsx`	Async ELK via `buildAuthorityPathLayout()`
`src/storage/mongo/adapters/subgraph-adapter.ts`	Filter inbound-direction duplicate edges
Architecture docs	Update 00-overview.md, this ADR

Performance budget

All sizes use Web Worker (elk-worker.min.js). UI thread is never blocked.

Graph size	Expected layout time	UX
<30 nodes	<10ms	Instant (no visible spinner)
30-100 nodes	10-100ms	Instant to near-instant
100-200 nodes	100-300ms	Brief spinner overlay, previous layout visible
200+ nodes	300ms+	Spinner overlay + loading indicator

Lessons Learned (2026-02-22 tuning session)

Three bugs caused the initial ELK layout to appear barely better than Dagre. Each was a significant win:

1. Duplicate edges from inbound relationships (backend bug)

Entities store both inbound and outbound relationships (by design — for bidirectional traversal). The SubgraphAdapter iterated all relationships without filtering direction: "inbound", so every relationship appeared as two visual edges (A→B OWNED_BY and B→A OWNED_BY). Fix: skip rel.properties?.direction === "inbound" in all four traversal loops (neighborhood forward/reverse, execution flow forward/reverse). This cut edge count ~50%.

2. Layout computed on all entities, not visible subset (frontend bug)

GraphCanvas passed allEntities (the full unfiltered superset) to ELK, then hid filtered-out nodes with hidden: true. ELK computed positions for ALL nodes, so visible nodes had massive gaps where hidden ones reserved space. Fix: run ELK on only the visible entities array. This was the single biggest improvement.

3. ORTHOGONAL edge routing inflates vertical spacing

ELK's default ORTHOGONAL routing reserves vertical channels between nodes for right-angle edge bends. Even with nodeNode: 5, nodes were spread hundreds of pixels apart. POLYLINE routing eliminates channel reservation — edges draw as straight line segments, and ReactFlow handles the actual rendering anyway.

Future: compound graph containers

The reference UX pattern (swimlane columns with headers like "51 ServiceNow Roles" and expandable "41 more...") requires ELK compound graphs (elk.hierarchyHandling: "INCLUDE_CHILDREN") with ReactFlow's parentId grouping. This is the next major step for graph readability at scale (100+ nodes). See "Future: Compound graph containers" above.

Future: vertex splitting for hub nodes

For hub nodes with many both-incoming-and-outgoing edges (e.g., an owner entity), academic research supports "vertex splitting" — creating two visual copies (source copy for outgoing edges, sink copy for incoming). This eliminates back-edges and keeps clean left-to-right flow. The technique is well-studied (Henry et al. 2008 IEEE InfoVis, Ahmed et al. 2023) but no layout library implements it automatically — it requires a pre-processing transform before passing the graph to ELK.

When to Reconsider

If bundle size becomes critical (e.g., public-facing SaaS with aggressive load time targets), consider lazy-loading elkjs
If layout quality for very large graphs (500+ nodes) is insufficient, investigate ELK's stress or force algorithms as alternatives to layered
If real-time collaborative editing is added, investigate incremental layout (ELK does not support this natively — would need delta-based re-layout)

Status​

Context​

The scaling problem​

What we tried​

Why now​

Decision​

Why ELK.js​

Key capability: Partitioning​

Layout configuration​

Async layout with Web Worker​

What stays the same​

Future: Compound graph containers (Phase 2)​

Alternatives Considered​

Keep Dagre with post-layout redistribution​

d3-force with rank constraints​

Swim lane layout (manual)​

Keep Dagre for small graphs, ELK for large​

Consequences​

Positive​

Negative​

Migration scope​

Performance budget​

Lessons Learned (2026-02-22 tuning session)​

1. Duplicate edges from inbound relationships (backend bug)​

2. Layout computed on all entities, not visible subset (frontend bug)​

3. ORTHOGONAL edge routing inflates vertical spacing​

Future: compound graph containers​

Future: vertex splitting for hub nodes​

When to Reconsider​