ADR-011: ELK.js Graph Layout Engine
Status
Accepted (2026-02-22)
Supersedes: Dagre layout decision documented in 00-overview.md (2026-01-26).
Context
The platform's Graph Explorer uses ReactFlow (@xyflow/react v12) for interactive graph visualization. The original layout engine was Dagre (Sugiyama-based DAG layout), chosen in Phase 1 for its simplicity and determinism.
The scaling problem
Dagre places all same-rank nodes into a single vertical column. This was acceptable at MVP scale (10-30 nodes) but becomes unusable as tenant graphs grow:
- 15 identities → a single column 900px+ tall
- 20 resources → stacked vertically, requiring excessive scrolling
- 50+ total nodes → the graph becomes a narrow vertical strip, wider than tall, unreadable without zooming
The root cause is fundamental to Dagre's algorithm: it has no concept of distributing same-rank nodes across multiple sub-layers or rows. The nodesep parameter controls spacing but not arrangement.
What we tried
- Increasing
nodesep/ranksep: Makes the problem worse (larger gaps, same single-column layout) - Post-Dagre position redistribution: Nodes spread into grids per rank, but edges route terribly because Dagre's crossing minimization assumed single-column ranks
- Filtering to reduce node count: Helps but doesn't solve the core problem — security teams need the full graph for investigation
Why now
W1 (Exposure Discovery) introduces authority-path graphs with 8 workloads, 5 identities, 17 resources, and dozens of edges per tenant. Real customer environments will have 50-200+ entities. The layout must handle this scale.
Decision
Replace Dagre with ELK.js (elkjs npm package) as the sole graph layout engine across all graph components (GraphCanvas, MiniGraph).
Why ELK.js
ELK (Eclipse Layout Kernel) is a mature Java layout library ported to JavaScript. Its layered algorithm (org.eclipse.elk.layered) uses a 5-phase pipeline that directly addresses our limitations:
- Cycle Breaking — handles edge direction conflicts
- Layer Assignment — assigns nodes to ranks (like Dagre)
- Crossing Minimization —
LAYER_SWEEPstrategy, significantly better than Dagre - Node Placement —
NETWORK_SIMPLEXdistributes same-layer nodes across sub-rows - Edge Routing — orthogonal routing with crossing avoidance
Key capability: Partitioning
ELK's elk.partitioning.activate option maps directly to our executionLayer() concept. Each node is assigned a partitioning.partition value (0-6), and ELK guarantees:
- Nodes within the same partition are placed in the same rank column (or adjacent sub-columns)
- Partition ordering is preserved left-to-right
- Nodes within a partition are distributed across sub-rows to minimize height and edge crossings
This replaces our current workaround of invisible constraint edges between layer representatives.
Layout configuration
Both modes share a compact base with mode-specific overrides:
const COMPACT_BASE = {
"elk.algorithm": "layered",
"elk.direction": "RIGHT",
"elk.edgeRouting": "POLYLINE", // No vertical channel reservation
"elk.spacing.nodeNode": "20",
"elk.spacing.edgeNode": "10",
"elk.layered.spacing.nodeNodeBetweenLayers": "80",
"elk.layered.spacing.edgeNodeBetweenLayers": "20",
"elk.layered.spacing.edgeEdgeBetweenLayers": "10",
"elk.layered.crossingMinimization.strategy": "LAYER_SWEEP",
"elk.layered.nodePlacement.strategy": "NETWORK_SIMPLEX",
"elk.layered.considerModelOrder.strategy": "NODES_AND_EDGES",
"elk.layered.compaction.postCompaction.strategy": "EDGE_LENGTH",
"elk.separateConnectedComponents": "false",
};
// Execution flow: adds partitioning for causal left-to-right ordering
const execFlowOptions = { ...COMPACT_BASE, "elk.partitioning.activate": "true" };
// Neighborhood: uses base config as-is (no partitioning)
const neighborhoodOptions = { ...COMPACT_BASE };
Critical spacing lessons learned (2026-02-22):
POLYLINEedge routing is essential — the defaultORTHOGONALreserves vertical channels between nodes for edge bends, inflating vertical spread by 3-5x even with tightnodeNodespacingNETWORK_SIMPLEXnode placement produces compact columns;BRANDES_KOEPFtries to align nodes with neighbors in adjacent layers, spreading them outseparateConnectedComponents: falseprevents extra gaps between disconnected subgraphs
Per-node partition assignment (reuses existing executionLayer() logic):
| Partition | Entity Types | Position |
|---|---|---|
| 0 | Trigger resources, owners | Leftmost |
| 1 | Workloads, workload-subtype identities | |
| 2 | Connections, credentials, OAuth apps | |
| 3 | Service principals, managed identities | |
| 4 | Roles | |
| 5 | Permissions | |
| 6 | Non-trigger resources | Rightmost |
Async layout with Web Worker
Unlike Dagre (synchronous), ELK returns a Promise. The layout uses the Web Worker variant from day one (elkjs/lib/elk-worker.min.js) — since the API is async either way, using the worker costs no extra complexity and keeps the UI thread free for all graph sizes.
The layout is wrapped in a useElkLayout() hook that returns:
nodes/edges— positioned ReactFlow elementsisLayouting— boolean for loading state
Loading state handling: The sync→async transition introduces a brief empty canvas flash. To prevent this:
- Show a spinner overlay while
isLayouting === true - Keep the previous layout visible underneath during re-layout (don't clear nodes before new positions arrive)
- Only swap to new positions once ELK completes
What stays the same
- ReactFlow (@xyflow/react v12) remains the rendering layer
- Node rendering: EntityNode component, colors, finding badges, data domain tags — unchanged
- Edge styling: Color-coded by relationship type, dashed/solid/dotted — unchanged
- BFS path highlighting: Depth-limited neighbor highlighting — unchanged
- Determinism: All inputs sorted lexicographically before layout (E3 pattern preserved). ELK's
considerModelOrder.strategy: "NODES_AND_EDGES"ensures stable output for identical sorted input — critical for evidence-grade screenshots - Execution flow edge reversal:
EXEC_FLOW_REVERSE_EDGESset logic preserved - Filter sidebar: Entity type, findings, relationship type, source system filters — unchanged
Future: Compound graph containers (Phase 2)
ELK natively supports compound graphs (elk.hierarchyHandling: "INCLUDE_CHILDREN"). After the base migration, visual group containers can be added per partition:
- Labeled containers ("Identities (12)", "Resources (20)")
- Expand/collapse via ReactFlow's
parentId+hiddenpattern - Auto-collapse when total node count exceeds threshold
- ELK routes edges across group boundaries correctly
This is deferred to a follow-up implementation.
Alternatives Considered
Keep Dagre with post-layout redistribution
Redistribute same-rank nodes into a grid after Dagre computes positions. Edges route poorly because Dagre's crossing minimization assumed single-column ranks. Band-aid, not a solution.
d3-force with rank constraints
Strong forceX pins nodes to rank columns, forceCollide prevents overlap, forceManyBody spreads same-rank nodes. Produces organic/physics-based layouts that are harder to read for security causal chain analysis. No edge routing. Non-deterministic without explicit seeding.
Swim lane layout (manual)
Partition canvas into horizontal bands by entity type. High implementation effort. No automated edge crossing minimization. Nodes can be dragged out of lanes. Grouping by entity type breaks causal ordering since types span multiple ranks.
Keep Dagre for small graphs, ELK for large
Maintaining two layout codepaths (two edge-mapping functions, two position transforms, two sets of edge cases) is not worth the marginal benefit. ELK handles small graphs equally well — <5ms for 10 nodes.
Consequences
Positive
- Graphs with 50-200+ nodes become usable — same-rank nodes distributed, not stacked
- Better edge crossing minimization (5-phase pipeline vs Dagre's simpler heuristic)
- Native partitioning replaces constraint-edge workaround — cleaner code
- Foundation for compound graph containers (Phase 2)
- Single layout engine for all graph contexts (GraphCanvas, MiniGraph)
- Web Worker from day one — UI thread never blocked regardless of graph size
Negative
- Bundle size increase: +1.4MB for
elkjs(acceptable for internal platform) - Async layout adds minor complexity (hook + cancellation pattern)
dagredependency removed — any Dagre-specific behavior is lost (none identified)
Migration scope
| Component | Change |
|---|---|
ui/package.json | Remove @dagrejs/dagre, add elkjs |
ui/src/components/graph/layout.ts | Replace Dagre layout with async ELK layout |
ui/src/components/graph/useElkLayout.ts | New async hook with stale-request cancellation |
ui/src/components/graph/constants.ts | ELK option config (shared COMPACT_BASE) |
ui/src/components/graph/GraphCanvas.tsx | Consume async hook, layout visible entities only |
ui/src/components/graph/MiniGraph.tsx | Consume async layout hook |
ui/src/components/AuthorityPathDiagram.tsx | Async ELK via buildAuthorityPathLayout() |
src/storage/mongo/adapters/subgraph-adapter.ts | Filter inbound-direction duplicate edges |
| Architecture docs | Update 00-overview.md, this ADR |
Performance budget
All sizes use Web Worker (elk-worker.min.js). UI thread is never blocked.
| Graph size | Expected layout time | UX |
|---|---|---|
| <30 nodes | <10ms | Instant (no visible spinner) |
| 30-100 nodes | 10-100ms | Instant to near-instant |
| 100-200 nodes | 100-300ms | Brief spinner overlay, previous layout visible |
| 200+ nodes | 300ms+ | Spinner overlay + loading indicator |
Lessons Learned (2026-02-22 tuning session)
Three bugs caused the initial ELK layout to appear barely better than Dagre. Each was a significant win:
1. Duplicate edges from inbound relationships (backend bug)
Entities store both inbound and outbound relationships (by design — for bidirectional traversal). The SubgraphAdapter iterated all relationships without filtering direction: "inbound", so every relationship appeared as two visual edges (A→B OWNED_BY and B→A OWNED_BY). Fix: skip rel.properties?.direction === "inbound" in all four traversal loops (neighborhood forward/reverse, execution flow forward/reverse). This cut edge count ~50%.
2. Layout computed on all entities, not visible subset (frontend bug)
GraphCanvas passed allEntities (the full unfiltered superset) to ELK, then hid filtered-out nodes with hidden: true. ELK computed positions for ALL nodes, so visible nodes had massive gaps where hidden ones reserved space. Fix: run ELK on only the visible entities array. This was the single biggest improvement.
3. ORTHOGONAL edge routing inflates vertical spacing
ELK's default ORTHOGONAL routing reserves vertical channels between nodes for right-angle edge bends. Even with nodeNode: 5, nodes were spread hundreds of pixels apart. POLYLINE routing eliminates channel reservation — edges draw as straight line segments, and ReactFlow handles the actual rendering anyway.
Future: compound graph containers
The reference UX pattern (swimlane columns with headers like "51 ServiceNow Roles" and expandable "41 more...") requires ELK compound graphs (elk.hierarchyHandling: "INCLUDE_CHILDREN") with ReactFlow's parentId grouping. This is the next major step for graph readability at scale (100+ nodes). See "Future: Compound graph containers" above.
Future: vertex splitting for hub nodes
For hub nodes with many both-incoming-and-outgoing edges (e.g., an owner entity), academic research supports "vertex splitting" — creating two visual copies (source copy for outgoing edges, sink copy for incoming). This eliminates back-edges and keeps clean left-to-right flow. The technique is well-studied (Henry et al. 2008 IEEE InfoVis, Ahmed et al. 2023) but no layout library implements it automatically — it requires a pre-processing transform before passing the graph to ELK.
When to Reconsider
- If bundle size becomes critical (e.g., public-facing SaaS with aggressive load time targets), consider lazy-loading
elkjs - If layout quality for very large graphs (500+ nodes) is insufficient, investigate ELK's
stressorforcealgorithms as alternatives tolayered - If real-time collaborative editing is added, investigate incremental layout (ELK does not support this natively — would need delta-based re-layout)