Skip to main content

ADR-011: ELK.js Graph Layout Engine

Status

Accepted (2026-02-22)

Supersedes: Dagre layout decision documented in 00-overview.md (2026-01-26).


Context

The platform's Graph Explorer uses ReactFlow (@xyflow/react v12) for interactive graph visualization. The original layout engine was Dagre (Sugiyama-based DAG layout), chosen in Phase 1 for its simplicity and determinism.

The scaling problem

Dagre places all same-rank nodes into a single vertical column. This was acceptable at MVP scale (10-30 nodes) but becomes unusable as tenant graphs grow:

  • 15 identities → a single column 900px+ tall
  • 20 resources → stacked vertically, requiring excessive scrolling
  • 50+ total nodes → the graph becomes a narrow vertical strip, wider than tall, unreadable without zooming

The root cause is fundamental to Dagre's algorithm: it has no concept of distributing same-rank nodes across multiple sub-layers or rows. The nodesep parameter controls spacing but not arrangement.

What we tried

  • Increasing nodesep/ranksep: Makes the problem worse (larger gaps, same single-column layout)
  • Post-Dagre position redistribution: Nodes spread into grids per rank, but edges route terribly because Dagre's crossing minimization assumed single-column ranks
  • Filtering to reduce node count: Helps but doesn't solve the core problem — security teams need the full graph for investigation

Why now

W1 (Exposure Discovery) introduces authority-path graphs with 8 workloads, 5 identities, 17 resources, and dozens of edges per tenant. Real customer environments will have 50-200+ entities. The layout must handle this scale.


Decision

Replace Dagre with ELK.js (elkjs npm package) as the sole graph layout engine across all graph components (GraphCanvas, MiniGraph).

Why ELK.js

ELK (Eclipse Layout Kernel) is a mature Java layout library ported to JavaScript. Its layered algorithm (org.eclipse.elk.layered) uses a 5-phase pipeline that directly addresses our limitations:

  1. Cycle Breaking — handles edge direction conflicts
  2. Layer Assignment — assigns nodes to ranks (like Dagre)
  3. Crossing MinimizationLAYER_SWEEP strategy, significantly better than Dagre
  4. Node PlacementNETWORK_SIMPLEX distributes same-layer nodes across sub-rows
  5. Edge Routing — orthogonal routing with crossing avoidance

Key capability: Partitioning

ELK's elk.partitioning.activate option maps directly to our executionLayer() concept. Each node is assigned a partitioning.partition value (0-6), and ELK guarantees:

  • Nodes within the same partition are placed in the same rank column (or adjacent sub-columns)
  • Partition ordering is preserved left-to-right
  • Nodes within a partition are distributed across sub-rows to minimize height and edge crossings

This replaces our current workaround of invisible constraint edges between layer representatives.

Layout configuration

Both modes share a compact base with mode-specific overrides:

const COMPACT_BASE = {
"elk.algorithm": "layered",
"elk.direction": "RIGHT",
"elk.edgeRouting": "POLYLINE", // No vertical channel reservation
"elk.spacing.nodeNode": "20",
"elk.spacing.edgeNode": "10",
"elk.layered.spacing.nodeNodeBetweenLayers": "80",
"elk.layered.spacing.edgeNodeBetweenLayers": "20",
"elk.layered.spacing.edgeEdgeBetweenLayers": "10",
"elk.layered.crossingMinimization.strategy": "LAYER_SWEEP",
"elk.layered.nodePlacement.strategy": "NETWORK_SIMPLEX",
"elk.layered.considerModelOrder.strategy": "NODES_AND_EDGES",
"elk.layered.compaction.postCompaction.strategy": "EDGE_LENGTH",
"elk.separateConnectedComponents": "false",
};

// Execution flow: adds partitioning for causal left-to-right ordering
const execFlowOptions = { ...COMPACT_BASE, "elk.partitioning.activate": "true" };

// Neighborhood: uses base config as-is (no partitioning)
const neighborhoodOptions = { ...COMPACT_BASE };

Critical spacing lessons learned (2026-02-22):

  • POLYLINE edge routing is essential — the default ORTHOGONAL reserves vertical channels between nodes for edge bends, inflating vertical spread by 3-5x even with tight nodeNode spacing
  • NETWORK_SIMPLEX node placement produces compact columns; BRANDES_KOEPF tries to align nodes with neighbors in adjacent layers, spreading them out
  • separateConnectedComponents: false prevents extra gaps between disconnected subgraphs

Per-node partition assignment (reuses existing executionLayer() logic):

PartitionEntity TypesPosition
0Trigger resources, ownersLeftmost
1Workloads, workload-subtype identities
2Connections, credentials, OAuth apps
3Service principals, managed identities
4Roles
5Permissions
6Non-trigger resourcesRightmost

Async layout with Web Worker

Unlike Dagre (synchronous), ELK returns a Promise. The layout uses the Web Worker variant from day one (elkjs/lib/elk-worker.min.js) — since the API is async either way, using the worker costs no extra complexity and keeps the UI thread free for all graph sizes.

The layout is wrapped in a useElkLayout() hook that returns:

  • nodes / edges — positioned ReactFlow elements
  • isLayouting — boolean for loading state

Loading state handling: The sync→async transition introduces a brief empty canvas flash. To prevent this:

  • Show a spinner overlay while isLayouting === true
  • Keep the previous layout visible underneath during re-layout (don't clear nodes before new positions arrive)
  • Only swap to new positions once ELK completes

What stays the same

  • ReactFlow (@xyflow/react v12) remains the rendering layer
  • Node rendering: EntityNode component, colors, finding badges, data domain tags — unchanged
  • Edge styling: Color-coded by relationship type, dashed/solid/dotted — unchanged
  • BFS path highlighting: Depth-limited neighbor highlighting — unchanged
  • Determinism: All inputs sorted lexicographically before layout (E3 pattern preserved). ELK's considerModelOrder.strategy: "NODES_AND_EDGES" ensures stable output for identical sorted input — critical for evidence-grade screenshots
  • Execution flow edge reversal: EXEC_FLOW_REVERSE_EDGES set logic preserved
  • Filter sidebar: Entity type, findings, relationship type, source system filters — unchanged

Future: Compound graph containers (Phase 2)

ELK natively supports compound graphs (elk.hierarchyHandling: "INCLUDE_CHILDREN"). After the base migration, visual group containers can be added per partition:

  • Labeled containers ("Identities (12)", "Resources (20)")
  • Expand/collapse via ReactFlow's parentId + hidden pattern
  • Auto-collapse when total node count exceeds threshold
  • ELK routes edges across group boundaries correctly

This is deferred to a follow-up implementation.


Alternatives Considered

Keep Dagre with post-layout redistribution

Redistribute same-rank nodes into a grid after Dagre computes positions. Edges route poorly because Dagre's crossing minimization assumed single-column ranks. Band-aid, not a solution.

d3-force with rank constraints

Strong forceX pins nodes to rank columns, forceCollide prevents overlap, forceManyBody spreads same-rank nodes. Produces organic/physics-based layouts that are harder to read for security causal chain analysis. No edge routing. Non-deterministic without explicit seeding.

Swim lane layout (manual)

Partition canvas into horizontal bands by entity type. High implementation effort. No automated edge crossing minimization. Nodes can be dragged out of lanes. Grouping by entity type breaks causal ordering since types span multiple ranks.

Keep Dagre for small graphs, ELK for large

Maintaining two layout codepaths (two edge-mapping functions, two position transforms, two sets of edge cases) is not worth the marginal benefit. ELK handles small graphs equally well — <5ms for 10 nodes.


Consequences

Positive

  • Graphs with 50-200+ nodes become usable — same-rank nodes distributed, not stacked
  • Better edge crossing minimization (5-phase pipeline vs Dagre's simpler heuristic)
  • Native partitioning replaces constraint-edge workaround — cleaner code
  • Foundation for compound graph containers (Phase 2)
  • Single layout engine for all graph contexts (GraphCanvas, MiniGraph)
  • Web Worker from day one — UI thread never blocked regardless of graph size

Negative

  • Bundle size increase: +1.4MB for elkjs (acceptable for internal platform)
  • Async layout adds minor complexity (hook + cancellation pattern)
  • dagre dependency removed — any Dagre-specific behavior is lost (none identified)

Migration scope

ComponentChange
ui/package.jsonRemove @dagrejs/dagre, add elkjs
ui/src/components/graph/layout.tsReplace Dagre layout with async ELK layout
ui/src/components/graph/useElkLayout.tsNew async hook with stale-request cancellation
ui/src/components/graph/constants.tsELK option config (shared COMPACT_BASE)
ui/src/components/graph/GraphCanvas.tsxConsume async hook, layout visible entities only
ui/src/components/graph/MiniGraph.tsxConsume async layout hook
ui/src/components/AuthorityPathDiagram.tsxAsync ELK via buildAuthorityPathLayout()
src/storage/mongo/adapters/subgraph-adapter.tsFilter inbound-direction duplicate edges
Architecture docsUpdate 00-overview.md, this ADR

Performance budget

All sizes use Web Worker (elk-worker.min.js). UI thread is never blocked.

Graph sizeExpected layout timeUX
<30 nodes<10msInstant (no visible spinner)
30-100 nodes10-100msInstant to near-instant
100-200 nodes100-300msBrief spinner overlay, previous layout visible
200+ nodes300ms+Spinner overlay + loading indicator

Lessons Learned (2026-02-22 tuning session)

Three bugs caused the initial ELK layout to appear barely better than Dagre. Each was a significant win:

1. Duplicate edges from inbound relationships (backend bug)

Entities store both inbound and outbound relationships (by design — for bidirectional traversal). The SubgraphAdapter iterated all relationships without filtering direction: "inbound", so every relationship appeared as two visual edges (A→B OWNED_BY and B→A OWNED_BY). Fix: skip rel.properties?.direction === "inbound" in all four traversal loops (neighborhood forward/reverse, execution flow forward/reverse). This cut edge count ~50%.

2. Layout computed on all entities, not visible subset (frontend bug)

GraphCanvas passed allEntities (the full unfiltered superset) to ELK, then hid filtered-out nodes with hidden: true. ELK computed positions for ALL nodes, so visible nodes had massive gaps where hidden ones reserved space. Fix: run ELK on only the visible entities array. This was the single biggest improvement.

3. ORTHOGONAL edge routing inflates vertical spacing

ELK's default ORTHOGONAL routing reserves vertical channels between nodes for right-angle edge bends. Even with nodeNode: 5, nodes were spread hundreds of pixels apart. POLYLINE routing eliminates channel reservation — edges draw as straight line segments, and ReactFlow handles the actual rendering anyway.

Future: compound graph containers

The reference UX pattern (swimlane columns with headers like "51 ServiceNow Roles" and expandable "41 more...") requires ELK compound graphs (elk.hierarchyHandling: "INCLUDE_CHILDREN") with ReactFlow's parentId grouping. This is the next major step for graph readability at scale (100+ nodes). See "Future: Compound graph containers" above.

Future: vertex splitting for hub nodes

For hub nodes with many both-incoming-and-outgoing edges (e.g., an owner entity), academic research supports "vertex splitting" — creating two visual copies (source copy for outgoing edges, sink copy for incoming). This eliminates back-edges and keeps clean left-to-right flow. The technique is well-studied (Henry et al. 2008 IEEE InfoVis, Ahmed et al. 2023) but no layout library implements it automatically — it requires a pre-processing transform before passing the graph to ELK.


When to Reconsider

  • If bundle size becomes critical (e.g., public-facing SaaS with aggressive load time targets), consider lazy-loading elkjs
  • If layout quality for very large graphs (500+ nodes) is insufficient, investigate ELK's stress or force algorithms as alternatives to layered
  • If real-time collaborative editing is added, investigate incremental layout (ELK does not support this natively — would need delta-based re-layout)