Skip to main content

Multi-Account AWS Connector Architecture

Spec-deviation note (revision-1): This design deviates from sv0-connectors#32's acceptance criterion that says "for each target account, assume role chain: bootstrap → org role → per-account role." This design replaces that with never chains STS — every per-account assume happens directly from bootstrap creds, and the org role is used only for ListAccounts/OU walk. The deviation is justified (STS chaining adds latency, complicates credential refresh, and provides no security benefit in our threat model), but the original #32 AC needs an amendment comment before this design is implemented. Filed as a coordination action in the umbrella's Coord-7 (revision-1).

TL;DR

Today's AWS connector scans exactly one account: the one its assumed role lives in (AWS_ROLE_ARN). MediaPro is a 3-account org and Lab 2 (nimbus-security / nimbus-workloads / nimbus-data) cannot be built without honoring --accounts. This proposal restructures the connector around a (account × service-category) cell as the unit of work, with role-chain authentication (bootstrap → org-discovery role → per-account SecurityV0ReadOnly), per-cell partial-failure isolation, and a StackSet-deployable spoke role. Account discovery is via AWS Organizations when available and an explicit list otherwise. One aws_account node per discovered account, OU membership encoded as BELONGS_TO, and cross-account role assumption emitted as a single TRUSTS edge between role nodes (so Stream 3's stitcher can collapse workload-A → role-A → role-B → resource-B into one authority path).

Problem

The single-account connector has four concrete failures against the MediaPro-shaped customer:

  1. --accounts is parsed nowhere. sv0-connectors#32 flagged it; cli/main.py accepts no such flag. scan() calls aws_client.get_account_id() once, then iterates regions only. Workloads in mp-workloads are invisible to a connector that assumed into mp-security.
  2. Monolithic scan, monolithic failure. AWSConnector.scan() wraps the whole region loop in a single try/except that re-raises (cli/main.py:308-310). One Bedrock-region throttle aborts IAM, Lambda, S3, ECS, etc. for every account. Enterprise security teams will not approve a 4-hour read window for "all of AWS or nothing."
  3. No service-category scoping. Every scan extracts every category (IAM + Lambda + Bedrock + Secrets + S3 + DynamoDB + SNS + Step Functions + EventBridge + ECS + ECR + CloudTrail). There is no --services iam knob. A buyer who wants nightly IAM scans + weekly Bedrock cannot have it.
  4. Cross-account authority paths fail open. When a workload assumes a role in another account, the trust-policy parser sees the trusted account ID but no role node exists in that account because the connector never scanned it. The transformer creates a placeholder external_aws_account resource node (transformer.py:1175-1190) and the path dead-ends at an account chip instead of resolving into the actual destination role's permissions.

Current state

What ships today

FileBehavior
integrations/aws/src/sv0_aws/cli/main.pyCLI accepts --regions, --skip-cloudtrail, no --accounts, no --services. scan() is a single linear pass over one account.
integrations/aws/src/sv0_aws/adapters/aws_client.pyOne boto3.Session. _assume_role() assumes AWS_ROLE_ARN once, caches credentials with refresh-5-min-before-expiry. No multi-target STS. paginate_with_backoff correctly resumes mid-pagination on throttling and uses retry_mode="adaptive".
integrations/aws/src/sv0_aws/config.pyAWS_ORGANIZATION_ROLE_ARN is read from env but never used by anything. CLOUDTRAIL_BUCKET_LAYOUT already supports organization (W1.3 phase 2) — partial multi-account groundwork in CloudTrail only.
integrations/aws/src/sv0_aws/extractors/*.pyEach extractor takes account_id as a parameter, but the entire CLI passes the same current_account_id to all of them. The extractor signatures already permit per-account scoping; the orchestrator does not exploit it.
integrations/aws/src/sv0_aws/core/transformer.py_transform_accounts() exists and emits aws_account nodes (lines 400-438). aws_account placeholder nodes for trusted external accounts also exist (lines 1170-1190). BELONGS_TO workload→account edges and BELONGS_TO account→OU edges are not emitted.
integrations/aws/cfn/securityv0-readonly-role.yamlSingle CFN template, parameterized on ExternalId and SecurityV0AccountId. No StackSet wrapper, no aws:PrincipalOrgID condition, no per-region considerations beyond the IAM-is-global default. Permission set already covers all categories planned below — re-use as-is.

Gap vs sv0-connectors#32

#32 accepted criteria, status:

CriterionStatusNotes
--accounts 111,222,333 scans three explicit accountsNot builtCLI does not parse the flag
AWS_ORGANIZATION_ROLE_ARN triggers ListAccounts auto-discoveryNot builtenv var is read, never consumed
Partial failures don't abort whole scanNot builtsingle try/except in scan()
scanScope.sourceSystems includes per-account identifiersPartialscanScope.sourceSystems is hardcoded ["aws_iam","aws_lambda",...], never per-account
Cross-account workload→role trace is one authority pathNot builtplaceholder external_aws_account dead-ends the path
CFN updated for per-account role deploymentNot builtStackSet wrapper missing

Other follow-ups

  • sv0-connectors#57 — CloudTrail org-trail layout already wired in config (CLOUDTRAIL_BUCKET_LAYOUT=organization) and CloudTrailExtractor accepts organization_id. The remaining gap is the rest of the connector following CloudTrail's lead.
  • sv0-platform#309 — research-only on cross-tenant rate-limiting. Tactical PR (adaptive retry, structured throttle logs) already in aws_client.py. Multi-account budget here MUST stay within those tactical mitigations and not require new platform-side coordination.
  • sv0-connectors#89 (P0-8) — multi-account is on the pre-client P0 list and explicitly listed as the alternative to "scope pilot to one account." MediaPro will not accept the latter.

Design proposal

Account discovery & role-chain auth

Two modes, one code path.

  1. Organizations mode — set AWS_ORGANIZATION_ROLE_ARN to a role in the management/delegated-admin account that has organizations:ListAccounts, organizations:DescribeOrganization, organizations:ListOrganizationalUnitsForParent, organizations:ListParents. The connector assumes this role first, paginates ListAccounts, builds the OU tree, then for each ACTIVE account assumes the per-account spoke role. CLI: sv0-aws scan --all --discover-org.
  2. Explicit list mode — the operator supplies --accounts 111111111111,222222222222,333333333333. No org role required; the connector only needs bootstrap creds plus a spoke role in each listed account. Compatible with ephemeral / pilot accounts that aren't part of an Org. CLI: sv0-aws scan --all --accounts 111,222,333.

Role chain.

bootstrap creds (env / instance / SSO)
└─ optional: AssumeRole AWS_ORGANIZATION_ROLE_ARN (Organizations API surface)
└─ for each target account:
AssumeRole arn:aws:iam::<account_id>:role/SecurityV0ReadOnly
(ExternalId required, sts:ExternalId on trust policy)

We deliberately do not chain: every per-account assume happens directly from the bootstrap session, never from the org-role session. STS chained sessions are capped at 1 hour regardless of MaxSessionDuration (per 2026-03-30-aws-integration-strategy.md §Phase 0), and the Organizations API call is a one-shot at the start of the run, so there is no upside to chaining.

Per-account role ARN convention. arn:aws:iam::<account_id>:role/SecurityV0ReadOnly is the default. Override is a single env var, AWS_SPOKE_ROLE_NAME=SecurityV0ReadOnly, applied uniformly. We deliberately do not support per-account ARN overrides in v1 — uniform naming is what makes StackSet deployment a one-step operation. Customers with naming-convention objections get an opt-out in v2.

Credential lifetime. AssumeRole returns 1-hour credentials. We cache per (account_id, region) in a dict[tuple[str, str], CachedCreds]. Refresh logic mirrors today's _are_credentials_valid() — refresh 5 minutes before expiry. A long IAM scan against 50 accounts may need to re-assume; we tolerate that.

Account denies AssumeRole — log assume_role_denied account_id=X error_code=AccessDenied, mark the cell status: failed, reason: assume_role_denied, do not retry, and continue. The account is reported in scanScope.errors.permissionDenied. Critically: a denied account does not block discovery of other accounts.

Service-category scoping

The category set (initial v1):

CategoryExtractors / APIsPer-account TPS budget
iamiam:GetAccountAuthorizationDetails, iam:GetCredentialReport, iam:GetServiceLastAccessedDetails~3 TPS, single global call (heavy)
lambdalambda:ListFunctions, GetFunction, GetPolicy, ListEventSourceMappings, per region~5 TPS per region
bedrockbedrock:ListAgents/Get*, ListKnowledgeBases, ListFlows, ListGuardrails, GetModelInvocationLoggingConfiguration~5 TPS per region
ecs_ecrecs:ListClusters/DescribeServices/DescribeTaskDefinition, ecr:DescribeRepositories~10 TPS per region
step_functionsstates:ListStateMachines/DescribeStateMachine~5 TPS per region
eventbridgeevents:ListRules/ListTargetsByRule/ListConnections/ListDestinations~5 TPS per region
s3s3:ListAllMyBuckets/GetBucket*global list + per-bucket (1-2 TPS practical)
secretssecretsmanager:ListSecrets/DescribeSecret, ssm:DescribeParameters~5 TPS per region
dynamodb_snsdynamodb:ListTables/DescribeTable, sns:ListTopics/GetTopicAttributes~10 TPS per region
cloudtrailcloudtrail:DescribeTrails, S3 archive scan via CloudTrailExtractoralready isolated per workload, ~30s/Lambda
access_analyzeraccess-analyzer:ListAnalyzers/ListFindings~5 TPS per region
configconfig:DescribeConfigRules (delegated admin only)~3 TPS per region

ecs+ecr and dynamodb+sns are paired because they always co-load in customer demos and split-billing the TPS budget for them adds no value.

Scope object (consumed from Stream 1's ScanScopeDoc). Per the locked Stream 1 ↔ Stream 2 contract: AWS-specific keys (account_ids[], regions[], optional discovery/exclude fields) live inside scope_keys; service_categories[] lives outside scope_keys, at the top of the ScanScope document, and is validated by the platform against ConnectorInstance.discovered_capabilities.service_categories_available. The on-disk Mongo document (Stream 1's ScanScopeDoc) splits the two:

{
"scope_keys": {
"account_ids": ["111111111111", "222222222222", "333333333333"],
"regions": ["us-east-1", "eu-west-1"]
// optional fields, all inside scope_keys:
// "discover_org": true, // exclusive with explicit account_ids
// "exclude_account_ids": ["999999999999"], // applied after discovery
// "exclude_ous": ["ou-aaaa-suspended"]
},
"service_categories": ["iam", "lambda", "bedrock"]
}

The flat object the AWS connector executor sees after Stream 1 unwraps ScanScopeDoc for the CLI is { account_ids, regions, service_categories, ...optional discovery fields } — that flattened form is what the rest of this document refers to as "the scope object."

Unit-of-work cell(account_id, service_category). Region is inside a cell, not a third dimension, because every category-extractor today already iterates regions internally (extract_lambda_by_region, extract_bedrock_entities_by_region, etc.). Adding region as a third axis would force extractor refactors with no scheduling benefit (regions for one category in one account share the same STS session and share the same per-service quota).

A scan of {accounts: 3, services: 4} therefore produces 12 cells. Each cell:

  • has its own STS-credentials handle (cached per account; re-used across cells in that account)
  • has its own try/except boundary
  • emits its own per-cell connector-report row
  • can succeed (available), partial (partial), or fail (unavailable_no_access / unavailable_not_enabled)

Permission declarations. Each category declares its required IAM actions in code (extractors/<name>_extractor.py:REQUIRED_ACTIONS = frozenset({...})). At scan start, the connector unions REQUIRED_ACTIONS across service_categories and runs iam:SimulatePrincipalPolicy (or falls back to a smoke test per category) to short-circuit cells whose role lacks permissions. This bounds wasted AssumeRole churn.

Parallelism & rate-limit budgets

Per-tenant scheduler. A single ThreadPoolExecutor(max_workers=N) (default N=4) drives all cells. Threads, not async — every extractor is boto3-blocking and the threading overhead is dwarfed by network I/O. Workers are budgeted per account, not per category: never run >2 cells concurrently against the same account, because per-account IAM TPS is the tightest binding constraint. Per-region category cells against different accounts are independent and parallelize freely.

account 111  account 222  account 333
├─ iam ├─ iam ├─ iam
└─ lambda └─ lambda └─ lambda
(≤2) (≤2) (≤2)

global concurrency cap: 6 (= 3 accounts × 2)
worker pool: min(global_cap, max_workers)

Per-service rate-limit budgets. Reuse paginate_with_backoff (already cursor-resuming + adaptive retry, per aws_client.py:175-304). No new token bucket — sv0-platform#309 is parked research and we do not block multi-account on it. We DO add per-cell budget tracking: each cell records api_calls_made, throttle_events, wall_time_seconds in its connector-report row. This is the data Stream 4 needs to size cost expectations and #309 needs to inform its design.

Backpressure. _assume_role throttling on STS (~100 TPS per account) is plausible at scale. Bootstrap session adds an STS retry session with the same adaptive config. If STS fails after retries on a given target, the cell is marked failed, account is not removed from subsequent cells (because the per-(account,region) credential cache may still hold valid creds for sibling cells).

Partial-failure propagation.

cell outcome → connector-report status → ScanRun roll-up
─────────────────────────────────────────────────────────
available │ ok │ contributes to ok
partial │ partial │ ScanRun = partial
empty │ ok (recordCount=0) │ contributes to ok
failed │ unavailable_no_access │ ScanRun = partial (NOT failed unless ALL cells fail)

A ScanRun is failed only if every cell failed, because by definition we have nothing to ingest. Otherwise it is ok (all available) or partial (mixed). This is the contract Stream 1's ScanRun schema must accept; we flag it as an assumption.

Node/edge emission for cross-account

This is the contract Stream 3 will consume. Schemas are in NormalizedGraph form (camelCase, matches sv0-platform/src/ingestion/types.ts).

aws_account resource node (one per discovered account)

{
"nodeId": "aws_account:<account_id>", // already emitted today
"nodeType": "resource",
"sourceSystem": "aws_organizations:<account_id>",
"sourceId": "<account_id>",
"displayName": "<account_name or 'AWS Account <id>'>",
"status": "active", // or "suspended" from DescribeAccount
"properties": {
"accountId": "<account_id>",
"accountName": "<account_name>",
"ouId": "ou-aaaa-bbbbbbbb", // already emitted
"ouPath": "Root/Production/Workloads", // already emitted
"isManagementAccount": false, // already emitted
"organizationId": "o-abcdef0123", // already emitted
"subtype": "aws_account", // already emitted
// NEW additions for Stream 3 stitching:
"accountPurpose": "workloads", // hint from name pattern: security|workloads|data|sandbox|management|unknown
"discoveredVia": "organizations" | "explicit_list",
"joinedMethod": "INVITED" | "CREATED",
"joinedTimestamp": "2026-01-15T..."
}
}

aws_ou resource node (NEW — currently missing)

{
"nodeId": "aws_ou:<ou_id>",
"nodeType": "resource",
"sourceSystem": "aws_organizations:<organization_id>",
"sourceId": "<ou_id>",
"displayName": "<ou_name>",
"status": "active",
"properties": {
"ouId": "ou-aaaa-bbbbbbbb",
"ouName": "Production",
"ouPath": "Root/Production",
"parentId": "r-abcd" | "ou-...",
"organizationId": "o-abcdef0123",
"subtype": "aws_ou"
}
}

BELONGS_TO edges (NEW — currently missing)

Two patterns:

// workload / identity / resource → its owning account
{
"edgeId": "belongs_to:<source_node_id>:<account_id>",
"edgeType": "BELONGS_TO",
"sourceNodeId": "aws_lambda:222222222222:us-east-1:claims-reconcile",
"targetNodeId": "aws_account:222222222222",
"properties": { "boundary": "account" }
}

// account → OU
{
"edgeId": "belongs_to:<account_id>:<ou_id>",
"edgeType": "BELONGS_TO",
"sourceNodeId": "aws_account:222222222222",
"targetNodeId": "aws_ou:ou-aaaa-bbbbbbbb",
"properties": { "boundary": "ou" }
}

Cross-account TRUSTS / ASSUMES_ROLE edges

Today TrustPolicyParser already extracts aws_accounts from a role's AssumeRolePolicyDocument. The current transformer (lines 1170-1190) creates a placeholder external_aws_account and a generic edge — this hides the real path. New emission:

// Role A in account-A trusts role B in account-B (when both roles exist in scan)
{
"edgeId": "trusts:aws_iam_role:222222222222:cross-account-data-reader:aws_iam_role:333333333333:data-orchestrator",
"edgeType": "TRUSTS",
"sourceNodeId": "aws_iam_role:222222222222:cross-account-data-reader", // the trusting role (target of AssumeRole)
"targetNodeId": "aws_iam_role:333333333333:data-orchestrator", // the trusted principal (caller)
"properties": {
"boundary": "cross_account",
"trustingAccountId": "222222222222",
"trustedAccountId": "333333333333",
"externalId": true, // condition `sts:ExternalId` present
"principalOrgIdCondition": "o-abc", // if `aws:PrincipalOrgID` condition present
"trustPolicyHash": "<sha256 of AssumeRolePolicyDocument>"
}
}

If only the trusting role is in-scope (the trusted account is not scanned), we still emit the edge but the target is the existing external_aws_account placeholder. Stream 3's stitcher uses trustedAccountId to back-fill the edge if a later sibling-connector scan brings the other account into the graph.

Federation edges (AWS role trusts external IdP)

This is what makes Entra-SP-via-OIDC visible to Stream 3. Trust-policy parser already emits oidc_providers and saml_providers. New edge:

// AWS role trusts an OIDC provider (e.g. Entra)
{
"edgeId": "trusts_federation:aws_iam_role:222222222222:gh-deploy:oidc:token.actions.githubusercontent.com",
"edgeType": "TRUSTS",
"sourceNodeId": "aws_iam_role:222222222222:gh-deploy",
"targetNodeId": "external_oidc_provider:token.actions.githubusercontent.com",
"properties": {
"boundary": "cross_system",
"providerType": "oidc",
"providerUrl": "https://token.actions.githubusercontent.com",
"audience": "sts.amazonaws.com",
"subjectClaim": "repo:nimbus/inframgmt:ref:refs/heads/main", // condition StringEquals on token.actions.githubusercontent.com:sub
"trustPolicyHash": "<sha256>"
}
}

For Entra-SP federation specifically, the providerUrl will be https://sts.windows.net/<tenant_id>/ or https://login.microsoftonline.com/<tenant_id>/v2.0. Stream 3 correlates external_oidc_provider:sts.windows.net/<tenant_id> with the matching entra_service_principal:<tenant_id>:<sp_id> node coming from the Entra connector — the <tenant_id> substring is the join key.

Source-record fingerprints

Every node and edge that originates from an AWS API response carries a properties.sourceFingerprint deterministic hash:

sourceFingerprint = sha256(
source_system_id // "aws_iam:222222222222"
+ ":" // "
+ source_record_id // role ARN
+ ":" //
+ source_field_path // "AssumeRolePolicyDocument"
)

This is what Stream 3 uses as the stable join key when stitching across connectors and across re-scans. Every cross-account TRUSTS edge above carries trustPolicyHash (sha256 of the document body) so the stitcher can detect "same trust, scanned from both sides" without a fragile ARN+ARN exact match.

CloudFormation / Terraform deployment

StackSet template (NEW: cfn/securityv0-readonly-role-stackset.yaml)

A thin wrapper around the existing securityv0-readonly-role.yaml:

  • Permission model: SERVICE_MANAGED — no per-account admin role required because the management account / delegated admin handles deployment. Customers without delegated admin fall back to SELF_MANAGED with AWSCloudFormationStackSetAdministrationRole / AWSCloudFormationStackSetExecutionRole (the standard pre-StackSet bootstrap pair).
  • Deployment targets: OrganizationalUnitIds (configurable). The MediaPro pilot deploys to a single OU containing the 3 pilot accounts.
  • Auto-deployment: Enabled: true so newly-added accounts in the OU automatically get the role. This is what makes ephemeral-account workflows (Stream 4) tractable — terraform apply adds an account, the StackSet fires, the role exists by the time terraform output returns the account ID.
  • Capabilities: CAPABILITY_NAMED_IAM (we name the role SecurityV0ReadOnly).
  • Stack instances per region: IAM roles are global, so the stack only needs to deploy in one region per account (typically us-east-1). The existing CFN already creates a global IAM role + managed policy.

The trust policy in the existing securityv0-readonly-role.yaml MUST add an aws:PrincipalOrgID condition alongside the existing sts:ExternalId for the org-mode deployment. This prevents a leaked ExternalId from being usable outside the org. Single-account / sandbox deployments without an Org ID fall back to ExternalId-only.

# delta to AssumeRolePolicyDocument:
Condition:
StringEquals:
'sts:ExternalId': !Ref ExternalId
'aws:PrincipalOrgID': !Ref OrganizationId # NEW, optional via !If

Terraform module (NEW: sv0-demo-labs/shared/securityv0-spoke-role/)

For Lab 2 / Stream 4. Provides the same role as a Terraform module that can be for_each-ed across a terraform-aws-modules/account instantiation. Stream 4's scaffold will look like:

module "spoke_role" {
for_each = toset(local.pilot_account_ids)
source = "../../shared/securityv0-spoke-role"
account_id = each.key
external_id = var.sv0_external_id
organization_id = data.aws_organizations_organization.this.id
providers = { aws = aws.member[each.key] }
}

This composes with the IaC up/scan/teardown lifecycle: terraform apply brings up accounts and roles, the connector scan runs, terraform destroy tears everything down. No manual click step.

Permission boundary

The existing SecurityV0ReadOnlyPolicy is already a "one explicit-allow per service" managed policy. We add a permission boundary on the role itself (SecurityV0ReadOnlyBoundary) that explicitly denies every write verb — Deny *:Put*, Deny *:Create*, Deny *:Delete*, Deny *:Update*, plus an explicit Deny secretsmanager:GetSecretValue and Deny ssm:GetParameter (read-but-leaks-secret guards already in the policy as exclusions, hardened here as permission-boundary denies). This is the artifact security teams want to see in their CFN review packet.

CLI / API surface

Honored end-to-end

# Existing single-account behavior (unchanged default)
sv0-aws scan --all

# Explicit account list
sv0-aws scan --all --accounts 111111111111,222222222222,333333333333

# Org auto-discovery
AWS_ORGANIZATION_ROLE_ARN=arn:aws:iam::management:role/SV0OrgDiscovery \
sv0-aws scan --all --discover-org

# Service-category subset (one or many)
sv0-aws scan --all --accounts 111,222,333 --services iam
sv0-aws scan --all --accounts 111,222,333 --services iam,bedrock

# Combined region + service + account scoping
sv0-aws scan --all --accounts 111,222,333 --regions us-east-1 --services iam,lambda

Integration with Stream 1's ConnectorInstance / ScanScope

In production the connector is invoked by Stream 1's worker, not from a human terminal. The CLI gains a --scope-json <file> flag that takes a JSON-serialized ScanScope (the AWS-specific extension above) and is mutually exclusive with --accounts / --regions / --services. The worker writes the scope to a tempfile and execs the connector. This is a forward-compatible bridge — when Stream 1 ships a Python entry-point that takes the scope object directly, the CLI wrapper falls away.

If Stream 1's exact ConnectorInstance schema is not yet published, we treat the scope object above as the working contract. The umbrella plan should reconcile field names if Stream 1 picks different ones.

Connector-report shape per (account × category) cell

The NormalizedGraph already has evidenceCompleteness.sources and scanScope. Multi-account requires sub-keying by (account_id, category):

{
"evidenceCompleteness": {
"sources": {
"aws_iam:111111111111": { "status": "available", "recordCount": 142, "apiCallCount": 7 },
"aws_iam:222222222222": { "status": "unavailable_no_access", "recordCount": 0, "errorCode": "AssumeRoleAccessDenied" },
"aws_iam:333333333333": { "status": "available", "recordCount": 89, "apiCallCount": 5 },
"aws_lambda:111111111111": { "status": "available", "recordCount": 23 },
"aws_lambda:222222222222": { "status": "partial", "recordCount": 17, "errorCode": "ThrottlingMaxRetries" },
"aws_lambda:333333333333": { "status": "available", "recordCount": 8 }
// ...
}
},
"scanScope": {
"mode": "targeted",
"sourceSystems": [
"aws_iam:111111111111", "aws_iam:222222222222", "aws_iam:333333333333",
"aws_lambda:111111111111", "aws_lambda:222222222222", "aws_lambda:333333333333"
],
"errors": {
"errorsEncountered": 1,
"permissionDenied": ["222222222222"]
}
}
}

The aws_iam:222222222222 cell failing does NOT remove aws_iam entities from the graph for accounts 111 and 333 — scanScope.sourceSystems lists every successful (category, account) pair. The platform's diff engine treats each cell independently for delete-eligibility.

Migration / backward compat

Existing single-account scans. A scan with no --accounts and no --discover-org falls through to a single-cell run against sts:GetCallerIdentity().Account. The connector-report shape changes shape — what was evidenceCompleteness.sources["aws_iam"] becomes evidenceCompleteness.sources["aws_iam:<account_id>"]. This is a one-time platform-side migration: the diff engine and UI already key on full source-system strings, so the colon suffix is transparent. A six-week "double-key" emit (both aws_iam and aws_iam:<id>) cushions the cutover.

Tenants without AWS Organizations. Org mode requires AWS_ORGANIZATION_ROLE_ARN. Tenants without an Org use explicit-list mode. There is no auto-fallback — being silently dropped from "scanning my whole org" to "scanning one account" is a worse failure mode than a clear "no org role configured, supply --accounts" error.

Compatibility with sv0-demo-lab-1. Lab 1 is a single AWS account scanned by a SecurityV0ReadOnly role within that same account. After this change, Lab 1 still works exactly as today: no --accounts, no --discover-org, the connector resolves to a one-cell-per-category single-account scan. The only visible change is the cell-keyed evidenceCompleteness shape, which Lab 1's tests must update to match.

Implementation plan

TDD-style. Group under phases. Repos: 🔵 = sv0-connectors, 🟢 = sv0-demo-labs/shared, 🟡 = sv0-platform (one tiny shape allowance only).

Phase 1: Org discovery + role-chain auth

  • 🔵 T1.1 — Write failing test: AWSClientAdapter.assume_role_into(account_id) returns a per-account-cached session; second call within 55min returns same creds. Then implement.
  • 🔵 T1.2 — Write failing test: OrganizationsDiscovery.list_active_accounts() returns an iterator of AWSAccount with ou_path populated from ListParents + DescribeOrganizationalUnit. Mock boto3 with moto or botocore.stub.Stubber. Then implement under sv0_aws/discovery/organizations.py.
  • 🔵 T1.3 — Write failing test: assume_role_into("999999999999") on AccessDenied returns a CellOutcome.failed(reason="assume_role_denied") instead of raising. Then implement (refactor _assume_role to accept target ARN parameter).
  • 🔵 T1.4 — Add --accounts and --discover-org flags to cli/main.py; mutually exclusive validation. Test argparse exit behavior.

Phase 2: Service-category scoping in CLI

  • 🔵 T2.1 — Define ServiceCategory enum + CATEGORY_TO_EXTRACTORS mapping. Test that the union of all category extractor sets equals the current monolithic extractor set (no regression).
  • 🔵 T2.2 — Add --services flag (comma-separated, validates against enum). Default = "all". Test that --services iam builds an executor that runs only IAM extractors.
  • 🔵 T2.3 — Add --scope-json flag that overrides --accounts/--regions/--services. Test round-trip: write scope file → exec connector → resulting scanScope.sourceSystems matches input.

Phase 3: Parallel per-(account × category) execution

  • 🔵 T3.1 — Write failing test: CellExecutor runs 12 cells (3 acc × 4 cat), one cell raising mid-extract; the 11 surviving cells produce data and the 1 failing cell produces a failed connector-report row. No exception escapes. Then implement under sv0_aws/orchestrator/cell_executor.py.
  • 🔵 T3.2 — Write failing test: scheduler runs at most 2 cells concurrently against the same account. Use a barrier or counter to assert. Then implement.
  • 🔵 T3.3 — Write failing test: per-cell api_calls_made and wall_time_seconds are populated in the connector-report row. Then implement (instrument extractors via a context manager).
  • 🔵 T3.4 — Reshape evidenceCompleteness.sources keys from aws_iam to aws_iam:<account_id>. Add a feature flag EVIDENCE_DOUBLE_KEY=true that emits both during the cutover. Update connector tests.

Phase 4: Cross-account node/edge emission

  • 🔵 T4.1 — Write failing test: a graph with workloads in 3 accounts emits exactly 3 aws_account nodes and 1 BELONGS_TO edge per workload. Then implement BELONGS_TO workload→account in transformer.
  • 🔵 T4.2 — Write failing test: when discover_org=true, the graph emits one aws_ou node per discovered OU and BELONGS_TO account→OU edges. Then implement.
  • 🔵 T4.3 — Write failing test: a role in account A whose trust policy lists account B (and account B IS in scope) emits a TRUSTS edge from aws_iam_role:A:roleA to aws_iam_role:B:roleB with boundary: cross_account and a trustPolicyHash. Then implement (refactor _create_trust_edges to look up real role nodes when present, fall back to external_aws_account when not).
  • 🔵 T4.4 — Write failing test: an OIDC-trusted role (Entra sts.windows.net/<tenant> audience) emits a TRUSTS edge to an external_oidc_provider node with the tenant ID extractable from providerUrl. Then implement.
  • 🔵 T4.5 — Write failing test: every cross-account / federation edge carries a sourceFingerprint and trustPolicyHash. Then implement.

Phase 5: StackSet template + Terraform module

  • 🔵 T5.1 — Add aws:PrincipalOrgID condition support to cfn/securityv0-readonly-role.yaml behind a conditional parameter. Add cfn-lint to the connector CI.
  • 🔵 T5.2 — Add cfn/securityv0-readonly-role-stackset.yaml (SERVICE_MANAGED + auto-deployment, OUId-parameterized). Validate with aws cloudformation validate-template.
  • 🟢 T5.3 — Add Terraform module sv0-demo-labs/shared/securityv0-spoke-role/{main.tf,variables.tf,outputs.tf} that mirrors the CFN. terraform validate passes.
  • 🔵 T5.4 — Write integrations/aws/SETUP.md (referenced by README per sv0-connectors#89 P0-9 docs item) covering: explicit-list deploy, StackSet deploy, ExternalId rotation, troubleshooting AssumeRole denials.

Phase 6: Hardening + docs

  • 🔵 T6.1iam:SimulatePrincipalPolicy-based pre-flight permission check per (account, category). On failure → cell short-circuits with unavailable_no_access, no API calls made.
  • 🔵 T6.2 — Permission boundary SecurityV0ReadOnlyBoundary added to CFN + Terraform. Test that aws iam simulate-custom-policy denies iam:CreateRole.
  • 🟡 T6.3 — Platform diff engine: confirm (connector_id, source_system) keying tolerates the colon suffix aws_iam:<account>. Add a regression test if missing.

Total: 19 tasks across 6 phases.

Validation criteria

After Phase 1: sv0-aws scan --all --accounts 111,222,333 issues exactly 3 sts:AssumeRole calls (or 0 if cached), one per account, never chained. Failed AssumeRole on account 222 does not block 111 and 333.

After Phase 2: sv0-aws scan --all --services iam --accounts 111 issues IAM API calls only — lambda:ListFunctions, bedrock:ListAgents, etc. are absent from the captured boto3 trace.

After Phase 3: scanning {accounts: 3, services: 4} produces 12 connector-report rows in evidenceCompleteness.sources. Wall-clock time is ≤ 1.6× the longest single-cell time (proves parallelism, with overhead for serial-per-account capping). Killing one cell mid-flight via os.kill on its thread results in 11 successful cells and 1 failed cell — never an exception escape.

After Phase 4: scanning the Lab 2 3-account topology (mp-security, mp-workloads, mp-data) emits:

  • exactly 3 aws_account nodes
  • ≥ 3 BELONGS_TO edges per scanned workload (one per workload to its account)
  • 1 aws_ou node per OU in the path
  • 1 BELONGS_TO edge per account to its OU
  • 1 TRUSTS edge for the mp-workloads-to-mp-data cross-account role assume — with both source and target being real role nodes (NOT placeholder external_aws_account)
  • the platform's path materializer renders this as a single authority path: Bedrock-agent → action-Lambda → Lambda-role → cross-account-trusts-edge → data-role → S3-bucket

After Phase 5: a terraform apply against 3 fresh AWS accounts deploys the spoke role into all 3 in one apply, and sv0-aws scan --all --accounts <those-3> succeeds with no manual click step. The CFN StackSet equivalent succeeds via aws cloudformation create-stack-instances.

After Phase 6: iam:SimulatePrincipalPolicy against the SecurityV0ReadOnly role with iam:CreateRole returns implicitDeny (boundary blocks even if some future policy attempt allows it).

Cost / API-call budget for Stream 4 sizing. Per scan of one account, default service set:

CategoryCalls (rough)Notes
iam5–15one paginated GetAccountAuthorizationDetails, plus GenerateCredentialReport, plus per-role GetServiceLastAccessedDetails for top N
lambda1 + 3·(#functions) per regionlist + get + getPolicy + listEventSourceMappings
bedrock5–20 per regionlist + describe per agent / KB / flow
s31 + 8·(#buckets)global list, then per-bucket policy/encryption/etc.
secrets1 + 2·(#secrets) per regionlist + describe + getResourcePolicy
dynamodb_sns1 + 1·(#tables + #topics) per region
step_functions1 + 1·(#machines) per region
eventbridge1 + 1·(#rules + #connections) per region
ecs_ecr5–30 per regionclusters + services + taskDefs + repos
cloudtrail30–300 per LambdaS3 archive scan, budget-bounded to 600s/Lambda by default
access_analyzer2 + 1·(#findings) per region
config~5 per regionrare in pilot accounts

A small Lab-2-sized account (10 Lambdas, 2 Bedrock agents, 5 buckets, 5 secrets) costs ~150 API calls for the steady-state set excluding CloudTrail. CloudTrail evidence dominates — budget another ~3000 calls per account if the 30-day evidence window is on. Total per-scan order of magnitude: ~3,000 API calls per account; ~10,000 for the 3-account Lab 2; ~$0.05 in CloudTrail LookupEvents charges per scan, negligible against the rest of AWS pricing.

Open questions

  1. Stream 1's ConnectorInstance.scanScope exact field names. RESOLVED in umbrella revision-1 contract lock: Stream 1 ScanScopeDoc.scope_keys = { account_ids: string[], regions: string[] } (always plural arrays); service_categories[] is a top-level field on ScanScopeDoc, validated by the platform against ConnectorInstance.discovered_capabilities.service_categories_available.
  2. Per-account per-region rate-limit budgets vs per-account global. I'm using per-account-2-cell-cap as the simplest correct lower bound; finer per-(account,region,category) bucketing is sv0-platform#309 territory and explicitly out of scope here.
  3. Should the org-discovery role be one ARN (the management account) or a list (delegated admin per service)? Today: one ARN. AWS best practice is delegated admin per service (Config, Security Hub, Access Analyzer can each have a different delegated admin). Proposal: v1 supports one org-discovery ARN. If a customer needs delegated-admin granularity, they fall back to explicit --accounts. Revisit in v2.
  4. aws:PrincipalOrgID is mandatory or optional? Mandatory in org-mode would prevent ExternalId-only sandbox usage. I propose: optional, but emitted as a warning in validate if absent in org-mode.
  5. Cross-account TRUSTS edge direction. I chose "source=trusting role, target=trusted principal" because that mirrors how path-materializer.ts traverses today (caller → callee). If Stream 3's stitcher prefers the opposite, this is a one-line transformer change.
  6. Scheduler implementation. ThreadPoolExecutor(max_workers=4) with semaphores, or a small asyncio orchestrator over thread-pool-wrapped extractors? The former is simpler, the latter composes better with eventual platform-side worker integration. I propose ThreadPoolExecutor for Phase 3, asyncio refactor when Stream 1's worker ships.

References

  • integrations/aws/src/sv0_aws/cli/main.py — current CLI entry point, single-account scan loop
  • integrations/aws/src/sv0_aws/adapters/aws_client.py_assume_role(), paginate_with_backoff, adaptive retry config
  • integrations/aws/src/sv0_aws/config.pyAWS_ORGANIZATION_ROLE_ARN (declared, unused), CloudTrail org-trail config
  • integrations/aws/src/sv0_aws/core/transformer.py:400-438 — existing _transform_accounts
  • integrations/aws/src/sv0_aws/core/transformer.py:1170-1190 — placeholder external_aws_account (to be refactored)
  • integrations/aws/src/sv0_aws/core/trust_policy_parser.py — already extracts trusted accounts / OIDC providers / SAML providers
  • integrations/aws/cfn/securityv0-readonly-role.yaml — single-account CFN, basis for StackSet wrapper
  • sv0-platform/src/ingestion/types.tsScanScope, NormalizedGraph, EvidenceCompletenessReport
  • sv0-demo-labs/labs/sv0-demo-lab-1/main.tf — single-account Lab 1 (must continue to work)
  • sv0-connectors#32 — multi-account acceptance criteria (this design fulfills)
  • sv0-connectors#57 — CloudTrail org-trail multi-account discovery (already partially landed in CloudTrail extractor; rest of the connector follows here)
  • sv0-connectors#89 P0-8 — pre-client P0 epic listing multi-account as alternative to "scope pilot to one account"
  • sv0-platform#309 — multi-tenant connector throttling research (parked; this design respects its tactical mitigations and does not block on it)
  • sv0-documentation#195 — MediaPro pilot readiness umbrella, multi-account on the must-ship list
  • docs/architecture/research/2026-03-30-aws-integration-strategy.md §2, §7, §Phase 0 — multi-account customer shape, delegated-admin posture, STS chain limits
  • docs/architecture/research/2026-03-11-aws-connector-research.md — earlier groundwork
  • docs/plans/2026-04-08-demo-lab-plan.md §"Lab 2 — Nimbus Enterprise" — 3-account topology this connector must serve
  • docs/architecture/05-connectors.md — connector interface invariants (ScanScope, NormalizedGraph, evidence completeness)