Multi-Account AWS Connector Architecture
Spec-deviation note (revision-1): This design deviates from
sv0-connectors#32's acceptance criterion that says "for each target account, assume role chain: bootstrap → org role → per-account role." This design replaces that with never chains STS — every per-account assume happens directly from bootstrap creds, and the org role is used only forListAccounts/OU walk. The deviation is justified (STS chaining adds latency, complicates credential refresh, and provides no security benefit in our threat model), but the original#32AC needs an amendment comment before this design is implemented. Filed as a coordination action in the umbrella's Coord-7 (revision-1).
TL;DR
Today's AWS connector scans exactly one account: the one its assumed role lives in (AWS_ROLE_ARN). MediaPro is a 3-account org and Lab 2 (nimbus-security / nimbus-workloads / nimbus-data) cannot be built without honoring --accounts. This proposal restructures the connector around a (account × service-category) cell as the unit of work, with role-chain authentication (bootstrap → org-discovery role → per-account SecurityV0ReadOnly), per-cell partial-failure isolation, and a StackSet-deployable spoke role. Account discovery is via AWS Organizations when available and an explicit list otherwise. One aws_account node per discovered account, OU membership encoded as BELONGS_TO, and cross-account role assumption emitted as a single TRUSTS edge between role nodes (so Stream 3's stitcher can collapse workload-A → role-A → role-B → resource-B into one authority path).
Problem
The single-account connector has four concrete failures against the MediaPro-shaped customer:
--accountsis parsed nowhere.sv0-connectors#32flagged it;cli/main.pyaccepts no such flag.scan()callsaws_client.get_account_id()once, then iterates regions only. Workloads inmp-workloadsare invisible to a connector that assumed intomp-security.- Monolithic scan, monolithic failure.
AWSConnector.scan()wraps the whole region loop in a singletry/exceptthat re-raises (cli/main.py:308-310). One Bedrock-region throttle aborts IAM, Lambda, S3, ECS, etc. for every account. Enterprise security teams will not approve a 4-hour read window for "all of AWS or nothing." - No service-category scoping. Every scan extracts every category (IAM + Lambda + Bedrock + Secrets + S3 + DynamoDB + SNS + Step Functions + EventBridge + ECS + ECR + CloudTrail). There is no
--services iamknob. A buyer who wants nightly IAM scans + weekly Bedrock cannot have it. - Cross-account authority paths fail open. When a workload assumes a role in another account, the trust-policy parser sees the trusted account ID but no role node exists in that account because the connector never scanned it. The transformer creates a placeholder
external_aws_accountresource node (transformer.py:1175-1190) and the path dead-ends at an account chip instead of resolving into the actual destination role's permissions.
Current state
What ships today
| File | Behavior |
|---|---|
integrations/aws/src/sv0_aws/cli/main.py | CLI accepts --regions, --skip-cloudtrail, no --accounts, no --services. scan() is a single linear pass over one account. |
integrations/aws/src/sv0_aws/adapters/aws_client.py | One boto3.Session. _assume_role() assumes AWS_ROLE_ARN once, caches credentials with refresh-5-min-before-expiry. No multi-target STS. paginate_with_backoff correctly resumes mid-pagination on throttling and uses retry_mode="adaptive". |
integrations/aws/src/sv0_aws/config.py | AWS_ORGANIZATION_ROLE_ARN is read from env but never used by anything. CLOUDTRAIL_BUCKET_LAYOUT already supports organization (W1.3 phase 2) — partial multi-account groundwork in CloudTrail only. |
integrations/aws/src/sv0_aws/extractors/*.py | Each extractor takes account_id as a parameter, but the entire CLI passes the same current_account_id to all of them. The extractor signatures already permit per-account scoping; the orchestrator does not exploit it. |
integrations/aws/src/sv0_aws/core/transformer.py | _transform_accounts() exists and emits aws_account nodes (lines 400-438). aws_account placeholder nodes for trusted external accounts also exist (lines 1170-1190). BELONGS_TO workload→account edges and BELONGS_TO account→OU edges are not emitted. |
integrations/aws/cfn/securityv0-readonly-role.yaml | Single CFN template, parameterized on ExternalId and SecurityV0AccountId. No StackSet wrapper, no aws:PrincipalOrgID condition, no per-region considerations beyond the IAM-is-global default. Permission set already covers all categories planned below — re-use as-is. |
Gap vs sv0-connectors#32
#32 accepted criteria, status:
| Criterion | Status | Notes |
|---|---|---|
--accounts 111,222,333 scans three explicit accounts | Not built | CLI does not parse the flag |
AWS_ORGANIZATION_ROLE_ARN triggers ListAccounts auto-discovery | Not built | env var is read, never consumed |
| Partial failures don't abort whole scan | Not built | single try/except in scan() |
scanScope.sourceSystems includes per-account identifiers | Partial | scanScope.sourceSystems is hardcoded ["aws_iam","aws_lambda",...], never per-account |
| Cross-account workload→role trace is one authority path | Not built | placeholder external_aws_account dead-ends the path |
| CFN updated for per-account role deployment | Not built | StackSet wrapper missing |
Other follow-ups
sv0-connectors#57— CloudTrail org-trail layout already wired in config (CLOUDTRAIL_BUCKET_LAYOUT=organization) andCloudTrailExtractoracceptsorganization_id. The remaining gap is the rest of the connector following CloudTrail's lead.sv0-platform#309— research-only on cross-tenant rate-limiting. Tactical PR (adaptive retry, structured throttle logs) already inaws_client.py. Multi-account budget here MUST stay within those tactical mitigations and not require new platform-side coordination.sv0-connectors#89(P0-8) — multi-account is on the pre-client P0 list and explicitly listed as the alternative to "scope pilot to one account." MediaPro will not accept the latter.
Design proposal
Account discovery & role-chain auth
Two modes, one code path.
- Organizations mode — set
AWS_ORGANIZATION_ROLE_ARNto a role in the management/delegated-admin account that hasorganizations:ListAccounts,organizations:DescribeOrganization,organizations:ListOrganizationalUnitsForParent,organizations:ListParents. The connector assumes this role first, paginatesListAccounts, builds the OU tree, then for eachACTIVEaccount assumes the per-account spoke role. CLI:sv0-aws scan --all --discover-org. - Explicit list mode — the operator supplies
--accounts 111111111111,222222222222,333333333333. No org role required; the connector only needs bootstrap creds plus a spoke role in each listed account. Compatible with ephemeral / pilot accounts that aren't part of an Org. CLI:sv0-aws scan --all --accounts 111,222,333.
Role chain.
bootstrap creds (env / instance / SSO)
└─ optional: AssumeRole AWS_ORGANIZATION_ROLE_ARN (Organizations API surface)
└─ for each target account:
AssumeRole arn:aws:iam::<account_id>:role/SecurityV0ReadOnly
(ExternalId required, sts:ExternalId on trust policy)
We deliberately do not chain: every per-account assume happens directly from the bootstrap session, never from the org-role session. STS chained sessions are capped at 1 hour regardless of MaxSessionDuration (per 2026-03-30-aws-integration-strategy.md §Phase 0), and the Organizations API call is a one-shot at the start of the run, so there is no upside to chaining.
Per-account role ARN convention. arn:aws:iam::<account_id>:role/SecurityV0ReadOnly is the default. Override is a single env var, AWS_SPOKE_ROLE_NAME=SecurityV0ReadOnly, applied uniformly. We deliberately do not support per-account ARN overrides in v1 — uniform naming is what makes StackSet deployment a one-step operation. Customers with naming-convention objections get an opt-out in v2.
Credential lifetime. AssumeRole returns 1-hour credentials. We cache per (account_id, region) in a dict[tuple[str, str], CachedCreds]. Refresh logic mirrors today's _are_credentials_valid() — refresh 5 minutes before expiry. A long IAM scan against 50 accounts may need to re-assume; we tolerate that.
Account denies AssumeRole — log assume_role_denied account_id=X error_code=AccessDenied, mark the cell status: failed, reason: assume_role_denied, do not retry, and continue. The account is reported in scanScope.errors.permissionDenied. Critically: a denied account does not block discovery of other accounts.
Service-category scoping
The category set (initial v1):
| Category | Extractors / APIs | Per-account TPS budget |
|---|---|---|
iam | iam:GetAccountAuthorizationDetails, iam:GetCredentialReport, iam:GetServiceLastAccessedDetails | ~3 TPS, single global call (heavy) |
lambda | lambda:ListFunctions, GetFunction, GetPolicy, ListEventSourceMappings, per region | ~5 TPS per region |
bedrock | bedrock:ListAgents/Get*, ListKnowledgeBases, ListFlows, ListGuardrails, GetModelInvocationLoggingConfiguration | ~5 TPS per region |
ecs_ecr | ecs:ListClusters/DescribeServices/DescribeTaskDefinition, ecr:DescribeRepositories | ~10 TPS per region |
step_functions | states:ListStateMachines/DescribeStateMachine | ~5 TPS per region |
eventbridge | events:ListRules/ListTargetsByRule/ListConnections/ListDestinations | ~5 TPS per region |
s3 | s3:ListAllMyBuckets/GetBucket* | global list + per-bucket (1-2 TPS practical) |
secrets | secretsmanager:ListSecrets/DescribeSecret, ssm:DescribeParameters | ~5 TPS per region |
dynamodb_sns | dynamodb:ListTables/DescribeTable, sns:ListTopics/GetTopicAttributes | ~10 TPS per region |
cloudtrail | cloudtrail:DescribeTrails, S3 archive scan via CloudTrailExtractor | already isolated per workload, ~30s/Lambda |
access_analyzer | access-analyzer:ListAnalyzers/ListFindings | ~5 TPS per region |
config | config:DescribeConfigRules (delegated admin only) | ~3 TPS per region |
ecs+ecr and dynamodb+sns are paired because they always co-load in customer demos and split-billing the TPS budget for them adds no value.
Scope object (consumed from Stream 1's ScanScopeDoc). Per the locked Stream 1 ↔ Stream 2 contract: AWS-specific keys (account_ids[], regions[], optional discovery/exclude fields) live inside scope_keys; service_categories[] lives outside scope_keys, at the top of the ScanScope document, and is validated by the platform against ConnectorInstance.discovered_capabilities.service_categories_available. The on-disk Mongo document (Stream 1's ScanScopeDoc) splits the two:
{
"scope_keys": {
"account_ids": ["111111111111", "222222222222", "333333333333"],
"regions": ["us-east-1", "eu-west-1"]
// optional fields, all inside scope_keys:
// "discover_org": true, // exclusive with explicit account_ids
// "exclude_account_ids": ["999999999999"], // applied after discovery
// "exclude_ous": ["ou-aaaa-suspended"]
},
"service_categories": ["iam", "lambda", "bedrock"]
}
The flat object the AWS connector executor sees after Stream 1 unwraps ScanScopeDoc for the CLI is { account_ids, regions, service_categories, ...optional discovery fields } — that flattened form is what the rest of this document refers to as "the scope object."
Unit-of-work cell — (account_id, service_category). Region is inside a cell, not a third dimension, because every category-extractor today already iterates regions internally (extract_lambda_by_region, extract_bedrock_entities_by_region, etc.). Adding region as a third axis would force extractor refactors with no scheduling benefit (regions for one category in one account share the same STS session and share the same per-service quota).
A scan of {accounts: 3, services: 4} therefore produces 12 cells. Each cell:
- has its own STS-credentials handle (cached per account; re-used across cells in that account)
- has its own try/except boundary
- emits its own per-cell connector-report row
- can succeed (
available), partial (partial), or fail (unavailable_no_access/unavailable_not_enabled)
Permission declarations. Each category declares its required IAM actions in code (extractors/<name>_extractor.py:REQUIRED_ACTIONS = frozenset({...})). At scan start, the connector unions REQUIRED_ACTIONS across service_categories and runs iam:SimulatePrincipalPolicy (or falls back to a smoke test per category) to short-circuit cells whose role lacks permissions. This bounds wasted AssumeRole churn.
Parallelism & rate-limit budgets
Per-tenant scheduler. A single ThreadPoolExecutor(max_workers=N) (default N=4) drives all cells. Threads, not async — every extractor is boto3-blocking and the threading overhead is dwarfed by network I/O. Workers are budgeted per account, not per category: never run >2 cells concurrently against the same account, because per-account IAM TPS is the tightest binding constraint. Per-region category cells against different accounts are independent and parallelize freely.
account 111 account 222 account 333
├─ iam ├─ iam ├─ iam
└─ lambda └─ lambda └─ lambda
(≤2) (≤2) (≤2)
global concurrency cap: 6 (= 3 accounts × 2)
worker pool: min(global_cap, max_workers)
Per-service rate-limit budgets. Reuse paginate_with_backoff (already cursor-resuming + adaptive retry, per aws_client.py:175-304). No new token bucket — sv0-platform#309 is parked research and we do not block multi-account on it. We DO add per-cell budget tracking: each cell records api_calls_made, throttle_events, wall_time_seconds in its connector-report row. This is the data Stream 4 needs to size cost expectations and #309 needs to inform its design.
Backpressure. _assume_role throttling on STS (~100 TPS per account) is plausible at scale. Bootstrap session adds an STS retry session with the same adaptive config. If STS fails after retries on a given target, the cell is marked failed, account is not removed from subsequent cells (because the per-(account,region) credential cache may still hold valid creds for sibling cells).
Partial-failure propagation.
cell outcome → connector-report status → ScanRun roll-up
─────────────────────────────────────────────────────────
available │ ok │ contributes to ok
partial │ partial │ ScanRun = partial
empty │ ok (recordCount=0) │ contributes to ok
failed │ unavailable_no_access │ ScanRun = partial (NOT failed unless ALL cells fail)
A ScanRun is failed only if every cell failed, because by definition we have nothing to ingest. Otherwise it is ok (all available) or partial (mixed). This is the contract Stream 1's ScanRun schema must accept; we flag it as an assumption.
Node/edge emission for cross-account
This is the contract Stream 3 will consume. Schemas are in NormalizedGraph form (camelCase, matches sv0-platform/src/ingestion/types.ts).
aws_account resource node (one per discovered account)
{
"nodeId": "aws_account:<account_id>", // already emitted today
"nodeType": "resource",
"sourceSystem": "aws_organizations:<account_id>",
"sourceId": "<account_id>",
"displayName": "<account_name or 'AWS Account <id>'>",
"status": "active", // or "suspended" from DescribeAccount
"properties": {
"accountId": "<account_id>",
"accountName": "<account_name>",
"ouId": "ou-aaaa-bbbbbbbb", // already emitted
"ouPath": "Root/Production/Workloads", // already emitted
"isManagementAccount": false, // already emitted
"organizationId": "o-abcdef0123", // already emitted
"subtype": "aws_account", // already emitted
// NEW additions for Stream 3 stitching:
"accountPurpose": "workloads", // hint from name pattern: security|workloads|data|sandbox|management|unknown
"discoveredVia": "organizations" | "explicit_list",
"joinedMethod": "INVITED" | "CREATED",
"joinedTimestamp": "2026-01-15T..."
}
}
aws_ou resource node (NEW — currently missing)
{
"nodeId": "aws_ou:<ou_id>",
"nodeType": "resource",
"sourceSystem": "aws_organizations:<organization_id>",
"sourceId": "<ou_id>",
"displayName": "<ou_name>",
"status": "active",
"properties": {
"ouId": "ou-aaaa-bbbbbbbb",
"ouName": "Production",
"ouPath": "Root/Production",
"parentId": "r-abcd" | "ou-...",
"organizationId": "o-abcdef0123",
"subtype": "aws_ou"
}
}
BELONGS_TO edges (NEW — currently missing)
Two patterns:
// workload / identity / resource → its owning account
{
"edgeId": "belongs_to:<source_node_id>:<account_id>",
"edgeType": "BELONGS_TO",
"sourceNodeId": "aws_lambda:222222222222:us-east-1:claims-reconcile",
"targetNodeId": "aws_account:222222222222",
"properties": { "boundary": "account" }
}
// account → OU
{
"edgeId": "belongs_to:<account_id>:<ou_id>",
"edgeType": "BELONGS_TO",
"sourceNodeId": "aws_account:222222222222",
"targetNodeId": "aws_ou:ou-aaaa-bbbbbbbb",
"properties": { "boundary": "ou" }
}
Cross-account TRUSTS / ASSUMES_ROLE edges
Today TrustPolicyParser already extracts aws_accounts from a role's AssumeRolePolicyDocument. The current transformer (lines 1170-1190) creates a placeholder external_aws_account and a generic edge — this hides the real path. New emission:
// Role A in account-A trusts role B in account-B (when both roles exist in scan)
{
"edgeId": "trusts:aws_iam_role:222222222222:cross-account-data-reader:aws_iam_role:333333333333:data-orchestrator",
"edgeType": "TRUSTS",
"sourceNodeId": "aws_iam_role:222222222222:cross-account-data-reader", // the trusting role (target of AssumeRole)
"targetNodeId": "aws_iam_role:333333333333:data-orchestrator", // the trusted principal (caller)
"properties": {
"boundary": "cross_account",
"trustingAccountId": "222222222222",
"trustedAccountId": "333333333333",
"externalId": true, // condition `sts:ExternalId` present
"principalOrgIdCondition": "o-abc", // if `aws:PrincipalOrgID` condition present
"trustPolicyHash": "<sha256 of AssumeRolePolicyDocument>"
}
}
If only the trusting role is in-scope (the trusted account is not scanned), we still emit the edge but the target is the existing external_aws_account placeholder. Stream 3's stitcher uses trustedAccountId to back-fill the edge if a later sibling-connector scan brings the other account into the graph.
Federation edges (AWS role trusts external IdP)
This is what makes Entra-SP-via-OIDC visible to Stream 3. Trust-policy parser already emits oidc_providers and saml_providers. New edge:
// AWS role trusts an OIDC provider (e.g. Entra)
{
"edgeId": "trusts_federation:aws_iam_role:222222222222:gh-deploy:oidc:token.actions.githubusercontent.com",
"edgeType": "TRUSTS",
"sourceNodeId": "aws_iam_role:222222222222:gh-deploy",
"targetNodeId": "external_oidc_provider:token.actions.githubusercontent.com",
"properties": {
"boundary": "cross_system",
"providerType": "oidc",
"providerUrl": "https://token.actions.githubusercontent.com",
"audience": "sts.amazonaws.com",
"subjectClaim": "repo:nimbus/inframgmt:ref:refs/heads/main", // condition StringEquals on token.actions.githubusercontent.com:sub
"trustPolicyHash": "<sha256>"
}
}
For Entra-SP federation specifically, the providerUrl will be https://sts.windows.net/<tenant_id>/ or https://login.microsoftonline.com/<tenant_id>/v2.0. Stream 3 correlates external_oidc_provider:sts.windows.net/<tenant_id> with the matching entra_service_principal:<tenant_id>:<sp_id> node coming from the Entra connector — the <tenant_id> substring is the join key.
Source-record fingerprints
Every node and edge that originates from an AWS API response carries a properties.sourceFingerprint deterministic hash:
sourceFingerprint = sha256(
source_system_id // "aws_iam:222222222222"
+ ":" // "
+ source_record_id // role ARN
+ ":" //
+ source_field_path // "AssumeRolePolicyDocument"
)
This is what Stream 3 uses as the stable join key when stitching across connectors and across re-scans. Every cross-account TRUSTS edge above carries trustPolicyHash (sha256 of the document body) so the stitcher can detect "same trust, scanned from both sides" without a fragile ARN+ARN exact match.
CloudFormation / Terraform deployment
StackSet template (NEW: cfn/securityv0-readonly-role-stackset.yaml)
A thin wrapper around the existing securityv0-readonly-role.yaml:
- Permission model:
SERVICE_MANAGED— no per-account admin role required because the management account / delegated admin handles deployment. Customers without delegated admin fall back toSELF_MANAGEDwithAWSCloudFormationStackSetAdministrationRole/AWSCloudFormationStackSetExecutionRole(the standard pre-StackSet bootstrap pair). - Deployment targets:
OrganizationalUnitIds(configurable). The MediaPro pilot deploys to a single OU containing the 3 pilot accounts. - Auto-deployment:
Enabled: trueso newly-added accounts in the OU automatically get the role. This is what makes ephemeral-account workflows (Stream 4) tractable —terraform applyadds an account, the StackSet fires, the role exists by the timeterraform outputreturns the account ID. - Capabilities:
CAPABILITY_NAMED_IAM(we name the roleSecurityV0ReadOnly). - Stack instances per region: IAM roles are global, so the stack only needs to deploy in one region per account (typically
us-east-1). The existing CFN already creates a global IAM role + managed policy.
The trust policy in the existing securityv0-readonly-role.yaml MUST add an aws:PrincipalOrgID condition alongside the existing sts:ExternalId for the org-mode deployment. This prevents a leaked ExternalId from being usable outside the org. Single-account / sandbox deployments without an Org ID fall back to ExternalId-only.
# delta to AssumeRolePolicyDocument:
Condition:
StringEquals:
'sts:ExternalId': !Ref ExternalId
'aws:PrincipalOrgID': !Ref OrganizationId # NEW, optional via !If
Terraform module (NEW: sv0-demo-labs/shared/securityv0-spoke-role/)
For Lab 2 / Stream 4. Provides the same role as a Terraform module that can be for_each-ed across a terraform-aws-modules/account instantiation. Stream 4's scaffold will look like:
module "spoke_role" {
for_each = toset(local.pilot_account_ids)
source = "../../shared/securityv0-spoke-role"
account_id = each.key
external_id = var.sv0_external_id
organization_id = data.aws_organizations_organization.this.id
providers = { aws = aws.member[each.key] }
}
This composes with the IaC up/scan/teardown lifecycle: terraform apply brings up accounts and roles, the connector scan runs, terraform destroy tears everything down. No manual click step.
Permission boundary
The existing SecurityV0ReadOnlyPolicy is already a "one explicit-allow per service" managed policy. We add a permission boundary on the role itself (SecurityV0ReadOnlyBoundary) that explicitly denies every write verb — Deny *:Put*, Deny *:Create*, Deny *:Delete*, Deny *:Update*, plus an explicit Deny secretsmanager:GetSecretValue and Deny ssm:GetParameter (read-but-leaks-secret guards already in the policy as exclusions, hardened here as permission-boundary denies). This is the artifact security teams want to see in their CFN review packet.
CLI / API surface
Honored end-to-end
# Existing single-account behavior (unchanged default)
sv0-aws scan --all
# Explicit account list
sv0-aws scan --all --accounts 111111111111,222222222222,333333333333
# Org auto-discovery
AWS_ORGANIZATION_ROLE_ARN=arn:aws:iam::management:role/SV0OrgDiscovery \
sv0-aws scan --all --discover-org
# Service-category subset (one or many)
sv0-aws scan --all --accounts 111,222,333 --services iam
sv0-aws scan --all --accounts 111,222,333 --services iam,bedrock
# Combined region + service + account scoping
sv0-aws scan --all --accounts 111,222,333 --regions us-east-1 --services iam,lambda
Integration with Stream 1's ConnectorInstance / ScanScope
In production the connector is invoked by Stream 1's worker, not from a human terminal. The CLI gains a --scope-json <file> flag that takes a JSON-serialized ScanScope (the AWS-specific extension above) and is mutually exclusive with --accounts / --regions / --services. The worker writes the scope to a tempfile and execs the connector. This is a forward-compatible bridge — when Stream 1 ships a Python entry-point that takes the scope object directly, the CLI wrapper falls away.
If Stream 1's exact ConnectorInstance schema is not yet published, we treat the scope object above as the working contract. The umbrella plan should reconcile field names if Stream 1 picks different ones.
Connector-report shape per (account × category) cell
The NormalizedGraph already has evidenceCompleteness.sources and scanScope. Multi-account requires sub-keying by (account_id, category):
{
"evidenceCompleteness": {
"sources": {
"aws_iam:111111111111": { "status": "available", "recordCount": 142, "apiCallCount": 7 },
"aws_iam:222222222222": { "status": "unavailable_no_access", "recordCount": 0, "errorCode": "AssumeRoleAccessDenied" },
"aws_iam:333333333333": { "status": "available", "recordCount": 89, "apiCallCount": 5 },
"aws_lambda:111111111111": { "status": "available", "recordCount": 23 },
"aws_lambda:222222222222": { "status": "partial", "recordCount": 17, "errorCode": "ThrottlingMaxRetries" },
"aws_lambda:333333333333": { "status": "available", "recordCount": 8 }
// ...
}
},
"scanScope": {
"mode": "targeted",
"sourceSystems": [
"aws_iam:111111111111", "aws_iam:222222222222", "aws_iam:333333333333",
"aws_lambda:111111111111", "aws_lambda:222222222222", "aws_lambda:333333333333"
],
"errors": {
"errorsEncountered": 1,
"permissionDenied": ["222222222222"]
}
}
}
The aws_iam:222222222222 cell failing does NOT remove aws_iam entities from the graph for accounts 111 and 333 — scanScope.sourceSystems lists every successful (category, account) pair. The platform's diff engine treats each cell independently for delete-eligibility.
Migration / backward compat
Existing single-account scans. A scan with no --accounts and no --discover-org falls through to a single-cell run against sts:GetCallerIdentity().Account. The connector-report shape changes shape — what was evidenceCompleteness.sources["aws_iam"] becomes evidenceCompleteness.sources["aws_iam:<account_id>"]. This is a one-time platform-side migration: the diff engine and UI already key on full source-system strings, so the colon suffix is transparent. A six-week "double-key" emit (both aws_iam and aws_iam:<id>) cushions the cutover.
Tenants without AWS Organizations. Org mode requires AWS_ORGANIZATION_ROLE_ARN. Tenants without an Org use explicit-list mode. There is no auto-fallback — being silently dropped from "scanning my whole org" to "scanning one account" is a worse failure mode than a clear "no org role configured, supply --accounts" error.
Compatibility with sv0-demo-lab-1. Lab 1 is a single AWS account scanned by a SecurityV0ReadOnly role within that same account. After this change, Lab 1 still works exactly as today: no --accounts, no --discover-org, the connector resolves to a one-cell-per-category single-account scan. The only visible change is the cell-keyed evidenceCompleteness shape, which Lab 1's tests must update to match.
Implementation plan
TDD-style. Group under phases. Repos: 🔵 = sv0-connectors, 🟢 = sv0-demo-labs/shared, 🟡 = sv0-platform (one tiny shape allowance only).
Phase 1: Org discovery + role-chain auth
- 🔵 T1.1 — Write failing test:
AWSClientAdapter.assume_role_into(account_id)returns a per-account-cached session; second call within 55min returns same creds. Then implement. - 🔵 T1.2 — Write failing test:
OrganizationsDiscovery.list_active_accounts()returns an iterator ofAWSAccountwithou_pathpopulated fromListParents+DescribeOrganizationalUnit. Mockboto3withmotoorbotocore.stub.Stubber. Then implement undersv0_aws/discovery/organizations.py. - 🔵 T1.3 — Write failing test:
assume_role_into("999999999999")onAccessDeniedreturns aCellOutcome.failed(reason="assume_role_denied")instead of raising. Then implement (refactor_assume_roleto accept target ARN parameter). - 🔵 T1.4 — Add
--accountsand--discover-orgflags tocli/main.py; mutually exclusive validation. Test argparse exit behavior.
Phase 2: Service-category scoping in CLI
- 🔵 T2.1 — Define
ServiceCategoryenum +CATEGORY_TO_EXTRACTORSmapping. Test that the union of all category extractor sets equals the current monolithic extractor set (no regression). - 🔵 T2.2 — Add
--servicesflag (comma-separated, validates against enum). Default = "all". Test that--services iambuilds an executor that runs only IAM extractors. - 🔵 T2.3 — Add
--scope-jsonflag that overrides--accounts/--regions/--services. Test round-trip: write scope file → exec connector → resultingscanScope.sourceSystemsmatches input.
Phase 3: Parallel per-(account × category) execution
- 🔵 T3.1 — Write failing test:
CellExecutorruns 12 cells (3 acc × 4 cat), one cell raising mid-extract; the 11 surviving cells produce data and the 1 failing cell produces afailedconnector-report row. No exception escapes. Then implement undersv0_aws/orchestrator/cell_executor.py. - 🔵 T3.2 — Write failing test: scheduler runs at most 2 cells concurrently against the same account. Use a barrier or counter to assert. Then implement.
- 🔵 T3.3 — Write failing test: per-cell
api_calls_madeandwall_time_secondsare populated in the connector-report row. Then implement (instrument extractors via a context manager). - 🔵 T3.4 — Reshape
evidenceCompleteness.sourceskeys fromaws_iamtoaws_iam:<account_id>. Add a feature flagEVIDENCE_DOUBLE_KEY=truethat emits both during the cutover. Update connector tests.
Phase 4: Cross-account node/edge emission
- 🔵 T4.1 — Write failing test: a graph with workloads in 3 accounts emits exactly 3
aws_accountnodes and 1BELONGS_TOedge per workload. Then implementBELONGS_TOworkload→account in transformer. - 🔵 T4.2 — Write failing test: when
discover_org=true, the graph emits oneaws_ounode per discovered OU andBELONGS_TOaccount→OU edges. Then implement. - 🔵 T4.3 — Write failing test: a role in account A whose trust policy lists account B (and account B IS in scope) emits a
TRUSTSedge fromaws_iam_role:A:roleAtoaws_iam_role:B:roleBwithboundary: cross_accountand atrustPolicyHash. Then implement (refactor_create_trust_edgesto look up real role nodes when present, fall back toexternal_aws_accountwhen not). - 🔵 T4.4 — Write failing test: an OIDC-trusted role (Entra
sts.windows.net/<tenant>audience) emits aTRUSTSedge to anexternal_oidc_providernode with the tenant ID extractable fromproviderUrl. Then implement. - 🔵 T4.5 — Write failing test: every cross-account / federation edge carries a
sourceFingerprintandtrustPolicyHash. Then implement.
Phase 5: StackSet template + Terraform module
- 🔵 T5.1 — Add
aws:PrincipalOrgIDcondition support tocfn/securityv0-readonly-role.yamlbehind a conditional parameter. Addcfn-lintto the connector CI. - 🔵 T5.2 — Add
cfn/securityv0-readonly-role-stackset.yaml(SERVICE_MANAGED + auto-deployment, OUId-parameterized). Validate withaws cloudformation validate-template. - 🟢 T5.3 — Add Terraform module
sv0-demo-labs/shared/securityv0-spoke-role/{main.tf,variables.tf,outputs.tf}that mirrors the CFN.terraform validatepasses. - 🔵 T5.4 — Write
integrations/aws/SETUP.md(referenced by README persv0-connectors#89P0-9 docs item) covering: explicit-list deploy, StackSet deploy, ExternalId rotation, troubleshooting AssumeRole denials.
Phase 6: Hardening + docs
- 🔵 T6.1 —
iam:SimulatePrincipalPolicy-based pre-flight permission check per (account, category). On failure → cell short-circuits withunavailable_no_access, no API calls made. - 🔵 T6.2 — Permission boundary
SecurityV0ReadOnlyBoundaryadded to CFN + Terraform. Test thataws iam simulate-custom-policydeniesiam:CreateRole. - 🟡 T6.3 — Platform diff engine: confirm
(connector_id, source_system)keying tolerates the colon suffixaws_iam:<account>. Add a regression test if missing.
Total: 19 tasks across 6 phases.
Validation criteria
After Phase 1: sv0-aws scan --all --accounts 111,222,333 issues exactly 3 sts:AssumeRole calls (or 0 if cached), one per account, never chained. Failed AssumeRole on account 222 does not block 111 and 333.
After Phase 2: sv0-aws scan --all --services iam --accounts 111 issues IAM API calls only — lambda:ListFunctions, bedrock:ListAgents, etc. are absent from the captured boto3 trace.
After Phase 3: scanning {accounts: 3, services: 4} produces 12 connector-report rows in evidenceCompleteness.sources. Wall-clock time is ≤ 1.6× the longest single-cell time (proves parallelism, with overhead for serial-per-account capping). Killing one cell mid-flight via os.kill on its thread results in 11 successful cells and 1 failed cell — never an exception escape.
After Phase 4: scanning the Lab 2 3-account topology (mp-security, mp-workloads, mp-data) emits:
- exactly 3
aws_accountnodes - ≥ 3
BELONGS_TOedges per scanned workload (one per workload to its account) - 1
aws_ounode per OU in the path - 1
BELONGS_TOedge per account to its OU - 1
TRUSTSedge for themp-workloads-to-mp-datacross-account role assume — with both source and target being real role nodes (NOT placeholderexternal_aws_account) - the platform's path materializer renders this as a single authority path:
Bedrock-agent → action-Lambda → Lambda-role → cross-account-trusts-edge → data-role → S3-bucket
After Phase 5: a terraform apply against 3 fresh AWS accounts deploys the spoke role into all 3 in one apply, and sv0-aws scan --all --accounts <those-3> succeeds with no manual click step. The CFN StackSet equivalent succeeds via aws cloudformation create-stack-instances.
After Phase 6: iam:SimulatePrincipalPolicy against the SecurityV0ReadOnly role with iam:CreateRole returns implicitDeny (boundary blocks even if some future policy attempt allows it).
Cost / API-call budget for Stream 4 sizing. Per scan of one account, default service set:
| Category | Calls (rough) | Notes |
|---|---|---|
iam | 5–15 | one paginated GetAccountAuthorizationDetails, plus GenerateCredentialReport, plus per-role GetServiceLastAccessedDetails for top N |
lambda | 1 + 3·(#functions) per region | list + get + getPolicy + listEventSourceMappings |
bedrock | 5–20 per region | list + describe per agent / KB / flow |
s3 | 1 + 8·(#buckets) | global list, then per-bucket policy/encryption/etc. |
secrets | 1 + 2·(#secrets) per region | list + describe + getResourcePolicy |
dynamodb_sns | 1 + 1·(#tables + #topics) per region | |
step_functions | 1 + 1·(#machines) per region | |
eventbridge | 1 + 1·(#rules + #connections) per region | |
ecs_ecr | 5–30 per region | clusters + services + taskDefs + repos |
cloudtrail | 30–300 per Lambda | S3 archive scan, budget-bounded to 600s/Lambda by default |
access_analyzer | 2 + 1·(#findings) per region | |
config | ~5 per region | rare in pilot accounts |
A small Lab-2-sized account (10 Lambdas, 2 Bedrock agents, 5 buckets, 5 secrets) costs ~150 API calls for the steady-state set excluding CloudTrail. CloudTrail evidence dominates — budget another ~3000 calls per account if the 30-day evidence window is on. Total per-scan order of magnitude: ~3,000 API calls per account; ~10,000 for the 3-account Lab 2; ~$0.05 in CloudTrail LookupEvents charges per scan, negligible against the rest of AWS pricing.
Open questions
Stream 1'sRESOLVED in umbrella revision-1 contract lock: Stream 1ConnectorInstance.scanScopeexact field names.ScanScopeDoc.scope_keys = { account_ids: string[], regions: string[] }(always plural arrays);service_categories[]is a top-level field onScanScopeDoc, validated by the platform againstConnectorInstance.discovered_capabilities.service_categories_available.- Per-account per-region rate-limit budgets vs per-account global. I'm using per-account-2-cell-cap as the simplest correct lower bound; finer per-(account,region,category) bucketing is
sv0-platform#309territory and explicitly out of scope here. - Should the org-discovery role be one ARN (the management account) or a list (delegated admin per service)? Today: one ARN. AWS best practice is delegated admin per service (Config, Security Hub, Access Analyzer can each have a different delegated admin). Proposal: v1 supports one org-discovery ARN. If a customer needs delegated-admin granularity, they fall back to explicit
--accounts. Revisit in v2. aws:PrincipalOrgIDis mandatory or optional? Mandatory in org-mode would prevent ExternalId-only sandbox usage. I propose: optional, but emitted as a warning invalidateif absent in org-mode.- Cross-account TRUSTS edge direction. I chose "source=trusting role, target=trusted principal" because that mirrors how
path-materializer.tstraverses today (caller → callee). If Stream 3's stitcher prefers the opposite, this is a one-line transformer change. - Scheduler implementation.
ThreadPoolExecutor(max_workers=4)with semaphores, or a smallasyncioorchestrator over thread-pool-wrapped extractors? The former is simpler, the latter composes better with eventual platform-side worker integration. I propose ThreadPoolExecutor for Phase 3, asyncio refactor when Stream 1's worker ships.
References
integrations/aws/src/sv0_aws/cli/main.py— current CLI entry point, single-account scan loopintegrations/aws/src/sv0_aws/adapters/aws_client.py—_assume_role(),paginate_with_backoff, adaptive retry configintegrations/aws/src/sv0_aws/config.py—AWS_ORGANIZATION_ROLE_ARN(declared, unused), CloudTrail org-trail configintegrations/aws/src/sv0_aws/core/transformer.py:400-438— existing_transform_accountsintegrations/aws/src/sv0_aws/core/transformer.py:1170-1190— placeholderexternal_aws_account(to be refactored)integrations/aws/src/sv0_aws/core/trust_policy_parser.py— already extracts trusted accounts / OIDC providers / SAML providersintegrations/aws/cfn/securityv0-readonly-role.yaml— single-account CFN, basis for StackSet wrappersv0-platform/src/ingestion/types.ts—ScanScope,NormalizedGraph,EvidenceCompletenessReportsv0-demo-labs/labs/sv0-demo-lab-1/main.tf— single-account Lab 1 (must continue to work)sv0-connectors#32— multi-account acceptance criteria (this design fulfills)sv0-connectors#57— CloudTrail org-trail multi-account discovery (already partially landed in CloudTrail extractor; rest of the connector follows here)sv0-connectors#89P0-8 — pre-client P0 epic listing multi-account as alternative to "scope pilot to one account"sv0-platform#309— multi-tenant connector throttling research (parked; this design respects its tactical mitigations and does not block on it)sv0-documentation#195— MediaPro pilot readiness umbrella, multi-account on the must-ship listdocs/architecture/research/2026-03-30-aws-integration-strategy.md§2, §7, §Phase 0 — multi-account customer shape, delegated-admin posture, STS chain limitsdocs/architecture/research/2026-03-11-aws-connector-research.md— earlier groundworkdocs/plans/2026-04-08-demo-lab-plan.md§"Lab 2 — Nimbus Enterprise" — 3-account topology this connector must servedocs/architecture/05-connectors.md— connector interface invariants (ScanScope, NormalizedGraph, evidence completeness)