Skip to content
← Journal

IAM that scales: human roles, machine roles, blast radius

A practical mental model for designing IAM that survives growth without turning into the team that rubber-stamps wildcards.

Published April 25, 2026 · 3 min read
  • IAM
  • Security
  • Cloud
  • Architecture

Every IAM problem I have walked into is the same problem in disguise: someone, three quarters ago, granted a wildcard “to unblock a deploy”, and nobody felt safe taking it back. Multiplied by twenty quarters and forty engineers, you get the org you read about in incident reports.

IAM that scales is not “least privilege purity”. It is a system a busy team can keep clean while still shipping. Three rules.

1. Separate human roles from machine roles. Religiously.

A human signs in, opens a console, types a thing, walks away. A machine wakes up, makes 12 API calls, dies. They have completely different needs and completely different blast radii. Stop putting them in the same role.

For humans:

  • SSO, always. No long-lived IAM users.
  • A small number of named roles tied to job function: viewer, developer, oncall, admin. No personal roles.
  • Console access only. No programmatic credentials in someone’s ~/.aws/.
  • MFA non-negotiable. Hardware keys for admin.

For machines:

  • One role per workload. Not per team, not per environment, per workload. The S3-uploader-service has its own role.
  • No long-lived secrets. Use OIDC / instance roles / workload identity / IRSA — the cloud-native equivalent for your platform.
  • Policies are infra-as-code, reviewed in PRs, signed off by an engineer who runs the workload.

When humans and machines share a role, you get the worst of both: humans inherit the workload’s noise; the workload inherits the human’s blast radius.

2. Optimise for revocation, not for grants

Granting access is easy. Revoking access is the part that breaks.

Design with revocation as the first-class operation:

  • Group everything by revocable units. A team leaves: revoke the team’s group, not 47 personal grants. A workload is decommissioned: delete its role, the rest of the policy graph stays clean.
  • Time-bound elevated access. “I need admin for 30 minutes to debug” is a workflow (aws sts assume-role with a session policy and a max duration). It is not a Slack request and a permanent grant.
  • Periodically diff actual usage vs granted permissions. AWS calls this “Access Analyzer”. GCP calls it “Policy Intelligence”. Whatever it is called in your cloud, look at it. Quarterly.

You will never write the perfect policy on day one. You can build a system that is easy to clean.

3. Blast radius is a number you can write down

Before granting any role to any human or workload, ask: “if this credential leaks tomorrow, what is the worst thing it can do, and how loud is the alarm?”

A useful exercise: write down, for each named role, the answer to those two questions in one sentence each. Pin it to the role definition.

RoleWorst thingAlarm
developerDeploy a new version of any non-prod serviceCI shows it; commit links to the PR
oncallRead prod data, restart prod podsAll actions audited; oncall schedule is public
adminAnythingHardware key; activity → Slack channel; quarterly review
s3-uploader-svcUpload to one bucket, with one prefixCloudTrail data event; per-prefix metric

If you cannot fit the answers in one line, the role is doing too much. Split it.


Most IAM disasters are not failures of permission models. They are failures of attention. The team stopped paying attention to the policy graph because the policy graph stopped being readable. Keep it readable, keep revocation cheap, and keep the blast radius for each credential a number a human can hold in their head.

That is IAM that lasts.

Was this useful?