IAM that scales: human roles, machine roles, blast radius
A practical mental model for designing IAM that survives growth without turning into the team that rubber-stamps wildcards.
- IAM
- Security
- Cloud
- Architecture
Every IAM problem I have walked into is the same problem in disguise: someone, three quarters ago, granted a wildcard “to unblock a deploy”, and nobody felt safe taking it back. Multiplied by twenty quarters and forty engineers, you get the org you read about in incident reports.
IAM that scales is not “least privilege purity”. It is a system a busy team can keep clean while still shipping. Three rules.
1. Separate human roles from machine roles. Religiously.
A human signs in, opens a console, types a thing, walks away. A machine wakes up, makes 12 API calls, dies. They have completely different needs and completely different blast radii. Stop putting them in the same role.
For humans:
- SSO, always. No long-lived IAM users.
- A small number of named roles tied to job function:
viewer,developer,oncall,admin. No personal roles. - Console access only. No programmatic credentials in someone’s
~/.aws/. - MFA non-negotiable. Hardware keys for
admin.
For machines:
- One role per workload. Not per team, not per environment, per workload. The S3-uploader-service has its own role.
- No long-lived secrets. Use OIDC / instance roles / workload identity / IRSA — the cloud-native equivalent for your platform.
- Policies are infra-as-code, reviewed in PRs, signed off by an engineer who runs the workload.
When humans and machines share a role, you get the worst of both: humans inherit the workload’s noise; the workload inherits the human’s blast radius.
2. Optimise for revocation, not for grants
Granting access is easy. Revoking access is the part that breaks.
Design with revocation as the first-class operation:
- Group everything by revocable units. A team leaves: revoke the team’s group, not 47 personal grants. A workload is decommissioned: delete its role, the rest of the policy graph stays clean.
- Time-bound elevated access. “I need admin for 30 minutes to debug” is a workflow (
aws sts assume-rolewith a session policy and a max duration). It is not a Slack request and a permanent grant. - Periodically diff actual usage vs granted permissions. AWS calls this “Access Analyzer”. GCP calls it “Policy Intelligence”. Whatever it is called in your cloud, look at it. Quarterly.
You will never write the perfect policy on day one. You can build a system that is easy to clean.
3. Blast radius is a number you can write down
Before granting any role to any human or workload, ask: “if this credential leaks tomorrow, what is the worst thing it can do, and how loud is the alarm?”
A useful exercise: write down, for each named role, the answer to those two questions in one sentence each. Pin it to the role definition.
| Role | Worst thing | Alarm |
|---|---|---|
developer | Deploy a new version of any non-prod service | CI shows it; commit links to the PR |
oncall | Read prod data, restart prod pods | All actions audited; oncall schedule is public |
admin | Anything | Hardware key; activity → Slack channel; quarterly review |
s3-uploader-svc | Upload to one bucket, with one prefix | CloudTrail data event; per-prefix metric |
If you cannot fit the answers in one line, the role is doing too much. Split it.
Most IAM disasters are not failures of permission models. They are failures of attention. The team stopped paying attention to the policy graph because the policy graph stopped being readable. Keep it readable, keep revocation cheap, and keep the blast radius for each credential a number a human can hold in their head.
That is IAM that lasts.
Was this useful?