Boring cloud architectures that survive growth

Three principles I keep coming back to when teams ask me what their cloud should look like once it stops being a prototype.

Published April 30, 2026 · 3 min read

Cloud
Architecture
Postgres
Reliability

A startup’s first cloud diagram is a Pinterest board. Six services, two queues, a Lambda named “magic”, a CDN that does too much, and a Notion page that explains it. Six months in, the diagram is the same — except now there are three people on call and an unspoken agreement not to touch the CDN config.

Boring cloud is the opposite of that. It is the architecture you can hand to a new senior engineer and have them productive on day three. Here are the three rules I lean on.

1. One database, until it hurts

I have lost count of the times I have seen a team split their data across three databases before they had three customers. “Postgres for users, DynamoDB for events, Redis for sessions” — written on a whiteboard while still on the trial tier.

Almost every product can ride a single Postgres for the first 18 months of growth. Postgres does JSON, full-text search, time series via partitions, queues via SKIP LOCKED, and vector search via pgvector. The tradeoff for monoculture is lower operational cognitive load, which is the resource that runs out first when a team grows.

Add a second store only when you have a measured reason: a sustained query that the planner cannot rescue, a write pattern that breaks vacuum, or a workload (full-text on tens of millions of rows) that demands a specialised engine. Pick that store carefully. The point is not to never split — it is to split with evidence.

2. Stateless compute, statefully observed

Compute should be replaceable. If a single pod or container holds onto state — a cache, a session, a counter — it is a future incident. Push state down to the database (or a dedicated cache like Redis), keep compute boring, and your scaling problem becomes “add a replica” instead of “carefully drain everything”.

The flip side: those replaceable workloads must be observed individually. Every request gets a request id; every job gets a job id; every span ties back to the user that triggered it. When something blows up at 03:14, you should be able to read the trace of one request and know which line of code did it. Not infer. Read.

A traceparent header propagated end-to-end is more valuable than three monitoring vendors.

3. The platform is its boundaries

The interesting part of an architecture is not what is inside it — every cloud has compute, storage, queues. The interesting part is the boundaries you draw. Where does the public internet end? Where does PII stop being allowed to flow? Which services may call which other services?

Boundaries are cheap to encode in IaC and expensive to discover during incidents. So encode them:

A single VPC with subnets that map to trust levels (public / app / data).
One IAM principal per service, with the smallest sensible policy.
Egress via a single NAT or proxy, logged.
Secrets in a secrets manager, never in env files committed to anything.

Most “AWS bills exploded” stories I have walked into trace back to fuzzy boundaries. A staging environment that could reach production. A Lambda with an inherited role. A bucket with an inherited policy.

Boring cloud does not mean small cloud. It means a cloud whose surprises happen on purpose. New things land deliberately, the team understands what is in production, and the on-call rotation rotates instead of accumulating in one person’s calendar.

That is the architecture worth growing into.

Was this useful?