The Systems That Held Up
Every system I've built that survived production had one thing in common: the architecture wasn't clever. The decisions were clear, the constraints were respected, and when things changed — they always change — the system bent instead of broke. The ones that didn't survive? Someone (often me) made assumptions early that turned out to be wrong, and the design made those assumptions expensive to fix.
This is the first in a series on how I approach system design.
Start With the Problem, Not the Solution
Before drawing any boxes or arrows, I try to answer a few things: What problem are we solving? Who are the users? What does success look like? And just as important — what is not required right now?
I once spent a week designing a service that would aggregate data from three upstream systems into a unified view. Clean architecture, well-separated concerns. Then someone asked a basic question: "Do all three systems update at the same frequency?" They didn't. One was near-real-time, one was batched daily, and one was manually updated by ops. The "unified view" was a fiction — the data was never consistent at any given point in time. We scrapped the design and built something much simpler with explicit staleness indicators.
Understand the Data First
The most common blind spot in system design isn't picking the wrong database or the wrong protocol. It's assuming the data you need exists and is clean.
A time-series dashboard I designed assumed continuous data — daily aggregations over two years of records. A quick query showed that roughly 40% of historical records had created_at set to null, left behind by a migration 18 months earlier. That single query killed the original aggregation logic and split the pipeline into two paths: one for clean data, one for historical records using a derived date.
I'll go deeper into data-first design thinking in a follow-up article.
Define Constraints Early
A design that works in theory but falls apart under real infrastructure constraints isn't a valid design. Time, cost, team expertise, and pod memory limits are design inputs, not obstacles.
I built a real-time log processing system with a zero-drop requirement. The challenge wasn't the pipeline logic — it was that file sizes varied from a few kilobytes to hundreds of megabytes, and pods had fixed memory limits. Designing for the full range meant either over-provisioning for the 99.9% of small files or OOM-killing on the rare large ones. I set an 8MB threshold based on production distributions: files under it went through the real-time pipeline, the 0.1% above it routed to a separate workflow. The infrastructure constraints — not the business requirements — dictated that two-path split.
Think in Growth Multipliers
Instead of designing for today alone, I think in stages:
| Scale | What Changes | Example |
|---|---|---|
| 1x | Keep it simple — one service, one database | Monolith with direct DB queries |
| 2-5x | Optimize the obvious hot paths | Add indexes, cache repeated reads |
| 10x | Decouple what's bottlenecking | Move async work to queues, split read/write paths |
| 100x | Rethink boundaries | Separate services, shard data, accept eventual consistency |
The goal isn't to design for 100x on day one. It's to make sure the 1x design doesn't actively block you from getting to 10x. If adding an index or swapping a synchronous call for a queue gets you through the next growth stage, the architecture is doing its job.
Evaluate Trade-offs Explicitly
Every design choice has costs. The skill isn't avoiding trade-offs — it's making them visible.
When we evaluated how to build a configurable onboarding system, three patterns kept surfacing in our design discussions:
| Choice | Pick When | What It Costs You |
|---|---|---|
| Backend-driven UI | Business needs to change flows without deploys | Debugging moves from code to config; UI and backend couple tightly |
| Configurable workflows | Rules change faster than your release cycle | Validation complexity grows with every new rule path |
| Microservices | Teams need to deploy and scale independently | You inherit distributed tracing, network failures, and operational overhead |
We went with a backend-driven UI for the onboarding system. It gave us the flexibility to change flows without frontend deploys — but it also shifted complexity into configuration validation and debugging. When something broke, the issue was rarely in the code. It was in a config rule that interacted with another rule in a way nobody anticipated. We shipped all three patterns across different projects, and each came with costs we had to manage.
Design for Failure, Not Just Scale
Systems don't just grow — they fail. And the failures that hurt most aren't the dramatic crashes. They're the silent ones: a system that keeps running but produces bad data, a deploy that partially succeeds, a state change that can't be rolled back cleanly.
A service config rollout went out mid-deploy. Half the in-flight requests used the old schema, half picked up the new one. No versioning on the config payload meant downstream consumers couldn't tell which responses were valid and which were malformed. The service never crashed — it kept running, producing data that looked correct but wasn't. Rolling back and reconciling the mixed-version output took longer than the deploy itself.
Versioning, graceful rollouts, and knowing what valid state looks like at every point in the pipeline — I build for these from the start now, not after the first incident.
Communicate the Design
A good design that nobody understands is a liability. I think in layers:
- Level 1: High-level flow — for everyone. What goes in, what comes out, who's affected.
- Level 2: Services and data flow — for engineers. How components interact, where data lives, what protocols connect them.
- Level 3: Implementation details — only when needed. Specific algorithms, data structures, retry policies.
I've derailed design reviews by jumping straight to Level 3 — retry backoff strategies, partition key selection — and watching the room lose the thread within minutes. We hadn't agreed on the Level 1 flow yet. Starting at the wrong layer doesn't just waste time — it creates the illusion of alignment on details while the big picture is still unresolved.
Takeaway
Over time, this has become the mental model I default to: problem → data → constraints → growth → trade-offs → failure → communication. Not every design needs all seven steps, but skipping one usually shows up later as a surprise. Good systems aren't the ones that scale the furthest — they're the ones that fail predictably, recover cleanly, and evolve without forcing a rewrite.