The First Million is the Hardest
When you're building for a handful of users, almost any architecture works. Rails with SQLite? Sure. A single Heroku dyno? Why not. The problems start when success arrives—and it always arrives faster than you expect.
We've had the privilege of scaling several platforms from zero to millions of users. Here's what we wish we'd known from the start.
Lesson 1: Your Database is the Bottleneck
Every scaling story eventually becomes a database story. The queries that took 5ms with 1,000 rows now take 5 seconds with 10 million. The joins that seemed elegant are now killing your servers.
What works:
- Read replicas from day one
- Aggressive caching at the application layer
- Strategic denormalization (yes, really)
- Connection pooling with something like PgBouncer
What doesn't:
- "We'll optimize later"
- Assuming your ORM knows best
- Ignoring query plans
Lesson 2: Async Everything
The fastest API response is the one that doesn't wait. Any operation that can happen asynchronously should happen asynchronously.
// Instead of this
await sendWelcomeEmail(user);
await createAnalyticsEvent(user);
await notifySlack(user);
// Do this
await Promise.all([
queue.add('email:welcome', { userId: user.id }),
queue.add('analytics:signup', { userId: user.id }),
queue.add('slack:notify', { userId: user.id }),
]);
Background jobs aren't just for heavy lifting—they're for anything that isn't directly needed for the response.
Lesson 3: Cache Like Your Life Depends On It
At scale, cache invalidation isn't just one of the two hard problems in computer science—it's your entire job.
Our caching strategy:
- Edge caching for static assets and API responses that rarely change
- Application cache (Redis) for computed data and session state
- Request-level memoization for data accessed multiple times per request
The key insight: cache at the highest level possible. An edge-cached response never hits your servers at all.
Lesson 4: Observability is Non-Negotiable
You cannot improve what you cannot measure. Before we scale anything, we ensure we can answer:
- What's the p99 latency of every endpoint?
- Where is time being spent in each request?
- What's the error rate, and what's causing errors?
- How close are we to resource limits?
Tools we love: Datadog for APM, Sentry for errors, custom dashboards for business metrics.
Lesson 5: Plan for Failure
At scale, failure isn't a possibility—it's a certainty. Servers will crash. Databases will hiccup. Third-party APIs will timeout.
Design for graceful degradation:
- Circuit breakers for external dependencies
- Fallback content when caches miss
- Queue-based processing that can retry
- Feature flags to disable problematic features instantly
The Payoff
Building for scale requires upfront investment, but the payoff is profound: systems that handle traffic spikes without paging engineers, that grow with your business instead of constraining it, that let you sleep soundly at night.
That's what we build. Every time.