Building for Scale: 2 Million Users

The First Million is the Hardest

When you're building for a handful of users, almost any architecture works. Rails with SQLite? Sure. A single Heroku dyno? Why not. The problems start when success arrives—and it always arrives faster than you expect.

We've had the privilege of scaling several platforms from zero to millions of users. Here's what we wish we'd known from the start.

Lesson 1: Your Database is the Bottleneck

Every scaling story eventually becomes a database story. The queries that took 5ms with 1,000 rows now take 5 seconds with 10 million. The joins that seemed elegant are now killing your servers.

What works:

Read replicas from day one
Aggressive caching at the application layer
Strategic denormalization (yes, really)
Connection pooling with something like PgBouncer

What doesn't:

"We'll optimize later"
Assuming your ORM knows best
Ignoring query plans

Lesson 2: Async Everything

The fastest API response is the one that doesn't wait. Any operation that can happen asynchronously should happen asynchronously.

// Instead of this
await sendWelcomeEmail(user);
await createAnalyticsEvent(user);
await notifySlack(user);

// Do this
await Promise.all([
  queue.add('email:welcome', { userId: user.id }),
  queue.add('analytics:signup', { userId: user.id }),
  queue.add('slack:notify', { userId: user.id }),
]);

Background jobs aren't just for heavy lifting—they're for anything that isn't directly needed for the response.

Lesson 3: Cache Like Your Life Depends On It

At scale, cache invalidation isn't just one of the two hard problems in computer science—it's your entire job.

Our caching strategy:

Edge caching for static assets and API responses that rarely change
Application cache (Redis) for computed data and session state
Request-level memoization for data accessed multiple times per request

The key insight: cache at the highest level possible. An edge-cached response never hits your servers at all.

Lesson 4: Observability is Non-Negotiable

You cannot improve what you cannot measure. Before we scale anything, we ensure we can answer:

What's the p99 latency of every endpoint?
Where is time being spent in each request?
What's the error rate, and what's causing errors?
How close are we to resource limits?

Tools we love: Datadog for APM, Sentry for errors, custom dashboards for business metrics.

Lesson 5: Plan for Failure

At scale, failure isn't a possibility—it's a certainty. Servers will crash. Databases will hiccup. Third-party APIs will timeout.

Design for graceful degradation:

Circuit breakers for external dependencies
Fallback content when caches miss
Queue-based processing that can retry
Feature flags to disable problematic features instantly

The Payoff

Building for scale requires upfront investment, but the payoff is profound: systems that handle traffic spikes without paging engineers, that grow with your business instead of constraining it, that let you sleep soundly at night.

That's what we build. Every time.