← All Case Studies
Technology Software DevelopmentKubernetesDevOps
The Challenge

A growing SaaS platform's monolithic architecture was causing slow deployments, frequent outages, and blocking feature development across teams.

The Challenge

One massive codebase doing everything: user accounts, billing, the core product, integrations. It had clearly worked well enough when the company was small, but they had millions of users now and around 40 engineers, all tripping over each other in the same repo.

Deployments took the better part of a day. A dodgy change in the billing module could, and regularly did, take down user-facing features. Teams queued up behind each other waiting to ship. A single failed deployment meant rolling back everyone’s work, not just the person who broke it. Production incidents were happening weekly, sometimes more.

The platform had outgrown its architecture, and it was costing them real money.

The Solution

We went with the strangler fig pattern. No big-bang rewrite. We had a long conversation with their lead architect about whether to attempt a clean cutover, but the risk was too high for a platform serving live traffic around the clock. Incremental was the only sensible path.

We mapped out the core business domains with their team: user management, billing, the main platform features, and integrations. Each one became its own service with clear API boundaries and its own database. We were deliberate about making service ownership match their actual team structure, so each team owned their service from development through to production support.

Everything ran on Kubernetes for orchestration, scaling, and automated recovery. Each service got its own CI/CD pipeline.

Key technical decisions:

  • Event-driven communication through Kafka, rather than synchronous REST calls between services. We considered gRPC initially but the team had more experience with event streams, and the loose coupling suited their domain better.
  • Independent databases per service. This one caused real debate. Their DBA was understandably nervous about losing the ability to run cross-domain queries, but the shared database was one of the biggest sources of coupling and deployment risk.
  • Kubernetes with auto-scaling to handle traffic spikes without someone being paged at 3am.
  • Structured logging and distributed tracing, because once you have 12 services, working out where a request failed becomes a genuine problem.
  • An API gateway as a single entry point for clients.

About three months into the migration, we hit something nobody had documented. The billing service had a hard dependency on a legacy payment processor callback that went through the monolith’s session layer. It was not in any architecture diagram. We only found it because a subset of payment confirmations started silently failing in staging. That set us back about two weeks while we built a compatibility shim and sorted out the routing. It was a good reminder that large monoliths always have surprises buried in them.

We ran both systems in parallel for the full six months. New features went straight into microservices while the monolith continued handling existing traffic. Once a service was stable and carrying real load, we shifted traffic across gradually.

The Results

The numbers told a clear story, but honestly the thing that stood out most was watching their teams start moving independently for the first time in years.

  • Deployment times: down from 4 to 6 hours (with cross-team coordination) to 10 to 15 minutes per service
  • Incident response: mean time to resolution dropped from over 2 hours to 20 to 30 minutes
  • Deployment frequency: went from 2 or 3 times a week to over 20 deployments daily across all services
  • Outage duration: platform-wide outages lasting 30+ minutes became localised service issues resolved in 5 to 10 minutes
  • Developer velocity: feature development accelerated by 40%, mostly from eliminating the cross-team coordination overhead

The billing team stopped waiting on user management. The integrations team shipped updates without going anywhere near core platform code. Teams could experiment, break things safely, and iterate quickly.

A bug in one service no longer took down the whole platform. Better monitoring, alerting, and automated recovery made production calmer and more predictable. Their on-call engineers were noticeably less stressed by the end.

Six months after the migration wrapped up, they were running 12 independent services on Kubernetes with sub-minute recovery times and one brief service interruption in month three, resolved in under five minutes, and no further unplanned downtime during the final quarter. The monolith was still there, handling a few legacy features, but it was no longer the thing holding them back.

Facing something similar?

Tell us what you are dealing with and we will let you know how we can help.

Get In Touch