Writing · February 15, 2026
Migrating 100+ Servers to Azure for H&M — Zero Outages
Lessons from leading the H&M Azure migration at IBM — phased planning, Terraform landing zones, blue-green DNS, and what I'd do differently.
When IBM tasked my team with migrating over 100 application servers for H&M to Azure, the hardest part wasn't writing Terraform — it was the sequencing. One wrong move and a retail giant's operations go dark on a Monday morning.
Here's exactly how we did it, what nearly broke us, and what I'd do differently today.
The constraint that changed everything
H&M's engineering teams operate across time zones. Any outage during business hours — even 10 minutes — meant escalations, revenue impact, and very unhappy stakeholders. The mandate was clear: zero weekend work, zero business-hour outages.
That single constraint shaped every technical decision we made.
Phase 1 — Assess before you touch anything
Before a single Terraform file was written, we spent three weeks in discovery:
- Catalogued every server: OS, app runtime, dependencies, network rules
- Identified "chattiness" — which servers talked to each other constantly
- Mapped external dependencies: on-prem databases, third-party APIs, VPN tunnels
- Flagged stateful apps that needed special handling (session stores, file mounts)
The output was a migration wave plan — groups of servers that could move together without breaking dependencies. Wave 1 was the smallest, lowest-risk group. Wave 5 was the scary stuff.
Phase 2 — Build the landing zone first
We never migrated a single server until the Azure landing zone was rock solid. This included:
- Hub-and-spoke VNet topology with peering to on-prem via ExpressRoute
- Private endpoints for all PaaS services — no public IPs on anything backend
- Azure Policy assignments to enforce tagging, region locks, and SKU constraints
- Log Analytics workspace wired to every resource from day one
- IAM roles scoped to least-privilege — no contributor-level service principals
The Terraform for this was modular. One module per resource type, versioned in a private registry. Every environment (dev, staging, prod) was a composition of the same modules with different variable files.
module "vnet" {
source = "internal-registry/vnet/azure"
version = "2.1.0"
name = "hm-prod-vnet"
address_space = ["10.10.0.0/16"]
resource_group_name = module.rg.name
location = var.location
}
Phase 3 — The migration pipeline
Each wave followed the same pattern:
- Replicate — Azure Migrate replicates the VM continuously, no cutover yet
- Test cutover — spin up the replica in an isolated VNet, run smoke tests
- Approve — wave lead signs off on test results
- Production cutover — flip DNS, monitor for 30 minutes, declare success
- Decommission — original server stays warm for 72 hours, then deleted
The DNS flip was the magic. Instead of changing IPs in application configs (risky, slow), we updated Azure Private DNS zones. Cutover time per server: under 2 minutes. Rollback time: under 2 minutes. That's the number that let us sleep at night.
What nearly broke us
Undocumented dependencies. Three servers in wave 3 had a hardcoded IP reference to an on-prem NFS share that nobody knew about. We caught it in the test cutover when file writes started failing silently. The fix took 4 hours; catching it in prod would have taken 4 days.
Lesson: run your smoke tests against real workloads, not just ping checks. We added a synthetic transaction test to every wave after that.
Monitoring from day one
We instrumented before migration, not after. Every replicated VM had the Azure Monitor agent installed during replication. By the time we did the DNS flip, we had 2 weeks of baseline metrics to compare against.
Alerts were set to page if CPU, memory, or error rates deviated more than 20% from baseline within the first hour post-cutover.
Results
- 100+ servers migrated across 5 waves
- Zero business-hour outages
- Zero weekend work after wave 1
- Mean cutover time: 8 minutes per server
- One rollback in wave 2 (caught and resolved in 11 minutes)
What I'd do differently
Start instrumentation earlier. By wave 3 we wanted application-level traces that hadn't been set up yet. Azure Application Insights should be day-one infrastructure, not an afterthought.
Automate the dependency scan. We did this manually in discovery. Tools like Azure Migrate's dependency analysis or Turbonomic can do this automatically and catch the undocumented links before they bite you in a test cutover.
Involve app teams earlier. The engineers who built the apps knew things the architecture diagrams didn't show. A 30-minute call with each app team in week one would have saved us days of discovery work.
Have questions about your own migration? Connect with me on LinkedIn.
Comments & Reactions