How GitHub’s outages stalled Microsoft’s AI coding lead

OraCore Editors

[IND] May 27, 202617 min readOraCore Editors

How GitHub’s outages stalled Microsoft’s AI coding lead

I break down why GitHub’s outages and Azure migration pain weakened Microsoft’s AI coding edge, and give you a copy-ready playbook.

reliability GitHub Azure Copilot AI coding

Share LinkedIn

How GitHub’s outages stalled Microsoft’s AI coding lead

I break down why GitHub’s outages and Azure migration pain weakened Microsoft’s AI coding edge, and give you a copy-ready playbook.

I've been watching Microsoft’s AI coding story for a while, and honestly, it kept feeling weird. The ingredients were all there: GitHub, Copilot, Azure, OpenAI money, developer mindshare. On paper, that should have been a straight line to dominance. But the product kept wobbling. Users were getting blocked by outages. Teams were shipping AI features into an infrastructure that sounded stretched thin. And the worst part? The company that owns one of the most important developer platforms on the planet looked like it was losing the room to Cursor and Claude Code while still talking like it had the lead.

That’s the part that bothered me. I’ve seen plenty of teams talk themselves into a platform advantage that doesn’t survive contact with actual usage. GitHub had the distribution. It had the habit. It had the default status. But defaults are fragile when reliability slips. If your coding platform starts acting like a bottleneck, developers don’t sit around and philosophize about strategy. They open another tab. They move. They complain in public. They stop trusting the thing that used to feel inevitable.

What CNBC published on May 22, 2026, is basically the story of that trust breaking down. The piece is by Jordan Novet at CNBC, and the source URL is here. The article doesn’t just say GitHub had outages. It shows how those outages collided with an overloaded migration, leadership churn, and a fast-moving AI coding market that doesn’t wait for anyone.

GitHub didn’t lose because AI was weak. It lost because reliability got embarrassing.

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

“We have not met our own availability standards,” Vlad Fedorov, GitHub’s technology chief, wrote in a March blog post.

That line says more than a dozen product slides ever could. What this actually means is simple: if your platform is the place developers go to work, then uptime is not a support metric. It is the product. Once that slips, every AI feature you add starts looking like garnish on a broken kitchen.

The CNBC piece says GitHub suffered over a dozen incidents lasting more than an hour since March. That’s not a one-off. That’s a pattern users can feel. And developers are brutally practical here. They don’t care that the roadmap is ambitious if the service blocks them “for hours per day,” as Mitchell Hashimoto of HashiCorp put it in his blog post referenced by CNBC. When the core workflow breaks, trust drains fast.

I ran into this exact dynamic years ago on a team that kept stacking smart features onto a flaky internal platform. Every new enhancement looked great in demos. Then the platform would stall during peak usage and the whole team would quietly route around it. That’s what GitHub is fighting now. AI coding is a force multiplier only if the underlying system can absorb the load. If it can’t, the AI just amplifies the pain.

How to apply it: if you run a developer platform, treat availability as a feature with its own roadmap, its own owners, and its own release gates. Don’t bury it under “platform health.” Put a hard budget on downtime, error rate, and recovery time. Publish the number internally. If you can’t defend it, you don’t have a platform strategy, you have hope.

Track availability by workflow, not just service.
Set an error budget for developer-facing actions like clone, push, review, and merge.
Make incident review part of product planning, not just ops cleanup.

The Azure migration problem is not a side quest. It is the bottleneck.

CNBC reports that GitHub’s drawn-out migration to Microsoft Azure limited its computing capacity, and that GitHub leaders had considered moving heavily to Azure before those plans were shelved or delayed. That is the kind of corporate sentence that hides a very unsexy truth: migration debt can become product debt fast.

What this actually means is that GitHub was trying to scale AI-era demand while still living with infrastructure decisions made for a different era. Vlad Fedorov said in March that 12.5% of GitHub traffic was going through a region of Azure data centers in Iowa, with a goal to serve 50% of traffic from Azure by July. That’s not “fully migrated.” That’s a halfway house under pressure. And halfway houses are where queues form, capacity gets weird, and everyone starts negotiating for exceptions.

I’ve seen this movie. A company says it’s “moving to the cloud,” which sounds tidy until you realize the move is really a multiyear negotiation between old systems, new systems, legal constraints, and whatever the business can tolerate this quarter. The problem is that AI coding demand doesn’t wait for your migration plan to catch up. When usage spikes, the platform either has headroom or it doesn’t. If it doesn’t, you get outages, throttling, and angry users who do not care about your architectural roadmap.

How to apply it: if you’re migrating infrastructure while shipping a usage-heavy AI product, stop pretending the migration is invisible. It isn’t. Build a capacity model that includes peak AI usage, not just baseline traffic. Then decide what gets paused when the system is under strain. If you can’t answer that, your migration is already steering the product.

Measure peak-load capacity separately from average traffic.
Keep a migration risk register tied to customer-facing incidents.
Do not launch new AI features into a platform that is already running hot.

Distribution only matters when the product earns the habit

Microsoft had the distribution advantage. GitHub was already the home base for developers. That should have made Copilot the obvious default. But defaults are sticky only when they keep paying rent. The CNBC article notes that GitHub has six times more developers than when Microsoft bought it in 2018, and that it remains a major hub in the devops market. Still, newer tools like Cursor and Anthropic’s Claude Code have pulled ahead in momentum.

What this actually means is that distribution is not the same thing as preference. GitHub could be everywhere and still feel second-best if the experience is slower, noisier, or less reliable than the alternatives. Cursor overtook GitHub Copilot in market share about a year ago among Ramp customers, according to Ramp’s data cited by CNBC. That’s a pretty blunt signal. The old incumbent didn’t just get challenged. It got passed.

I think this is where a lot of platform companies fool themselves. They assume the existing audience will tolerate enough friction because the tool is already inside the workflow. Sometimes that’s true. Then a smaller competitor shows up with a tighter loop, faster product iteration, and fewer excuses, and suddenly “good enough” looks lazy. AI coding is especially unforgiving because the user is judging the tool every few keystrokes. There’s no long sales cycle to hide behind. The product has to feel useful immediately.

How to apply it: if you own a default product, act like you’re the challenger. Ship against the sharper competitor, not your own installed base. Watch churn in behavior, not just account counts. If users are opening a competitor’s app for the same task, you’re not winning by default anymore.

Useful signals to watch:

How often users switch tools mid-task.
Whether your product is the first tab or the fallback tab.
Whether new features reduce steps, or just add more buttons.

Leadership churn makes infrastructure pain feel bigger than it is

The article points out that Thomas Dohmke stepped down as GitHub CEO in August, Julia Liuson retired in April, and key GitHub VPs moved into other Microsoft divisions. That matters because technical problems get interpreted through organizational chaos. If the product is flaky and the leadership bench looks scrambled, users do not assume there is a clean recovery plan. They assume drift.

What this actually means is that outages are never just outages once the org starts wobbling. They become evidence. Evidence that the team is stretched. Evidence that priorities are changing. Evidence that nobody is fully in charge. You can survive one bad week. It’s much harder to survive a period where every bad week feels connected to the last one.

I’ve worked in orgs where the engineering team could have fixed the issue, but the company kept reassigning ownership, so every postmortem turned into a political document. That’s how confidence dies. Not with one giant failure. With a long series of “we’re working on it” updates that never quite restore the feeling that someone owns the problem end to end.

CNBC also quotes Armin Ronacher, creator of Flask, saying people are tired of the instability, product churn, Copilot noise, unclear leadership, and the feeling that the platform is no longer primarily designed for the community that made it valuable. That’s not a random complaint. That’s the kind of sentence developers repeat when they’ve already mentally moved on.

How to apply it: if your product is suffering and your org is also in flux, over-communicate ownership. Name the person accountable. Name the incident owner. Name the migration owner. Name the support path. If users can’t tell who is steering, they will assume no one is.

Assign a single owner for reliability, not a committee.
Publish incident timelines and recovery steps in plain language.
Keep leadership changes from becoming product ambiguity.

AI coding markets move fast, and Copilot stopped feeling like the pace setter

CNBC cites OpenAI saying 4 million people were actively using Codex in April, up from 3 million less than two weeks earlier. It also notes Anthropic’s Claude Code surge and Cursor’s momentum. Meanwhile, Microsoft said in January that GitHub Copilot had 4.7 million paid subscribers, up 75% from a year earlier. Those numbers are impressive, but they don’t automatically translate to leadership.

What this actually means is that the market is not awarding points for being early anymore. Copilot was announced in 2021, and being first used to matter a lot. Now the bar is higher. Developers want tools that feel responsive, fit into their workflow, and keep improving. If your product is the known quantity while competitors are the ones shipping the features people talk about, you start sounding like the old standard instead of the new one.

I keep coming back to this because it’s a classic platform trap. A company gets credit for inventing the category, then spends too long defending its original shape. But AI coding is not a museum. Developers don’t care who got there first if someone else is making their day easier right now. The market rewards the tool that reduces friction in the moment.

How to apply it: don’t measure AI product success only by signups or paid subs. Measure whether the product is becoming the default for actual coding sessions. Look at retention by task, not just by account. If users subscribe but still prefer another tool for real work, your lead is more cosmetic than operational.

Practical checks:

Compare active usage to paid adoption.
Track how often AI suggestions are accepted without edits.
Watch whether users adopt the newest features or ignore them.

Multi-cloud is a survival move, not a strategy trophy

GitHub now relies on Amazon, Google, Microsoft, and Oracle for cloud infrastructure in addition to its own facilities, according to the article. Fedorov wrote that GitHub started working on a path to multi-cloud while migrating out of smaller custom data centers. That’s not the neat, centralized story Microsoft would probably prefer to tell. It’s a practical response to pressure.

What this actually means is that resilience sometimes requires admitting your preferred architecture cannot absorb the current load. Multi-cloud is expensive, messy, and operationally annoying. But if the alternative is repeated outages, the “clean” architecture is just a nicer way to fail.

I’ve always thought teams talk about multi-cloud too romantically. In practice, it’s often the result of pain, not ideology. One cloud runs hot. One region gets saturated. One dependency becomes a single point of failure. Suddenly the architecture that was supposed to simplify everything becomes the thing you have to hedge against. GitHub sounds like it is there now.

How to apply it: use multi-cloud only where it buys you a real resilience benefit. Don’t spread workloads everywhere for optics. Put it where failure hurts: auth, storage, code review, and deployment paths. Then test failover like you mean it.

If you’re building on top of someone else’s platform, this is also a reminder to keep exit options alive. The companies that moved fastest away from GitHub’s pain were not the ones with the prettiest architecture deck. They were the ones with a decent fallback and a willingness to use it.

The template you can copy

# Reliability-first AI coding platform playbook

## What we are optimizing for
- Developer trust before feature count
- Availability before AI novelty
- Capacity before launch velocity
- Clear ownership before org reshuffles

## 1) Define the product as a workflow, not a feature
The product is not "Copilot" or "agentic coding." The product is:
- open repo
- edit code
- generate suggestion
- review suggestion
- merge change
- recover when something breaks

For each step, define:
- owner
- SLO
- top failure modes
- escalation path

## 2) Set hard reliability budgets
Track these weekly:
- uptime
- p95 latency
- failed pushes
- failed merges
- incident count over 1 hour
- mean time to recover

Rules:
- no new AI feature ships if core workflow incidents exceed the budget
- no launch if recovery tools are untested
- no migration milestone counts as complete until traffic is stable for 30 days

## 3) Capacity plan for AI load, not old load
Model traffic under:
- normal usage
- peak weekday usage
- incident recovery traffic
- AI-agent burst traffic
- regional failover traffic

For each scenario, answer:
- what breaks first?
- what degrades gracefully?
- what gets throttled?
- what gets paused?

## 4) Keep migration visible
Create a migration dashboard with:
- current traffic split
- remaining legacy dependencies
- blocked moves
- capacity risks
- customer impact

If the migration is delaying product reliability, say so plainly.

## 5) Treat competitors as your product spec
Every quarter, compare your product against the sharpest alternative.
Ask:
- Which task is faster elsewhere?
- Which step is less annoying elsewhere?
- Which failure is more tolerable elsewhere?
- Which feature users mention without being prompted?

Then pick the top 3 gaps and ship against them.

## 6) Write the ownership memo
Use this exact format:
- Incident owner:
- Infrastructure owner:
- Product owner:
- Migration owner:
- Support owner:
- Escalation channel:
- Customer communication lead:

## 7) Rollout checklist
Before shipping a new AI coding feature:
- load test passed
- rollback tested
- support docs updated
- incident playbook updated
- human review path confirmed
- capacity headroom verified
- customer comms drafted

## 8) Fallback plan for customers
If the primary platform is degraded:
- expose status clearly
- provide read-only access if possible
- preserve user work automatically
- offer export paths
- document recovery steps in one page

## 9) Weekly operating review
Answer these every week:
- What broke?
- What was the user-facing impact?
- What did we learn about capacity?
- What did we delay because of reliability?
- What did competitors ship that we need to match?

## 10) Decision rule
If a feature improves AI novelty but increases outage risk, it does not ship until the reliability gap is closed.

This is the part I’d actually hand to a team. It’s not glamorous, but it forces the right conversation. If your AI coding product is losing trust, the fix is not more hype. It is more headroom, clearer ownership, and a product definition that starts with the workflow developers care about.

The original reporting is from CNBC’s article “Microsoft’s GitHub was positioned to win the AI coding race. Outages got in the way”. My breakdown is derivative of that reporting and the public sources it cites, plus my own read of how platform reliability usually fails in the real world.

// Related Articles

How GitHub’s outages stalled Microsoft’s AI coding lead

GitHub didn’t lose because AI was weak. It lost because reliability got embarrassing.

Get the latest AI news in your inbox

The Azure migration problem is not a side quest. It is the bottleneck.

Distribution only matters when the product earns the habit

Leadership churn makes infrastructure pain feel bigger than it is

AI coding markets move fast, and Copilot stopped feeling like the pace setter

Multi-cloud is a survival move, not a strategy trophy

The template you can copy

OpenAI’s IPO filing turns hype into scrutiny

Skatteetaten proves public sector AI should be judged by outcomes

OpenAI’s IPO filing puts AI’s biggest test on Wall Street

OpenAI’s latest moves now center on pricing, safety, and scale

RISC-V mini PCs are worth buying now, but only as a bet on the future

Fedora 44 RISC-V widens Linux board support