All Incidents
2026-05-19
🔴CRITICAL
railwaygoogle-cloudoutageautomationinfrastructurefalse-positive

Google's Automated System Suspends Railway's Cloud Account, Triggering 8-Hour Outage

What Happened

On May 19, 2026, Google Cloud Platform suspended Railway's production account as part of an automated, platform-wide enforcement action. The suspension was incorrect — but the automation didn't ask, and there was no human in the loop to catch it before it fired.

The blast radius was total. Railway's control plane API runs on GCP, and when the account went dark, so did the routing tables that Railway's edge proxies depend on. Workloads on Railway Metal and AWS kept running, but once their cached routing tables expired (~1 hour later), every workload across every region became unreachable.

The result: an ~8-hour platform-wide outage affecting the dashboard, API, OAuth login, and all customer workloads.

Railway's status posts on X during the May 20 outage: "Google Cloud has blocked our account, making some Railway services unavailable... we have access to some of our Google Cloud–hosted infrastructure and are working to restore the rest of the service." A follow-up confirms the dashboard was unavailable and all running services were down.
Railway's status posts on X during the May 20 outage: "Google Cloud has blocked our account, making some Railway services unavailable... we have access to some of our Google Cloud–hosted infrastructure and are working to restore the rest of the service." A follow-up confirms the dashboard was unavailable and all running services were down.

How It Happened

  • 22:10 UTC — Monitoring detects API health failures; 22:11 the dashboard starts returning 503s and users can't log in
  • 22:19 UTC — Root cause identified: the GCP account had been suspended
  • 22:22 UTC — Railway files a P0 ticket with Google Cloud
  • 22:29 UTC — GCP access restored — but services stayed offline because recovery had to be sequenced
  • 23:54 UTC — All persistent disks restored
  • 01:30 UTC (May 20) — Compute instances begin recovering; networking restored
  • 04:00 UTC — API, dashboard, and OAuth confirmed operational
  • 06:14 → 07:58 UTC — Incident moved to monitoring, then fully resolved

Secondary fallout: GitHub rate-limited Railway's integrations during the retry surge, and Terms-of-Service acceptance records were reset.

Why This Matters

The headline detail isn't that a single API failed — it's how a single account got switched off. A consequential, high-blast-radius action (suspend an entire production cloud account) was taken by an automated system with no human review, and it was wrong. Whether that enforcement engine is "AI" in the buzzword sense or just classical abuse-detection heuristics is unconfirmed — but the failure mode is the same one we keep cataloguing: automation acting at machine speed on a false positive, with humans only able to react after the damage.

Railway put the customer reality bluntly:

"Your customers don't care whether the failure was Google or Railway; they see your product."

There's also a hard architectural lesson buried here: Railway's data plane spanned multiple providers (Metal, AWS), but the control plane single-pointed on GCP. Multi-cloud workloads don't help if the thing that tells them where to route still lives in one vendor's account.

Lessons Learned

  • Automated enforcement needs a human circuit-breaker — Suspending an entire production account should not be a fully automated, instant action against an established customer
  • Control-plane dependencies are hidden single points of failure — Your workloads can survive a provider outage and still go dark if routing/control lives there
  • Cached state buys time, not safety — Railway's ~1-hour routing cache delayed total failure but didn't prevent it
  • "It was the vendor" is not an answer customers accept — Resilience to a provider's mistakes is your responsibility, not theirs
  • Recovery is sequential, not instant — Even after access was restored in minutes, disks → compute → networking took hours to bring back

Prevention Checklist

  • [ ] Map every control-plane dependency and identify which single vendor account could take you fully offline
  • [ ] Remove single-provider dependencies from the data plane's hot path (routing, edge proxies)
  • [ ] Treat third-party automated enforcement as a threat model — have an escalation path and P0 contact in place before you need it
  • [ ] Extend high-availability state (DB shards, routing tables) across multiple providers
  • [ ] Increase routing-table cache TTLs / add fallback routing so a control-plane outage degrades gracefully
  • [ ] Rehearse sequenced recovery (disks → compute → networking) so order-of-operations is known under pressure
  • [ ] Keep customer-facing status comms ready — outages caused by vendors are still your incident

Original Source: Railway — Incident Report: May 19, 2026 GCP Account Outage

Railway's status posts: @Railway on X — "Google Cloud has blocked our account, making some Railway services unavailable."

Note: The "AI" attribution here is intentionally cautious. Railway describes a Google automated account action; whether the underlying enforcement is machine-learning-driven or rule-based has not been confirmed. The incident is included as a study in automated, no-human-in-the-loop decisions with outsized impact.