How We Do It

Post-mortem for May, 16th 2025 incident

Julien Danjou

20 May 2025 — 2 min read

On May 16th, 2025, Mergify experienced a significant service disruption from 06:00 UTC to 09:51 UTC. This post-mortem outlines the incident's context, our response, resolution, and the steps we're taking to prevent future occurrences.

Context

Mergify's infrastructure is hosted on Google Cloud Platform (GCP), utilizing Cloud Run with Direct VPC Access to connect our services securely to internal resources. This setup is designed to ensure optimal application performance and security.

At 06:00 UTC on May 16th, our monitoring systems alerted us to a major outage affecting all our services, including the dashboard and engine. Initial investigations indicated that all Cloud Run containers had been scaled down to zero, and new instances failed to start due to failed health checks.

Investigation

Upon receiving the alerts, our team promptly initiated an incident response. Early diagnostics revealed that the issue was not application-related but stemmed from the underlying infrastructure. Specifically, Cloud Run services were unable to start new instances because health checks failed, and no application logs were emitted, indicating a failure occurring before the application startup.

We attempted several recovery actions, including redeploying resources and recreating the subnet, but these efforts were unsuccessful. By 08:15 UTC, we escalated the issue to Google Cloud Support. At 08:47 UTC, GCP acknowledged an ongoing incident affecting the Cloud Run Direct VPC Egress feature and provided workarounds. However, before we could implement these solutions, GCP resolved the underlying network issue at 09:51 UTC.

Resolution

Once GCP resolved the network issue, our services began to recover. We closely monitored the system to ensure stability and confirmed that all services were operational. The incident was officially closed at 11:05 UTC.

Learnings and Future Improvements

This incident highlighted a critical dependency on a single GCP region and the need for improved disaster recovery strategies. To enhance our resilience, we plan to implement multi-region deployment capabilities to allow rapid failover in case of regional outages.

These measures aim to reduce recovery time and maintain service availability during unforeseen infrastructure failures.

We apologize for the inconvenience caused by this outage and appreciate your understanding as we work to strengthen our systems against future incidents.

Mastering Playwright for Mobile Testing

Let's be honest: in a world where mobile traffic dominates, making sure your web app works perfectly on phones and tablets isn't just a "nice-to-have." It's non-negotiable. And that’s exactly where Playwright for mobile testing shines, offering a unified and powerful

Can Playwright Be Used for Mobile Testing? A Guide

So, can you use Playwright for mobile testing? The short answer is yes, absolutely—but with a very important distinction. Playwright is a master of mobile web testing, not native app testing. Think of it as a world-class impersonator for your computer's browser. It can perfectly mimic how

Playwright for Mobile App Testing A Practical Guide

Absolutely. When people think about mobile testing, their minds often jump straight to native apps and complex tools like Appium. But what about mobile web applications, PWAs, or hybrid apps? For those, Playwright is not just a good choice—it’s an excellent one. It’s fantastic at emulating mobile

Mastering Playwright Mobile App Testing

Absolutely. Playwright mobile app testing isn't about running native apps; it’s a smarter way to automate and validate how your web applications look and feel on mobile browsers. It pulls this off using incredibly powerful device emulation, letting you simulate mobile viewports and user agents right from