Handling unexpected third party API changes
As you may guess, Mergify relies a lot on third-party APIs like the Stripe API or the GitHub API and their behavior. Like any third-party service, we need to deal with many things to ensure our integration never breaks.
The first one that comes to mind is network failures or temporary outages. These issues are common and easy to work around; usually, we only need to retries API calls.
At Mergify, we store GitHub events in a temporary database and use classic exponential back-off to retries until everything works again. But outside of outages, the other thing we have to deal with is API changes.
๐พ Tracking API changes
An API is a contract that users and companies use to build tools. It's common sense you should never break such a pact. Any evolution and breaking change should be tracked, versioned, and announced. It should be up to the user to choose when they use the next API version to be able to test and validate its integration.
They are different methods to allow people to select which API version they want to use.
At Mergify, we like the work done by Stripe API in this area. Their policy is clear, and they use a versioned API and have clear procedures on how to upgrade or roll back. On the other side, GitHub still doesn't have a firm policy about API versioning yetโฆ but it's coming! (spoiler alert: we tested it, and it looks excellent ๐)
Unfortunately, even when you follow all these good practices, you may get your integration broken; here is the story of one of our latest incidents.
๐ณ Breaking Change
Everything started on a Friday because incidents are more fun on that day.
A Mergify user started to ask why the draft pull requests that Mergify creates are not always deleted. Our support team replied that it was not the expected behavior and gathered the logs and the scenario leading to this situation.
The support engineer opened an internal ticket for the development team to look at the issue. A Mergify engineer wrote an integration test with the scenario and confirmed that our application did not delete the pull request. So far, so good: we have reproduced the bug, so it should be straightforward to write a fix and release it by Monday.
The following day, Saturday evening, a couple of customers complained about some pull requests that Mergify was not queueing. Looking at the logs, we noticed unexpected HTTP errors when creating the draft pull request for the speculative checks of the merge queues. The code complained that the pull request Mergify was trying to make already existed.
Indeed, everyone knows you can't create two pull requests with the same head branch. We know it. We even have functional tests for this situation. So, how is that possible?
We started to run some existing integration tests manually and discovered that they didn't pass anymore while we didn't change anything in our codebase.
๐งโ๐ Fixing the Issue
Here is the thing, when Mergify creates a pull request for a merge queue, it starts by preparing a branch with the speculative future main branch and makes a pull request with it. Next, once the CIs reports their status on this speculative branch, Mergify deletes the branch. GitHub automatically closes the associated pull request, and Mergify continues processing the pull requests in the queue.
That Friday, GitHub changed its behavior: it wouldn't close automatically a pull request associated with a deleted branch. That's why when Mergify tried to recreate a pull request for the same branch, it'd fail.
A quick look at our logs revealed when GitHub rolled out this change and who was impacted by this change:
We immediately escalated this issue to the GitHub support team and started to implement a workaround. Easily enough, the fix is to close the pull request ourselves โ it just costs one more request.
It took a few hours to write the fix and test it thoroughly, as this impacts a critical part of Mergify. Our engineers deployed the workaround on Sunday morning, and all the impacted merge queues resumed their processing correctly.
๐ง Reflecting
Having your API provider break one of their API behavior without any warning is one of the worst-case scenarios, and we hit it on pretty lousy timing. Supposing there's even good timing.
This is an excellent example that breaking APIs can go beyond an OpenAPI schema change.
GitHub broke one of their API contracts without changing anything to the API URL, payloads, or headers. To this day, we still don't know if this was an intended change from GitHub or a nasty bug, though we bet it's the latter. We're still waiting on an appropriate response from GitHub.
Relying on third-party APIs is always full of surprises, even when using some of the best providers like GitHub.
That's one of the reasons that Mergify puts a lot of effort into functional and integration testing. We strive to discover problems as soon as possible and to be able to reproduce this kind of behavior change from our third-party integrations quickly to mitigate them as soon as possible.
It is not the first time that one of our providers unexpectedly breaks. It's usually not visible to our users, but depending on the behavior change, they, unfortunately, may notice it.