How to handle checks timeout in an evolutive merge queue?

How to handle checks timeout in an evolutive merge queue?

Fabien Martinet

Merge queues are a core component of Mergify. We implement new features regularly to provide new use possibilities. Their core mechanics also need to be adapted to support such evolutions.

While testing and ensuring that any pull request fits perfectly into the main branch, time is a dimension that can be impacted in many different ways during the process. It is more obvious in a feature like the merge queue freeze, which directly influences the queue’s temporal dynamic. Such functionalities require that the way time is monitored and handled by Mergify evolves.

How it was ⏰

Let’s dive into the mechanics of our merge queues. As explained in this post, to ensure the best testing and security possible, Mergify may use single isolated pull requests, batches, or speculative checks to prevent conflicts during the pull request validation process.

Algorithmically speaking, we create an object that we call a “train,” composed of “train cars.” Each train car represents either a single pull request or a batch of several pull requests. Mergify then tests and validates each vehicle according to the checks and conditions defined in the configuration.

Checking if the CI and tests have timed out is one of the customizable conditions that our merge queues offer. Until recently, Mergify used to handle timeouts by measuring the time on the runtime it took for a car to be tested and merged.

If the merge didn’t occur in the time window defined in the conditions, the timeout would be raised, and the concerned pull requests in the car would be unqueued. This way of handling the time was valid until we introduced new features to our merge queues. For example, any element that can block the merge of a fully tested and validated car, like the freeze feature we mentioned, invalidates that assumption.

How we needed it to be 💡

To adapt our timeout checking mechanics to these new behaviors, we needed to change the scope of measurement. Indeed, we were assessing the timeout at the queue rule level, like any other queue rule, where in reality, we needed to process this check locally inside each car to exclude the merge from the scope.

Algorithmically, the timer starts at the creation of the car. At this moment, the checks are launched. Since the merge is no longer relevant as a point in time, the solution for timeout checking had to be focused solely on CI checks, which are the only condition that induces a time dimension. Indeed, CI checks are what take nearly 99,9% of the time consumed during a car validation, which makes them the only parameter worthy of being time-checked, the other parameters being negligible and not relevant.

How we’ve done it 🛠

Following our previous observations, we’ve chosen to process the timeout locally inside the car instead of considering it as a common queue rule. To do that, we’ve designed a set of variables and functions that allows us to calculate and compare the runtime of the car to the desired timeout span. Each time the car’s status is updated, the new code is triggered, thus allowing us to adapt various behaviors. Indeed, because we focused our measurement on the CI checks, we can now separate the following different cases:

Mergify doesn’t ever again look at the timeout in the successful case. Only in the other conditions, does Mergify check that the CI has passed within the desired period.

With this mechanism, Mergify can ensure that the timeout is well assessed and measured the way we wanted. The separation of the cases prevents the code from checking timeout when it is not needed anymore and stops the car immediately when it is reached.

Removing the merge reference point from our equation, the change in the process scope allows Mergify to prevent unwanted behaviors due to new features and makes it future-proof.

And who knows, maybe some of these new features might interest you! Stay tuned!