Flaky tests: Who are they and how to classify them ?

What is a flaky test? This is a big question, since automated testing is a key to CI/CD. To truly answer this question, you will understand what makes a test flaky and know the different types of flaky test, helping you to classify them.

Flaky tests: Who are they and how to classify them ?

Nowadays, automated testing is a key and unavoidable practice that allows you and your teams to perform continuous delivery and improvements to your project. While determinism is one of the core concepts of this tool, you can never really achieve it. Indeed, if every test is designed  to run and deliver a predictable outcome, there is some times where it does not meet these expectations. When it happens, it probably means you are facing what we call flaky tests.

What is a flaky test ?

A flaky test can be seen as a bi-state object, an object that both passes and fails periodically, given the same test configuration. All along your project’s life, you’ll never be able to avoid dealing with them, as they tend to grow exponentially with the size of your test suites.

A key struggle that adds up is that flaky tests leads to unreliable CI tests suites, thus reducing the confidence you can have in your testing and the overall development experience. You can bear little flakiness in your tests, it will be barely noticeable. But have a lot, and your whole test suite will be obsolete, loosing its value and allowing bugs and bad code to pass through it.

Flaky tests are not easy to catch, due to their evasive nature. Indeed, the frequency of flakiness is always a struggle to deal with, some can happen frequently and other are so rare that they go undetected. Moreover, tests can be flaky as a direct result of the testing environment they’re run in.  This is why there is not only bad things around flaky tests. Indeed, they can find their origin from multiple sources, and some of them can help you in finding bugs you would never have without their flakiness. Some bad infrastructures or test environment design can then be unveiled.

Flaky tests are more common that they should be. Source: Geek & Poke via Testinium

On the cost side, flakiness takes a lot of time, money and resources to deal with, while also slowing the project’s progress and decreasing the team’s trust in the development process. As costs are high, fixing flaky tests can be meaningful on the critical and large ones, but the return on investment is fatally decreasing as the number of flaky tests is reducing.

So, this leaves you in a paradoxal situations, where suppressing all flaky tests is not entirely the best solution and not really achievable. In reality, there is a balance, a tolerance threshold. So, how much flakiness can you handle ? That is the key question you have to answer.

How to classify them ?

In order to reach that balance, classifying seems to be inevitable when it comes to flaky tests. Objectively, there is not one unique solution, the categorisation system can be adapted, modified, according to the enterprise needs and vision. Still, it is a primordial step to allow better use of the resources (people, time and cost) needed when handling flaky tests.

The goal being to be able to separate flaky tests into groups, according to their origin, in order to prioritise them. We can notice two categories which seem more obvious than others across software projects:

  • Independent flaky tests: Test that fails independently, outside or inside the test suite. Due to the easiness of reproducibility, they are easier to notice, debug and solve.
  • Systemic flaky tests: Tests that are failing due to environmental, shared state issues or even their order in the test suite. They are way more delicate to detect and debug since their behaviour can change along with the system or workflow evolution.
Flaky tests can be hidden everywhere at every level. Source: MethodPoet 

While test can be flaky for an infinite number of reasons, there is some that are more redundant than others. Identifying these origins can lead to a more efficient categorisation, allowing you to create sub-categories or new ones. Here are some of the most encountered reasons for flakiness:

  • Accessing resources that are not strictly required: Test that, for example download a file outside the tested unit, or use a system tool that could be mocked.
  • Insufficient isolation: Tests that are not using copy of the resources can result in race conditions or resource contention when run in parallel. Also, tests that changes system state or interact with databases should always clean up after use, otherwise other tests might be failing due to theses interactions.
  • Concurrency: Several parallel threads that interact in a non-desirable way (data races, deadlocks, etc…)
  • Test order dependency: Tests that can deterministically fail or pass depending on the order they are run in the test suite.
  • Network: Test relying on network connectivity, which is not a parameter that can be fully controlled.
  • Time: Test relying on system time can be non-deterministic and hard to reproduce when failing.
  • Input/Output operations: Test can be flaky it it does not properly garbage collect and close the ressources it has accessed.
  • Asynchronous wait, invocations left un-synchronized, or poorly synchronized: Test that makes asynchronous calls, but does not wait properly for the result. Any fixed sleep period should be avoided. The waiting time can differ depending on the environment.
  • Accessing systems or services that are not perfectly stable: It is better to use mocks for services as fully as possible, in order to avoid dependency to external, uncontrolled factors.
  • Usage of random number generation: When using random number or other objects generation, it is useful to log the generated value in order to avoid needlessly difficult reproduction of the test failure.
  • Unordered collections: Avoid making assumptions about element order in an unordered object.
  • Hardcoded values: Test that uses constant values where elements or mechanics might change in time.
  • Too restrictive testing range: When using an output range for assertion, it might be possible that not all outcomes have been considered, making the test fails when they happen.
  • Environment dependency: Test outcome can simply vary depending on the test environment it is run on.

Having an idea of what failure reasons are more often encountered than others in your project will help you separate and prioritise the concerned tests. This way, the solving part will be easier since there is already existing datas on the case. Those datas can hint the solver in certain directions, gaining time and potentially unveiling some more global project's sensible points, such as writing practices, environments flaws or bad workflow processes.

Conclusion

Now that we know what we are talking about, it should be simpler to identify flaky tests and separate them into groups, in order to simplify their handling. It will also now be easier to setup processes to fix and prevent them (link to next article) in a more optimal way. We hope these informations will help you in your combat against flaky tests in the future, as it has helped us at Mergify ! Stay tuned !

Ready to automate your GitHub workflow?
We save your time by prioritizing and automatically merging, commenting, rebasing, updating, labeling, backporting, closing, assigning, your pull requests.