Basic Knowledge

Understanding Flaky Test Examples: A Developer's Guide to Test Reliability

Huguette Miramar

28 Dec 2024 — 11 min read

When Good Tests Go Bad: Real-World Flaky Test Examples

Every developer has encountered them - those maddening tests that seem to have a mind of their own, passing one minute and failing the next without any code changes. These flaky tests can seriously slow down development and erode confidence in your test suite. Let's look at some real cases where tests went wrong and what we can learn from them.

Time-Dependent Failures: The Case of the Monthly Breakdown

Time dependencies are a classic source of test flakiness. One team discovered their CI pipeline would mysteriously break during the first few days of each month. After investigation, they found a statistics calculation test that relied on the current date. The test worked fine most of the time but failed when crossing month boundaries. What made this particularly tricky was its sporadic nature - the failures were easy to dismiss since they only happened briefly each month. The team lived with this issue for years before finally addressing it, showing how seemingly minor flakiness can persist if not properly addressed.

Race Conditions: The Unseen Dependencies

Some of the most challenging flaky tests emerge from race conditions between different parts of the code. A team might add what seems like an isolated new feature, only to see unrelated tests start failing intermittently. This often happens when the new code modifies shared resources or state in ways that conflict with existing tests. Finding these issues requires careful attention to how different parts of the codebase interact. A seemingly independent test might actually depend on specific timing or state that's no longer guaranteed.

Environmental Dependencies: The CI/Local Discrepancy

The classic "works on my machine" problem frequently shows up in flaky tests. Tests pass perfectly in development but fail randomly in CI. The root cause often lies in environmental differences - maybe the CI server uses a different version of a key library, or a network resource that's available locally is intermittently unavailable in CI. For example, one team found their tests would fail when a database connection was configured differently between environments. This highlights why testing across multiple environments is essential for catching these issues early.

The Subtleties of Async Operations: Promises and Pitfalls

Asynchronous code brings its own special brand of test flakiness. Take a test that makes an API call - if it doesn't properly wait for the response before checking results, it might pass when the network is fast but fail under load or latency. These timing-sensitive failures are particularly frustrating because they're hard to reproduce consistently. Success depends on carefully managing async operations, implementing proper wait conditions, and setting appropriate timeouts.

By studying these real examples of flaky tests, we can better spot similar patterns in our own code. While completely eliminating test flakiness may be impossible, understanding common causes helps us write more reliable tests from the start. Stable tests are worth the effort - they're essential for maintaining development speed and keeping CI/CD pipelines running smoothly.

The Ripple Effect: How Global State Creates Testing Chaos

Testing code that interacts with global state can quickly become messy and unpredictable. When multiple tests share resources, environment variables, and configurations, changes in one test can unknowingly break others. This creates a ripple effect where fixing one issue leads to new failures elsewhere, making debugging feel like an endless game of whack-a-mole.

Shared Resources: The Mutable Menace

Consider a suite of tests that all access the same database. One test updates a record expecting certain data to be present. If another test changed that data earlier, the first test fails - not because of a bug in the code, but because of shared state between tests. Take a shopping cart application as an example: Test A reduces an item's inventory count to zero while Test B tries to add that same item to a cart. Test B fails because Test A depleted the stock, even though both tests work fine in isolation.

Environment Variables: Hidden Dependencies

Environment variables pose a similar challenge. Tests often rely on specific environment settings, but those settings can change unexpectedly. For instance, Test A needs an environment variable pointing to a test server at localhost:3000. Test B comes along and changes that variable to point to a different server. Now Test A fails because it's hitting the wrong endpoint. These hidden dependencies between tests are particularly frustrating to track down since the connection isn't obvious in the code.

Global Configurations: The Silent Saboteur

Global configurations like singleton objects wreak similar havoc. When one test modifies a global setting, it affects all subsequent tests that assume default values. A real-world example: Test A sets the logging level to DEBUG for detailed output. Test B, which expects minimal logging, then fails because it's flooded with log messages that slow down execution. Each test worked perfectly alone but conflicts when run together due to shared global state.

Identifying and Isolating Global State Dependencies

Finding and fixing these global state issues requires careful investigation of how tests interact. Tools that monitor state changes can help pinpoint where tests collide. More importantly, proper isolation techniques prevent the problems in the first place. Instead of letting tests share a real database, use mocks or test doubles that provide consistent, predictable responses. Rather than modifying actual environment variables, create isolated environments for each test. By keeping tests truly independent, you'll spend less time debugging mysterious failures and more time shipping reliable code.

Inside Facebook's Battle Against Flaky Tests

When you have millions of tests running constantly like Facebook (now Meta) does, even a small percentage of flaky tests can cause major headaches. Flaky tests - those that sometimes pass and sometimes fail without code changes - can waste developers' time and erode trust in the testing process. To tackle this challenge at scale, Facebook developed the Probabilistic Flakiness Score (PFS), a smart way to measure and manage unreliable tests.

Understanding the Probabilistic Flakiness Score (PFS)

The PFS goes beyond simple pass/fail tracking to provide deeper insights into test reliability. It answers a key question: If a test fails, what's the likelihood it would have passed if run again with the exact same code and setup? By using conditional probability, the PFS helps Facebook's teams identify which tests are most prone to random failures. This data-driven approach allows them to focus their fixes where they'll have the biggest impact.

Key Metrics for Measuring Test Reliability

While the PFS is Facebook's main tool for managing flaky tests, they track several other important metrics:

Flakiness Rate: A straightforward measure of how often tests fail unexpectedly. This provides a quick way to spot problem areas in the test suite.
Test Run Time: Longer-running tests tend to be more prone to flakiness. Monitoring test duration helps identify tests that might need optimization.
Impact on Developer Workflow: Facebook looks at how flaky tests affect day-to-day work. They track time spent investigating and rerunning failed tests to understand the real cost of flakiness.

Implementing the PFS: Practical Considerations

While Facebook operates at massive scale, teams of any size can apply the principles behind the PFS. Here's how to adapt their approach:

Gather Historical Test Data: Look at past test runs to spot patterns in flaky behavior. This information forms the foundation for calculating flakiness scores.
Categorize Flaky Tests: Group tests by what causes them to fail - whether it's timing issues, database problems, or other factors. This helps you develop targeted solutions.
Prioritize Fixes Based on Impact: Focus first on fixing the flaky tests that block important work or slow down development. This gives you the best return on your testing efforts.

Building Robust Test Automation Frameworks

Facebook's success with flaky tests isn't just about measurement - it's also about prevention through good test design:

Test Isolation: Each test should run independently, without relying on other tests. This makes it easier to find and fix problems when tests fail.
Environment Consistency: Running tests in consistent environments reduces random failures caused by different setups. This helps eliminate "it works on my machine" problems.
Effective Use of Mocks and Stubs: Replace external dependencies with controlled test doubles to reduce unpredictable behavior in tests.

By combining smart measurement through the PFS with solid testing practices, Facebook keeps their massive test suite running smoothly. While most teams won't face challenges at Facebook's scale, these same principles can help any development team build more reliable tests and spend less time chasing flaky failures.

Learning From the Open Source Community

Open source communities offer invaluable insights into handling flaky tests, even without the resources of large companies like Facebook. By examining how these projects tackle test unreliability in the real world, teams of any size can learn practical solutions. The public nature of open source development - where code reviews and issue discussions happen in the open - creates unique opportunities to learn from both successes and failures.

Common Flaky Test Patterns in Open Source

Open source projects face distinct testing challenges due to their collaborative nature and diverse contributor base. Asynchronous operations often cause flaky tests, especially in projects involving network calls or multiple threads. For example, a test might check for an event triggered after a network request - passing when the network is fast but failing with high latency if it doesn't properly wait for the event. Another common issue arises from race conditions in multi-threaded code, where tests pass or fail unpredictably based on timing.

Successful Resolution Strategies: Real-World Examples

Open source projects have developed effective approaches to flaky tests that other teams can adopt. Many projects run tests across multiple environments to catch issues early, addressing the classic "works on my machine" problem at scale. Test isolation is another key strategy - using mocks and stubs to prevent tests from affecting each other. For instance, one open source database project fixed a flaky test by mocking database interactions rather than depending on specific configurations. This isolation eliminated the unstable dependency completely.

Balancing Test Reliability and Contribution Accessibility

Open source maintainers face a unique challenge: keeping tests reliable while staying welcoming to new contributors. Too-strict testing requirements can discourage participation, but loose standards lead to test instability. Successful projects address this by providing clear test-writing guidelines and documentation. They also use tools to track flaky tests and prioritize fixes based on impact. Some projects "quarantine" flaky tests - allowing them to run without blocking contributions while keeping them visible for future fixes. This balanced approach maintains project momentum while steadily improving test quality through both technical solutions and community engagement. These practices make open source projects valuable examples for any team working to write more reliable tests.

GitHub's Winning Strategy for Test Stability

At GitHub, managing flaky tests was a critical challenge that demanded immediate attention. Running a platform of GitHub's scale means that even a small percentage of unreliable tests can waste developer time and shake confidence in the test suite. Unlike isolated test failures that consistently point to real bugs, flaky tests that pass and fail unpredictably make it hard to trust test results.

The Three Pillars of GitHub's Approach

GitHub developed a practical solution based on three core elements: intelligent test retries, data analysis with Stan, and targeted developer alerts. Here's how each piece works together to catch and fix flaky tests:

Smart Retries: GitHub moved beyond simple test rerunning to create a smarter retry system. By studying how tests fail - whether with consistent or varying error messages - the system can tell which retries are worth attempting. For example, if a test always fails with the same error, it likely points to a real bug rather than flakiness. But if failures seem random, a retry makes sense. This selective approach saves time by focusing retries where they're most likely to help, similar to Facebook's test analysis but adapted for GitHub's needs.
Statistical Analysis with Stan: To better understand test reliability patterns, GitHub uses Stan, a statistical programming language. Stan helps model the probability that a test is flaky based on its past behavior. Consider a test that passes 9 out of 10 times - while a 90% success rate might seem good enough, Stan can reveal if that one failure is truly random or hints at deeper problems. This detailed analysis helps GitHub spot subtle patterns in test behavior and avoid false alarms.
Targeted Developer Notifications: When the system identifies a flaky test, it alerts the right developers with specific details about failure patterns and flakiness probability. By sending focused notifications with actionable data, GitHub ensures flaky tests get fixed instead of being ignored as background noise. The system also ranks notifications by impact, so developers can tackle the most disruptive issues first.

From Theory to Practice: Implementation and Results

After putting these techniques into action, GitHub saw major improvements in test stability. Their previously high rate of flaky tests dropped significantly. The combination of smart retries, statistical modeling, and targeted alerts helped them focus on real issues instead of chasing false failures. As a result, developers now spend less time debugging tests and have more confidence in their test results. GitHub's success shows how a systematic, data-driven approach can effectively tackle test flakiness, even in large, complex codebases.

Your Action Plan for Flaky Test Prevention

After examining how flaky tests impact teams from startups to giants like Facebook and GitHub, let's create a practical plan to prevent and fix these testing challenges. This guide will help you build more stable and reliable tests.

Building a Foundation of Reliable Tests

The key to eliminating flakiness starts with writing solid tests from day one. Here are essential best practices to follow:

Atomic Tests: Keep each test focused on checking one specific thing. Just like building with Lego blocks, smaller independent tests are easier to fix when something breaks. For example, instead of cramming login and profile updates into one test, split them into separate ones. When a test fails, you'll know exactly which part had issues.
Explicit Dependencies: Make it crystal clear what each test needs to run - whether that's external services, test data, or specific settings. Think of it like a cooking recipe that lists every ingredient upfront. No surprises means fewer random failures.
Data Management: Take control of your test data. Create fresh, predictable data for each test using fixtures or factory methods instead of sharing data between tests. It's like giving each chef their own prep station - no mix-ups, no contamination, just clean results every time.

Handling Timing Issues and Asynchronous Operations

Async code often causes flaky tests. These strategies will help you avoid common timing problems:

Proper Waits and Timeouts: Set clear wait times for async operations to finish before checking results. Like setting a kitchen timer when baking, this ensures you check at the right moment - not too early or late. This is especially important for API calls that return data at varying speeds.
Synchronization Mechanisms: For code running in parallel, use proper sync tools (locks, semaphores) to prevent race conditions. Think of it as traffic lights managing busy intersections - keeping everything flowing smoothly without crashes.

Implementing Effective Isolation

Keeping tests separate prevents them from interfering with each other:

Mocking and Stubbing: Replace external systems with predictable test doubles. This prevents flakiness from things like network issues or service outages. For instance, swap out real database calls with mocks that always behave consistently.
Environment Control: Run tests in clean, consistent environments. Using containers or VMs gives each test run a fresh start, avoiding the classic "it works on my machine" problem.

Continuous Improvement and Monitoring

Preventing flaky tests requires ongoing attention:

Flaky Test Detection and Reporting: Use tools to automatically spot and track flaky tests. Monitor failure rates to focus on fixing the most disruptive issues first.
Regular Test Reviews: Just like code reviews improve code quality, reviewing tests helps catch potential flakiness before it becomes a problem.
Root Cause Analysis: When you find a flaky test, dig deep to understand why it's unstable. Fix the core issue, not just the symptoms.

Practical Checklist For Evaluating Your Existing Test Suite

Aspect	Question
Test Isolation	Are tests independent of each other? Do they share state or resources?
Environment Consistency	Do tests run in consistent environments across different machines and CI systems?
Asynchronous Operations	Are asynchronous operations handled correctly with proper waits and timeouts?
Test Data Management	Is test data created and managed effectively? Is it predictable and consistent?
Dependency Management	Are all test dependencies explicitly declared and managed?

By following these practices, you'll build a more reliable test suite that gives your team confidence in your code. Want to improve your CI/CD process and reduce flaky tests? Check out Mergify for smooth, reliable code integration.