How to Fix Flaky Tests — Complete Diagnosis Guide (2026)

Nishtha chauhan
Nishtha chauhan
|Published on |12 min
Cover Image for How to Fix Flaky Tests — Complete Diagnosis Guide (2026)

What Are Flaky Tests?

A flaky test is a test that produces inconsistent results — passing on one run and failing on another — without any change to the underlying code. Flaky tests are one of the most damaging problems in modern software development pipelines. They erode developer trust, slow down CI/CD pipelines, hide real bugs, and waste hours of engineering time.

According to Google's engineering research, flaky tests are present in virtually every large codebase at scale. Their internal data showed that roughly 1 in 7 tests in large repositories eventually becomes flaky. For teams running thousands of tests per day, even a 1% flakiness rate creates constant noise and delays.

The key characteristic that defines a flaky test: non-determinism. The test outcome depends on something other than the code under test — timing, environment state, network availability, external services, or test execution order.

Ebook Preview

Get the Mobile Testing Playbook Used by 800+ QA Teams

Discover 50+ battle-tested strategies to catch critical bugs before production and ship 5-star apps faster.

100% Free. No spam. Unsubscribe anytime.

Why Flaky Tests Are a Critical Problem

Before diving into diagnosis and fixes, it's worth understanding why flaky tests deserve serious attention:

  • They mask real failures.

    When tests flicker, developers begin ignoring failures. A genuine regression can slip through because the team assumed the failure was "just flakiness."

  • They slow CI/CD pipelines.

    Retrying flaky tests on every build adds minutes or hours to your deployment cycle.

  • They destroy team morale.

    Nothing is more frustrating than a red build that magically turns green on re-run.

  • They increase infrastructure costs.

    Unnecessary retries consume compute time, especially on cloud CI platforms billed by the minute.

  • They indicate deeper architectural problems.

    A high flakiness rate is often a symptom of poor test isolation, bad resource management, or fragile external dependencies.

The Flaky Test Diagnosis Framework

Fixing flaky tests starts with a systematic diagnosis. Randomly patching tests without understanding the root cause leads to recurring failures. Use this five-step diagnosis framework:

Step 1 — Reproduce the Flakiness

You cannot fix what you cannot reproduce. Run the test in isolation repeatedly to confirm it is actually flaky:

# Run a specific test 50 times to surface flakiness
for i in $(seq 1 50); do pytest tests/test_payment.py::test_charge_card; done

Tools like pytest-repeat, jest --runInBand, or Go's -count flag help stress-test individual tests. If the test only fails in CI but never locally, the environment difference itself is a clue — focus on environment parity.

Step 2 — Classify the Failure Type

Log every failure message carefully. Group failures by error type:

  • TimeoutError

    → likely async/timing issue

  • AssertionError

    on shared state → likely test order dependency

  • ConnectionRefusedError

    → likely external service or port conflict

  • StaleElementReferenceException

    → likely UI/browser automation race condition

  • Random data mismatch → likely missing seed or random value

Classification narrows your root cause search dramatically.

Step 3 — Isolate the Test

Run the test in complete isolation — separate process, fresh database, no shared singletons. If the test passes in isolation but fails in suite, you have a test order dependency or shared state pollution problem.

Step 4 — Add Verbose Logging and Timestamps

Insert timestamps, thread IDs, and intermediate state logging before each assertion. Flakiness often leaves a trace — a race condition shows up as a millisecond difference, a resource leak shows up as a gradually degrading value.

Step 5 — Check Environmental Factors

Compare CI and local environments: OS, timezone, locale, CPU count, available memory, and network latency. Many flaky tests are deterministic on a developer's MacBook but non-deterministic on a 2-core Linux CI container.

Root Causes of Flaky Tests — With Fixes

1. Async and Timing Issues (Most Common)

Problem: Tests that rely on sleep(), fixed delays, or assume operations complete within a certain time window are inherently fragile. A slow CI machine, a garbage collection pause, or a noisy neighbor in a shared cloud environment can violate these assumptions.

Symptoms:

  • TimeoutError

    on async operations

  • Tests fail intermittently under load

  • Failures more frequent on CI than locally

Fix — Use Explicit Waits and Polling:

# BAD: Fixed sleep is fragile
time.sleep(2)
assert db.record_exists(id)
# GOOD: Poll until condition is met or timeout
import tenacity
@tenacity.retry(stop=tenacity.stop_after_delay(10), wait=tenacity.wait_fixed(0.2))
def wait_for_record(id):
assert db.record_exists(id)
wait_for_record(record_id)

For browser automation (Selenium, Playwright, Cypress), always use explicit waits (waitForElement, waitUntil) instead of sleep. Cypress's built-in retry-ability handles most async UI flakiness out of the box.

For async JavaScript tests, always await every Promise and avoid fire-and-forget patterns in test setup:

// BAD
beforeEach(() => {
db.seed(); // returns a Promise but not awaited
});
// GOOD
beforeEach(async () => {
await db.seed();
});

2. Test Order Dependency and Shared State

Problem: Tests that depend on a specific execution order, or that leave side effects (data in a database, files on disk, global variables) for subsequent tests, are a major source of flakiness. Most test runners do not guarantee execution order, and parallel execution makes this worse.

Symptoms:

  • Test passes alone but fails in suite

  • Failures depend on which tests ran before

  • Randomizing test order (

    --randomly-seed

    ) surfaces failures

Fix — Enforce Test Isolation:

Every test must own its setup and teardown. Use transactions that are rolled back after each test, temporary directories, and in-memory databases:

# Django example — wrap each test in a transaction
from django.test import TestCase # Automatically rolls back DB after each test
class PaymentTest(TestCase):
def setUp(self):
self.user = User.objects.create(email="test@example.com")
def test_charge(self):
result = charge(self.user, 100)
self.assertTrue(result.success)
# DB is rolled back automatically after each test

For global state (singletons, module-level caches), use mocks or explicit reset functions in tearDown:

def tearDown(self):
cache.clear()
config.reset_to_defaults()

3. Race Conditions in Concurrent Code

Problem: Tests for concurrent or multi-threaded code frequently exhibit race conditions. The test interleaves thread execution differently on each run.

Symptoms:

  • Failures in tests that exercise queues, workers, or async event processing

  • Inconsistent counts, unexpected

    None

    values, partial writes

Fix — Control Concurrency in Tests:

Use barriers, semaphores, or mock executors to make concurrent operations deterministic:

# Use a ThreadPoolExecutor with a controlled thread count # and join all futures before asserting with ThreadPoolExecutor(max_workers=1) as executor: futures = [executor.submit(process_task, t) for t in tasks] results = [f.result() for f in futures] # .result() blocks until done assert len(results) == len(tasks)

In Go, use sync.WaitGroup and ensure all goroutines complete before assertions. Never assert on goroutine output without synchronization.

4. External Dependencies — Network, APIs, and Databases

Problem: Tests that call real external services (third-party APIs, payment gateways, email providers) are non-deterministic by definition. Network latency varies, rate limits kick in, external services have their own outages.

Symptoms:

  • Tests only fail at certain times of day

  • Failures correlate with CI machine network latency spikes

  • Error messages reference timeouts or HTTP 429/503

Fix — Mock or Stub External Dependencies:

# Using unittest.mock to stub HTTP calls
from unittest.mock import patch, MagicMock
@patch("myapp.payments.stripe.charge")
def test_process_payment(mock_charge):
mock_charge.return_value = MagicMock(id="ch_123", status="succeeded")
result = process_payment(user_id=1, amount=5000)
assert result.transaction_id == "ch_123"
mock_charge.assert_called_once_with(amount=5000, currency="usd")

For integration tests that must use real services, use contract testing (e.g., Pact) to verify API compatibility without live calls in every run.

5. Random and Non-Deterministic Data

Problem: Tests that generate random data without seeding the random number generator produce different inputs on each run. A bug that only manifests for certain input values will cause intermittent failures.

Fix — Seed All Random Number Generators:

import random import pytest @pytest.fixture(autouse=True) def seed_random(): random.seed(42) yield # For numpy import numpy as np np.random.seed(42) # For factories (factory_boy) faker = Faker() Faker.seed(42)

Log the seed value at the start of each test run. When a failure is reported, testers can reproduce it exactly by re-using the same seed.

6. File System and Resource Leaks

Problem: Tests that write to shared paths (/tmp/output.csv), leave open file handles, or fail to clean up ports and sockets cause conflicts when tests run in parallel.

Fix — Use Temporary Directories:

import tempfile import pytest @pytest.fixture def tmp_dir(): with tempfile.TemporaryDirectory() as d: yield d def test_export_csv(tmp_dir): output_path = os.path.join(tmp_dir, "output.csv") export_data(output_path) assert os.path.exists(output_path) # tmp_dir is deleted automatically after the test

For port conflicts in server tests, use port=0 to let the OS assign a free port, or use tools like pytest-asyncio with isolated event loops.

7. Timezone and Locale Sensitivity

Problem: Tests that compare dates, times, or locale-formatted strings often break in CI environments configured with a different timezone or locale than the developer's machine.

Fix — Always Use UTC and Explicit Locale:

# Freeze time for date-dependent tests
from freezegun import freeze_time
@freeze_time("2025-01-15 12:00:00")
def test_invoice_due_date():
invoice = create_invoice(terms_days=30)
assert invoice.due_date == date(2025, 2, 14)

Set TZ=UTC explicitly in your CI environment configuration and use timezone-aware datetime objects throughout your codebase.

Flaky Test Detection Tools and Strategies

Automatic Flakiness Detection in CI

Modern CI platforms support built-in flakiness detection:

  • GitHub Actions:

    Use test result annotations and re-run only failed jobs

  • BuildKite / CircleCI:

    Flaky test dashboards with historical pass rate

  • pytest-flakefinder:

    Runs each test multiple times to detect flakiness locally

  • Jest --detectOpenHandles:

    Surfaces async resource leaks in JavaScript

  • Gradle's

    --rerun-tasks

    :

    Detects test flakiness in JVM projects

Quarantining Flaky Tests

When a test is confirmed flaky but cannot be fixed immediately, quarantine it rather than deleting it or ignoring it silently:

@pytest.mark.flaky(reruns=3, reruns_delay=1)
def test_background_job_completes():
# Known flaky — ticket #4521 tracks the fix
...

Use pytest-rerunfailures to auto-retry flaky tests a fixed number of times. Track quarantined tests in a dedicated dashboard and enforce a policy that quarantined tests must be fixed within a sprint.

Flaky Tests in CI/CD Pipelines — Best Practices

  1. Track flakiness rate as a metric.

    Measure and alert on tests with a pass rate below 99%.

  2. Never merge code that increases the flakiness rate.

  3. Run tests in parallel with randomized order

    (

    pytest-randomly

    ,

    jest --randomize

    ) to surface hidden dependencies.

  4. Use test sharding

    to distribute load and reduce timeout-related flakiness.

  5. Enforce hermetic builds

    — tests should not depend on external network calls in unit/integration suites.

  6. Separate slow and fast tests

    — run fast unit tests on every commit, reserve slow integration tests for pre-merge.

  7. Log test durations

    — a test that suddenly takes 3x longer is about to become flaky.

Flaky Test Prevention: Writing Reliable Tests from the Start

The best way to fix flaky tests is to prevent them. Adopt these principles when writing new tests:

Principle

Implementation

Determinism

Seed all randomness; freeze time; mock external calls

Isolation

Each test creates and destroys its own state

Idempotency

Tests can be run any number of times with the same result

Speed

Fast tests reduce reliance on timeouts

Specificity

Assert on exact, known values — not ranges or approximations

Hermetic

No network calls, no shared global state, no file system side effects

Summary — Flaky Test Diagnosis Checklist

Use this checklist when investigating a flaky test:

  • Can you reproduce the flakiness by running the test 50+ times?

  • Does the test pass when run in isolation (single test, fresh process)?

  • Does randomizing test execution order change the failure rate?

  • Are there any

    sleep()

    or fixed delay calls in the test or its fixtures?

  • Does the test make real network calls or depend on external services?

  • Are random values seeded consistently?

  • Are all async operations properly awaited?

  • Are temporary files, ports, and database records cleaned up after the test?

  • Is the test sensitive to system timezone or locale?

  • Does the test involve multi-threading or concurrent code without proper synchronization?

Conclusion

Flaky tests are not a minor inconvenience — they are a systemic risk to software quality and delivery speed. The path to fixing them is the same as fixing any complex bug: reproduce reliably, classify the failure, isolate the cause, and apply a targeted fix.

The root causes almost always fall into one of the categories covered in this guide: timing issues, shared state, external dependencies, non-determinism, or environment differences. A disciplined diagnosis process, combined with strong test isolation practices and automated flakiness detection in CI, will eliminate the vast majority of flaky test problems in your codebase.

Invest the time to fix flaky tests properly. Your CI pipeline, your team's morale, and your users' confidence in your releases all depend on it.