How to Fix Flaky Tests — Complete Diagnosis Guide (2026)

- What Are Flaky Tests?
- Why Flaky Tests Are a Critical Problem
- The Flaky Test Diagnosis Framework
- Root Causes of Flaky Tests — With Fixes
- Flaky Test Detection Tools and Strategies
- Flaky Tests in CI/CD Pipelines — Best Practices
- Flaky Test Prevention: Writing Reliable Tests from the Start
- Summary — Flaky Test Diagnosis Checklist
- Conclusion
What Are Flaky Tests?
A flaky test is a test that produces inconsistent results — passing on one run and failing on another — without any change to the underlying code. Flaky tests are one of the most damaging problems in modern software development pipelines. They erode developer trust, slow down CI/CD pipelines, hide real bugs, and waste hours of engineering time.
According to Google's engineering research, flaky tests are present in virtually every large codebase at scale. Their internal data showed that roughly 1 in 7 tests in large repositories eventually becomes flaky. For teams running thousands of tests per day, even a 1% flakiness rate creates constant noise and delays.
The key characteristic that defines a flaky test: non-determinism. The test outcome depends on something other than the code under test — timing, environment state, network availability, external services, or test execution order.

Get the Mobile Testing Playbook Used by 800+ QA Teams
Discover 50+ battle-tested strategies to catch critical bugs before production and ship 5-star apps faster.
Why Flaky Tests Are a Critical Problem
Before diving into diagnosis and fixes, it's worth understanding why flaky tests deserve serious attention:
They mask real failures.
When tests flicker, developers begin ignoring failures. A genuine regression can slip through because the team assumed the failure was "just flakiness."
They slow CI/CD pipelines.
Retrying flaky tests on every build adds minutes or hours to your deployment cycle.
They destroy team morale.
Nothing is more frustrating than a red build that magically turns green on re-run.
They increase infrastructure costs.
Unnecessary retries consume compute time, especially on cloud CI platforms billed by the minute.
They indicate deeper architectural problems.
A high flakiness rate is often a symptom of poor test isolation, bad resource management, or fragile external dependencies.
The Flaky Test Diagnosis Framework
Fixing flaky tests starts with a systematic diagnosis. Randomly patching tests without understanding the root cause leads to recurring failures. Use this five-step diagnosis framework:
Step 1 — Reproduce the Flakiness
You cannot fix what you cannot reproduce. Run the test in isolation repeatedly to confirm it is actually flaky:
# Run a specific test 50 times to surface flakinessfor i in $(seq 1 50); do pytest tests/test_payment.py::test_charge_card; done
Tools like pytest-repeat, jest --runInBand, or Go's -count flag help stress-test individual tests. If the test only fails in CI but never locally, the environment difference itself is a clue — focus on environment parity.
Step 2 — Classify the Failure Type
Log every failure message carefully. Group failures by error type:
- TimeoutError
→ likely async/timing issue
- AssertionError
on shared state → likely test order dependency
- ConnectionRefusedError
→ likely external service or port conflict
- StaleElementReferenceException
→ likely UI/browser automation race condition
Random data mismatch → likely missing seed or random value
Classification narrows your root cause search dramatically.
Step 3 — Isolate the Test
Run the test in complete isolation — separate process, fresh database, no shared singletons. If the test passes in isolation but fails in suite, you have a test order dependency or shared state pollution problem.
Step 4 — Add Verbose Logging and Timestamps
Insert timestamps, thread IDs, and intermediate state logging before each assertion. Flakiness often leaves a trace — a race condition shows up as a millisecond difference, a resource leak shows up as a gradually degrading value.
Step 5 — Check Environmental Factors
Compare CI and local environments: OS, timezone, locale, CPU count, available memory, and network latency. Many flaky tests are deterministic on a developer's MacBook but non-deterministic on a 2-core Linux CI container.
Root Causes of Flaky Tests — With Fixes
1. Async and Timing Issues (Most Common)
Problem: Tests that rely on sleep(), fixed delays, or assume operations complete within a certain time window are inherently fragile. A slow CI machine, a garbage collection pause, or a noisy neighbor in a shared cloud environment can violate these assumptions.
Symptoms:
- TimeoutError
on async operations
Tests fail intermittently under load
Failures more frequent on CI than locally
Fix — Use Explicit Waits and Polling:
# BAD: Fixed sleep is fragiletime.sleep(2)assert db.record_exists(id)# GOOD: Poll until condition is met or timeoutimport tenacity@tenacity.retry(stop=tenacity.stop_after_delay(10), wait=tenacity.wait_fixed(0.2))def wait_for_record(id):assert db.record_exists(id)wait_for_record(record_id)
For browser automation (Selenium, Playwright, Cypress), always use explicit waits (waitForElement, waitUntil) instead of sleep. Cypress's built-in retry-ability handles most async UI flakiness out of the box.
For async JavaScript tests, always await every Promise and avoid fire-and-forget patterns in test setup:
// BADbeforeEach(() => {db.seed(); // returns a Promise but not awaited});// GOODbeforeEach(async () => {await db.seed();});
2. Test Order Dependency and Shared State
Problem: Tests that depend on a specific execution order, or that leave side effects (data in a database, files on disk, global variables) for subsequent tests, are a major source of flakiness. Most test runners do not guarantee execution order, and parallel execution makes this worse.
Symptoms:
Test passes alone but fails in suite
Failures depend on which tests ran before
Randomizing test order (
--randomly-seed) surfaces failures
Fix — Enforce Test Isolation:
Every test must own its setup and teardown. Use transactions that are rolled back after each test, temporary directories, and in-memory databases:
# Django example — wrap each test in a transactionfrom django.test import TestCase # Automatically rolls back DB after each testclass PaymentTest(TestCase):def setUp(self):self.user = User.objects.create(email="test@example.com")def test_charge(self):result = charge(self.user, 100)self.assertTrue(result.success)# DB is rolled back automatically after each test
For global state (singletons, module-level caches), use mocks or explicit reset functions in tearDown:
def tearDown(self):cache.clear()config.reset_to_defaults()
3. Race Conditions in Concurrent Code
Problem: Tests for concurrent or multi-threaded code frequently exhibit race conditions. The test interleaves thread execution differently on each run.
Symptoms:
Failures in tests that exercise queues, workers, or async event processing
Inconsistent counts, unexpected
Nonevalues, partial writes
Fix — Control Concurrency in Tests:
Use barriers, semaphores, or mock executors to make concurrent operations deterministic:
# Use a ThreadPoolExecutor with a controlled thread count
# and join all futures before asserting
with ThreadPoolExecutor(max_workers=1) as executor:
futures = [executor.submit(process_task, t) for t in tasks]
results = [f.result() for f in futures] # .result() blocks until done
assert len(results) == len(tasks)
In Go, use sync.WaitGroup and ensure all goroutines complete before assertions. Never assert on goroutine output without synchronization.
4. External Dependencies — Network, APIs, and Databases
Problem: Tests that call real external services (third-party APIs, payment gateways, email providers) are non-deterministic by definition. Network latency varies, rate limits kick in, external services have their own outages.
Symptoms:
Tests only fail at certain times of day
Failures correlate with CI machine network latency spikes
Error messages reference timeouts or HTTP 429/503
Fix — Mock or Stub External Dependencies:
# Using unittest.mock to stub HTTP callsfrom unittest.mock import patch, MagicMock@patch("myapp.payments.stripe.charge")def test_process_payment(mock_charge):mock_charge.return_value = MagicMock(id="ch_123", status="succeeded")result = process_payment(user_id=1, amount=5000)assert result.transaction_id == "ch_123"mock_charge.assert_called_once_with(amount=5000, currency="usd")
For integration tests that must use real services, use contract testing (e.g., Pact) to verify API compatibility without live calls in every run.
5. Random and Non-Deterministic Data
Problem: Tests that generate random data without seeding the random number generator produce different inputs on each run. A bug that only manifests for certain input values will cause intermittent failures.
Fix — Seed All Random Number Generators:
import random
import pytest
@pytest.fixture(autouse=True)
def seed_random():
random.seed(42)
yield
# For numpy
import numpy as np
np.random.seed(42)
# For factories (factory_boy)
faker = Faker()
Faker.seed(42)
Log the seed value at the start of each test run. When a failure is reported, testers can reproduce it exactly by re-using the same seed.
6. File System and Resource Leaks
Problem: Tests that write to shared paths (/tmp/output.csv), leave open file handles, or fail to clean up ports and sockets cause conflicts when tests run in parallel.
Fix — Use Temporary Directories:
import tempfile
import pytest
@pytest.fixture
def tmp_dir():
with tempfile.TemporaryDirectory() as d:
yield d
def test_export_csv(tmp_dir):
output_path = os.path.join(tmp_dir, "output.csv")
export_data(output_path)
assert os.path.exists(output_path)
# tmp_dir is deleted automatically after the test
For port conflicts in server tests, use port=0 to let the OS assign a free port, or use tools like pytest-asyncio with isolated event loops.
7. Timezone and Locale Sensitivity
Problem: Tests that compare dates, times, or locale-formatted strings often break in CI environments configured with a different timezone or locale than the developer's machine.
Fix — Always Use UTC and Explicit Locale:
# Freeze time for date-dependent testsfrom freezegun import freeze_time@freeze_time("2025-01-15 12:00:00")def test_invoice_due_date():invoice = create_invoice(terms_days=30)assert invoice.due_date == date(2025, 2, 14)
Set TZ=UTC explicitly in your CI environment configuration and use timezone-aware datetime objects throughout your codebase.
Flaky Test Detection Tools and Strategies
Automatic Flakiness Detection in CI
Modern CI platforms support built-in flakiness detection:
GitHub Actions:
Use test result annotations and re-run only failed jobs
BuildKite / CircleCI:
Flaky test dashboards with historical pass rate
pytest-flakefinder:
Runs each test multiple times to detect flakiness locally
Jest --detectOpenHandles:
Surfaces async resource leaks in JavaScript
Gradle's
--rerun-tasks:
Detects test flakiness in JVM projects
Quarantining Flaky Tests
When a test is confirmed flaky but cannot be fixed immediately, quarantine it rather than deleting it or ignoring it silently:
@pytest.mark.flaky(reruns=3, reruns_delay=1)def test_background_job_completes():# Known flaky — ticket #4521 tracks the fix...
Use pytest-rerunfailures to auto-retry flaky tests a fixed number of times. Track quarantined tests in a dedicated dashboard and enforce a policy that quarantined tests must be fixed within a sprint.
Flaky Tests in CI/CD Pipelines — Best Practices
Track flakiness rate as a metric.
Measure and alert on tests with a pass rate below 99%.
Never merge code that increases the flakiness rate.
Run tests in parallel with randomized order
(
pytest-randomly,
jest --randomize) to surface hidden dependencies.
Use test sharding
to distribute load and reduce timeout-related flakiness.
Enforce hermetic builds
— tests should not depend on external network calls in unit/integration suites.
Separate slow and fast tests
— run fast unit tests on every commit, reserve slow integration tests for pre-merge.
Log test durations
— a test that suddenly takes 3x longer is about to become flaky.
Flaky Test Prevention: Writing Reliable Tests from the Start
The best way to fix flaky tests is to prevent them. Adopt these principles when writing new tests:
Principle | Implementation |
Determinism | Seed all randomness; freeze time; mock external calls |
Isolation | Each test creates and destroys its own state |
Idempotency | Tests can be run any number of times with the same result |
Speed | Fast tests reduce reliance on timeouts |
Specificity | Assert on exact, known values — not ranges or approximations |
Hermetic | No network calls, no shared global state, no file system side effects |
Summary — Flaky Test Diagnosis Checklist
Use this checklist when investigating a flaky test:
Can you reproduce the flakiness by running the test 50+ times?
Does the test pass when run in isolation (single test, fresh process)?
Does randomizing test execution order change the failure rate?
Are there any
sleep()or fixed delay calls in the test or its fixtures?
Does the test make real network calls or depend on external services?
Are random values seeded consistently?
Are all async operations properly awaited?
Are temporary files, ports, and database records cleaned up after the test?
Is the test sensitive to system timezone or locale?
Does the test involve multi-threading or concurrent code without proper synchronization?
Conclusion
Flaky tests are not a minor inconvenience — they are a systemic risk to software quality and delivery speed. The path to fixing them is the same as fixing any complex bug: reproduce reliably, classify the failure, isolate the cause, and apply a targeted fix.
The root causes almost always fall into one of the categories covered in this guide: timing issues, shared state, external dependencies, non-determinism, or environment differences. A disciplined diagnosis process, combined with strong test isolation practices and automated flakiness detection in CI, will eliminate the vast majority of flaky test problems in your codebase.
Invest the time to fix flaky tests properly. Your CI pipeline, your team's morale, and your users' confidence in your releases all depend on it.



