How to Fix Flaky Tests in Mobile Test Automation — Complete Diagnosis Guide (2026)

- What Are Flaky Tests?
- Why Flaky Tests Are a Critical Problem
- The Flaky Test Diagnosis Framework
- Why Mobile Flaky Tests Happen
- Root Causes of Flaky Tests — With Fixes
- How to Reduce Test Automation Maintenance Surface
- Flaky Test Detection Tools and Strategies
- Flaky Tests in CI/CD Pipelines — Best Practices
- Flaky Test Prevention: Writing Reliable Tests from the Start
- Flaky Test Diagnosis Checklist
- Conclusion
What Are Flaky Tests?
A flaky test is a test that produces inconsistent results — passing on one run and failing on another — without any change to the underlying code. Flaky tests are one of the most damaging problems in modern software development pipelines because they erode developer trust, slow down CI/CD, hide real bugs, and waste engineering time.
In mobile test automation, flaky tests are even harder to diagnose. The same user flow can pass on one Android device and fail on another because of device fragmentation, OS dialogs, permission prompts, keyboard behavior, network variance, animations, or locator drift. A checkout test might fail not because the checkout is broken, but because the keyboard covered the CTA, the device was on a slower network, or an Appium locator no longer matched the updated UI hierarchy.
That is why learning how to fix flaky tests is not just about adding retries or increasing timeouts. The right approach is to identify the real source of non-determinism, isolate it, and reduce the maintenance surface of your test automation suite.
According to Google's engineering research, flaky tests are present in virtually every large codebase at scale. Their internal data showed that roughly 1 in 7 tests in large repositories eventually becomes flaky. For teams running thousands of tests per day, even a 1% flakiness rate creates constant noise and delays.
The key characteristic that defines a flaky test is non-determinism. The test outcome depends on something other than the code under test — timing, environment state, network availability, external services, device state, OS behavior, or test execution order.

Get the Mobile Testing Playbook Used by 800+ QA Teams
Discover 50+ battle-tested strategies to catch critical bugs before production and ship 5-star apps faster.
Why Flaky Tests Are a Critical Problem
Before diving into diagnosis and fixes, it's worth understanding why flaky tests deserve serious attention:
They mask real failures. When tests flicker, developers begin ignoring failures. A genuine regression can slip through because the team assumed the failure was "just flakiness."
They slow CI/CD pipelines. Retrying flaky tests on every build adds minutes or hours to your deployment cycle.
They destroy team morale. Nothing is more frustrating than a red build that magically turns green on re-run.
They increase infrastructure costs. Unnecessary retries consume compute time, especially on cloud CI platforms billed by the minute.
They indicate deeper architectural problems. A high flakiness rate is often a symptom of poor test isolation, bad resource management, or fragile external dependencies.
Why Mobile Flaky Tests Are More Expensive
Mobile flaky tests create extra debugging work because the failure surface is much larger than web or backend testing. A failed mobile run can depend on the device model, OS version, screen size, app build, permission state, network profile, keyboard behavior, animation timing, or whether the app was installed fresh before the test.
For mobile teams, the real cost is not just the failed test. It is the investigation that follows. Engineers and QA teams have to answer the same questions again and again:
Did the app actually break?
Did the test break?
Did the locator change?
Did a permission dialog block the flow?
Did the keyboard hide the button?
Did this only happen on one device or OS version?
Did the backend respond slowly?
Did another test leave the app in a dirty state?
This is why mobile flaky tests quickly become a test automation maintenance problem. When a team spends every sprint fixing locators, updating waits, debugging screenshots, and re-running the same flows manually, the automation suite stops feeling like leverage and starts feeling like another product to maintain.
The Flaky Test Diagnosis Framework
Fixing flaky tests starts with systematic diagnosis. Randomly patching tests without understanding the root cause leads to recurring failures. For mobile teams, the diagnosis process must include device state, app state, OS behavior, permissions, network, and locator stability.
Use this five-step framework.
Step 1 — Reproduce the Flakiness
You cannot fix what you cannot reproduce. Run the test repeatedly to confirm that it is actually flaky and not a one-off infrastructure failure.
For general automated tests, run the same test in isolation multiple times:
# Run a specific test 50 times to surface flakinessfor i in $(seq 1 50); do pytest tests/test_payment.py::test_charge_card; done
For mobile flaky tests, reproduce the failure across the same conditions where it originally failed:
Same device model
Same OS version
Same app build
Same test account
Same network profile
Same permission state
Same install state: fresh install vs existing app session
Same execution environment: local, emulator, simulator, real device, or cloud device
If the test fails only on one device or OS version, treat that as a major clue. Mobile failures are often deterministic within the right environment but invisible everywhere else.
Step 2 — Classify the Failure Type
Log every failure message carefully and group failures by error type. Classification narrows your root cause search.
Common flaky test failure patterns include:
TimeoutError→ likely async, network, or timing issueAssertionErroron shared state → likely test order dependencyConnectionRefusedError→ likely external service or environment issueStaleElementReferenceException→ likely UI automation race conditionRandom data mismatch → likely missing seed or non-deterministic test data
For mobile test automation, also classify failures using mobile-specific categories:
Element not found → locator drift, slow screen load, wrong screen, or hidden element
Tap hits wrong target → screen size, scroll position, density, or animation issue
Permission popup blocks flow → missing permission handling or first-run state issue
Keyboard covers CTA → input behavior or viewport issue
Test passes locally but fails in CI → device state, app build, network, or environment mismatch
Test fails after UI copy/layout change → brittle locator or text-based assertion
Test fails only on one device → device fragmentation or OS-specific behavior
Do not label everything as “flaky” too quickly. First decide whether it is an app bug, automation bug, environment issue, data issue, or device-specific issue.
Step 3 — Isolate the Test
Run the test in complete isolation — separate process, fresh state, clean app install, and controlled test data. If the test passes alone but fails in the full suite, you likely have test order dependency or shared state pollution.
For mobile tests, check these isolation questions:
Was the correct APK or IPA installed?
Was the app launched from the expected state?
Was the user already logged in from a previous test?
Were permissions already granted on one device but not another?
Did another test leave items in cart, change profile data, or modify local storage?
Did the test rely on cached API data?
Did the previous test leave the app on a different screen?
Was the same build used locally and in CI?
If your team runs multiple builds across environments, link build version and test result together. A large number of flaky test investigations are wasted simply because the team was debugging the wrong app build.
Step 4 — Add Logs, Screenshots, and Timestamps
Flaky tests leave traces. You need enough evidence to reconstruct what happened before the failure.
For general tests, add timestamps, thread IDs, request IDs, and intermediate state logs before each assertion.
For mobile tests, capture:
Screenshot before failure
Screenshot after every major step
Video recording of the test run
Device logs
App logs
Network logs, where possible
Current screen hierarchy
App version and build number
Device model
OS version
Permission state
Network state
Test account used
Exact failed step
Do not rely only on the final error message. A mobile test might fail with “element not found,” but the real issue could be a permission popup, loading spinner, keyboard overlay, or wrong app state three steps earlier.
Step 5 — Check Environmental Factors
Compare local, CI, emulator, simulator, and real-device environments. Many flaky tests are deterministic in one environment and non-deterministic in another.
For general tests, compare:
OS
Timezone
Locale
CPU count
Available memory
Network latency
Dependency versions
For mobile tests, also compare:
Device model
OS version
Screen size and pixel density
Orientation
Animation settings
Keyboard type
Permission state
App install state
Battery saver / low-power mode
Push notifications
System dialogs
Network profile
Device farm configuration
The goal is not to make every environment identical. The goal is to know which variables affect the test so you can either control them or test them intentionally.
Why Mobile Flaky Tests Happen
Mobile flaky tests happen because mobile apps run in a messier environment than most web or backend systems. The UI is affected by the device, OS, keyboard, network, permissions, gestures, animations, and real user interruptions. A test that looks stable on one device can become unreliable across a real mobile device matrix.
Here are the most common causes.
1. Permissions and First-Run App State
Mobile apps often behave differently on first launch. A fresh install might show permission prompts for camera, location, photos, contacts, notifications, microphone, or Bluetooth. An already-used device might skip those prompts entirely.
This creates flaky tests when the automation assumes one state but gets another.
Example:
On Device A, location permission is already granted, so the test continues.
On Device B, the OS permission dialog appears and blocks the next tap.
The test fails with an element-not-found error even though the app itself is working.
Fix this by making permission state explicit. Either pre-grant permissions before the test or handle permission prompts as part of the flow. Do not let permission state depend on what happened in a previous test run.
2. Device Fragmentation
Device fragmentation is one of the biggest reasons mobile flaky tests are harder than web tests. Android alone varies across OEMs, screen sizes, OS versions, navigation modes, keyboards, permission behavior, and system UI overlays. iOS is more controlled, but iOS version differences, device sizes, safe areas, and app tracking prompts can still change how a flow behaves.
A flow can pass on one device and fail on another because:
The CTA is below the fold
The keyboard covers the button
A bottom sheet renders at a different height
A permission dialog has different wording
A scroll lands in a slightly different position
A loading animation takes longer on a slower device
A tap coordinate hits a different element on a different screen size
This is why mobile test automation should not be validated on only one emulator. Use a realistic device matrix for smoke, regression, and release testing.
3. OS Dialogs and System Interruptions
System-level dialogs are outside your app, but they can still break your tests. Permission popups, app update prompts, biometric prompts, notification permission dialogs, battery warnings, and system alerts can interrupt a flow.
These failures often get misclassified as locator problems because the target element is technically not visible. But the real problem is that the OS is blocking the app.
Fix this by adding explicit handling for known system dialogs and by resetting device state before major test runs. For high-value regression flows, capture screenshots or video so the team can see whether an OS dialog interrupted the test.
4. Keyboard Behavior
Keyboard behavior is a common source of mobile flaky tests. A form test may enter text correctly, but then fail because the keyboard covers the submit button.
This happens when:
The test does not hide the keyboard before tapping the CTA
The keyboard layout differs across devices
The “Done” action behaves differently across OS versions
The screen does not scroll after the input field is focused
The CTA is visible in one viewport but hidden in another
Fix this by making keyboard handling explicit. After text input, hide the keyboard or wait for the CTA to become visible and tappable. Avoid assuming that the same tap target will remain visible after text entry on every device.
5. Network Variance
Mobile apps are highly sensitive to network conditions. Slow APIs, retries, cached responses, offline states, and partial loading states can all create flaky tests.
A test might fail because:
The app is still loading data
The backend response is slow
A retry banner appears
The app falls back to cached data
A spinner blocks the next interaction
The assertion runs before the UI updates
Fix this by waiting for user-visible state, not fixed time. Wait for the loader to disappear, the expected screen to render, the button to become enabled, or the success state to appear. Where needed, validate backend/API state separately instead of relying only on UI timing.
6. Animations, Gestures, and Timing
Mobile UI is full of motion: transitions, bottom sheets, carousels, scrolling, swipe gestures, skeleton loaders, snackbars, and animated navigation. These can make a test flaky if the automation interacts before the screen is stable.
Common symptoms include:
Tap happens before transition finishes
Swipe does not travel far enough
Scroll inertia moves the target
Bottom sheet is still animating
Assertion runs before content settles
Element exists but is not yet tappable
Fix this by waiting for stable UI state. Avoid blind sleep() calls. Wait until the target is visible, enabled, and stable before interacting.
7. Appium Locator Drift
Appium locator drift is one of the most common maintenance problems in mobile automation. Appium tests often depend on accessibility IDs, resource IDs, XPath, visible text, or UI hierarchy. When product teams update the interface, even harmless changes can break tests.
Examples:
A developer renames a resource ID
A designer changes button copy
A new wrapper view changes the XPath
A layout refactor moves the element in the hierarchy
A screen redesign changes accessibility labels
A dynamic element gets a different ID in each run
The app may still work perfectly, but the test fails because the locator no longer points to the right object.
Fix this by preferring stable accessibility labels, avoiding brittle XPath chains, and testing user-visible outcomes instead of internal UI structure. If your team spends every sprint fixing Appium locators, it may be time to reduce reliance on locator-heavy automation for broad mobile regression coverage.
Root Causes of Flaky Tests — With Fixes
1. Async and Timing Issues (Most Common)
Problem: Tests that rely on sleep(), fixed delays, or assume operations complete within a certain time window are inherently fragile. A slow CI machine, a garbage collection pause, or a noisy neighbor in a shared cloud environment can violate these assumptions.
Symptoms:
TimeoutErroron async operationsTests fail intermittently under load
Failures more frequent on CI than locally
Fix — Use Explicit Waits and Polling:
# BAD: Fixed sleep is fragiletime.sleep(2)assert db.record_exists(id)# GOOD: Poll until condition is met or timeoutimport tenacity@tenacity.retry(stop=tenacity.stop_after_delay(10), wait=tenacity.wait_fixed(0.2))def wait_for_record(id):assert db.record_exists(id)wait_for_record(record_id)
For browser automation (Selenium, Playwright, Cypress), always use explicit waits ( waitForElement, waitUntil ) instead of sleep. Cypress's built-in retry-ability handles most async UI flakiness out of the box.
For mobile test automation, timing issues usually show up as tap failures, missing elements, stuck loaders, or assertions that run before the screen has settled. Avoid waiting for a fixed number of seconds after every tap. Instead, wait for a clear UI state: a screen title, enabled CTA, completed loading state, disappeared spinner, visible success message, or stable element position. Fixed sleeps make mobile flaky tests worse because device speed, animation timing, and network latency vary across every run.
For async JavaScript tests, always await every Promise and avoid fire-and-forget patterns in test setup:
// BADbeforeEach(() => {db.seed(); // returns a Promise but not awaited});// GOODbeforeEach(async () => {await db.seed();});
2. Test Order Dependency and Shared State
Problem: Tests that depend on a specific execution order, or that leave side effects (data in a database, files on disk, global variables) for subsequent tests, are a major source of flakiness. Most test runners do not guarantee execution order, and parallel execution makes this worse.
Symptoms:
Test passes alone but fails in suite
Failures depend on which tests ran before
Randomizing test order (
--randomly-seed) surfaces failures
Fix — Enforce Test Isolation:
Every test must own its setup and teardown. Use transactions that are rolled back after each test, temporary directories, and in-memory databases:
# Django example — wrap each test in a transactionfrom django.test import TestCase # Automatically rolls back DB after each testclass PaymentTest(TestCase):def setUp(self):self.user = User.objects.create(email="test@example.com")def test_charge(self):result = charge(self.user, 100)self.assertTrue(result.success)# DB is rolled back automatically after each test
For global state (singletons, module-level caches), use mocks or explicit reset functions in tearDown:
def tearDown(self):cache.clear()config.reset_to_defaults()
For mobile tests, shared state often comes from the app itself. One test may leave the user logged in, grant permissions, add items to cart, change profile data, modify local storage, or leave the app on a different screen. The next test then starts from a state it did not create. To fix this, define the starting state for every mobile flow: fresh install, logged-out state, logged-in state, seeded account, pre-granted permissions, or clean cart. Do not let one test inherit state from another unless that dependency is intentional.
3. Race Conditions in Concurrent Code
Problem: Tests for concurrent or multi-threaded code frequently exhibit race conditions. The test interleaves thread execution differently on each run.
Symptoms:
Failures in tests that exercise queues, workers, or async event processing
Inconsistent counts, unexpected
Nonevalues, partial writes
Fix — Control Concurrency in Tests:
Use barriers, semaphores, or mock executors to make concurrent operations deterministic:
# Use a ThreadPoolExecutor with a controlled thread count# and join all futures before assertingwith ThreadPoolExecutor(max_workers=1) as executor:futures = [executor.submit(process_task, t) for t in tasks]results = [f.result() for f in futures] # .result() blocks until doneassert len(results) == len(tasks)
In Go, use sync.WaitGroup and ensure all goroutines complete before assertions. Never assert on goroutine output without synchronization.
4. External Dependencies — Network, APIs, and Databases
Problem: Tests that call real external services (third-party APIs, payment gateways, email providers) are non-deterministic by definition. Network latency varies, rate limits kick in, external services have their own outages.
Symptoms:
Tests only fail at certain times of day
Failures correlate with CI machine network latency spikes
Error messages reference timeouts or HTTP 429/503
Fix — Mock or Stub External Dependencies:
# Using unittest.mock to stub HTTP callsfrom unittest.mock import patch, MagicMock@patch("myapp.payments.stripe.charge")def test_process_payment(mock_charge):mock_charge.return_value = MagicMock(id="ch_123", status="succeeded")result = process_payment(user_id=1, amount=5000)assert result.transaction_id == "ch_123"mock_charge.assert_called_once_with(amount=5000, currency="usd")
For integration tests that must use real services, use contract testing (e.g., Pact) to verify API compatibility without live calls in every run.
For mobile flaky tests, network variance is often hidden behind UI symptoms. A button may not appear because an API response is slow. A list may render old data because the app used cache. A checkout flow may fail because the backend retried silently. When a mobile UI test fails, check whether the backend state and UI state actually match. For critical flows, combine UI evidence with API or database validation so the team knows whether the app failed, the backend failed, or the automation simply moved too early.
5. Random and Non-Deterministic Data
Problem: Tests that generate random data without seeding the random number generator produce different inputs on each run. A bug that only manifests for certain input values will cause intermittent failures.
Fix — Seed All Random Number Generators:
@pytest.fixture(autouse=True)def seed_random():random.seed(42)yield# For numpyimport numpy as npnp.random.seed(42)# For factories (factory_boy)faker = Faker()Faker.seed(42)
Log the seed value at the start of each test run. When a failure is reported, testers can reproduce it exactly by re-using the same seed.
6. File System and Resource Leaks
Problem: Tests that write to shared paths (/tmp/output.csv), leave open file handles, or fail to clean up ports and sockets cause conflicts when tests run in parallel.
Fix — Use Temporary Directories:
@pytest.fixturedef tmp_dir():with tempfile.TemporaryDirectory() as d:yield ddef test_export_csv(tmp_dir):output_path = os.path.join(tmp_dir, "output.csv")export_data(output_path)assert os.path.exists(output_path)# tmp_dir is deleted automatically after the test
For port conflicts in server tests, use port=0 to let the OS assign a free port, or use tools like pytest-asyncio with isolated event loops.
7. Timezone and Locale Sensitivity
Problem: Tests that compare dates, times, or locale-formatted strings often break in CI environments configured with a different timezone or locale than the developer's machine.
Fix — Always Use UTC and Explicit Locale:
# Freeze time for date-dependent testsfrom freezegun import freeze_time@freeze_time("2025-01-15 12:00:00")def test_invoice_due_date():invoice = create_invoice(terms_days=30)assert invoice.due_date == date(2025, 2, 14)
Set TZ=UTC explicitly in your CI environment configuration and use timezone-aware datetime objects throughout your codebase.
Mobile apps often render dates, currencies, addresses, phone numbers, and language strings based on device locale. If your test expects exact visible text, it may pass on one device and fail on another. For mobile regression tests, either set device locale/timezone explicitly or assert the behavior instead of hardcoding locale-sensitive copy.
8. Mobile UI and Locator Drift
Problem: Mobile UI changes frequently. Labels, IDs, hierarchy, copy, layout, and component structure can shift across releases. In Appium-heavy suites, this creates recurring test automation maintenance because the test is tied to how the screen is implemented, not what the user is trying to do.
Symptoms:
Element not found after a routine UI change
Tap lands on the wrong element
Test breaks after button copy changes
Same flow fails only on one screen size
XPath breaks after a layout refactor
Accessibility ID changes break multiple tests
The app works manually, but automation fails
Fix — Reduce Locator Fragility:
Prefer stable accessibility labels over brittle XPath. Avoid deeply nested selectors that depend on UI hierarchy. Assert meaningful user outcomes instead of incidental implementation details. During UI refactors, review affected test flows and update labels intentionally.
If your team keeps fixing the same locators every sprint, the problem is not just a bad selector. The problem is that your automation surface is too tightly coupled to UI implementation. For broad mobile regression coverage, consider shifting high-value user flows away from locator-heavy scripts and toward intent-driven test execution.
How to Reduce Test Automation Maintenance Surface
The best way to fix flaky tests is not to keep patching the same failures forever. The better approach is to reduce the maintenance surface of the automation suite.
Test automation maintenance increases when tests are too tightly coupled to implementation details. This is especially common in mobile automation, where a simple UI change can break multiple Appium locators even when the product behavior is still correct.
To reduce maintenance surface, start with these rules:
Automate stable, high-value user flows first: login, onboarding, search, checkout, payment, profile update, core regression, and critical error states.
Avoid automating every tiny UI variation through brittle end-to-end scripts.
Prefer assertions on user-visible outcomes over assertions on internal structure.
Avoid brittle XPath chains where possible.
Use stable accessibility labels for important elements.
Keep setup reusable: test data, user accounts, permissions, app builds, and environment configuration.
Separate smoke tests, regression tests, exploratory coverage, and edge-case validation.
Run broad smoke coverage across a smaller device matrix, then deeper regression on representative devices.
Review recurring flaky tests every sprint and ask whether the test should be fixed, rewritten, moved lower in the test pyramid, or removed.
For mobile teams, reducing maintenance surface also means choosing the right execution model. If a flow changes every sprint, locator-heavy automation will keep breaking every sprint. For those flows, intent-driven execution can be more useful than maintaining fragile scripts line by line.
Flaky Test Detection Tools and Strategies
Automatic Flakiness Detection in CI
Modern CI platforms support built-in flakiness detection:
GitHub Actions: Use test result annotations and re-run only failed jobs
BuildKite / CircleCI: Flaky test dashboards with historical pass rate
pytest-flakefinder: Runs each test multiple times to detect flakiness locally
Jest --detectOpenHandles: Surfaces async resource leaks in JavaScript
Gradle's
--rerun-tasks: Detects test flakiness in JVM projects
Mobile Flaky Test Detection
Mobile flaky test detection should track more than pass/fail status. A failed mobile test needs enough context to explain whether the app failed, the automation failed, or the environment changed.
Track these dimensions for every mobile test run:
Test name
Failed step
App build version
Device model
OS version
Screen size
Network profile
Permission state
Fresh install vs existing session
Test account
Screenshot at failure
Video recording
Device logs
App logs
Failure category
Useful failure categories include:
Locator drift
Permission dialog
Keyboard issue
Network delay
Animation/timing issue
Backend/API mismatch
Device-specific issue
Dirty app state
Real product bug
Reruns can help confirm whether a failure is flaky, but they should not become the strategy. If a test needs to be rerun three times to pass, it is still broken. Use reruns to collect signal, not to hide instability.
Quarantining Flaky Tests
When a test is confirmed flaky but cannot be fixed immediately, quarantine it rather than deleting it or ignoring it silently:
@pytest.mark.flaky(reruns=3, reruns_delay=1)def test_background_job_completes():# Known flaky — ticket #4521 tracks the fix...
Use pytest-rerunfailures to auto-retry flaky tests a fixed number of times. Track quarantined tests in a dedicated dashboard and enforce a policy that quarantined tests must be fixed within a sprint.
For mobile teams, quarantining should also include a failure category. Do not quarantine a test as simply “flaky.” Mark whether it failed because of locator drift, device fragmentation, permission state, keyboard behavior, network variance, OS dialog, dirty app state, or a suspected product bug. This makes the backlog actionable instead of turning into a graveyard of ignored tests.
Flaky Tests in CI/CD Pipelines — Best Practices
Track flakiness rate as a metric. Measure and alert on tests with a pass rate below 99%. Track this by test case, suite, device, OS version, and app build.
Never merge code that increases the flakiness rate. A flaky test may look like a QA problem, but it affects the whole engineering system. If a change makes the test suite less trustworthy, treat that as a release risk.
Separate app failures, automation failures, and infrastructure failures. A failed mobile test should not automatically be marked as a product bug. Classify whether the failure came from the app, the automation layer, the backend, the device, the network, or the CI environment.
Run smoke tests on every PR. Keep PR-level mobile automation focused on high-signal flows: launch, login, core navigation, checkout/payment, onboarding, and one or two business-critical actions.
Run broader device coverage nightly or before release. Do not run every test across every device on every commit. Use a smaller representative matrix for fast feedback and broader Android/iOS coverage for nightly or pre-release validation.
Pin app build, OS version, and device profile for reproducibility. If a test fails, the team should know exactly which build, device, OS version, and environment produced the failure.
Capture evidence for every failed mobile run. Store screenshots, video, app logs, device logs, failed step, app version, and device metadata. Without evidence, teams waste time guessing.
Use test sharding carefully. Sharding helps reduce runtime, but it can also expose shared state and order dependency issues. If sharding changes failure rate, investigate test isolation.
Do not let retries become your flaky-test strategy. Retrying once can help confirm non-determinism. Retrying repeatedly until the suite turns green only hides the problem and trains the team to ignore failures.
Review recurring flaky tests every sprint. The highest-value flaky test work is usually in the top recurring failures. Fix those first instead of randomly patching one-off failures.
Flaky Test Prevention: Writing Reliable Tests from the Start
The best way to fix flaky tests is to prevent them. Adopt these principles when writing new tests:
Principle | Implementation |
Determinism | Seed all randomness; freeze time; mock external calls |
Isolation | Each test creates and destroys its own state |
Idempotency | Tests can be run any number of times with the same result |
Speed | Fast tests reduce reliance on timeouts |
Specificity | Assert on exact, known values — not ranges or approximations |
Hermetic | No network calls, no shared global state, no file system side effects |
For mobile test automation, add these prevention principles as well:
Principle | Mobile implementation |
Device control | Define target devices, OS versions, network profile, orientation, and screen size before the run |
Permission control | Pre-grant permissions or handle permission prompts explicitly inside the test |
Build control | Install and record the correct APK or IPA before execution |
UI stability | Wait for visible, enabled, stable UI state instead of fixed delays |
Keyboard handling | Hide the keyboard or wait for the CTA to become tappable after text input |
Locator resilience | Prefer stable accessibility labels and avoid brittle XPath chains |
Evidence capture | Store screenshots, video, logs, failed step, app version, and device metadata |
App state control | Define whether the flow starts from fresh install, logged-out state, logged-in state, or seeded account state |
Flaky Test Diagnosis Checklist
Use this checklist when investigating a flaky test.
General Flaky Test Checklist
Can you reproduce the flakiness by running the test repeatedly?
Does the test pass when run in isolation?
Does the failure rate change when test execution order is randomized?
Are there any
sleep()or fixed delay calls in the test or fixtures?Does the test make real network calls or depend on external services?
Are random values seeded consistently?
Are all async operations properly awaited?
Are temporary files, ports, and database records cleaned up after the test?
Is the test sensitive to timezone or locale?
Does the test involve multi-threading or concurrent code without proper synchronization?
Does the failure happen only in CI?
Does the failure happen only under parallel execution?
Mobile Flaky Test Checklist
Use this checklist specifically for mobile flaky tests:
Did the test fail on one device, OS version, or screen size only?
Was the correct APK or IPA build installed?
Was the app started from the expected state: fresh install, logged out, logged in, or seeded?
Were permissions pre-granted or handled inside the test?
Did an OS dialog, notification, biometric prompt, app update prompt, or permission popup appear?
Did the keyboard cover the element being tapped?
Did the test wait for actual UI state, or did it rely on fixed sleeps?
Did a network delay leave the app in loading, retry, offline, or cached state?
Did an animation, bottom sheet, carousel, or scroll inertia affect the tap?
Did an Appium locator break because of copy, hierarchy, or accessibility ID changes?
Did the same flow pass when run manually on the same device?
Are screenshots, video, app logs, and device logs attached to the failure?
Is the failure an app bug, automation bug, backend issue, device issue, or environment issue?
Has this same flaky test failed in previous sprints?
Conclusion
Flaky tests are not a minor inconvenience. They are a systemic risk to software quality, delivery speed, and team trust. The path to fixing them is the same as fixing any complex bug: reproduce reliably, classify the failure, isolate the cause, collect evidence, and apply a targeted fix.
For mobile teams, the problem is even sharper. Mobile flaky tests can come from device fragmentation, permissions, OS dialogs, keyboard behavior, network variance, animations, app state, or Appium locator drift. That means the fix is not simply “add a longer wait” or “rerun the test.” The fix is disciplined diagnosis plus a smaller, more reliable automation surface.
If your team is tired of fixing the same locators every sprint, Quash helps you run mobile flows from plain-language intent on real devices. With Test Paths and reruns, your team can reduce repeated maintenance and focus on whether the app actually works — not whether another brittle script broke again.



