How to Fix Flaky Tests in Mobile Test Automation — Complete Diagnosis Guide (2026)

Nishtha chauhan
Nishtha chauhan
|Updated on |12 min
Cover Image for How to Fix Flaky Tests in Mobile Test Automation — Complete Diagnosis Guide (2026)

What Are Flaky Tests?

A flaky test is a test that produces inconsistent results — passing on one run and failing on another — without any change to the underlying code. Flaky tests are one of the most damaging problems in modern software development pipelines because they erode developer trust, slow down CI/CD, hide real bugs, and waste engineering time.

In mobile test automation, flaky tests are even harder to diagnose. The same user flow can pass on one Android device and fail on another because of device fragmentation, OS dialogs, permission prompts, keyboard behavior, network variance, animations, or locator drift. A checkout test might fail not because the checkout is broken, but because the keyboard covered the CTA, the device was on a slower network, or an Appium locator no longer matched the updated UI hierarchy.

That is why learning how to fix flaky tests is not just about adding retries or increasing timeouts. The right approach is to identify the real source of non-determinism, isolate it, and reduce the maintenance surface of your test automation suite.

According to Google's engineering research, flaky tests are present in virtually every large codebase at scale. Their internal data showed that roughly 1 in 7 tests in large repositories eventually becomes flaky. For teams running thousands of tests per day, even a 1% flakiness rate creates constant noise and delays.

The key characteristic that defines a flaky test is non-determinism. The test outcome depends on something other than the code under test — timing, environment state, network availability, external services, device state, OS behavior, or test execution order.


Ebook Preview

Get the Mobile Testing Playbook Used by 800+ QA Teams

Discover 50+ battle-tested strategies to catch critical bugs before production and ship 5-star apps faster.

100% Free. No spam. Unsubscribe anytime.

Why Flaky Tests Are a Critical Problem

Before diving into diagnosis and fixes, it's worth understanding why flaky tests deserve serious attention:

  • They mask real failures. When tests flicker, developers begin ignoring failures. A genuine regression can slip through because the team assumed the failure was "just flakiness."

  • They slow CI/CD pipelines. Retrying flaky tests on every build adds minutes or hours to your deployment cycle.

  • They destroy team morale. Nothing is more frustrating than a red build that magically turns green on re-run.

  • They increase infrastructure costs. Unnecessary retries consume compute time, especially on cloud CI platforms billed by the minute.

  • They indicate deeper architectural problems. A high flakiness rate is often a symptom of poor test isolation, bad resource management, or fragile external dependencies.

Why Mobile Flaky Tests Are More Expensive

Mobile flaky tests create extra debugging work because the failure surface is much larger than web or backend testing. A failed mobile run can depend on the device model, OS version, screen size, app build, permission state, network profile, keyboard behavior, animation timing, or whether the app was installed fresh before the test.

For mobile teams, the real cost is not just the failed test. It is the investigation that follows. Engineers and QA teams have to answer the same questions again and again:

  • Did the app actually break?

  • Did the test break?

  • Did the locator change?

  • Did a permission dialog block the flow?

  • Did the keyboard hide the button?

  • Did this only happen on one device or OS version?

  • Did the backend respond slowly?

  • Did another test leave the app in a dirty state?

This is why mobile flaky tests quickly become a test automation maintenance problem. When a team spends every sprint fixing locators, updating waits, debugging screenshots, and re-running the same flows manually, the automation suite stops feeling like leverage and starts feeling like another product to maintain.


The Flaky Test Diagnosis Framework

Fixing flaky tests starts with systematic diagnosis. Randomly patching tests without understanding the root cause leads to recurring failures. For mobile teams, the diagnosis process must include device state, app state, OS behavior, permissions, network, and locator stability.

Use this five-step framework.

Step 1 — Reproduce the Flakiness

You cannot fix what you cannot reproduce. Run the test repeatedly to confirm that it is actually flaky and not a one-off infrastructure failure.

For general automated tests, run the same test in isolation multiple times:

# Run a specific test 50 times to surface flakiness
for i in $(seq 1 50); do pytest tests/test_payment.py::test_charge_card; done

For mobile flaky tests, reproduce the failure across the same conditions where it originally failed:

  • Same device model

  • Same OS version

  • Same app build

  • Same test account

  • Same network profile

  • Same permission state

  • Same install state: fresh install vs existing app session

  • Same execution environment: local, emulator, simulator, real device, or cloud device

If the test fails only on one device or OS version, treat that as a major clue. Mobile failures are often deterministic within the right environment but invisible everywhere else.

Step 2 — Classify the Failure Type

Log every failure message carefully and group failures by error type. Classification narrows your root cause search.

Common flaky test failure patterns include:

  • TimeoutError → likely async, network, or timing issue

  • AssertionError on shared state → likely test order dependency

  • ConnectionRefusedError → likely external service or environment issue

  • StaleElementReferenceException → likely UI automation race condition

  • Random data mismatch → likely missing seed or non-deterministic test data

For mobile test automation, also classify failures using mobile-specific categories:

  • Element not found → locator drift, slow screen load, wrong screen, or hidden element

  • Tap hits wrong target → screen size, scroll position, density, or animation issue

  • Permission popup blocks flow → missing permission handling or first-run state issue

  • Keyboard covers CTA → input behavior or viewport issue

  • Test passes locally but fails in CI → device state, app build, network, or environment mismatch

  • Test fails after UI copy/layout change → brittle locator or text-based assertion

  • Test fails only on one device → device fragmentation or OS-specific behavior

Do not label everything as “flaky” too quickly. First decide whether it is an app bug, automation bug, environment issue, data issue, or device-specific issue.

Step 3 — Isolate the Test

Run the test in complete isolation — separate process, fresh state, clean app install, and controlled test data. If the test passes alone but fails in the full suite, you likely have test order dependency or shared state pollution.

For mobile tests, check these isolation questions:

  • Was the correct APK or IPA installed?

  • Was the app launched from the expected state?

  • Was the user already logged in from a previous test?

  • Were permissions already granted on one device but not another?

  • Did another test leave items in cart, change profile data, or modify local storage?

  • Did the test rely on cached API data?

  • Did the previous test leave the app on a different screen?

  • Was the same build used locally and in CI?

If your team runs multiple builds across environments, link build version and test result together. A large number of flaky test investigations are wasted simply because the team was debugging the wrong app build.

Step 4 — Add Logs, Screenshots, and Timestamps

Flaky tests leave traces. You need enough evidence to reconstruct what happened before the failure.

For general tests, add timestamps, thread IDs, request IDs, and intermediate state logs before each assertion.

For mobile tests, capture:

  • Screenshot before failure

  • Screenshot after every major step

  • Video recording of the test run

  • Device logs

  • App logs

  • Network logs, where possible

  • Current screen hierarchy

  • App version and build number

  • Device model

  • OS version

  • Permission state

  • Network state

  • Test account used

  • Exact failed step

Do not rely only on the final error message. A mobile test might fail with “element not found,” but the real issue could be a permission popup, loading spinner, keyboard overlay, or wrong app state three steps earlier.

Step 5 — Check Environmental Factors

Compare local, CI, emulator, simulator, and real-device environments. Many flaky tests are deterministic in one environment and non-deterministic in another.

For general tests, compare:

  • OS

  • Timezone

  • Locale

  • CPU count

  • Available memory

  • Network latency

  • Dependency versions

For mobile tests, also compare:

  • Device model

  • OS version

  • Screen size and pixel density

  • Orientation

  • Animation settings

  • Keyboard type

  • Permission state

  • App install state

  • Battery saver / low-power mode

  • Push notifications

  • System dialogs

  • Network profile

  • Device farm configuration

The goal is not to make every environment identical. The goal is to know which variables affect the test so you can either control them or test them intentionally.


Why Mobile Flaky Tests Happen

Mobile flaky tests happen because mobile apps run in a messier environment than most web or backend systems. The UI is affected by the device, OS, keyboard, network, permissions, gestures, animations, and real user interruptions. A test that looks stable on one device can become unreliable across a real mobile device matrix.

Here are the most common causes.

1. Permissions and First-Run App State

Mobile apps often behave differently on first launch. A fresh install might show permission prompts for camera, location, photos, contacts, notifications, microphone, or Bluetooth. An already-used device might skip those prompts entirely.

This creates flaky tests when the automation assumes one state but gets another.

Example:

  • On Device A, location permission is already granted, so the test continues.

  • On Device B, the OS permission dialog appears and blocks the next tap.

  • The test fails with an element-not-found error even though the app itself is working.

Fix this by making permission state explicit. Either pre-grant permissions before the test or handle permission prompts as part of the flow. Do not let permission state depend on what happened in a previous test run.

2. Device Fragmentation

Device fragmentation is one of the biggest reasons mobile flaky tests are harder than web tests. Android alone varies across OEMs, screen sizes, OS versions, navigation modes, keyboards, permission behavior, and system UI overlays. iOS is more controlled, but iOS version differences, device sizes, safe areas, and app tracking prompts can still change how a flow behaves.

A flow can pass on one device and fail on another because:

  • The CTA is below the fold

  • The keyboard covers the button

  • A bottom sheet renders at a different height

  • A permission dialog has different wording

  • A scroll lands in a slightly different position

  • A loading animation takes longer on a slower device

  • A tap coordinate hits a different element on a different screen size

This is why mobile test automation should not be validated on only one emulator. Use a realistic device matrix for smoke, regression, and release testing.

3. OS Dialogs and System Interruptions

System-level dialogs are outside your app, but they can still break your tests. Permission popups, app update prompts, biometric prompts, notification permission dialogs, battery warnings, and system alerts can interrupt a flow.

These failures often get misclassified as locator problems because the target element is technically not visible. But the real problem is that the OS is blocking the app.

Fix this by adding explicit handling for known system dialogs and by resetting device state before major test runs. For high-value regression flows, capture screenshots or video so the team can see whether an OS dialog interrupted the test.

4. Keyboard Behavior

Keyboard behavior is a common source of mobile flaky tests. A form test may enter text correctly, but then fail because the keyboard covers the submit button.

This happens when:

  • The test does not hide the keyboard before tapping the CTA

  • The keyboard layout differs across devices

  • The “Done” action behaves differently across OS versions

  • The screen does not scroll after the input field is focused

  • The CTA is visible in one viewport but hidden in another

Fix this by making keyboard handling explicit. After text input, hide the keyboard or wait for the CTA to become visible and tappable. Avoid assuming that the same tap target will remain visible after text entry on every device.

5. Network Variance

Mobile apps are highly sensitive to network conditions. Slow APIs, retries, cached responses, offline states, and partial loading states can all create flaky tests.

A test might fail because:

  • The app is still loading data

  • The backend response is slow

  • A retry banner appears

  • The app falls back to cached data

  • A spinner blocks the next interaction

  • The assertion runs before the UI updates

Fix this by waiting for user-visible state, not fixed time. Wait for the loader to disappear, the expected screen to render, the button to become enabled, or the success state to appear. Where needed, validate backend/API state separately instead of relying only on UI timing.

6. Animations, Gestures, and Timing

Mobile UI is full of motion: transitions, bottom sheets, carousels, scrolling, swipe gestures, skeleton loaders, snackbars, and animated navigation. These can make a test flaky if the automation interacts before the screen is stable.

Common symptoms include:

  • Tap happens before transition finishes

  • Swipe does not travel far enough

  • Scroll inertia moves the target

  • Bottom sheet is still animating

  • Assertion runs before content settles

  • Element exists but is not yet tappable

Fix this by waiting for stable UI state. Avoid blind sleep() calls. Wait until the target is visible, enabled, and stable before interacting.

7. Appium Locator Drift

Appium locator drift is one of the most common maintenance problems in mobile automation. Appium tests often depend on accessibility IDs, resource IDs, XPath, visible text, or UI hierarchy. When product teams update the interface, even harmless changes can break tests.

Examples:

  • A developer renames a resource ID

  • A designer changes button copy

  • A new wrapper view changes the XPath

  • A layout refactor moves the element in the hierarchy

  • A screen redesign changes accessibility labels

  • A dynamic element gets a different ID in each run

The app may still work perfectly, but the test fails because the locator no longer points to the right object.

Fix this by preferring stable accessibility labels, avoiding brittle XPath chains, and testing user-visible outcomes instead of internal UI structure. If your team spends every sprint fixing Appium locators, it may be time to reduce reliance on locator-heavy automation for broad mobile regression coverage.


Root Causes of Flaky Tests — With Fixes

1. Async and Timing Issues (Most Common)

Problem: Tests that rely on sleep(), fixed delays, or assume operations complete within a certain time window are inherently fragile. A slow CI machine, a garbage collection pause, or a noisy neighbor in a shared cloud environment can violate these assumptions.

Symptoms:

  • TimeoutError on async operations

  • Tests fail intermittently under load

  • Failures more frequent on CI than locally

Fix — Use Explicit Waits and Polling:

# BAD: Fixed sleep is fragile
time.sleep(2)
assert db.record_exists(id)
# GOOD: Poll until condition is met or timeout
import tenacity
@tenacity.retry(stop=tenacity.stop_after_delay(10), wait=tenacity.wait_fixed(0.2))
def wait_for_record(id):
assert db.record_exists(id)
wait_for_record(record_id)

For browser automation (Selenium, Playwright, Cypress), always use explicit waits ( waitForElement, waitUntil ) instead of sleep. Cypress's built-in retry-ability handles most async UI flakiness out of the box.

For mobile test automation, timing issues usually show up as tap failures, missing elements, stuck loaders, or assertions that run before the screen has settled. Avoid waiting for a fixed number of seconds after every tap. Instead, wait for a clear UI state: a screen title, enabled CTA, completed loading state, disappeared spinner, visible success message, or stable element position. Fixed sleeps make mobile flaky tests worse because device speed, animation timing, and network latency vary across every run.

For async JavaScript tests, always await every Promise and avoid fire-and-forget patterns in test setup:

// BAD
beforeEach(() => {
db.seed(); // returns a Promise but not awaited
});
// GOOD
beforeEach(async () => {
await db.seed();
});

2. Test Order Dependency and Shared State

Problem: Tests that depend on a specific execution order, or that leave side effects (data in a database, files on disk, global variables) for subsequent tests, are a major source of flakiness. Most test runners do not guarantee execution order, and parallel execution makes this worse.

Symptoms:

  • Test passes alone but fails in suite

  • Failures depend on which tests ran before

  • Randomizing test order ( --randomly-seed ) surfaces failures

Fix — Enforce Test Isolation:

Every test must own its setup and teardown. Use transactions that are rolled back after each test, temporary directories, and in-memory databases:

# Django example — wrap each test in a transaction
from django.test import TestCase # Automatically rolls back DB after each test
class PaymentTest(TestCase):
def setUp(self):
self.user = User.objects.create(email="test@example.com")
def test_charge(self):
result = charge(self.user, 100)
self.assertTrue(result.success)
# DB is rolled back automatically after each test

For global state (singletons, module-level caches), use mocks or explicit reset functions in tearDown:

def tearDown(self):
cache.clear()
config.reset_to_defaults()

For mobile tests, shared state often comes from the app itself. One test may leave the user logged in, grant permissions, add items to cart, change profile data, modify local storage, or leave the app on a different screen. The next test then starts from a state it did not create. To fix this, define the starting state for every mobile flow: fresh install, logged-out state, logged-in state, seeded account, pre-granted permissions, or clean cart. Do not let one test inherit state from another unless that dependency is intentional.

3. Race Conditions in Concurrent Code

Problem: Tests for concurrent or multi-threaded code frequently exhibit race conditions. The test interleaves thread execution differently on each run.

Symptoms:

  • Failures in tests that exercise queues, workers, or async event processing

  • Inconsistent counts, unexpected None values, partial writes

Fix — Control Concurrency in Tests:

Use barriers, semaphores, or mock executors to make concurrent operations deterministic:

# Use a ThreadPoolExecutor with a controlled thread count
# and join all futures before asserting
with ThreadPoolExecutor(max_workers=1) as executor:
futures = [executor.submit(process_task, t) for t in tasks]
results = [f.result() for f in futures] # .result() blocks until done
assert len(results) == len(tasks)

In Go, use sync.WaitGroup and ensure all goroutines complete before assertions. Never assert on goroutine output without synchronization.

4. External Dependencies — Network, APIs, and Databases

Problem: Tests that call real external services (third-party APIs, payment gateways, email providers) are non-deterministic by definition. Network latency varies, rate limits kick in, external services have their own outages.

Symptoms:

  • Tests only fail at certain times of day

  • Failures correlate with CI machine network latency spikes

  • Error messages reference timeouts or HTTP 429/503

Fix — Mock or Stub External Dependencies:

# Using unittest.mock to stub HTTP calls
from unittest.mock import patch, MagicMock
@patch("myapp.payments.stripe.charge")
def test_process_payment(mock_charge):
mock_charge.return_value = MagicMock(id="ch_123", status="succeeded")
result = process_payment(user_id=1, amount=5000)
assert result.transaction_id == "ch_123"
mock_charge.assert_called_once_with(amount=5000, currency="usd")

For integration tests that must use real services, use contract testing (e.g., Pact) to verify API compatibility without live calls in every run.

For mobile flaky tests, network variance is often hidden behind UI symptoms. A button may not appear because an API response is slow. A list may render old data because the app used cache. A checkout flow may fail because the backend retried silently. When a mobile UI test fails, check whether the backend state and UI state actually match. For critical flows, combine UI evidence with API or database validation so the team knows whether the app failed, the backend failed, or the automation simply moved too early.

5. Random and Non-Deterministic Data

Problem: Tests that generate random data without seeding the random number generator produce different inputs on each run. A bug that only manifests for certain input values will cause intermittent failures.

Fix — Seed All Random Number Generators:

@pytest.fixture(autouse=True)
def seed_random():
random.seed(42)
yield
# For numpy
import numpy as np
np.random.seed(42)
# For factories (factory_boy)
faker = Faker()
Faker.seed(42)

Log the seed value at the start of each test run. When a failure is reported, testers can reproduce it exactly by re-using the same seed.

6. File System and Resource Leaks

Problem: Tests that write to shared paths (/tmp/output.csv), leave open file handles, or fail to clean up ports and sockets cause conflicts when tests run in parallel.

Fix — Use Temporary Directories:

@pytest.fixture
def tmp_dir():
with tempfile.TemporaryDirectory() as d:
yield d
def test_export_csv(tmp_dir):
output_path = os.path.join(tmp_dir, "output.csv")
export_data(output_path)
assert os.path.exists(output_path)
# tmp_dir is deleted automatically after the test

For port conflicts in server tests, use port=0 to let the OS assign a free port, or use tools like pytest-asyncio with isolated event loops.

7. Timezone and Locale Sensitivity

Problem: Tests that compare dates, times, or locale-formatted strings often break in CI environments configured with a different timezone or locale than the developer's machine.

Fix — Always Use UTC and Explicit Locale:

# Freeze time for date-dependent tests
from freezegun import freeze_time
@freeze_time("2025-01-15 12:00:00")
def test_invoice_due_date():
invoice = create_invoice(terms_days=30)
assert invoice.due_date == date(2025, 2, 14)

Set TZ=UTC explicitly in your CI environment configuration and use timezone-aware datetime objects throughout your codebase.

Mobile apps often render dates, currencies, addresses, phone numbers, and language strings based on device locale. If your test expects exact visible text, it may pass on one device and fail on another. For mobile regression tests, either set device locale/timezone explicitly or assert the behavior instead of hardcoding locale-sensitive copy.

8. Mobile UI and Locator Drift

Problem: Mobile UI changes frequently. Labels, IDs, hierarchy, copy, layout, and component structure can shift across releases. In Appium-heavy suites, this creates recurring test automation maintenance because the test is tied to how the screen is implemented, not what the user is trying to do.

Symptoms:

  • Element not found after a routine UI change

  • Tap lands on the wrong element

  • Test breaks after button copy changes

  • Same flow fails only on one screen size

  • XPath breaks after a layout refactor

  • Accessibility ID changes break multiple tests

  • The app works manually, but automation fails

Fix — Reduce Locator Fragility:

Prefer stable accessibility labels over brittle XPath. Avoid deeply nested selectors that depend on UI hierarchy. Assert meaningful user outcomes instead of incidental implementation details. During UI refactors, review affected test flows and update labels intentionally.

If your team keeps fixing the same locators every sprint, the problem is not just a bad selector. The problem is that your automation surface is too tightly coupled to UI implementation. For broad mobile regression coverage, consider shifting high-value user flows away from locator-heavy scripts and toward intent-driven test execution.


How to Reduce Test Automation Maintenance Surface

The best way to fix flaky tests is not to keep patching the same failures forever. The better approach is to reduce the maintenance surface of the automation suite.

Test automation maintenance increases when tests are too tightly coupled to implementation details. This is especially common in mobile automation, where a simple UI change can break multiple Appium locators even when the product behavior is still correct.

To reduce maintenance surface, start with these rules:

  • Automate stable, high-value user flows first: login, onboarding, search, checkout, payment, profile update, core regression, and critical error states.

  • Avoid automating every tiny UI variation through brittle end-to-end scripts.

  • Prefer assertions on user-visible outcomes over assertions on internal structure.

  • Avoid brittle XPath chains where possible.

  • Use stable accessibility labels for important elements.

  • Keep setup reusable: test data, user accounts, permissions, app builds, and environment configuration.

  • Separate smoke tests, regression tests, exploratory coverage, and edge-case validation.

  • Run broad smoke coverage across a smaller device matrix, then deeper regression on representative devices.

  • Review recurring flaky tests every sprint and ask whether the test should be fixed, rewritten, moved lower in the test pyramid, or removed.

For mobile teams, reducing maintenance surface also means choosing the right execution model. If a flow changes every sprint, locator-heavy automation will keep breaking every sprint. For those flows, intent-driven execution can be more useful than maintaining fragile scripts line by line.


Flaky Test Detection Tools and Strategies

Automatic Flakiness Detection in CI

Modern CI platforms support built-in flakiness detection:

  • GitHub Actions: Use test result annotations and re-run only failed jobs

  • BuildKite / CircleCI: Flaky test dashboards with historical pass rate

  • pytest-flakefinder: Runs each test multiple times to detect flakiness locally

  • Jest --detectOpenHandles: Surfaces async resource leaks in JavaScript

  • Gradle's --rerun-tasks: Detects test flakiness in JVM projects

Mobile Flaky Test Detection

Mobile flaky test detection should track more than pass/fail status. A failed mobile test needs enough context to explain whether the app failed, the automation failed, or the environment changed.

Track these dimensions for every mobile test run:

  • Test name

  • Failed step

  • App build version

  • Device model

  • OS version

  • Screen size

  • Network profile

  • Permission state

  • Fresh install vs existing session

  • Test account

  • Screenshot at failure

  • Video recording

  • Device logs

  • App logs

  • Failure category

Useful failure categories include:

  • Locator drift

  • Permission dialog

  • Keyboard issue

  • Network delay

  • Animation/timing issue

  • Backend/API mismatch

  • Device-specific issue

  • Dirty app state

  • Real product bug

Reruns can help confirm whether a failure is flaky, but they should not become the strategy. If a test needs to be rerun three times to pass, it is still broken. Use reruns to collect signal, not to hide instability.

Quarantining Flaky Tests

When a test is confirmed flaky but cannot be fixed immediately, quarantine it rather than deleting it or ignoring it silently:

@pytest.mark.flaky(reruns=3, reruns_delay=1)
def test_background_job_completes():
# Known flaky — ticket #4521 tracks the fix
...

Use pytest-rerunfailures to auto-retry flaky tests a fixed number of times. Track quarantined tests in a dedicated dashboard and enforce a policy that quarantined tests must be fixed within a sprint.

For mobile teams, quarantining should also include a failure category. Do not quarantine a test as simply “flaky.” Mark whether it failed because of locator drift, device fragmentation, permission state, keyboard behavior, network variance, OS dialog, dirty app state, or a suspected product bug. This makes the backlog actionable instead of turning into a graveyard of ignored tests.


Flaky Tests in CI/CD Pipelines — Best Practices

  1. Track flakiness rate as a metric. Measure and alert on tests with a pass rate below 99%. Track this by test case, suite, device, OS version, and app build.

  2. Never merge code that increases the flakiness rate. A flaky test may look like a QA problem, but it affects the whole engineering system. If a change makes the test suite less trustworthy, treat that as a release risk.

  3. Separate app failures, automation failures, and infrastructure failures. A failed mobile test should not automatically be marked as a product bug. Classify whether the failure came from the app, the automation layer, the backend, the device, the network, or the CI environment.

  4. Run smoke tests on every PR. Keep PR-level mobile automation focused on high-signal flows: launch, login, core navigation, checkout/payment, onboarding, and one or two business-critical actions.

  5. Run broader device coverage nightly or before release. Do not run every test across every device on every commit. Use a smaller representative matrix for fast feedback and broader Android/iOS coverage for nightly or pre-release validation.

  6. Pin app build, OS version, and device profile for reproducibility. If a test fails, the team should know exactly which build, device, OS version, and environment produced the failure.

  7. Capture evidence for every failed mobile run. Store screenshots, video, app logs, device logs, failed step, app version, and device metadata. Without evidence, teams waste time guessing.

  8. Use test sharding carefully. Sharding helps reduce runtime, but it can also expose shared state and order dependency issues. If sharding changes failure rate, investigate test isolation.

  9. Do not let retries become your flaky-test strategy. Retrying once can help confirm non-determinism. Retrying repeatedly until the suite turns green only hides the problem and trains the team to ignore failures.

  10. Review recurring flaky tests every sprint. The highest-value flaky test work is usually in the top recurring failures. Fix those first instead of randomly patching one-off failures.


Flaky Test Prevention: Writing Reliable Tests from the Start

The best way to fix flaky tests is to prevent them. Adopt these principles when writing new tests:

Principle

Implementation

Determinism

Seed all randomness; freeze time; mock external calls

Isolation

Each test creates and destroys its own state

Idempotency

Tests can be run any number of times with the same result

Speed

Fast tests reduce reliance on timeouts

Specificity

Assert on exact, known values — not ranges or approximations

Hermetic

No network calls, no shared global state, no file system side effects

For mobile test automation, add these prevention principles as well:

Principle

Mobile implementation

Device control

Define target devices, OS versions, network profile, orientation, and screen size before the run

Permission control

Pre-grant permissions or handle permission prompts explicitly inside the test

Build control

Install and record the correct APK or IPA before execution

UI stability

Wait for visible, enabled, stable UI state instead of fixed delays

Keyboard handling

Hide the keyboard or wait for the CTA to become tappable after text input

Locator resilience

Prefer stable accessibility labels and avoid brittle XPath chains

Evidence capture

Store screenshots, video, logs, failed step, app version, and device metadata

App state control

Define whether the flow starts from fresh install, logged-out state, logged-in state, or seeded account state


Flaky Test Diagnosis Checklist

Use this checklist when investigating a flaky test.

General Flaky Test Checklist

  • Can you reproduce the flakiness by running the test repeatedly?

  • Does the test pass when run in isolation?

  • Does the failure rate change when test execution order is randomized?

  • Are there any sleep() or fixed delay calls in the test or fixtures?

  • Does the test make real network calls or depend on external services?

  • Are random values seeded consistently?

  • Are all async operations properly awaited?

  • Are temporary files, ports, and database records cleaned up after the test?

  • Is the test sensitive to timezone or locale?

  • Does the test involve multi-threading or concurrent code without proper synchronization?

  • Does the failure happen only in CI?

  • Does the failure happen only under parallel execution?

Mobile Flaky Test Checklist

Use this checklist specifically for mobile flaky tests:

  • Did the test fail on one device, OS version, or screen size only?

  • Was the correct APK or IPA build installed?

  • Was the app started from the expected state: fresh install, logged out, logged in, or seeded?

  • Were permissions pre-granted or handled inside the test?

  • Did an OS dialog, notification, biometric prompt, app update prompt, or permission popup appear?

  • Did the keyboard cover the element being tapped?

  • Did the test wait for actual UI state, or did it rely on fixed sleeps?

  • Did a network delay leave the app in loading, retry, offline, or cached state?

  • Did an animation, bottom sheet, carousel, or scroll inertia affect the tap?

  • Did an Appium locator break because of copy, hierarchy, or accessibility ID changes?

  • Did the same flow pass when run manually on the same device?

  • Are screenshots, video, app logs, and device logs attached to the failure?

  • Is the failure an app bug, automation bug, backend issue, device issue, or environment issue?

  • Has this same flaky test failed in previous sprints?


Conclusion

Flaky tests are not a minor inconvenience. They are a systemic risk to software quality, delivery speed, and team trust. The path to fixing them is the same as fixing any complex bug: reproduce reliably, classify the failure, isolate the cause, collect evidence, and apply a targeted fix.

For mobile teams, the problem is even sharper. Mobile flaky tests can come from device fragmentation, permissions, OS dialogs, keyboard behavior, network variance, animations, app state, or Appium locator drift. That means the fix is not simply “add a longer wait” or “rerun the test.” The fix is disciplined diagnosis plus a smaller, more reliable automation surface.

If your team is tired of fixing the same locators every sprint, Quash helps you run mobile flows from plain-language intent on real devices. With Test Paths and reruns, your team can reduce repeated maintenance and focus on whether the app actually works — not whether another brittle script broke again.

Run mobile flows in Quash