Quash for Windows is here.Download now

Your QA Team Didn't Miss Those Bugs. The Entire Industry Did.

Your QA Team Didn't Miss Those Bugs. The Entire Industry Did.

When Anthropic unveiled Claude Mythos Preview and launched Project Glasswing, the coverage was predictably breathless. A frontier AI model. Government briefings. Emergency meetings with major banks. The tech press treated it as a geopolitical thriller.

But underneath the drama is a quieter, more consequential story — one that matters to every engineering team that ships software, regardless of whether they work on operating systems or mobile apps.

Anthropic says Mythos Preview identified thousands of high-severity vulnerabilities across major operating systems and browsers, including, according to the company, a critical flaw in OpenBSD that had gone undetected since approximately 1999. The model reportedly found it not by running faster scripts or broader scans, but by reasoning about the code — forming hypotheses about where intent and implementation diverged, and probing those gaps autonomously.

That detail is not a footnote about AI capability. It is a direct statement about the structural limits of every testing methodology the software industry has relied on for the past three decades.

This is a testing story. The protagonist is not the model. It is the paradigm the model just walked around.

The OpenBSD finding and what it reveals about the limits of test automation

According to Anthropic's Project Glasswing announcement, OpenBSD — widely regarded as one of the most security-hardened operating systems in existence, used to run firewalls, routers, and critical infrastructure worldwide — harbored a vulnerability that the company says dates to around 1999. If accurate, that flaw survived the dot-com crash, the rise of agile methodologies, a decade of DevOps transformation, the shift-left movement, continuous integration pipelines, and countless penetration testing engagements conducted by some of the most skilled security engineers in the industry.

Anthropic and its partners are clear that Mythos did not find this by executing a more comprehensive version of what existing tools already do. The model read the code, reasoned about what the program was trying to accomplish, formed a hypothesis about where the implementation could fail, and confirmed it — a process closer to expert security research than to automated scanning.

That distinction is the entire point. And understanding it requires being honest about what automated testing was designed to do in the first place.

Ebook Preview

Get the Mobile Testing Playbook Used by 800+ QA Teams

Discover 50+ battle-tested strategies to catch critical bugs before production and ship 5-star apps faster.

100% Free. No spam. Unsubscribe anytime.

What traditional test automation can and cannot detect

Modern test automation is fundamentally deterministic. You define an expected behavior. You write a script that checks whether that behavior occurs. You run the script. Pass or fail.

This model is extraordinarily effective at one specific problem: verifying that things you already know about continue to work. Regression testing. Smoke testing. UI flow validation. Checking that the checkout flow still functions correctly after last Tuesday's deployment. According to Katalon's State of Quality Report 2025, which surveyed more than 1,500 QA professionals, around 45% of teams identify regression testing as their most automated testing type. Scripts genuinely excel here. Solid regression testing for mobile apps can catch a substantial share of the defects that fall within expected user paths — and for teams shipping weekly or daily, that coverage is not optional.

But a script cannot test what it has not been told to test. It cannot ask, "What if the core assumption behind this function is wrong?" It cannot notice that a piece of code doing its job quietly for years might be doing it in a way that is subtly exploitable under conditions no one thought to encode as a test case.

This is not a failure of the QA engineers who wrote those scripts. It is a structural constraint of the scripted testing model itself. You cannot automate curiosity. You cannot script for unknown unknowns. That is not a criticism — it is simply a description of what the model was built to do, and where its ceiling sits.

If you are new to automation and want to understand the full landscape before encountering its edges, learning test automation from scratch is the right place to start. But even the most thorough foundation in scripted automation will eventually bring you to the same structural wall.

The industry has known this for decades yet kept scaling scripts anyway

The cost of that structural gap is measurable. The Consortium for Information & Software Quality (CISQ) estimated in its 2022 report that poor software quality cost the US economy at least $2.41 trillion that year. That figure includes cybercrime losses driven by existing software vulnerabilities, supply chain failures tied to open-source deficiencies, and approximately $1.52 trillion in accumulated technical debt — the cost of deficiencies that have been allowed to compound rather than fixed.

To put that in perspective: the CISQ number is roughly double the US federal budget deficit for the same year. CISQ's author, Herb Krasner, was direct: "finding and fixing bugs is the largest single expense component in a software development lifecycle." The cybercrime component of that cost alone rose 64% between 2020 and 2021, and another 42% between 2021 and 2022. Open-source supply chain vulnerabilities rose 650% over the same window.

It is worth being precise about the historical comparison. NIST's 2002 study estimated that inadequate software testing infrastructure cost the US economy $59.5 billion annually. The CISQ's 2022 figure is not a direct update to that number — the studies measure different scopes, methodologies, and eras. But the directional story is unambiguous. The problem has not improved as tooling improved. It has accelerated as software has become the substrate of everything.

Meanwhile, the industry's primary response to that acceleration was to write more scripts, faster. Katalon's 2025 survey found that 73% of testers use scripting or automation for functional and regression testing. Continuous testing adoption surged from 16% to over 50% in 2025. Substantial QA budgets are flowing into automation tooling. The investment is real and the coverage gains are genuine — but they are gains on known ground. The unknown surface has kept growing, largely untouched.

Understanding the limitations of scriptless and scripted test automation approaches side by side is useful here, because the comparison clarifies why the problem is architectural rather than a question of which framework you are using.

The biggest limitation of automated testing: Behavior no one defined

There is a concept in security research called the attack surface — the total set of ways a system can be exploited. For most software teams, automated test coverage maps to something narrower: the expected interaction surface, which is the set of behaviors engineers imagined, the flows users were designed to take, and the outcomes the product team formally specified.

The gap between those two surfaces is where long-surviving vulnerabilities live.

A 2024 analysis in the developer community described the problem plainly: automated test suites "excel at regression testing and checking known workflows, but struggle with edge cases that weren't anticipated in test scripts, real user behavior that doesn't follow predetermined paths, and deeper defects such as memory leaks or race conditions." Autonomous testing approaches, the same analysis noted, represent "a fundamental breakthrough" because they can "explore applications intelligently, like real users would, discovering edge cases and unexpected behaviors" — rather than marching down a predetermined list.

This is not a new diagnosis. Engineers have understood the limits of scripted coverage for as long as scripted coverage has existed. The reason the industry never closed this gap is not lack of awareness. It is lack of tooling capable of reasoning rather than merely executing. The combinatorial space of possible application states in any modern codebase is too large for manual exploratory testing to cover at scale. And until recently, AI systems were not capable of the kind of multi-step contextual reasoning needed to probe that space intelligently.

The challenge is particularly sharp in mobile environments, where device fragmentation, OS version variation, and dynamic UI states mean that scripted selectors break constantly and real-device behavior diverges sharply from emulators. Thinking through how to improve mobile test coverage with AI makes the gap concrete — the surface area of what could go wrong on a real device under real conditions is vastly larger than any scripted suite can feasibly represent.

According to Anthropic, Mythos Preview's cybersecurity capabilities were not the result of specialized security training. They emerged from the model becoming sufficiently capable at general code reasoning, logical inference, and autonomous execution. That is the development that matters — not the specific model or the security context, but the demonstration that AI has crossed a threshold where it can reason about code the way an expert human does, rather than pattern-matching against known signatures.

Project Glasswing and what it signals about AI software testing

Anthropic's response to its own findings was to launch Project Glasswing, a coordinated effort to deploy Mythos Preview for defensive purposes before its capabilities could be replicated by less responsible actors. The initiative launched with twelve partner organizations: Amazon Web Services, Anthropic, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks. Each partner is using the model to identify and patch vulnerabilities in their own critical systems, with findings intended to benefit the broader industry.

CrowdStrike's statement on joining the consortium is worth quoting directly: "The window between a vulnerability being discovered and being exploited by an adversary has collapsed — what once took months now happens in minutes with AI."

That is not a prediction. It is a description of the current condition. And it carries a direct implication for product engineering teams well outside the security space.

If Anthropic and its partners are right that AI-powered reasoning can surface vulnerabilities that survived years of human review, the same class of capability will eventually be brought to bear on any sufficiently complex application. The question every engineering team now faces is whether they use that capability proactively — to find issues before release — or reactively, after users find them first.

What this means for QA teams right now

The most important reframe here: none of this is an indictment of QA engineers. The teams who have spent years building Selenium suites, Appium frameworks, and CI pipelines were doing exactly what the industry asked them to do, with the tools the industry gave them. They found the bugs those tools were structurally capable of finding. That is not a failure. That is the system working as designed.

The problem is that the design was always insufficient for a class of defects that require not scripted verification but genuine reasoning about intent — bugs that exist in the gap between what the code was supposed to do and what it actually does when a user or an adversary approaches it from an angle no one anticipated.

Katalon's 2025 survey found that 82% of QA professionals still rely on manual testing daily, even as automation adoption has surged. That is not a sign of resistance to automation. It is a sign that practitioners know, from experience, that scripts have a ceiling — and that beyond the ceiling, human judgment is the only available fallback. The problem is that human judgment at scale is not economically or operationally viable in a modern release cycle.

The best practices for AI-driven mobile testing workflows reflect this shift in real terms: teams that have moved beyond pure scripted automation are not abandoning their existing suites, they are extending their reach into the behavioral space that scripts cannot cover. That extension is where the most consequential quality improvements are now being found.

From scripted QA to intent-driven testing: where the industry goes next

The testing industry is not short on automation. What it has historically lacked is reasoning — the ability to go beyond the paths engineers anticipated and examine the gaps between what software is supposed to do and what it actually does under conditions no one thought to specify.

That gap compounds at scale. A QA team maintaining a large scripted suite faces a structural tax: every new feature demands new scripts, every UI change breaks existing selectors, and the surface of unknown behavior grows faster than any team can manually cover. The result is that manual testing and scripted automation end up doing parallel, partial jobs — each covering what the other cannot, neither covering everything.

Generating test cases from PRDs, Figma designs, and code is one way AI is already reducing that overhead, allowing teams to convert specifications into test coverage without starting from a blank script. But generation addresses the efficiency problem. It does not resolve the deeper issue: tests anchored to what you already imagined still cannot find what you didn't.

Intent-driven testing is the structural answer to that problem. Instead of writing a script that checks a specific selector on a specific screen, a tester describes what a user should be able to accomplish — in plain language — and the system determines how to execute and validate that goal against the live application. This changes not just how tests are written, but what can be tested. An intent-driven system adapts when layouts change, validates backend behavior during the same run as frontend interaction, navigates dynamic states without breaking, and finds failures in flows that were never formally specified.

AI-powered mobile app testing built on this model — including platforms like Quash, which combines natural language test creation with AI execution on real devices — is beginning to close the gap that scripted automation has always left open. The shift is less about replacing existing QA practice and more about extending what QA practice can reach. Scripts remain the right tool for deterministic, high-frequency regression coverage where behavior is well-defined and stable. But the unexplored surface — the part of every application that has never been expressed as a test case — now has a credible answer for the first time.

The full picture of where testing is shifting — from scripts to intent — is worth understanding not as a product decision but as a strategic one. The ceiling on what scripted automation can find is fixed. The ceiling on what reasoning-capable AI can find is not.

Key takeaways

The Glasswing disclosure is a testing story, not just a security story. The class of bugs Anthropic says Mythos found are the same class that scripted automation has always been structurally unable to reach — deep, context-dependent defects that require reasoning, not pattern-matching.

The industry's testing investment has been well-directed but architecturally incomplete. Scripts protect known ground with genuine effectiveness. They cannot explore unknown ground by design. That is not a flaw in any specific tool — it is a constraint of the scripted model.

The cost of that constraint is documented and large. The CISQ estimated at least $2.41 trillion in poor software quality costs in the US in 2022 alone, with $1.52 trillion in accumulated technical debt. The problem scales with software complexity, not with team quality.

The shift is from verification toward reasoning. The next generation of QA automation is not about faster scripts or higher coverage percentages on known paths. It is about systems that understand what software is supposed to do — and find where it does not — without needing to be told exactly where to look.

Teams that make that shift will find defects their competitors are still shipping. The ones that wait will keep learning about those defects from their users.

FAQ's

What are the limitations of automated testing?

Automated testing is highly effective at verifying known, expected behavior — regression flows, smoke tests, and defined UI validations. Its structural limitation is that it cannot test what it has not been told to test. Scripts cannot reason about intent, surface edge cases that fall outside predefined paths, or find vulnerabilities that emerge from how code behaves under conditions that were never specified. This is not a quality problem with any specific tool — it is an architectural constraint of the scripted model. The CISQ estimated that the aggregate cost of this gap in the US alone reached at least $2.41 trillion in 2022, driven by existing vulnerabilities, technical debt, and supply chain failures in open-source software.

Can AI replace test automation scripts?

Not entirely — and that framing misses the point. Scripts remain the right tool for deterministic, high-frequency regression coverage where expected behavior is well-defined and stable. What AI-based testing changes is the scope of what can be tested. Intent-driven AI systems can explore application behavior without predefined scripts, adapt when interfaces change, validate backend responses in the same run as frontend interactions, and find failures in flows that were never explicitly scripted. The practical shift is that AI expands what QA practice can reach, rather than replacing the scripted coverage that already works. The most resilient QA programs will use both — scripts for known ground, AI reasoning for the rest.

What is intent-driven testing?

Intent-driven testing defines test coverage by describing what a user should be able to accomplish — in natural language — rather than by scripting specific steps, selectors, or assertions. The AI system interprets that intent, executes the appropriate actions against the real application, and validates whether the outcome matches the goal. Because execution is driven by reasoning rather than predetermined logic, intent-driven systems handle dynamic UI states, adapt to layout changes without breaking, and surface failures in flows that were never formally specified. It is the approach most directly suited to closing the gap between what scripted automation covers and what users actually encounter.

Related Blogs

  1. Regression testing:

    https://quashbugs.com/blog/regression-testing-mobile-apps

  2. Scriptless test automation:

    https://quashbugs.com/blog/scriptless-test-automation

  3. AI-driven mobile testing workflow / best practices:

    https://quashbugs.com/blog/ai-mobile-testing-best-practices

  4. AI-powered mobile app testing:

    https://quashbugs.com/blog/mobile-testing-with-ai

  5. AI test case generation from PRD, Figma, and code:

    https://quashbugs.com/blog/ai-test-case-generation-prd-figma-code

  6. Improving mobile test coverage with AI:

    https://quashbugs.com/blog/improving-mobile-test-coverage-ai

  7. Ultimate guide to mobile app testing:

    https://quashbugs.com/blog/mobile-app-testing-tools-2025-ultimate-guide

  8. Learn test automation from scratch:

    https://quashbugs.com/blog/learn-test-automation-beginners

  9. From scripts to intent:

    https://quashbugs.com/blog/from-scripts-to-intent-what-changed-in-mahoraga-v2