Exploring AI Quality Assurance in the Agentic World: My Initial Thoughts

Future of AI Quality Assurance
Table of Contents

If you are a professional on the QA team and have been doing a fair bit of digging into this whole AgentForce phenomenon, then it may appear to be a fascinating, but a little daunting, frontier. We often envision a future where customer interactions are smooth with a powerful set of systems working together, under the hood, to make the operations effortless for the operator, and the customer is left amazed at this unparalleled precision, right?

Where the workday isn’t a set of tasks sequenced in a confusing way. But something that says what to do, who to speak and what to do if that doesn’t work out as planned.

Well, this isn’t some dream, but a reality with the next release on our Salesforce Org, coming this weekend? That thought alone sends shivers down my body. AgentForce, well, it seems, is really happening now, and the level of automation it promises is going to be like one never seen before. Digital labour is what these hood systems are branded as and is truly reshaping the enterprise landscape.

These agents are, well, quite magical. And this is where your AI Quality Assurance instincts may kick in. Asking oneself, “What happens when that magic falters?” AI QA automation failures, from what I’ve gathered, won’t just lead to escalating costs; they can trigger a cascade of negative effects. We’re talking about disruptions in operations, a loss of customer trust, and even some ugly ethical dilemmas.

So, it begs the question: Are we truly prepared for this shift?

I think the answers lie in our tried-and-true software development disciplines and the time-tested rigour of QA best practices. As these deployments gain momentum, businesses are recognising that testing isn’t just a project phase; it’s genuinely the key to unlocking AgentForce’s full potential.

It appears LLM builders globally are already grappling with the predictability of their outcomes, and agents built upon these models seem to inherit that very same inherent weakness. So, how, then, do we ensure our “delivery is great again” in this new age of AI in software testing?

How We Approach AI Quality Assurance in Testing Agents

This is where things get particularly interesting. Unlike traditional software, AI agent testing interprets and responds to natural language inputs. Their responses are curated from a mix of LLM outputs and the institutional knowledge stored in Salesforce. This can inherently make their behaviour less predictable.

Without comprehensive AI Quality Assurance, these agents could easily produce inaccurate, biased, or inconsistent responses. Leading to less-than-ideal user experiences. My research has led me to explore the best practices and available AI testing tools provided by Salesforce to ensure a refined QA process for AgentForce.

Key Testing Tools at Our Disposal

Salesforce has provided some solid AI testing tools to get us started:

AgentForce Testing Centre

This is a low-code platform that allows admins and developers to simulate real-world business interactions with AI agents. It’s quite versatile, supporting various communication channels like email, WhatsApp, and experience sites, which means we can achieve comprehensive testing across different mediums.

A standout feature is its ability to automatically generate test cases based on common user queries, significantly reducing manual effort. We can also simulate diverse scenarios in controlled environments to observe agent responses and even perform bias detection to ensure fairness. Using the AgentForce Testing Center is now a critical part of AI Quality Assurance strategies.

Testing API

For larger-scale testing, the Testing API allows developers to programmatically create and execute tests. This approach, I think, is particularly beneficial for assessing numerous “utterances” (a term for user inputs) and evaluating agent performance over time.

Its key benefits include efficiency, by automating repetitive tasks; consistency, by ensuring uniform testing conditions; and seamless integration with CI/CD pipelines for continuous AI QA automation.

AgentForce DX

Much like the standard Salesforce Developer Experience (DX), AgentForce DX is a developer-centric tool for advanced testing and debugging. This means developers can conduct extensive unit testing as part of their AI testing framework before handing off the agent for QA validation, which seems like a sensible approach.

Best Practices: The AI Agent Testing Pyramid

Our traditional testing models typically focus on unit, integration, and end-to-end tests. The AI testing framework within Salesforce’s AgentForce offers a structured methodology that balances these different levels of complexity, aiming for a more reliable deployment of AI agents. This pyramid framework refocuses on those three key layers.

By starting with unit tests, developers can evaluate individual components of the AI agent – think its reasoning engine and APIs – to ensure each part functions correctly. This, to me, is crucial for catching issues early in the development process, ultimately saving time and costs.

Integration tests then assess how these components work together, ensuring the agent interacts effectively with external systems and data sources.

At the very top of the pyramid, end-to-end tests simulate actual user interactions, providing a holistic view of how the agent performs across various scenarios.

This level of testing is absolutely vital for evaluating the agent’s ability to handle complex, multi-turn conversations and interpret ambiguous user inputs. By adopting the AI testing framework, teams can address potential issues at each stage of development, from the smallest unit to full interaction flows.

This layered approach not only boosts testing efficiency but also ensures thorough AI model testing before deployment, reducing error risks and improving overall user satisfaction.

Not Without Its Challenges: Limitations in AgentForce Testing

While Salesforce’s AgentForce Testing Center provides robust tools for testing AI agents, it’s important to acknowledge some limitations. The Testing Centre, for example, primarily focuses on isolated utterances. This might not fully reflect real-world, conversational flows, potentially leading us to miss issues in complex multi-turn dialogues.

Additionally, handling the sheer variability of natural language inputs can be quite challenging, which could result in gaps in AI agent testing performance. Simulating dynamic contexts, where external data and user interactions are constantly shifting, also proves difficult, making it hard to test agents under truly real-world conditions.

Another significant constraint is the resource-intensive nature of large-scale testing, which demands substantial computational power and personnel.

Identifying subtle biases in agents is yet another challenge, as these biases can be highly context-dependent and difficult to detect with standard tests. Moreover, integrating agents with existing enterprise systems might uncover issues not captured in controlled test environments.

Finally, agents can sometimes struggle with interpreting complex or ambiguous user intentions, making such test scenarios harder to design and execute effectively. Despite these limitations, a well-planned, comprehensive autonomous testing strategy can help address these issues and improve agent performance.

And just to add another layer of complexity, there’s the cost of maintaining sandboxes for functional and regression testing. Salesforce charges up to 8 cents per action (they call them Flex Credits) in sandbox environments.

This can really escalate the costs of comprehensive testing, particularly for large-scale or long-term testing cycles. These charges can accumulate quickly as testing demands grow, making budget management and strategic planning crucial for effective AI Quality Assurance.

Concluding Thoughts: A New Horizon for QA

So, that’s my initial deep dive into AgentForce testing. It’s clear that while the tools are powerful, this new landscape presents unique challenges for QA. My research suggests we’re stepping into an era where traditional QA best practices are more crucial than ever, but must be applied with a new understanding of AI in software testing.

We’re not just testing software anymore; we’re testing intelligence. Agile, adaptable, and continuous learning will ensure these agents don’t just work, but work reliably and ethically, completing the circle of effective AI QA automation and robust AI model testing.

What are your thoughts on integrating these new AI testing paradigms into our existing workflows? Any ideas on how we can mitigate some of these challenges, especially around those sandbox costs?

Get a Free Consultation






Consulting Summit Partner_Horizontal logo